electronics

Article A Noniterative Radix-8 CORDIC with Low Latency and High Efficiency

Wenming Tang * and Feng Xu

The Key Laboratory for Information Science of Electromagnetic Waves, School of Information Science and Technology, Fudan University, Shanghai 200433, China; [email protected] * Correspondence: [email protected]

 Received: 6 August 2020; Accepted: 1 September 2020; Published: 17 September 2020 

Abstract: An efficient, noniterative Radix-8 (NR-8) coordinate digital computer (CORDIC) algorithm is proposed for low-latency and high-efficiency computation of the functions of , cosine, or the phase shift, with which the values of the functions are precisely computed by only using the angle in a narrow range of [0, π/12] rather than in a wide angle range of [0, π/2]. This algorithm is expressed by a formula that simplifies the traditional iterative processes by using a complex multiplier. The results obtained from the simulation and the experiment on an FPGA show that the NR-8 CORDIC algorithm operates well, with which the 16-bit precision output is extremely precise, with only 0.012% of the absolute error for computing the sine or cosine function with a step of 0.001◦. Compared with the best conventional CORDIC algorithm, the clock latency of this algorithm significantly decreases down to less than 50%, only needs half of the logic resources and consumes half of the power. This algorithm also takes advantages over other newly improved CORDIC and requires less than half of the clock latency, even for a 23-bit precision output. Therefore, this algorithm could provide a potential application in real-time systems such as radar digital beamforming.

Keywords: CORDIC; sine and cosine; phase shift; FPGA; digital beamforming

1. Introduction As one of the most common transcendental functions, the sine or cosine function has been widely used in real-time digital systems, such as radar, ultrasound, robotics, communication and so on [1–7]. The accuracy and efficiency of the computation of the functions are two key requirements for evaluating the performance of these systems. For this purpose, many methods to calculating the sine or cosine function have been developed, such as the , , approximation and so on [8–10]. However, these methods have the disadvantage of either high complexity or high latency, and thus an efficient method is extremely required to meet the accurate and efficient computation for real-time systems. Fortunately, the coordinate rotation digital computer (CORDIC) algorithm [11] can provide accurate and efficient computations by employing an iterative way and decomposing the calculation into a series of , and shift operations, which enables it to be widely used in digital circuits to implement the computations of trigonometric and exponential functions, and so forth [12]. However, as an iterative algorithm, the accuracy of the CORDIC algorithm strongly relies on the number of iterations, so the increase of the iteration number leads to the increase of the clock latency, thus lowering the efficiency for the computations. To further enhance the efficiency, more progress has been made by improving the architecture of the CORDIC algorithm to achieve a more efficient algorithm, such as the Scaling-Free (SF) CORDIC, Radix-4 CORDIC, Radix-8 CORDIC and low-latency Hybrid (LLH) CORDIC algorithms [13–17].

Electronics 2020, 9, 1521; doi:10.3390/electronics9091521 www.mdpi.com/journal/electronics Electronics 2020, 12, x FOR PEER REVIEW 2 of 18 Electronics 2020, 9, 1521 2 of 17 [13–17]. Some of the improved CORDIC algorithms have been widely used in radar digital beamforming (DBF) systems. For instance, Lee et al. developed a CORDIC-based algorithm to be Some of the improved CORDIC algorithms have been widely used in radar digital beamforming (DBF) used in Multi-Gbps MIMO systems, which is implemented by a Virtex-6 FPGA using 49,752 slices, systems. For instance, Lee et al. developed a CORDIC-based algorithm to be used in Multi-Gbps and the algorithm needs 260 ns (250 MHz, 65 clock periods) of latency due to the many iterations MIMO systems, which is implemented by a Virtex-6 FPGA using 49,752 slices, and the algorithm needs required for computations [4]. Similarly, Jun et al. described look-ahead, pipelined CORDIC-based 260 ns (250 MHz, 65 clock periods) of latency due to the many iterations required for computations [4]. adaptive filters and their application to adaptive beamforming [5], and the pipeline level m depends Similarly, Jun et al. described look-ahead, pipelined CORDIC-based adaptive filters and their on the m-bit precision. However, the CORDIC algorithm often requires many iterations to converge, application to adaptive beamforming [5], and the pipeline level m depends on the m-bit precision. which has become a major bottleneck for real-time applications. However, the CORDIC algorithm often requires many iterations to converge, which has become a In this work, a new noniterative Radix-8 (NR-8) CORDIC algorithm is proposed for low-latency major bottleneck for real-time applications. implementation on FPGAs. In the of the development of an NR-8 CORDIC algorithm, three In this work, a new noniterative Radix-8 (NR-8) CORDIC algorithm is proposed for low-latency steps were taken: (1) The NR-8 CORDIC algorithm was derived from the conventional Radix-2 implementation on FPGAs. In the process of the development of an NR-8 CORDIC algorithm, CORDIC one. (2) The input angle θ was set to a narrow range by simultaneously transforming the three steps were taken: (1) The NR-8 CORDIC algorithm was derived from the conventional Radix-2 input variables x and y . (3) A formula was deduced and optimized. These steps can narrow the CORDIC one. (2)0 The input0 angle θ was set to a narrow range by simultaneously transforming the selectinputed variables range ofx 0theand iterationy0. (3) Aangle formula and wasrealize deduced a noniterative and optimized. formula Theseof the stepsCORDIC can narrowalgorithm the; bselectedesides, the range algorithm of the iterationcan be accelerated angle and by realize the multiplier a noniterative module formula readily of available the CORDIC in FPGAs algorithm; [18]. Asbesides, a result, the algorithmthe algorithm can becan accelerated reduce 7– by17 the clock multiplier latencies module of the readilyconventional available CORDIC in FPGAs (16 [-18bit]. precision)As a result, alg theorithm algorithm to a canthree reduce-clock 7–17 latency clock, needs latencies less of logic the conventional resources and CORDIC consume (16-bits less precision) power. Comparedalgorithm towith a three-clock the LLH algorithm latency, needs [16], less it has logic great resources advantages and consumes in terms lessof time power. and Comparedresources. withFor thethe structure LLH algorithm of this [ 16paper,], it has following great advantages the introduction in terms is of Section time and 2, resources.in which the For derivation the structure from of the this conventionalpaper, following CORDIC the introduction algorithm is Sectionis presented.2, in which In Sectionthe derivation 3, the from proposed the conventional NR-8 CORDIC CORDIC is introduced.algorithm is Section presented. 4 presents In Section its3 ,FPGA the proposed implementation NR-8 CORDIC and isanalysis introduced.. Section Section 5 introduces4 presents the its applicationFPGA implementation of the NR-8 andCORDIC analysis. in radar Section DBF.5 introduces Finally, a the conclusion application is made of the according NR-8 CORDIC to the in results radar obtainedDBF. Finally, from a the conclusion above sections is made. according to the results obtained from the above sections.

2.2. Conventionalonventional CORDIC CORDIC Rotator Rotator Algorithm Algorithm TheThe CORDIC algorithmalgorithm usually usually operates operates in in rotation rotation mode mode or vector or vector mode mode [11,12 [1],1 following,12], following linear, linear,circular circular or hyperbolic or hyperbolic coordinate coordin trajectories.ate trajectories. In this In paper, this paper we focus, we focus on the on rotation the rotation mode mode using usingcircular circular trajectory. trajectory. * TheThe rotation rotation mode mode is is depicted depicted in in Fig Figureure 1,1, where where θθ isis the the angle angle between between the the VV00(,)( xx 00, yy 00) andand * * * Vdd(,)( xx dd, yy dd) vectors.vectors. AsAs the the vector vectorV 0Vrotates0 rotates counterclockwise counterclockwise to the to vector the vectorVd, the V coordinate,d , the coordinate,(xd, yd) , can be described as in Equation (1): (,)xydd, can be described as in Equation (1): " # " #" # " #" # xd cos θsinθ x 0  1 1  tan θ x0 = xd cosθ sinθ x00= tanθ x   −  =coscosθθ −   (1) yd sin θcos θ y0 tan θ  1 y0 (1) yd sinθ cosθ  y00   tanθ 1  y 

y V1 (x1,y1) (x ,y ) Vd d d (x ,y ) V2 2 2 ...... (x ,y ) V0 0 0 2nd rotation θ1 1st rotation θ θ 0 x O FigureFigure 1. 1. TThehe CORDIC CORDIC vector vector rotation rotation model. model.

If the initial vector (x , y ) is set to x = 1, y = 0, Equation (1) can be used to compute cos θ and If the initial vector (,)0xy0 is set to0 xy=1, 0  0 , Equation (1) can be used to compute cosθ sin θ. 00 00 and sinθ . is decomposed into a series of micro angles, each of which corresponds to one step rotation as shown in Figure 1 and described as in Equation (2):

Electronics 2020, 9, 1521 3 of 17

θ is decomposed into a series of micro angles, each of which corresponds to one step rotation as shown in Figure1 and described as in Equation (2):

Xn 1 (i+1) θ = θi, θi = tan− (σiR− ) (2) i=0 where n denotes the number of rotations, R denotes the radix, R = 2l, l N, θ denotes micro angles ∈ i and σ is the selection factors defined as all integers within the interval σ [ R/2, R/2]. i i ∈ − Substituting Equation (2) into Equation (1) yields Equation (3):

" # n " #" #  n  n " #" # Y Y  Y xd cos θi sin θi x0   1 tan θi x0 = − =  cos θi − (3) y sin θi cos θi y0   × tan θi 1 y0 d i=0 i=0 i=0

Equation (3) describes the computation process illustrated in Figure1. Apparently, the recursion formula of the ith rotation can be written as Equation (4): " # " #" # xi+1 1 tan θi xi = cos θi − (4) yi+1 tan θi 1 yi

If R equals 2, it is the conventional Radix-2 (R-2) CORDIC iterative algorithm. If R equals 4, it becomes a conventional Radix-4 (R-4) CORDIC iterative algorithm. If R equals 8, it becomes a conventional Radix-8 (R-8) CORDIC iterative algorithm, and thus Equation (3) can be written as Equation (5):   " # Yn (i+1) " # xd  1 σiR−  x0 = K  (i+1) −  (5) y ×  σ R 1  y0 d i=0 i − where the scale factor K is defined as Equation (6):

Yn Yn  2 2(i+1)1/2 K = cos θi = 1 + σi R− (6) i=0 i=0

The conventional CORDIC algorithm is implemented in an iterative fashion, in which the input angle is completed by a step-by-step mode using the series of micro angles. After the ith iteration, the residual error angle is defined as in Equation (7):

Xi z + = θ θ , i = 0, 1, 2, n (7) i 1 − j ··· j=0 where z0 = θ. For the next iteration, an optimal factor σi+1 is selected so that the residual error angle becomes minimal, which can be calculated by using Equation (8):

(i+1) σi = argmin zi σiR− = argmin ωi σi , s.t. σi [ R/2, R/2] (8) − | − | ∈ − i+1 where ωi = ziR . It can be directly solved using a rounding operation via Equation (9):   R/2, ωi (R 1)/2  ≥ − σi =  R/2, ωi (R 1)/2 (9)  − ≤ − −  U(ωi), else where the function U(ωi) rounds each element of ωi to the nearest integer. In each iteration, as the R number increases, the number of iterations decreases, but the selection factor σi increases, thus increasing the complexity of the conventional CORDIC algorithm. Hence, Electronics 2020, 9, 1521 4 of 17 we found that R-8 CORDIC (R = 8) is a good balance between complexity and efficiency. However, the current iterative R-8 algorithm still needs several iterations. For example, six iterations are necessary for a 16-bit depth digital signal processing application. To address this problem, this paper proposes a noniterative form of the R-8 CORDIC (NR-8 CORDIC) algorithm.

3. Noniterative Radix-8 CORDIC Algorithm We propose a noniterative computation structure of the R-8 CORDIC algorithm by iterating the data in a narrow input angle interval, using an explicit formula of solution, simplifying the scale factor and transforming the input variables x0 and y0 to accelerate the convergence of the algorithm.

3.1. Narrow Input Angle θ Range Conventionally, one only needs to consider the input angle θ [0, π/2] of the first quadrant, from ∈ which the rest of the quadrants can be easily computed by invoking the symmetry property of the sine or cosine function. Thus, the rest of the quadrants can be mapped to the first quadrant by simple transformation. In this article, we first narrow the input angle interval into an angle range of [0, π/12]. The first quadrant of the coordinate system is equally divided into six regions, marked from A to F, the range of which becomes [0, π/12]. Then the angle θ [0, π/2] can be folded to the range ∈ of ϕ [0, π/12]. The CORDIC output mappings between θ and ϕ are given in Table1. Accordingly, ∈ the input variables x0, y0 need to be changed to x0, y0, respectively. Therefore, we can readily compute the output values in the angle range of ϕ according to the CORDIC algorithm, on the base of which, and as shown in Table1, the output values x , y in the whole range of θ are achieved with ease ( √3 is d d " # cos θ sin θ calculated by using the Taylor series, and a matrix is defined as RT(θ) = − ). sin θ cos θ

Table 1. The CORDIC output mappings between θ and ϕ.

0 0 Regions θ x0,y0 xd,yd " # " #" # " # π x0 x0 xd x0 A, [0, 12 ) ϕ = = RT(θ) y0 y0 yd y0 " # " #" # " # x √3x y x x [ π π ) π 0 1 0 0 d 0 B, 12 , 6 6 ϕ = 2 − = RT(θ) − y0 x √3y yd y0 0 − 0 − 0 − 0 " # " #" # " # x √3x y x x [ π π ) π + 0 1 0 0 d 0 C, 6 , 4 6 ϕ = 2 − = RT(θ) y0 x √3y yd y0 0 0 − 0 0 " # " #" # " # x x √3y x x [ π π ) π 0 1 0 0 d 0 D, 4 , 3 3 ϕ = 2 − = RT(θ) − y0 √3x y yd y0 0 0 − 0 − 0 " # " #" # " # x x √3y x x π 5π π + 0 = 1 0 0 d = ( ) 0 E, [ 3 , 12 ) 3 ϕ 2 − RT θ y0 √3x0 + y0 yd t0 " # " #" # " # x0 x x x0 F, [ 5π , π ] π ϕ 0 = − 0 d = RT(θ) 0 12 2 2 − y y y y 0 − 0 − d 0

3.2. Explicit Formula of Convergence Equation (5) can be computed naturally by using iterations. The scale factor K is temporarily ignored for the sake of simplicity. The iterative formula of Equation (5) is given as follows. Let us define

(i+1) ai = σi8− (10)

If i = 1, then ( x = x a y 1 0 − 0 0 . y1 = y0 + a0x0 Electronics 2020, 9, 1521 5 of 17

If i = 2, then ( x = (1 a a )x (a + a )y 2 − 0 1 0 − 0 1 0 . y = (1 a a )y + (a + a )x 2 − 0 1 0 0 1 0 If i = 3, then ( x = (1 a a a a a a )x (a + a + a a a a )y 3 − 0 1 − 0 2 − 1 2 0 − 0 1 2 − 0 1 2 0 . y = (1 a a a a a a )y + (a + a + a a a a )x 3 − 0 1 − 0 2 − 1 2 0 0 1 2 − 0 1 2 0 If i = 4, then

   x4 = (1 a0a1 a0a2 a0a3 a1a2 a1a3 a2a3 + a0a1a2a3)x0 (a0 + a1 + a2 + a3 a0a1a2 a0a1a3 a0a2a3 a1a2a3)y0  − − − − − − − − − − −  (11)  y = (1 a a a a a a a a a a a a + a a a a )y + (a + a + a + a a a a a a a a a a a a a )x 4 − 0 1 − 0 2 − 0 3 − 1 2 − 1 3 − 2 3 0 1 2 3 0 0 1 2 3 − 0 1 2 − 0 1 3 − 0 2 3 − 1 2 3 0 A deductive formula can be summarized as ( xn = An x Bn y × 0 − × 0 (12) yn = An y + Bn x × 0 × 0 where An, Bn are respectively defined as

 k  P i+1    A = 1 ( 1) f C2i , n N  n a n  − i=1 − ∈   k     P i+1 2i 1   ( 1) fa Cn − , n = 2k, k N (13)   = − ∈  B =  i 1  n  k     P i 2i+1   ( 1) fa Cn , n = 2k + 1, k N i=0 − ∈

m where the function of fa(C ), 0 m n is defined as the product of m different elements selected from n ≤ ≤ the sets a0, a1, an 1 , as described in Equation (14): { ··· − } m  fa(C ) = a a ... a , (i , i , ... im) I(m) (14) n i1 i2 im 1 1 ∈  where I(m) denotes all possible combinatorial sets of m unique indices selected from 0, 1, n 1 . m  { ··· − } Apparently, there are a total of Cn sets in I(m) . Substituting Equation (10) into the products in Equation (14), we have the following inequality:

σ σ . . . σ m i1 i2 im 4 F(m) = ai ai ... ai = (15) 1 2 m 8S(m) ≤ 8S(m) where S(m) = (i1 + 1) + (i2 + 1) + ... + (im + 1) (16)

= = = = Note that the equality in Equation (15) holds if and only if σi1 σi2 ... σim 4. It can be (1+m)m observed that Equation (16) has a minimum of S(m) = 2 . Figure2 shows that F(m) quickly vanishes as m or S(m) increases, from which we have two observations as follows:

Observation (1): If m 3, when S(m) 7, F(m) < 0.000031, which can be ignored. ≥ ≥ Observation (2): If m = 1 or m = 2, when S(m) 6, F(m) < 0.000061, which can be ignored. ≥ Note that S(m) 8 is necessary in order to achieve a high accuracy. Thus, we can ignore the ≥ m terms that satisfy any of the above two conditions in fa(Cn ) of Equation (13) to greatly simplify computation. For example, the terms of a a a , a a a , , a a a , , a a a a , , a a a a , a a a 0 1 2 0 2 3 ··· 1 2 3 ··· 0 1 2 3 ··· 1 2 3 4 0 1 2 Electronics 2020, 12, x FOR PEER REVIEW 6 of 18 Electronics 2020, 9, 1521 6 of 17

푎0푎1푎2… a2 a 3 a 4 , and a0 a3,,,,,,,,, a 0 a 4 a15 a2 a2 a 3 a 4 a … can be ignored. Thus, the variables of

...ABa ,a a ,inand Equationa a , a (13)a , can, a bea ,simplified, a a , as , a , a , ... can be ignored. Thus, the variables of An, Bn nn2 3 4 0 3 0 4 ··· 1 2 ··· 2 3 ··· 4 5 in Equation (13) can be simplified as 3  n n  11a a  a a    a a  a a  A 30i 1 2  0 n i 1  in  n P i1   i P 1 i  2 P   " #  1 a0 ai a1a2  or 1 a0 ai a1 a i  (17) An B −= 44−   − i=1 − i= 2  =n i 1     4 aa or 4   (17) Bn  P ii  P    aii00  i ai   i=0 i=0

1.0X10-1 m=1 m=2 -3 m=3 1.0X10 m=4 1.0X10-5

(m)

F 1.0X10-7

1.0X10-9 1.0X10-11 0 2 4 6 8 10 12 S(m)

Figure 2. Decreasing of F(m) as m or S(m) increases. Figure 2. Decreasing of F(m) as m or S(m) increases. 3.3. Scale Factor 3.3. Scale Factor Now we consider the scale factor K of Equation (6). It can be written as a Taylor series, i.e., Now we consider the scale factor K of Equation (6). It can be written as a Taylor series, i.e., Yn   1 2 3 4 5 6 1 2 K = n1 a1+ a 3 5a + ... C 1a (18) − 2 i 2 8 i − 4 16 i 6 ≈ −  2 1 2 Ki=01 ai a i a i C a1 (18) i0 2 8 16 2 where C = 1 1 a2 + 3 a4, and apparently we can calculate all the values of C and 1 a2 by enumerating 2 130 248 0 2 1 1 2 −   2 allwhere values C of 1a and a00a , a respectively., and apparently In order we to can speed calculate up the parallelall the values computation of C ofand the NR-8a1 by 0 281 2 CORDIC algorithm, we can compensate the input variables x , y instead of A , B by the scale factor enumerating all values of a and a 2 , respectively. In order0 to0 speed up then paralleln computation K. More details about this process0 are described1 in the oncoming section.

of the NR-8 CORDIC algorithm, we can compensate the input variables xy00, instead of 3.4.by Transformationthe scale factor of K the. More Inputs details x0 and about y0 this process are described in the oncoming section. According to Table1, the input angle θ [0, π/2] can be folded to the range of [0, π/12]. ∈ 3.4. Transformation of the Inputs x and y Accordingly, the input variable x0,0y0 should0 also be transformed to x0, y0. The transformation rules are as follows:  According to Table" #1, th" e input# angle θ [0,π / 2] can be folded to the range of [0,π / 12] . x x 0 0  IfAθccordingly[0, π/12), ,the then input variable= . should also be transformed to xy00, . The transformation ∈ y0 y0 rules are as follows" : # " # x0 √3x0 y0 If θ [ π , π ), then 0 = 1xx  −  . ∈ If12 θ6[0,π / 12)y, then 2 00x= √3. y 0 yy 0   0 " # " 00−  −  # x √3x y [ π π ) 0 = 1 0 0 If θ 6 , 4 , then 2  − . ∈ π π y x0x 1 √33y x 0 y 0 If θ [,) , then0 =0 0 . 12 6" # "y −2 # x 0x √3xyy003 If θ [ π , π ), then 0 = 1 0 − 0 . 4 3 2 √ ∈ y0  3x0 y0  π π x01 −3 x 0 y 0 If θ [,)", then# " = √ # . π 5π 64 x0 y1 x02 3y0 If θ [ , ), then 0 =0 −xy003 . 3 12 2 √ ∈ y0 3x0 + y0 π π " # x" 1 # x 3 y x 0x 0 0 If5π θπ[,) , then0 =0 . If θ [ 12 , 2 ], then43 = y − 2 .  ∈ y 0 y 3xy00 0 − 0 π 5π x 1  x 3 y If θ [,) , then 0=  0 0 . y 3 12 0 2 3+xy00

Electronics 2020, 9, 1521 7 of 17

√3 can be calculated by using the Taylor series, which is √3 = 2 1 1 1 = 1.732. The input − 4 − 64 − 512 variable x0, y0 is multiplied by the scale factor K to compensate loss gain due to iteration, which produces 00 00 two new variables, x0 , y0 , described as, " # " # x00 x 0 = K 0 (19) 00 y0 × y0

Let A = An, B = Bn and the explicit formula of xn and yn in Equation (12) can be rewritten as ( xn = A x00 B y00   0 0 ( + ) = ( + ) 00 + 00 × − × xn yni A Bi x0 y0 i (20) yn = A y00 + B x00 ⇔ × × 0 × 0

As a result, the final outputs xd, yd in Equation (1) can be expressed by xn, yn in Equation (20), respectively, and Equation (20) can be easily implemented by using complex [19].

4. Implementation and Analysis In this section, the architecture and performance of the NR-8 CORDIC algorithm are discussed with a simulation and an FPGA implementation.

4.1. Noniterative Implementation

After narrowing the input angle range from θ to ϕ, we set the initial angle of z0 in Equation (7) to be z = ϕ [0, π/12]. Thus, we have w = z R = 8ϕ and 8ϕ [0, 2π/3], so 8ϕ < 2.5. Following from 0 ∈ 0 0 ∈ Equations (7)–(9), σ0 is obtained as σ0 = U(w0) = U(8ϕ),(U(8ϕ) rounds each element of 8ϕ to the nearest integer and U(8ϕ) 0, 1, 2 ). Subsequently, the residual z can be described as, ∈ { } 1   z0 σ0 = 0  z = z tan 1(σ /8) = z 1( ) = (21) 1 0 − 0  0 tan− 1/8 σ0 1 −  − z tan 1(1/4) σ = 2 0 − − 0 Apparently, tan 1( ) in Equation (21) has only three values, which can be implemented by using − • a small look-up table or registers. For the residual z , i 2, z can be described as, i ≥ i 1 i i zi = zi 1 tan− σi 18− zi 1 σi 18− (22) − − − ≈ − − − where an approximation of tan 1(x) x, x < 1/16 is taken. The error bound of such an approximation − ≈ can be easily estimated to be 8.119 10 5. × − By considering m-bit fixed-point processing, where all variables are stored in an FPGA as m-bit integers, we use Z , i = 0, 1, 2, ... to denote the fixed-point integer of z , i = 0, 1, 2, ... (i.e., Z = z 2m ) i i i b i c for the sake of simplicity, where denotes rounding down. b•c From Equation (21), it can be found that the residual error angle is z < 1/8 (i.e., Z = z 2m < | 1| | 1| | 1 | 2m 3) so the bit width of Z is m 2. To denote the bit width of an n-bit fixed-point variable X, we use − 1 − the form of X[n 1 : 0]. For example, Z is expressed as Z [m 3 : 0]. − 1 1 − Now, let us expand Z1 to the following form,

m 7 m 10 Z = Z [m 3 : m 7]2 − + Z [m 8 : m 10]2 − + ... + Z [m 2 3 q : 0] (23) 1 1 − − 1 − − 1 − − × l m 5 m where q = − . Here, denotes rounding up. 3 d•e Electronics 2020, 9, 1521 8 of 17

Accordingly, z1 can be rewritten as,

7 10 m z1 = Z1[m 3 : m 7]2− + Z1[m 8 : m 10]2− + ... + Z1[m 2 3 q : 0]2− − −  − − − − × (24) = tan 1 Z [m 3 : m 7]2 7 + tan 1 Z [m 8 : m 10]2 10 + ... + tan 1(Z [m 2 3 q : 0]2 m) − 1 − − − − 1 − − − − 1 − − × −

1 (i+1) According to Equation (2), θi = tan− σi8− , we found that the variables σ1, σ2, . . . σq and a1, (i+1) a2, ... aq (in Equation (10), ai = σi8− ) should be selected, which can satisfy both the computation Pq of Equation (5) and automatically fulfill the equation of θ = θi. Thus, it is not necessary to follow i=0 the iterative formula of the solutions in Equations (8) and (9). Instead, from the proposed expansion in Equations (23) and (24), we can directly give the variables as,   σ1 = Z1[m 3 : m 7]  − −  σ = Z [m 8 : m 10]  2 1 − −  . (25)  .  .   σq = Z [m 2 3 q : 0] 1 − − ×  3 σ02− , i = 0 a =  (26) i  (4+3i) σi2− , else

As a result, the residual z1 is rewritten as,

1 σ1 1 σ2 1 σq z = tan + tan + ... + tan m 1 − 27 − 210 − 2 σ1 σ2 σq + + ... + m ≈ 27 210 2 (27) Pq = aj j=1

Likewise, the variables z , z , can be expressed as, 2 3 ··· Xq zi = aj (28) j=i

Note that the original iterative formula of approximation of the input angle is now replaced by the new formula in Equations (25)–(28), which becomes directly computable. The computation process Electronics 2020, 12, x FOR PEER REVIEW 9 of 18 of variables σi is shown in Figure3.

Signed data Unsigned data h m 23   q

Sign bit bm-3 bm-4 bm-5 bm-6 bm-7 bl+2 bl+1 bl bh b0 σ s1 sx sq m 5  Zq  ···· 3 Zx Z2 Z1 m =16 b13 b12 b11 b10 b9 b8 b7 b6 b5 b4 b3 b2 b1 b0 s s s s 1 2 3 4 m = m FigureFigure 3. 3. TheThe architectures architectures of of the the variable variable σi and and the the variable variableZ i(ZZi( zi2 ). ). i i Zzii=b × 2c In summary, the computation takes the following steps: In summary, the computation takes the following steps: 1. Compute σ0 via rounding 8ϕ. 1. Compute σ via rounding 8φ . 2. Compute Z10via the constant values stored in registers and one subtractor in Equation (21).

2. Compute Z1 via the constant values stored in registers and one subtractor in Equation (21).

3. Compute σi ,iq 1,.., by directly fetching bits from as in Equation (25).

4. Compute A and B as in Equation (20) using σi at the third step, all of which are small integers.      For example, σ0 0,2 is a 2-bit unsigned integer, and σ1 8,8 is a 5-bit signed integer,    while σi 0,7 ,iq 2,3, is an unsigned integer no greater than 3-bit.

Thus, Equation (17) can be rewritten as

m σ σ A A 1 2 0 zz 1  = n  3712  m 22 (29) B B 2  n m 3 2 σ01z

Since all σi are small integers, their multiplication computations in Equation (29) can be easily implemented by using shifting and . According to the above deduction for the NR-8 CORDIC algorithm, the implementation of the digital circuit structure of the proposed NR-8 CORDIC algorithm is shown in Figure 4. The contents of the green dashed box can be implemented with a Digital Signal Processing (DSP) module. Therefore, all iterative processes are not required, and thus the NR-8 CORDIC algorithm only takes three clock cycles for computation: Cycle 1: Fold the angle θ [0,π / 2] to the range of φ [0,π / 12] , and transform the input  variables from xy00, to xy00, , according to Table 1, Section 3.4 and Figure 4. Compute σ0 , via rounding and using the equation of and a three-entry register in Equation (21), respectively.

Cycle 2: Directly fetch the values of σi and zi , (iq 1,2, ) from as in Equations (25)–(28), respectively, which are substituted to Equation (29) for computing A, B, and meanwhile     compensate the amplitude of the variables through the equations x0=KK x 0 , y 0 = y 0 .

Electronics 2020, 9, 1521 9 of 17

3. Compute σi, i = 1, ... , q by directly fetching bits from Z1 as in Equation (25). 4. Compute A and B as in Equation (20) using σi at the third step, all of which are small integers. For example, σ [0, 2] is a 2-bit unsigned integer, and σ [ 8, 8] is a 5-bit signed integer, 0 ∈ 1 ∈ − while σ [0, 7], i = 2, 3, q is an unsigned integer no greater than 3-bit. i ∈ ··· Thus, Equation (17) can be rewritten as " # " # " # A A 1 2m σ0 z σ1 z = n = 23 1 27 2 m −(m 3×) − × (29) B Bn 2 2 σ + z − × 0 1

Since all σi are small integers, their multiplication computations in Equation (29) can be easily implemented by using shifting and additions. According to the above deduction for the NR-8 CORDIC algorithm, the implementation of the digital circuit structure of the proposed NR-8 CORDIC algorithm is shown in Figure4. The contents of the green dashed box can be implemented with a Digital Signal Processing (DSP) module. Therefore, all iterative processes are not required, and thus the NR-8 CORDIC algorithm only takes three clock cycles for computation: Cycle 1: Fold the angle θ [0, π/2] to the range of ϕ [0, π/12], and transform the input variables ∈ ∈ from x0, y0 to x0, y0, according to Table1, Section 3.4 and Figure4. Compute σ0, Z1 via rounding 8ϕ and using the equation of Z = z 2m and a three-entry register in Equation (21), respectively. i b i × c Cycle 2: Directly fetch the values of σ and z , (i = 1, 2, q) from Z as in Equations (25)–(28), i i ··· 1 respectively, which are substituted to Equation (29) for computing A, B, and meanwhile compensate the amplitude of the variables x , y through the equations x00 = K x , y00 = K y . Electronics 2020, 12, x FOR PEER REVIEW0 0 0 × 0 0 × 0 10 of 18

1st Clock Cycle 2nd Clock Cycle 3rd Clock Cycle q z or j 0 <<(m-3) m s 0 s1

- FF A = 2 zz

m 15° / U(8j ) s 3712 X 2 45 + 0 22 ° FF m3 U * m+7 bits   30° >>3 complement B  2 s 01 z 60° M m 2 bm-3 75° [X]C  15 8 2 4 2

0 bm-4 s1 * FF K = 2 2s0 +3 s 0 s 1 X -1 - bm-5 >>7

COMP tan (1/8) U z1 bm-6 B

-1 A M tan (2/8) + bm-7

+ FF

32738 X bm-8 s x 2 U (a,b)(c,d) d 3 bm-9 z1 32515 FF

M Complex

bm-10 31792 + .

. Multiplier X . <<1 K

U + X z2 yd - -

/ bh FF x0 >>2 . 0 1 4 - U X

s M

. m5 + - .  [X]C

>>6 FF  U - M 9 16 25 m 3 2 >>9 b0 36 49 64 M -1

<<1 x0' FF x0" + y - X

0 >>2 - / U y0' >>6 - + FF

- M y " >>9 FF 0 DSP CLK

Figure 4. TheThe architecture architecture diagram of the NR NR-8-8 CORDIC algorithm on an FPGA. *The The multipliers multipliers can be optimized as shifters and adders. " # xd Cycle 3: Compute the final results x according to Equation (20) and Table1 by using the Cycle 3: Compute the final results yd d according to Equation (20) and Table 1 by using the y multiplier module [18]. d multiplier module [18]. 4.2. Resource Utilization and Performance Analysis 4.2. ResourceHere, two Utilization comparisons and Performance are presented Analysis for analyzing resource utilization (RU) and performance as follows.Here, two comparisons are presented for analyzing resource utilization (RU) and performance as follows. 4.2.1. RU Comparison of Conventional CORDIC Algorithms 4.2.1.The RU NR-8Comparison CORDIC of algorithmConventional and CORDIC several conventional Algorithms algorithms are implemented on a Xilinx FPGAThe (xcku040- NR-8 CORDICffva1156) algorithm including and the several evaluations conventional of the algorithms critical RU, are clock implemented latency andon a powerXilinx FPGAconsumption, (xcku040 and-ffva1156) these conventional including the algorithms evaluations are of R-2, the R-4 critical and R-8 RU, [11 clock–13,15 latency]. Note a thatnd power in the consumption, and these conventional algorithms are R-2, R-4 and R-8 [11–13,15]. Note that in the experiments, the 16, 8 and 6-level pipelines are used for R-2, R-4 and R-8 CORDIC cores to achieve the same accuracy, respectively [15]. Table 2 lists RU comparisons of the R-2, R-4 and R-8 CORDIC algorithms with the NR-8 CORDIC algorithm by using a synthesis tool (Vivado 2019.2) (2019.2, Xilinx, San Jose, CA, USA, 2019). The results demonstrate that the proposed NR-8 CORDIC algorithm has advantages over the conventional algorithms in many aspects, such as the RUs of Configurable Logic Block (CLB) Lookup Tables (LUTs), flip-flop (FF), DSPs, clock latency, power consumption and so forth. For example, for a 16-bit precision output, compared with the corresponding parameters of the R-2, R-4 and R-8 CORDIC algorithms in Table 2, the proposed NR-8 algorithm only requires one-half to one-eighth the RU and reduces clock latency to one-half to one-sixth and power consumption to one-half. Then, we implemented the algorithm in Verilog Hardware Description Language (HDL) using a pipelined approach. The place and route tool reports the worst negative slack and the worst hold slack as 0.302 ns and 0.024 ns, respectively, when using a clock frequency of 250 MHz. Compared with the conventional ones, such as the CORDIC IP core (6.0) from Xilinx with 16-bit precision and three iterations, the power of the NR-8 CORDIC algorithm significantly decreases to below 70%, and the proposed algorithm only needs one-third of the flip-, though the low power consumption LUTs utilization increases by 43%.

Electronics 2020, 9, 1521 10 of 17 experiments, the 16, 8 and 6-level pipelines are used for R-2, R-4 and R-8 CORDIC cores to achieve the same accuracy, respectively [15]. Table2 lists RU comparisons of the R-2, R-4 and R-8 CORDIC algorithms with the NR-8 CORDIC algorithm by using a synthesis tool (Vivado 2019.2) (2019.2, Xilinx, San Jose, CA, USA, 2019). The results demonstrate that the proposed NR-8 CORDIC algorithm has advantages over the conventional algorithms in many aspects, such as the RUs of Configurable Logic Block (CLB) Lookup Tables (LUTs), flip-flop (FF), DSPs, clock latency, power consumption and so forth. For example, for a 16-bit precision output, compared with the corresponding parameters of the R-2, R-4 and R-8 CORDIC algorithms in Table2, the proposed NR-8 algorithm only requires one-half to one-eighth the RU and reduces clock latency to one-half to one-sixth and power consumption to one-half. Then, we implemented the algorithm in Verilog Hardware Description Language (HDL) using a pipelined approach. The place and route tool reports the worst negative slack and the worst hold slack as 0.302 ns and 0.024 ns, respectively, when using a clock frequency of 250 MHz. Compared with the conventional ones, such as the CORDIC IP core (6.0) from Xilinx with 16-bit precision and three iterations, the power of the NR-8 CORDIC algorithm significantly decreases to below 70%, and the proposed algorithm only needs one-third of the flip-flops, though the low power consumption LUTs utilization increases by 43%.

Table 2. Utilization comparison of the NR-8 CORDIC algorithm with the conventional CORDIC algorithms with 16-bit precision.

Power CLB LUTs a FF (484,800)/UT DSPs Clock b Algorithms (Dynamic/ (242,400)/UT (%) (%) (1920)/UT (%) Latency Static) (W) R-2 [11] 1095/0.45 785/0.16 2/0.1 17 0.071/0.479 R-4 [12] 975/0.4 329/0.07 5/0.26 9 0.066/0.478 R-8 [15] 880/0.36 234/0.05 6/0.31 7 0.065/0.478 NR-8 300/0.12 98/0.02 5/0.21 3 0.031/0.478 a A CLB contains 8 6-input LUTs and 16 flip-flops [19]. b The working clock frequency of 250 MHz.

4.2.2. Performance Comparison of Newly Developed CORDIC Algorithms The comparisons of performance of the newly developed algorithms [12–15] with the NR-8 CORDIC algorithm are shown in Table3. The conventional Radix-X CORDIC algorithms, such as the R-2 CORDIC with m-bit precision, require m iterations. Normally, the number of iterations decreases as the number of X in Radix-X increases, and the complexity and timing (critical path) of the algorithms are almost unchanged. The high-performance R-4 CORDIC algorithm [14] requires m/2 iterations, O(m) complexity and low latencies. The low-latency hybrid (LLH) CORDIC algorithm [16] requires 3m/8 + 1 iterations and more complexity O(3m). Although the high performance/low-latency (HPLL) CORDIC algorithm [17] has low latency, this algorithm is not conducive to pipeline optimization to improve the speed, owing to the inherent iterative structure. For the NR-8 CORDIC algorithm, when the precision is less than 24 bits, complexity is less than q l 23 5 m O(2 ), q = − = 6, and σ has only seven types (i 0, 1, 2, 3, 4, 5, 6 ). Thus, Equations (15) and (26) 3 i ∈ { } are rewritten as   F(m) = ai ai ... ai , (i1, i1, ... im) 0, 1, 2, 3, 4, 5, 6 , 0 m 6  ( 1 2 m ∈ { } ≤ ≤  σ 2 3, i = 0 (30)  = 0 −  ai (4+3i) σi2− , i = 1, 2, 3, 4, 5, 6

We can make a conclusion from Equation (30) that when 4 m 6, if and only if a a ... a = ≤ ≤ i1 i2 im a0a1a2a3, the maximum of F(m) is given as

σ σ σ σ 2 8 7 7 1 MAX (F(m)) = a a a a = 0 1 2 3 (31) 4 m 6 0 1 2 3 3 7 10 13 × ×33 × 23 ≤ ≤ | | 2 × 2 × 2 × 2 ≤ 2 ≤ 2 Electronics 2020, 9, 1521 11 of 17

Apparently, when the NR-8 CORDIC algorithm requires 23-bit precision, the following

m m approximations are produced: F(m) 0 and fa(C ) 0 ( fa(C ) from Equation (14)). 4 m 6 ≈ n 4 m 6 ≈ n The most time-consuming path is attributed≤ ≤ to the computation≤ ≤ of variables A, B. According to the above analysis and Equation (13), A, B can be simplified as

" # " #  m σ0z1 σ1z2 σ2z3 σ3z4  A A 1  2 3 7  = n =  2 2 210 213  m  (m 3−) − −σ0σ1z2 − σ0σ2z3  (32) B Bn 2  2 σ + z  − × 0 1 − 210 − 213 Equation (32) can be realized by two-clock latency in the pipeline. Therefore, only four-clock latency is required for the NR-8 CORDIC algorithm with 23-bit precision, and the complexity is less than O(15). For instance, compared with the 10-clock latency required for 3 23 + 1 10 iterations 8 × ≈ using the LLH CORDIC algorithm [16], the clock latency of the NR-8 CORDIC algorithm significantly decreases to less than 50%, which needs only the four-clock latency.

Table 3. Performance comparison of newly developed CORDIC algorithms with m-bit precision.

Conventional CORDIC [12,13,15] High- Low-Latency High- Proposed NR-8 Algorithms Performance Hybrid (LLH) Performance/Low- CORDIC R-2 R-4 R-6 R-4 [14] [16] Latency [17] Iterations m + 1 (1/2)m (3/8)m m/2 (3/8)m+1 - 0

16 Adders/28 m 5 Complexity a O(2m) O(2m) O((15/8)m) O(m) O(3m) − Adders (m = 16) O(2d 3 e) Timing (Critical path) b Tadd/sub Tadd/sub Tadd/sub Tadd/sub 2Tadd/sub 2Tadd/sub 2Tadd/sub c Latency (m = 16) 17 9 7 8 6 68TFA/26TFA 3 a Base on analysis of critical rotator module. O( ): order in terms of full adders. - not reported. b Tadd/sub means c • /subtractor delay. TFA means a full adder delay.

4.3. Error Analysis

4.3.1. Comparisons with Low-Latency Hybrid (LLH) CORDIC According to the literature [16], the simulation has been performed to compute the cosine and sine functions for the angles θ, ranging from 0 to π/2 in the step of π/500. σ0, σ1, z1, z2 come from Equations (25)–(28). x0, y0 come from Table1 and Figure4. For m-bit precision, the critical descriptive codes for the NR-8 CORDIC algorithm are described in Algorithm 1.

Algorithm 1. The descriptive codes of the NR-8 CORDIC.

m x0 = 2 , y0 = 0;

m  3  7 A = 2 f ix (σ0 z1)/2 f ix (σ1 z2)/2 ; (m−3) × − × B = 2 − σ + z ; × 0 1

K = 215 28 σ2 + 3 σ4 σ2; − × 0 × 0 − 1  15 x00 = f ix (x0 K)/2 ; 0 0 ×  15 y00 = f ix (y0 K)/2 ; 0 0 × x = A x00 B y00 ; d × 0 − × 0 y = sign_sel(A y + B x00 ); d × 00 × 0 NR8 cos = R cos /22m; NR8 sin = R sin /22m;

Two functions, cos θ and sin θ, are produced by using standard functions from MATLAB, and the amplitude errors are described as ( δNR c = NR8 cos cos θ 8 | − | (33) δNR s = NR8 sin sin θ 8 | − | Figure5a shows the values of cosine and sine produced by the NR-8 CORDIC algorithm. Figure5b,c compare the errors for the cosine and sine functions between the NR-8 CORDIC and the Electronics 2020, 12, x FOR PEER REVIEW 13 of 18

Figure 5a shows the values of cosine and sine produced by the NR-8 CORDIC algorithm. Figure 5b,c compare the errors for the cosine and sine functions between the NR-8 CORDIC and the LLH Electronics 2020 9 CORDIC [16, ] ,with 1521 16-bit precision, respectively. The symbols of δNR8 c and δNR8 s stand for12 ofthe 17 absolute differences of cosine and sine between the computed value from the NR-8 CORDIC and the

LLHtheoretica CORDICl value [16 produced] with 16-bit from precision, MATLAB respectively. functions, Therespectively. symbols ofSimilarlyδNR8c and, theδ NRsymbols8s stand of for δLLHc the absolute differences of cosine and sine between the computed value from the NR-8 CORDIC and the and δLLHs denote the absolute differences of cosine and sine between the computed value from the theoretical value produced from MATLAB functions, respectively. Similarly, the symbols of δ and LLH CORDIC [16] and the theoretical value produced from MATLAB functions, respectively.LLHc It is δLLHs denote the absolute differences of cosine and sine between the computed value from the LLH found that the maximum errors are MAX(δ ) = 8.04 × 10−4 and MAX(δ ) = 5.50 × 10−4 for the CORDIC [16] and the theoretical value producedLLHc from MATLAB functions, respectively.LLHs It is found that cosine and sine functions, respectively, in the literature4 [16], which significantly4 decrease down to the maximum errors are MAX(δ LLHc) = 8.04 10− and MAX(δ LLHs) = 5.50 10− for the cosine and −5 × −5 × sineMAX( functions,δNR8 c ) = respectively,9.20 × 10 and in theMAX( literatureδNR8s ) [=16 9.01], which × 10 significantly in the NR-8 CORDIC decrease downalgorithm to MAX, respectively,(δNR8c) = 5 5 9.20thus indicat10− andingMAX that (theδNR proposed8s) = 9.01 NR10-−8 algorithmin the NR-8 has CORDIC high precision algorithm,. Moreover, respectively, according thus indicating to our × × thatanalyses the proposedfor the structures NR-8 algorithm of the two has algorithms, high precision. similar Moreover, results should according be obtained to our analyses for the for 24- thebit structuresprecision. of the two algorithms, similar results should be obtained for the 24-bit precision.

10-3 10-4 1 1 6 LLHs LLHc NR8s NR8c 0.5 0.5 3 NR8sin

Amplitude NR8cos Error(Deviation)

0 Error(Deviation) 0 0 0 /4 /2 0 /4 /2 0 /4 /2 0:(/500):/2 0:(/500):/2 0:(/500):/2 (a) (b) (c) Figure 5. 5.( a()a The) The curves curves of cosine of cosine and sine and computed sine computed by the NR-8 by the CORDIC NR-8 algorithm. CORDIC (balgorithm) Comparison. (b) ofComparison errors for cosineof errors between for cosine the NR-8 between and the LLHNR-8 CORDIC and the withLLH 16-bitCORDIC precision. with 16 (c-bit) Comparison precision. ( ofc) errorsComparison for sine of between errors for the sine NR-8 between and the the LLH NR CORDIC-8 and the with LLH 16-bit CORDIC precision. with 16-bit precision. 4.3.2. Comparison of Conventional CORDIC Algorithms 4.3.2. Comparison of Conventional CORDIC Algorithms Here, we analyze the computation errors cos θ and sin θ, as calculated by the R-2, R-4, R-8 and Here, we analyze the computation errorsM cos θ and sin θ , as calculated by the R-2, R-4, R-8 NR-8 CORDIC algorithms. When x0 = 2 1,M y0 = 0 (M 16) and a series of angles θ from 0 to 90◦ and NR-8 CORDIC algorithms. When xy00−2  1,  0 ≤( M  16 ) and a series of angles θ from 0 with angle steps of 1, 0.1, 0.01 and 0.001◦ are used, the values of cosθ and sinθ are computed by the algorithmsto 90° with above angle using steps FPGA of 1, and 0.1, simulated 0.01 and by0.001 ModelSim° are used SE, (10.6e).the values The errors of cos can and be calculatedsin are bycomputed differing by the the above algorithms values above from thoseusing computed FPGA and by simulated MATLAB by using ModelSim float-point SE (10.6e) computation. The errors and roundingcan be calculated to M-bit by integers. differing Figure the 6above shows values the maximum from those absolute computed errors by (MAE) MATLAB for the using cos floatθ and-point sinθ computation and rounding to M-bit integers. Figure 6 shows the maximum absolute errors (MAE) functions, which are denoted as δcos(x), δsin(x) (steps = 1, 0.1, 0.01, 0.001◦), and the corresponding rootfor the mean cos squared and sin errors functions (RMSE), arewhich shown are indenoted Figure 7as. Theδcos proposed(x ),δ sin (x ) algorithm(steps = 1, was 0.1, simulated 0.01, 0.001 by°), ModelSimand the corresponding SE and MATLAB root fixed-point mean squared processing, errors and(RMSE) verified are by shown using in the Figure FPGA. 7 We. The obtained proposed the Electronics 2020, 12, x FOR PEER REVIEW 14 of 18 samealgorithm results, was indicating simulated that by the ModelSim algorithm SE is and feasible MATLAB in engineering fixed-point implementation. processing, and verified by using the FPGA. We obtained the same results, indicating that the algorithm is feasible in 8 8 8 8  (1)  (1)  (1)  (1)  (1)  (1)  (1) engineeringcos implementation.sin cos sin cos sin cos cos(0.1) sin(0.1) cos(0.1) sin(0.1) cos(0.1) sin(0.1) cos(0.1) cos(0.01) sin(0.01) cos(0.01) sin(0.01) cos(0.01) sin(0.01) cos(0.01) 6  (0.001)  (0.001) 6  (0.001)  (0.001) 6  (0.001)  (0.001) 6  (0.001) cos sin cos sin cos sin cos 4 4 4 4

MAE MAE

MAE

MAE

sin(1) 2 2 2 2 sin(0.1) sin(0.01) sin(0.001) 0 0 0 0 12 13 14 15 16 12 13 14 15 16 12 13 14 15 16 12 13 14 15 16 M(bit) M(bit) M(bit) M(bit) (a) NR-8 (b) R-8 (c) R-4 (d) R-2

Figure 6. The relations between maximum absolute errors (MAEs) and the input bit width M of the Figure 6. The relations between maximum absolute errors (MAEs) and the input bit width M of the NR-8, R-8, R-4 and R-2 CORDIC algorithms. The angles θ change from 0 to 90 in steps of 1, 0.1, 0.01 NR-8, R-8, R-4 and R-2 CORDIC algorithms. The angles θ change from 0 to◦ 90° in steps of 1, 0.1, and 0.001 . 0.01 and 0.001◦ °.

 (1)  (1)  (1)  (1)  (1)  (1)  (1) 2.5 cos sin 2.5 cos sin 2.5 cos 2.5 cos sin cos(0.1) sin(0.1) cos(0.1) sin(0.1) cos(0.1) cos(0.1) sin(0.1)  (0.01)  (0.01)  (0.01)  (0.01)  (0.01)  (0.01)  (0.01) 2.0 cos sin 2.0 cos sin 2.0 cos 2.0 cos sin cos(0.001) sin(0.001) cos(0.001) sin(0.001) cos(0.001) cos(0.001) sin(0.001) 1.5 1.5 1.5 1.5

RMSE RMSE RMSE RMSE 1.0 1.0 1.0 1.0 sin(1)  (0.1) 0.5 0.5 0.5 sin 0.5 sin(0.01)  (0.001) 0.0 0.0 0.0 sin 0.0 12 13 14 15 16 12 13 14 15 16 12 13 14 15 16 12 13 14 15 16 M(Bit) M(Bit) M(bit) M(bit) (a) NR-8 (b) R-8 (c) R-4 (d) R-2

Figure 7. The relations between root mean squared errors (RMSEs) and the input bit width M of the NR-8, R-8, R-4 and R-2 CORDIC algorithms. The angles change from 0 to 90° in steps of 1, 0.1, 0.01 and 0.001°.

From Figures 6 and 7, it is found that the proposed NR-8 CORDIC algorithm is the most sensitive to the bit width M , and as the value decreases, both the MAE and RMSE values decrease sharply. Even though the value of equals 15 or 16, the MAE and RMSE values of the proposed NR-8 CORDIC algorithm are almost as small as those of the other algorithms. Note that when the value of is smaller than 15, the MAE and RMSE values of the NR-8 CORDIC algorithm are much smaller than those of the other algorithms. As a consequence, the overall MAE and RMSE values of the NR-8 CORDIC algorithm are relatively small in comparison with the conventional algorithm. In addition, the angle step functions have less influence on the MAE and RMSE values of the proposed NR-8 CORDIC algorithm than the other conventional algorithms. Specifically, both the MAE and RMSE values for the cosine function calculated using the NR-8 CORDIC algorithm are almost the same as the corresponding values for the sine function, indicating that the outputs of the cosine and sine functions are mostly orthogonal. However, for the other conventional algorithm, the orthogonality is relatively weak. Moreover, we made the statistical test more than 1000 times and found out that all of the MAEs and RMSEs for the cosine and sine functions are in the corresponding ranges described above, indicative of the significance of our proposed method. Therefore, we make a conclusion that the NR-8 CORDIC algorithm developed in this paper has lower clock latency, less complexity and less consumed power, allowing it to have higher efficiency than other algorithms, which provides a potential application in real-time systems such as radar digital beamforming.

5. Application of the NR-8 CORDIC Algorithm to DBF The diagram of the DBF mode for the MIMO millimeter wave radar is shown in Figure 8. The interface of the FPGA and ADC is the LVDS , and the phase shift Transmission (TX) is implemented by the FPGA, which transmits commands to AWR1243 registers through the SPI bus.

The desired steering angles are defined to be β12,,β βn for n TX antennas. The received I, Q complex data from ADC for each Reception (RX) channel go through a DSP module that includes

Electronics 2020, 12, x FOR PEER REVIEW 14 of 18

8 8 8 8 cos(1) sin(1) cos(1) sin(1) cos(1) sin(1) cos(1) cos(0.1) sin(0.1) cos(0.1) sin(0.1) cos(0.1) sin(0.1) cos(0.1)  (0.01)  (0.01)  (0.01)  (0.01)  (0.01)  (0.01)  (0.01) 6 cos sin 6 cos sin 6 cos sin 6 cos cos(0.001) sin(0.001) cos(0.001) sin(0.001) cos(0.001) sin(0.001) cos(0.001)

4 4 4 4

MAE MAE

MAE

MAE

sin(1) 2 2 2 2 sin(0.1) sin(0.01) sin(0.001) 0 0 0 0 12 13 14 15 16 12 13 14 15 16 12 13 14 15 16 12 13 14 15 16 M(bit) M(bit) M(bit) M(bit) (a) NR-8 (b) R-8 (c) R-4 (d) R-2

Figure 6. The relations between maximum absolute errors (MAEs) and the input bit width M of the NR-8, R-8, R-4 and R-2 CORDIC algorithms. The angles θ change from 0 to 90° in steps of 1, 0.1, Electronics 2020, 9, 1521 13 of 17 0.01 and 0.001°.

 (1)  (1)  (1)  (1)  (1)  (1)  (1) 2.5 cos sin 2.5 cos sin 2.5 cos 2.5 cos sin cos(0.1) sin(0.1) cos(0.1) sin(0.1) cos(0.1) cos(0.1) sin(0.1)  (0.01)  (0.01)  (0.01)  (0.01)  (0.01)  (0.01)  (0.01) 2.0 cos sin 2.0 cos sin 2.0 cos 2.0 cos sin cos(0.001) sin(0.001) cos(0.001) sin(0.001) cos(0.001) cos(0.001) sin(0.001) 1.5 1.5 1.5 1.5

RMSE RMSE RMSE RMSE 1.0 1.0 1.0 1.0 sin(1)  (0.1) 0.5 0.5 0.5 sin 0.5 sin(0.01)  (0.001) 0.0 0.0 0.0 sin 0.0 12 13 14 15 16 12 13 14 15 16 12 13 14 15 16 12 13 14 15 16 M(Bit) M(Bit) M(bit) M(bit) (a) NR-8 (b) R-8 (c) R-4 (d) R-2

Figure 7. The relations between root mean squared errors (RMSEs) and the input bit width M of the Figure 7. The relations between root mean squared errors (RMSEs) and the input bit width M of the NR-8, R-8, R-4 and R-2 CORDIC algorithms. The angles θ change from 0 to 90 in steps of 1, 0.1, 0.01 NR-8, R-8, R-4 and R-2 CORDIC algorithms. The angles change from 0 to◦ 90° in steps of 1, 0.1, and 0.001 . 0.01 and 0.001◦ °. From Figures6 and7, it is found that the proposed NR-8 CORDIC algorithm is the most sensitive From Figures 6 and 7, it is found that the proposed NR-8 CORDIC algorithm is the most to the bit width M, and as the M value decreases, both the MAE and RMSE values decrease sharply. sensitive to the bit width M , and as the value decreases, both the MAE and RMSE values Even though the value of M equals 15 or 16, the MAE and RMSE values of the proposed NR-8 CORDIC decrease sharply. Even though the value of equals 15 or 16, the MAE and RMSE values of the algorithm are almost as small as those of the other algorithms. Note that when the value of M is proposed NR-8 CORDIC algorithm are almost as small as those of the other algorithms. Note that smaller than 15, the MAE and RMSE values of the NR-8 CORDIC algorithm are much smaller than when the value of is smaller than 15, the MAE and RMSE values of the NR-8 CORDIC those of the other algorithms. As a consequence, the overall MAE and RMSE values of the NR-8 algorithm are much smaller than those of the other algorithms. As a consequence, the overall MAE CORDIC algorithm are relatively small in comparison with the conventional algorithm. In addition, and RMSE values of the NR-8 CORDIC algorithm are relatively small in comparison with the the angle step functions have less influence on the MAE and RMSE values of the proposed NR-8 conventional algorithm. In addition, the angle step functions have less influence on the MAE and CORDIC algorithm than the other conventional algorithms. Specifically, both the MAE and RMSE RMSE values of the proposed NR-8 CORDIC algorithm than the other conventional algorithms. values for the cosine function calculated using the NR-8 CORDIC algorithm are almost the same Specifically, both the MAE and RMSE values for the cosine function calculated using the NR-8 as the corresponding values for the sine function, indicating that the outputs of the cosine and sine CORDIC algorithm are almost the same as the corresponding values for the sine function, functions are mostly orthogonal. However, for the other conventional algorithm, the orthogonality is indicating that the outputs of the cosine and sine functions are mostly orthogonal. However, for the relatively weak. Moreover, we made the statistical test more than 1000 times and found out that all other conventional algorithm, the orthogonality is relatively weak. Moreover, we made the of the MAEs and RMSEs for the cosine and sine functions are in the corresponding ranges described statistical test more than 1000 times and found out that all of the MAEs and RMSEs for the cosine above, indicative of the significance of our proposed method. Therefore, we make a conclusion that and sine functions are in the corresponding ranges described above, indicative of the significance of the NR-8 CORDIC algorithm developed in this paper has lower clock latency, less complexity and our proposed method. Therefore, we make a conclusion that the NR-8 CORDIC algorithm less consumed power, allowing it to have higher efficiency than other algorithms, which provides a developed in this paper has lower clock latency, less complexity and less consumed power, potential application in real-time systems such as radar digital beamforming. allowing it to have higher efficiency than other algorithms, which provides a potential application 5.in Applicationreal-time systems of the such NR-8 as CORDIC radar digital Algorithm beamforming. to DBF

5. ApplicationThe diagram of the of NR the- DBF8 CORDIC mode A forlgorithm the MIMO to DBF millimeter wave radar is shown in Figure8. The interface of the FPGA and ADC is the LVDS bus, and the phase shift Transmission (TX) is implementedThe diagram by the of FPGA,the DBF which mode transmits for the MIMO commands millimeter to AWR1243 wave radar registers is shown through in Fig theure SPI 8. bus.The interface of the FPGA and ADC is the LVDS bus, and the phase shift Transmission (TX) is The desired steering angles are defined to be β , β , βn for n TX antennas. The received I, Q complex 1 2 ··· dataimplemented from ADC by for the each FPGA, Reception which (RX) transmits channel commands go through to aAWR1243 DSP module registers that includes through the the range SPI bus. and

DopplerThe desired FFT. RXsteering DBF is angles performed are defined to steer theto be RX beamβ12,,β towardsβn for then TX same antennas.βi. After The the corresponding received I, Q phasecomplex delay, data the from echoes ADC are for summed each Reception to achieve (RX a beamforming) channel go through [20,21]. a DSP module that includes " # x According to Equation (1), if the input vector 0 is not a constant vector and the input angle θ y0 " # x is a desired value, the vector 0 will produce the phase shift by the θ. Thus, we can realize the y0 beam delay of the desired steering angles according to the NR-8 CORDIC algorithm. In Figure4, if x0, y0 are replaced by I, Q complex data, respectively, and the angle θ is replaced by βi, the output values xd, yd will be obtained as the corresponding beam delay vector. Electronics 2020, 12, x FOR PEER REVIEW 15 of 18

Electronics 2020, 12, x FOR PEER REVIEW 15 of 18 the range and Doppler FFT. RX DBF is performed to steer the RX beam towards the same βi . After the corresponding phase delay, the echoes are summed to achieve a beamforming [20,21]. the range and Doppler FFT. RX DBF is performed to steer the RX beam towards the same βi . After x the correspondingAccording to Eq phaseuation delay, (1), ifthe the echoes input arevector summed 0 tois achievenot a constant a beamforming vector and [20 the,21 input]. angle y0 x0 According to Equation (1), if the input vector y is not a constant vector and the input angle θ is a desired value, the vector will produce the0 phase shift by the . Thus, we can realize theθ isbeam a desire delayd value,of the desiredthe vector steering angleswill produce according the to pha these NR shift-8 CORDICby the algorithm. Thus, we. In can Fig realizeure 4, if xy, are replaced by I, Q complex data, respectively, and the angle is replaced by , the the 00beam delay of the desired steering angles according to the NR-8 CORDIC algorithm. In Figure 4, output values xy, will be obtained as the corresponding beam delay vector. if xy00, are replaceddd by I, Q complex data, respectively, and the angle is replaced by , the Electronics 2020, 9, 1521 14 of 17 output values xydd, will be obtained as the corresponding beam delay vector. q bi x NR8-CORDIC d I x0 Antenna I Phase Shift yd Beam_i yq0 Q Array Q bi x NR8-CORDIC d I x0 Antenna I Phase Shift yd Beam_i Array y0 Q

QI . I

. Stitched

RX . ADC- Q DSP Q DBF range/ Imaging TX RX I azimuth ... . I . Stitched

RX . ADC- range/ Imaging Phase shift RX Q DSP Q DBF FPGA T..X. AWR1243 Desired steeringa aznimgluetsh [b b … b ] Phase shift 1 2 n FPGA AWR1243 Desired steering angles [b1 b 2 … b n ] Figure 8. The diagram of digital beamforming (DBF) mode for the MIMO millimeter wave radar. Figure 8. The diagram of digital beamforming (DBF) modemode for the MIMO millimeter wavewave radar.radar. In this section, the NR-8 CORDIC algorithm is applied to the phase delay for DBF on a 77 GHz MIMO millimeter wave radar system empowered by TI AWR1243 chips. The experimental device of InIn thisthis section,section, thethe NR-8NR-8 CORDICCORDIC algorithmalgorithm isis appliedapplied toto thethe phasephase delaydelay forfor DBFDBF onon aa 7777 GHzGHz DBF for the MIMO millimeter wave radar is shown in Figure 9. The related parameters are as MIMOMIMO millimeter wave radar radar system system empowered empowered by by TI TI AWR1243 AWR1243 chip chips.s. The The experimental experimental device device of follows: the sampling rate fs = 4 MHz, the bandwidth bw = 1120 MHz and the sampling points N = 16, ofDBF DBF for for the the MIMO MIMO millimeter millimeter wave wave radar radar is is shown shown in in Fig Figureure 99.. The relatedrelated parameters parameters areare as as where the bandwidth refers to the bandwidth of the radio frequency. However, if the baseband follows:follows: thethe samplingsampling raterate ffss = 4 MHz, the bandwidth bw == 1120 MHz and the sampling pointspoints NN == 1616,, signal is implemented by the radar chip with digital down converters, the frequency will be wherewhere thethe bandwidth bandwidth refers refers to to the the bandwidth bandwidth of the of radiothe radio frequency. frequency. However, However, if the basebandif the baseband signal reduced from 1120 MHz to less than 1 MHz. Therefore, we can use fs = 4 MHz for sampling. The issignal implemented is implemented by the radar by the chip radar with digitalchip with down digital converters, down the convert frequencyers, the will frequency be reduced will from be sample points are listed below: = {−168 − 64i, −60 − 224i, 224 − 88i, 148 + 84i, 52 + 196i, −164 + 1120reduced MHz from to less 1120 than MHz 1 MHz. to less Therefore, thanI,Q 1 MHz. we can Therefore, use fs = 4 we MHz can for use sampling. fs = 4 MHz The for sample sampling points. The are 60listedi, −36 below: − 132i, I,276Q −= 4{i, 168128 + 16464i,i, −6084 + 224284i, 224−188 +88 56i,i, 148 12 −+ 16884ii,, 52128+ +196 40i, −64164 + +26460i,i ,−23636 + 104132ii,, sample points{ are} listed− below− : I,−Q −= {−168 − 64−i, −60 − 224i, 224 − 88i, 148− + 84i, 52 + 196− i, −−164 + −276132 −4 204i, 128i}. A+ 164corneri, 84 reflector+ 284i, is 188placed+ 56 ini, 12front168 of thei, 128 radar+ 40 ati, a64 distance+ 264i, of236 3 m+ and104 ian, azimuth132 204 ofi}. 60i, −−36 − 132i, 276 −− 4i, 128 + 164− i, −84 + 284i, −−188 + 56i, 12 − 168− i, 128 + 40−i, −64 + 264i,− −236− + 104i, A20° corner. reflector is placed in front of the radar at a distance of 3 m and an azimuth of 20◦. −132 − 204i}. A corner reflector is placed in front of the radar at a distance of 3 m and an azimuth of 20°.

Computer XCKU040 FPGA

Computer XCKU040 FPGA Antenna Array AWR1243 Chip Corner Antenna Reflector Array AWR1243 Chip Corner Reflector Figure 9. TheThe experimental experimental device device of of DBF DBF for for the the MIMO millimeter wave radar radar.. Figure 9. The experimental device of DBF for the MIMO millimeter wave radar. One echo echo (I/Q) (I/Q) is is taken taken for for phase phase shift shift angle angle θ fromfrom 1.5 1.5° ◦toto 15 15° ◦withwith a 1.5 a 1.5° step,◦ step, producing producing 10 10 phases. Let x0 = I and y0 = Q in Figure4. The phase shift e ffects are shown in Figure 10, where the phases. Let x0  I and y0  Q in Figure 4. The phase shift effects are shown in Figure 10, where red andOne green echo lines (I/Q) represent is taken for I and phase Q signals, shift angle respectively. from The 1.5° 1st toand 15° 10stwith symbols a 1.5° step, in the producing illustration 10 the red and green lines represent I and Q signals, respectively. The 1st and 10st symbols in the representphases. L beamet x0 delaysI and of 1.5y0 Q(beam in _1)Fig andure 4 15. The(beam phase_10), shift respectively. effects are FFT shown transformation in Figure is10 applied, where illustration represent beam◦ delays of 1.5◦° (beam_1) and 15° (beam_10), respectively. FFT tothe each red beam, and green and the line phases represent errors are I and listed Q signals,in Table 4 respectively.. The variables Thep , 1ste∆p , and∆p and 10stδ max_symbolsp δmax_ inp are the transformation is applied to each beam, and the phase errors are listed in Table 4. The variables p , describedillustration as follows: represent beam delays of 1.5° (beam_1) and 15° (beam_10), respectively. FFT   훿 δ transformationp , p and ma isx applied_푝 max_ p to are each described beam, and as follows: the phase errors are listed in Table 4. The variables p , p = ANGLE(FFT(beam_n or I, Q )) , the phase angle of the nth delay beam or I, Q .  ,  and 훿 δ are{ described} PEAK as follows: { } p p max_푝 max_ p e∆p = ANGLE(FFT(beam_n )) ANGLE(FFT( I, Q )) , the phase difference between the nth PEAK − { } PEAK delay beam and the original echo I, Q . { } ∆ p = the desired steering angles.

δ = e∆p ∆p = the error of the phase shift. ∆p − where n = 1, 2, , 10 and the functions FFT(beam_n) and FFT(beam_1) represent the FFT ··· transformation of the nth delay beam and the first delay beam, respectively. Then, the phase differences corresponding to the peak values of the spectral lines are obtained. In Table4, the small value of the

◦ inequality, δ∆p < 0.064 , assures that the NR-8 CORDIC algorithm can be applied in real-time systems like radar digital beamforming with a high precision of the phase shift. Electronics 2020, 12, x FOR PEER REVIEW 16 of 18

p=ANGLE FFT( beam _ n or  I,Q ) , the phase angle of the nth delay beam or I,Q. PEAK

p =ANGLE FFT(beam _ n ) ANGLE FFT( I, Q ) , the phase difference between the PEAK PEAK nth delay beam and the original echo .

 p = the desired steering angles.

δp= p   p = the error of the phase shift. where n = 1, 2, ⋯, 10 and the functions FFT(beam _ n ) and FFT(beam _ 1 ) represent the FFT transformation of the nth delay beam and the first delay beam, respectively. Then, the phase differences corresponding to the peak values of the spectral lines are obtained. In Table 4, the small value of the inequality, δp 0.064 , assures that the NR-8 CORDIC algorithm can be applied in real-time systems like radar digital beamforming with a high precision of the phase shift.

Table 4. Phase errors of FFT transform.

Beam Beam Beam_ Beam Beam_ Beam Beams (I,Q) Beam_1 Beam_2 Beam_3 Beam_4 _5 _6 7 _8 9 _10 p (°) −129.305 −127.807 −126.354 −124.814 −123.361 −121.804 −120.241 −118.812 −117.344 −115.858 −114.290 Electronics 2020, 9, 1521 15 of 17

 p (°) 0 1.498 2.951 4.491 5.944 7.501 9.064 10.493 11.961 13.447 15.015

 Table 4. Phase errors of FFT transform. p 0 1.5 3.0 4.5 6.0 7.5 9.0 10.5 12.0 13.5 15.0 Beams(°) (I,Q) Beam_1 Beam_2 Beam_3 Beam_4 Beam_5 Beam_6 Beam_7 Beam_8 Beam_9 Beam_10

p(◦) 129.305 127.807 126.354 124.814 123.361 121.804 120.241 118.812 117.344 115.858 114.290 − − − − − − − − − − − δe∆pp(◦) 0 1.498 2.951−0.00 4.491−0.05 5.944 7.501 9.064 10.493 11.961 13.447 15.015 ∆ ( ) 00 −0.002 1.5 −0.049 3.0 4.5 6.00.001 7.5 0.064 9.0 −0.007 10.5 −0.039 12.0 − 13.50.053 15.00.015 p ◦ 9 6 δ(∆°p)( ◦ ) 0 0.002 0.049 0.009 0.056 0.001 0.064 0.007 0.039 0.053 0.015 − − − − − − −

140 240 120 10st 1st 120 100 4.0 4.4 0

Magnitude -120 I Q -240 1 3 5 7 9 Time(s) X1/f s Figure 10. TheThe effects effects of the phase shift show showinging the partial amplificationamplification of delay beams.beams.

Overall, ourour algorithm is mainly based on noniterative methods, whereas the majority of the conventional algorithmsalgorithms areare based based on on the the iterative iterative methods, methods, and and thus thus our our algorithm algorithm is low is low latency latency and highand high efficiency efficiency for the for high the high precision precision output. output. As for As the for normal the normal CORDIC CORDIC algorithm, algorithm the increase, the increase of the computationalof the computational complexity complexity means themeans change the ofchange the important of the important module, module with the, increasewith the ofincrease precision of and,precision in detail, and, thein detail, adders the required adders forrequired fulfilling for thefulfilling same the task same such task as a such task inas Figurea task in4, willFigure increase. 4, will For example, the number of adders that are used to get the values of A and B from Equation (13) increase. For example, the number of adders that are used to get then valuesn of An and Bn from willEquation increase (13 with) will the increase increase with of the the computational increase of the complexity. computational As for complexity. the R-2 CORDIC As for algorithm, the R-2 theCORDICm iterators algorithm, are required the m iterators for achieving are required the m-bit for precision achieving output, the m each-bit precision of which out needsput, an each adder of andwhich a subtractor;needs an adder thus the and computational a subtractor; complexitythus the computational can be expressed complexity by a term can of be O( mexpressed). Meanwhile, by a theterm data of O( wem). achieved Meanwhile, are basedthe data on we the achieved statistical are test based for severalon the statistical trials, and test the for results several are trials found, and to bethe very results reliable are with found a relatively to be very low reliable error, thus with indicating a relatively the lowNR-8 error, CORDIC thus algorithm indicating is ablethe NR to be-8 appliedCORDIC in algorithm these fields is able with to low be latencyapplied and in these high fields efficiency. with low latency and high efficiency. 6. Conclusions The proposed NR-8 CORDIC algorithm has low latency, low complexity and low RU, in comparison with the conventional R-X CORDIC and some newly developed CORDIC algorithms. In particular, when the m-bit precision is less than 24-bit, this algorithm has great advantages, e.g., the clock latencies can be reduced to 4 from 10 with much lower complexity. This algorithm adopts the narrow input angle range to obtain a high speed for calculations, and it uses the output uniform formula to efficiently compute the sine and cosine functions or the phase shift in a noniterative fashion. Therefore, this algorithm is of great value in time-critical applications, such as DBF, robot controllers, FFT transformation, signal modulation and demodulation, recently developed rapid convolutional neural networks (CNNs) [22,23] and so on. We anticipate that the algorithm will provide a higher precision, lower complexity and lower clock latency after further optimization in the future.

Author Contributions: W.T. and F.X. developed the theory, performed the experiment, and drafted the manuscript. All authors have read and agreed to the published version of the manuscript. Funding: This research received no external funding. Acknowledgments: This work was supported by The Key Laboratory for Information Science of Electromagnetic Waves, School of Information Science and Technology, Fudan University. Conflicts of Interest: The authors declare no conflict of interest. Electronics 2020, 9, 1521 16 of 17

References

1. Fang, L.; Xie, Y.; Li, B.; Chen, H. Generation scheme of chirp scaling phase functions based on floating-point CORDIC . J. Eng. 2019, 2019, 7436–7439. [CrossRef] 2. Vyas, P.; Vachhani, L. CORDIC-Based Azimuth Calculation and Obstacle Tracing via Optimal Sensor Placement on a Mobile Robot. IEEE/ASME Trans. Mechatron. 2016, 21, 2317–2329. [CrossRef] 3. Wong, C.C.; Liu, C.C. FPGA realisation of inverse kinematics for biped robot based on CORDIC. Electron. Lett. 2013, 49, 332–334. 4. Lee, H.; Oh, K.; Cho, M.; Jang, Y.; Kim, J. Efficient Low-Latency Implementation of CORDIC-Based Sorted QR Decomposition for Multi-Gbps MIMO Systems. IEEE Trans. Circuits Syst. II Express Brief 2018, 65, 1375–1379. [CrossRef] 5. Jun, M.; Parhi, K.K.; Deprettere, E.F. Annihilation-Reordering Look-Ahead Pipelined CORDIC-Based RLS Adaptive Filters and Their Application to Adaptive Beamforming. IEEE Trans. Signal Process. 2000, 48, 2414–2431. [CrossRef] 6. Nikolov, S.I.; Jensen, J.A.; Tomov, B.G. Fast parametric beamformer for synthetic aperture imaging. IEEE Trans. Ultrason. Ferroelectr. Freq. Control 2008, 55, 1755–1767. [CrossRef][PubMed] 7. Pilato, L.; Fanucci, L.; Saponara, S. Real-Time and High-Accuracy Arctangent Computation Using CORDIC and Fast Magnitude Estimation. Electronics 2017, 6, 22. [CrossRef] 8. Lakshmi, B.; Dhar, A.S. CORDIC architectures: A survey. VLSI Des. 2010, 2010, 794891. [CrossRef] 9. Ylostalo, J. Function approximation using . IEEE Signal Process. Mag. 2006, 23, 99–102. [CrossRef] 10. Ercegovac, M.D.; Lang, T.; Muller, J.M.; Tisserand, A. Reciprocation, , inverse square root, and some elementary functions using small multipliers. IEEE Trans. Comput. 2000, 49, 628–637. 11. Volder, J.E. The CORDIC Trigonometric Computing Technique. IEEE Trans. Electr. Comput. 1959, 8, 330–334. [CrossRef] 12. Meher, P.K.; Valls, J.; Juang, T.B.; Sridharan, K.; Maharatna, K. 50 Years of CORDIC: Algorithms, Architectures, and Applications. IEEE Trans. Circuits Syst. I Regul. Pap. 2009, 56, 1893–1907. [CrossRef] 13. Maharatna, K.; Banerjee, S.; Grass, E.; Krstic, M.; Troya, A. Modified virtually scaling-free adaptive CORDIC rotator algorithm and architecture. IEEE Trans. Circuits Syst. Video Technol. 2005, 15, 1463–1474. 14. Antelo, E.; Villalba, J.; Bruguera, J.D.; Zapata, E.L. High performance rotation architectures based on the radix-4 CORDIC algorithm. IEEE Trans. Comput. 1997, 46, 855–870. [CrossRef] 15. Rudagi, J.; Subbaraman, S. Comparative Analysis of Radix-2, Radix-4, Radix-8 CORDIC Processors. In Proceedings of the 2017 International Conference on Inventive Computing and Informatics (ICICI), Coimbatore, India, 23–24 November 2017; pp. 378–382. 16. Shukla, R.; Ray, K. Low latency hybrid CORDIC algorithm. IEEE Trans. Comput. 2014, 63, 3066–3078. [CrossRef] 17. Wu, C.S.; Wu, A.Y.; Lin, C.H. A High-Performance/Low-Latency Vector Rotational CORDIC Architecture Based on Extended Elementary Angle Set and Trellis-Based Searching Schemes. IEEE Trans. Circuits Syst. II Analog Digit. Signal Process. 2003, 50, 589–601. 18. DSP48 Macro v3.0, Xilinx Inc., USA. 2015. Available online: https://www.xilinx.com/support/documentation/ ip_documentation/xbip_dsp48_macro/v3_0/pg148-dsp48-macro.pdf (accessed on 12 August 2020). 19. UltraScale Architecture Configurable Logic Block User Guide. Xilinx Inc., USA. 2017. Available online: https: //www.xilinx.com/support/documentation/user_guides/ug574-ultrascale-clb.pdf (accessed on 12 August 2020). 20. Fischman, M.A.; Le, C. Digital beamforming developments for the joint NASA/Air Force Space Based Radar. In Proceedings of the IGARSS 2004, 2004 IEEE International Geoscience and Remote Sensing Symposium, Anchorage, AK, USA, 20–24 September 2004; pp. 687–690. 21. Lialios, D.I.; Ntetsikas, N.; Paschaloudis, K.D.; Zekios, C.L.; Georgakopoulos, S.V.; Kyriacou, G.A. Design of True Time Delay Millimeter Wave Beamformers for 5G Multibeam Phased Arrays. Electronics 2020, 9, 1331. [CrossRef] Electronics 2020, 9, 1521 17 of 17

22. Cao, Y.X.; Xiao, W.A.; Jia, J. A Cordic-based Acceleration Method on FPGA for CNN Normalization layer. In Proceedings of the 2020 International Conference on High Performance Big Data and Intelligent Systems (HPBD&IS), Shenzhen, China, 23 May 2020. 23. Parmar, Y.; Sridharan, K. A Resource-Efficient Multiplierless Systolic Array Architecture for Convolutions in Deep Networks. IEEE Trans. Circuits Syst. II Express Brief 2020, 67, 370–374. [CrossRef]

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).