electronics

Article A Low Complexity, High Throughput DoA Estimation Chip Design for Adaptive Beamforming

Kuan-Ting Chen 1, Wei-Hsuan Ma 2, Yin-Tsung Hwang 1,* and Kuan-Ying Chang 2

1 Department of Electrical Engineering, National Chung Hsing University, Taichung City 402, Taiwan; [email protected] 2 ILITEK co. ltd, HsinChu 302, Taiwan; [email protected] (W.-H.M.); [email protected] (K.-Y.C.) * Correspondence: [email protected]

 Received: 11 March 2020; Accepted: 10 April 2020; Published: 13 April 2020 

Abstract: Direction of Arrival (DoA) estimation is essential to adaptive beamforming widely used in many radar and wireless communication systems. Although many estimation algorithms have been investigated, most of them focus on the performance enhancement aspect but overlook the computing complexity or the hardware implementation issues. In this paper, a low-complexity yet effective DoA estimation algorithm and the corresponding hardware accelerator chip design are presented. The proposed algorithm features a combination of signal sub-space projection and parallel matching pursuit techniques, i.e., applying signal projection first before performing matching pursuit from a codebook. This measure helps minimize the interference from noise sub-space and makes the matching process free of extra orthogonalization computations. The computing complexity can thus be reduced significantly. In addition, estimations of all signal sources can be performed in parallel without going through a successive update process. To facilitate an efficient hardware implementation, the computing scheme of the estimation algorithm is also optimized. The most critical part of the algorithm, i.e., calculating the projection , is largely simplified and neatly accomplished by using QR decomposition. In addition, the proposed scheme supports parallel matches of all signal sources from a beamforming codebook to improve the processing throughput. The algorithm complexity analysis shows that the proposed scheme outperforms other well-known estimation algorithms significantly under various system configurations. The performance simulation results further reveal that, subject to a beamforming codebook with a 5◦ angular resolution, the Root Mean Square (RMS) error of angle estimations is only 0.76◦ when Signal to Noise Ratio (SNR) = 20 dB. The estimation accuracy outpaces other matching pursuit based approaches and is close to that of the classic Estimation of Signal Parameters Via Rotational Invariance Techniques (ESPRIT) scheme but requires only one fifth of its computing complexity. In developing the hardware accelerator design, pipelined Coordinate Digital Computer (CORDIC) processors consisting of simple adders and shifters are employed to implement the basic trigonometric operations needed in QR decomposition. A systolic array architecture is developed as the computing kernel for QR decomposition. Other computing modules are also realized using various linear systolic arrays and chained together seamlessly to maximize the computing throughput. A Taiwan Semiconductor Manufacturing Company (TSMC) 40 nm CMOS process was chosen as the implementation technology. The gate count of the chip design is 454.4k, featuring a core size of 0.76 mm2, and can operate up to 333 MHz. This suggests that one DoA estimation, with up to three signal sources, can be performed every 2.38 µs.

Keywords: DoA estimation; adaptive beamforming; chip design; hardware accelerator; QR decomposition

Electronics 2020, 9, 641; doi:10.3390/electronics9040641 www.mdpi.com/journal/electronics Electronics 2020, 9, 641 2 of 27

1. Introduction Beamforming is a technique of transmitting signal to or receiving signal from a designated direction. Because of the discrepancy in treating signals from different directions, it is also known as filtering in the space domain. Beamforming has been widely used in radar, wireless communication and medical imagining systems and requires an antenna (or transducer) array to accomplish it. By properly controlling the phases and amplitudes of signal on each antenna, the transmission or reception signal gain along specific directions can be enhanced while those along other directions will be suppressed. The benefits of beamforming include improved link budget, extended service range and interference suppression. In adaptive beamforming, the beams are not fixed. Instead, they are steered dynamically toward the directions of signal sources. Although not all beamforming applications require explicit directional information of the signal sources, DoA estimation is essential in radar signal processing and establishing wireless communication links. Explicit DoA information also helps achieve better interference cancellation.

1.1. Related Works of DoA Estimation The basics of DoA estimation is exploring the phase differences of signals received from an antenna array to distinguish the signal direction. In the presence of multiple signal sources and background noises, the orthogonal property between the noise and signal subspaces is assumed. The first step of estimation is calculating the auto-correlation matrix using the received antenna array 1 signals. A uniformly spaced (by a pitch of 2 ) linear antenna array is assumed for 1-D (azimuth angle) estimation. The Radio Frequency (RF) signals are down converted to baseband and form a signal vector x(t) in each digital sampling. If the RF modulation does include both in-phase and quadrature phase components, an additional Hilbert transform is needed to obtain complex-valued signal vectors. Conventional estimation schemes can be classified into three categories, i.e., noise subspace, signal sub-space and matching pursuit. For the noise subspace approach, the noise subspace is determined using the auto-correlation matrix. A MUSIC (MUltiple SIgnal Classification) scheme and its variations [1–3] perform Eigenvalue Decomposition (EVD) to obtain the noise sub-space spanned by the Eigen vectors corresponding to smaller Eigen values. The Eigen vectors of noise sub-space are used to construct a spatial frequency spectrum function. A spatial frequency scanning procedure is next performed and the peaks of the spectrum indicate the directions of signal sources. In [4,5], a ROOT-MUSIC scheme, aimed at eliminating the expensive spatial frequency scanning procedure by finding the roots of a nonlinear polynomial, was proposed. In [6], the noise space estimation is improved by applying oblique projection along the signal sub-space. For the signal sub-space approach, on the contrary, the signal sub-space is extracted via auto-correlation . Instead of spectrum scanning, the signal directions are computed directly. ESPRIT (Estimation of Signal Parameters Via Rotational Invariance Techniques) [7–9] is considered a classic scheme of this category. It applies the singular value decomposition first and uses the left singular matrix to represent the signal sub-space. By finding the transform matrix between the two sub-matrices of the singular matrix, all signal directions can be determined simultaneously following an additional Eigen value decomposition. Enhanced schemes such as [10] employs an additional time-frequency data model exhibiting multi-invariance to improve the estimation accuracy and scheme [11] can tackle the coherent signal detection problem. These two categories of estimation are computationally intensive because of various matrix decompositions, which requires iterative computations to obtain the results. The third category of estimation scheme assumes a codebook (or dictionary matrix) consisting of steering vectors is available. A steering vector outlines the phase differences of received signals of the antenna array subject to a specific direction of arrival. The estimation is performed by finding the best match from the codebook using some greedy schemes such as Matching Pursuit (MP) [12–14]. The size of the codebook indicates the angular resolution of the estimation. Since the codebook is not orthogonal (steering vectors are not mutually orthogonal), the accuracy of matching can be degraded if the angular Electronics 2020, 9, 641 3 of 27 separation of two signal sources is small. In view of this, an enhancement using Orthogonal Matching Pursuit (OMP) [15], which calls for an extra orthogonalization process after each match, was proposed. Other improvements include the scheme in [16], which introduces an extra majorization–minimization measure, and the scheme in [17], which applies a conjugate gradient method with subtraction method. In [18], the performance of the OMP scheme is enhanced with iterative local searching. In [19], the OMP scheme is applied to hybrid non-uniform array. Although this category of estimation schemes is considered less computation demanding than the first two counterparts, expensive computations such as pseudo matrix inverse computations are still needed.

1.2. Motivations and Contributions Note that the frequency of performing DoA estimation in adaptive beamforming can be high, in particular, in fast time varying environments such as for vehicular communications. In addition, array signal processing is computationally intensive and the algorithms usually exhibit a complexity of O(N3), where N is the antenna array size. For radar systems employing hundreds of antenna, the computing requirement is humongous. Therefore, conventional programmable digital signal processor approaches may not be able to process the data in real time and only the hardwired approaches are feasible. Hardwired approaches also provide much better power consumption than programmable approaches. To achieve an efficient hardware accelerator design, we need to 1) reduce the computing complexity of the algorithm, and 2) map the algorithm neatly to a hardware architecture supporting massive parallel processing and pipelined operations. In this paper, we aim at tackling these two issues and develop a low complexity, high throughput, DoA estimation accelerator design in chip. The contributions of this work can be summarized as follow: In the algorithm development perspective, various design measures were devised to facilitate low complexity DoA estimations. These include applying a signal sub-space projection to enhance the accuracy of matching pursuit, calculating only a subset of auto-correlation matrix to avoid computing redundancy, and simplifying the projection matrix computation via QR factorization, which factorizes a matrix to the product of a unitary matrix Q and an upper R. These measures successfully reduce the computing complexity to only one fifth that of the ESPRIT scheme while largely maintaining the estimation accuracy. In the hardware accelerator design perspective, an effective algorithm to hardware mapping was conducted. Efficient systolic array designs were developed for each computing module and integrated seamlessly to avoid any data disruption between modules. The data input scheme was taken into consideration when determining data buffering and hardware allocation. This avoids excessive computing concurrency and minimizes the hardware complexity. An efficient QR factorization design based on the Coordinate Rotation Digital Computer (CORDIC) units was also developed. All of these design measures and innovations can be equally applied to likewise designs. Finally, a chip design in TSMC 40 nm CMOS process technology was presented to demonstrate the significant speed up can be achieved. The gate count of the chip design is 454.4 k and can operate up to 333 MHz. This suggests that one DoA estimation, with up to three signal sources, can be performed every 2.38 µs. The rest of the paper is organized as follows: in Section2, a brief recap of three DoA estimation schemes, i.e., ESPRIT, Matching Pursuit (MP) and Orthogonal Matching Pursuit (OMP), used in the performance comparison is given first. The proposed projection based parallel matching pursuit (P2MP) scheme is next presented. In Section3, the performance evaluations of the proposed scheme against existing ones are described. The evaluations include computing complexity, estimation accuracy and how DoA estimation enhances Frequency Modulated Continuous Waveform (FMCW) radar detection. The performance robustness of the proposed scheme to the assumption of the number of signal sources is also verified. In Section4, the target application of the proposed DoA estimation scheme and the hardware design specifications are first introduced. The system architecture, timing diagram, and individual hardware module designs are next elaborated followed by a chip implementation report and comparison with existing works. Electronics 2020, 9, 641 4 of 27

2. Projection and Parallel Matching Pursuit Based DoA Estimation Electronics 2020, 9, x FOR PEER REVIEW 4 of 26 2.1. Recap of Conventional DoA Estimation Schemes Before introducing the proposed DoA estimation schemes, three popular schemes are reviewed first toBefore highlight introducing computing the proposedcomplexity DoA issue estimation and the difference schemes, threewith the popular proposed schemes one. areThe reviewed first one firstis the to ESPRIT highlight scheme computing [7–9], complexitywhich is a classic issue and signal the sub-space difference withbased the approach. proposed The one. corresponding The first one ispseudo the ESPRIT code is scheme shown in [7– Figure9], which 1. An is auto-correlati a classic signalon sub-space matrix is estimated based approach. first followed The corresponding by a singular pseudomatrix decomposition. code is shown in The Figure number1. An of auto-correlation signal sources matrixcan be isdetermined estimated firstby counting followed the by number a singular of matrixsignificant decomposition. singular values. The numberThe column of signal vectors sources of the can left be singular determined matrix by countingcorresponding the number to those of significant singular values. The column vectors of the left singular matrix corresponding to those significant singular values form a sub-matrix , which spans the signal sub-space. is further significant singular values form a sub-matrix L , which spans the signal sub-space. L is further divided divided into two sub-matrices and byS taking the N-1 and the last N-1 rows,S respectively. This intomeans two each sub-matrices matrix is derivedL1 and L using2 by taking signals the fromN-1 N and-1 out the of last N Nantennas.-1 rows, respectively.There thus exists This a means transform each matrix is derived using signals from N-1 out of N antennas. There thus exists a transform ψ between between and . The transform matrix can be calculated by taking the pseudo matrix Linversion.1 and L2. Finally, The transform the Eigen matrix decompositionψ can be calculated of is performed by taking and the pseudothe angular matrix information inversion. of Finally, Eigen thevalues Eigen indicate decomposition the directions of ψ ofis performedarrival. The and advantage the angular of this information approach ofis that Eigen no values prior information indicate the directionsof the number of arrival. of signal The advantage sources is of needed. this approach In addition, is that noall priorsignal information sources can of thebe numberidentified of signalsimultaneously. sources is needed.The obvious In addition, shortcoming all signal of this sources scheme can is be its identified computing simultaneously. complexity. Both The singular obvious shortcomingvalue and Eigen of this decompositions scheme is its computing are computationally complexity. Both intensive singular and value can and only Eigen be solved decompositions through areiterative computationally procedures intensive and the andcomputing can only complexity be solved through of each iterative iteration procedures is of order and O( theN3) computing [20]. The 3 complexitypseudo inverse of each calculation, iteration which is of order requires O(N two)[20 matrix]. The pseudomultiplications inverse calculation,and one matrix which inversion, requires also two 3 matrixbears a multiplicationscomplexity of O( andN3). one These matrix computing inversion, requirements also bears a become complexity a big of hurdle O(N ).to These high throughput computing requirementsrate in implementation. become a big hurdle to high throughput rate in implementation.

Figure 1. Pseudo code of the Estimation of Signal ParametersParameters Via Rotational Invariance Techniques (ESPRIT) DirectionDirection ofof ArrivalArrival (DoA)(DoA) estimationestimation algorithm.algorithm.

In contrast toto calculatingcalculating the the angles angles of of DoA DoA directly, directly, the the matching matching pursuit pursuit based based approaches approaches [12 [12––14] try14] totry identify to identify them them from from a codebook a codebook with awith pre-defined a pre-defined angular angular resolution. resolution. Unlike theUnlike ESPRIT the scheme,ESPRIT theyscheme, are freethey of are expensive free of matrixexpensive decompositions matrix deco andmpositions inversions. and Instead, inversions. a sparse Instead, approximation a sparse algorithm,approximation which algorithm, finds the which "best matching"finds the "best projections matching" of multidimensional projections of multidimensional data onto the span data ofonto an over-completethe span of andictionary over-complete matrix, dictionary is conducted. matrix Best, is matchingconducted. means Best ama minimumtching means number a minimum of vectors innumber the dictionary of vectors matrix in the are dictionary selected. matrix The selection are selected. or matching The selection process or is matching in general process greedy is and in conductedgeneral greedy progressively. and conducted The original progressively. MP scheme The works original well MP in scheme the presence works of well only in one the signal presence source of butonly the one performance signal source degrades but the as performance the number degrad of signales sourcesas the number increase. of Therefore,signal sources the enhancedincrease. schemeTherefore, OMP the [ 15enhanced] was presented. scheme OMP Figure [15]2 shows was pres a combinedented. Figure pseudo 2 shows code ofa combined the MP /OMP pseudo schemes. code Theof the residual MP/OMP matrix schemes.R is initially The residual set to thematrix auto-correlation R is initially matrixset to theRxx auto-correlation. Assume there arematrixK signal . Assume there are K signal sources, they are identified one by one using a for loop. In each loop iteration, the residual matrix is multiplied with the codebook matrix D, and the vector (in the codebook) yields the maximum norm 2 value after multiplication is considered the “best match”. The corresponding index p is added to the set P. For OMP, an additional step 8 is performed, i.e., the matched vector is appended to a matrix A. After the vector selection, in the MP scheme, the

Electronics 2020, 9, 641 5 of 27 sources, they are identified one by one using a for loop. In each loop iteration, the residual matrix is multiplied with the codebook matrix D, and the vector (in the codebook) yields the maximum norm 2 valueElectronics after 2020 multiplication, 9, x FOR PEER REVIEW is considered the “best match”. The corresponding index p is added to5 of the 26 set P. For OMP, an additional step 8 is performed, i.e., the matched vector is appended to a matrix A. Aftercontribution the vector of the selection, selected in vector the MP is scheme,subtracted the from contribution the residual of the matrix selected (step vector 9). In is the subtracted OMP scheme, from thean orthogonalization residual matrix (step process 9). In (step the OMP 10) is scheme, performed. an orthogonalization It uses all the selected process (stepvectors 10) instead is performed. of the Itcurrent uses all one the to selected calculate vectors the residual instead matrix of the R. current In other one words, to calculate the projection the residual of residual matrix matrixR. In other R on words,the vector the projectionspace spanned of residual by all matrix the selectedR on the vectors vector is space removed. spanned This by calls all the for selected a pseudo vectors matrix is removed.inversion, Thispinv( callsA), defined for a pseudo as: matrix inversion, pinv(A), defined as:

() =( ∙)1 ∙ (1) pinv(A) = (AH A)− AH (1) · · where the superscript H indicates taking the complex-conjugate and transpose of the matrix. This is wherethe major the superscriptdistinction betweenH indicates the takingoriginal the MP complex-conjugate scheme and the orthogonal and transpose MP ofscheme. the matrix. In addition, This is thethe majorresidual distinction matrix is between recalculated the original in each MP loop scheme iteration and while the orthogonal that of the MP MP scheme. scheme In is addition,updated theincrementally. residual matrix The orthogonalization is recalculated in process each loop facilitates iteration better while matching that of theresult MP in scheme subsequent is updated vector incrementally.selection [15]. The orthogonalizationcomputing complexity, process however, facilitates increases better matching significantly. result The in subsequentmatching process vector selectionproceeds [until15]. either The computing the matrix complexity,norm of the however,residual matrix increases is smaller significantly. than a threshold The matching or the process loop proceedsiteration untilends. eitherIt should thematrix also be norm noted of that the residualthe matching matrix processes is smaller in than both a threshold schemes εareor basically the loop iterationsequential ends. and Itcannot should be alsoprocessed be noted in thatparallel. the matchingIn summary processes of this inrecap, both the schemes MP and are the basically OMP sequentialschemes, in and spite cannot of a be limited processed angular in parallel. resolution In summary in estimation of this confined recap, the by MP the and codebook the OMP size, schemes, does inexhibit spite ofa asignificant limited angular advantage resolution in computing in estimation complexity confined byover the codebookthe ESPRIT size, scheme. does exhibit The aorthogonalization significant advantage process in employed computing in complexity the OMP sc overheme the helps ESPRIT improve scheme. the estimation The orthogonalization performance processin the presence employed of inmultiple the OMP signal scheme sources, helps but, improve in the themeantime, estimation negatively performance impacts in thethe presencecomputing of multiplecomplexity. signal sources, but, in the meantime, negatively impacts the computing complexity.

FigureFigure 2.2. CombinedCombined pseudo code ofof thethe MatchingMatching PursuitPursuit/Orthogonal/Orthogonal Matching Pursuit (MP(MP/OMP)/OMP) DoADoA estimationestimation algorithms.algorithms.

2.2.2.2. Proposed DoA Estimation Algorithm InIn viewview ofof thethe concernsconcerns discusseddiscussed earlier,earlier, aa newnew scheme,scheme, calledcalled ProjectionProjection andand ParallelParallel MatchingMatching 2 PursuitPursuit (P(P2MP) is proposed. Our Our idea is combining the merits of sub-spacesub-space andand codebookcodebook searchsearch approaches, i.e., applying signal projection first before performing matching pursuit. In other words, the matching pursuit is performed in the signal sub-space. This measure helps minimize the interference from noise sub-space and makes the matching process free of extra orthogonalization computations. The computing complexity can thus be reduced significantly. In addition, estimations of all signal sources can be performed in parallel without going through a successive update process. Further optimization measure is employed to alleviate the computing efforts of signal projection.

Electronics 2020, 9, 641 6 of 27 approaches, i.e., applying signal projection first before performing matching pursuit. In other words, the matching pursuit is performed in the signal sub-space. This measure helps minimize the interference from noise sub-space and makes the matching process free of extra orthogonalization computations. The computing complexity can thus be reduced significantly. In addition, estimations of all signal sourcesElectronics can 2020 be, 9 performed, x FOR PEER inREVIEW parallel without going through a successive update process. Further6 of 26 optimization measure is employed to alleviate the computing efforts of signal projection. Instead ofInstead usingexpensive of using expensive pseudo matrixpseudo inversion, matrix inversion, the projection the projection matrix matrix calculation calculation is accomplished is accomplished by applyingby applying a partial a partial QR factorization QR factorization to the auto-correlationto the auto-correlation matrix. matrix. Figure3 Figure shows 3 the shows proposed the proposed P 2MP 2 DoAP MP estimation DoA estimation scheme. scheme.

FigureFigure 3. Pseudo3. Pseudo code code of the of projection the projection and parallel and matchingparallel matching pursuit based pursuit DoA based estimation DoA algorithm. estimation algorithm. Assume there are K signal sources, the degree of signal sub-space is thus K. Therefore, we may chooseAssume the first there (or arbitrary) are K signalK columns sources, fromthe degreeRxx to of form signal a sub-matrixsub-space isV thusS, which K. Therefore, is supposed we may to spanchoose the signalthe first sub-space. (or arbitrary) Given K columns a codebook from matrix toD NformM consisting a sub-matrix of M dimension-, which is Nsupposedsteering to × vectors,span the each signal of them sub-space. is projected Given to thea codebook signal sub-space matrix first.× To consisting accomplish of this,M dimension- a projectionN matrixsteering correspondingvectors, each of to themVS is is calculated. projected Equationto the signal (2) su showb-space the expressionfirst. To accomplish of a projection this, a matrix, projection which matrix is equalcorresponding to multiplying to V S iswith calculated. its pseudo Equation inverse (2) matrix show pinvthe expr(VS).ession of a projection matrix, which is equal to multiplying with its pseudo inverse matrix pinv().   1 t − t PS = VS VS V S VS (2) =· ∙(· ∙) · ∙ (2) AA straightforward straightforward computing computing scheme scheme would would require require 3 complex-valued 3 complex-valued matrix matrix multiplications multiplications and oneand matrix one matrix inversion. inversion. To alleviate To alleviate the computing the comp demand,uting demand, a QR decomposition a QR decomposition is applied is to appliedVS first. to first. VS = QS,N K RK K (3) =,×× · ∙×× (3) where QS,N K is a unitary matrix consisting of K orthonormal vectors and RK K is an upper triangular × × matrix.where The,× choice is a unitary of QR matrix decomposition consisting of is K that orthonormal it can be vectors efficiently and supported× is an upper in hardware triangular matrix. The choice of QR decomposition is that it can be efficiently supported in hardware

implementation. This will be elaborated later in Section 3. The calculation of projection matrix is then simplified subject to the decomposition result. =⋅HH ⋅−1 ⋅ PVVVSSSS() V S =⋅HH ⋅−1 ⋅ HH QRSSSS() RQ QR RQ (4) =⋅−−11HHHH ⋅ =⋅ QRRSSSS() R R Q Q Q

H ⋅= ⋅ H Note that QQSS IKK× due to its unitary property and QQSS yields an N by N projection matrix. The simplification result indicates the projection matrix can be obtained by a simple matrix

Electronics 2020, 9, 641 7 of 27 implementation. This will be elaborated later in Section3. The calculation of projection matrix PS is then simplified subject to the decomposition result.

1 = ( H )− H PS VS VS VS VS · · · 1 H H − H H (4) = QSR (R QS QSR) R QS · · 1 · = Q R R 1(RH)− RHQH = Q QH S · − · S S· S H H Note that Q QS = IK K due to its unitary property and QS Q yields an N by N projection S · × · S matrix. The simplification result indicates the projection matrix can be obtained by a simple . After obtaining the projection matrix, all vectors a(θi) in the codebook are projected to the signal sub-space spanned by VS and those K vectors with largest norm 2 values after projection correspond to the DoA of K signal sources, respectively. Since all vector projections and the following norm 2 value comparisons can be performed in parallel (as indicated by the forall loop), this can greatly speed up the matching process while purely sequential processes are performed in the MP and OMP schemes. The most complicated computing module of the proposed scheme is QR decomposition. There are various computing schemes [21–24] using either Householder Transforms (HT) or Modified Gram Schmidt (MGS). It is demonstrated in [25] that a Givens Rotation (GR) based approach is the most efficient one. A (real-valued) Givens rotation is a trigonometric operation that rotates a 2-D vector by an angle. If the vector is rotated to one of the two coordinate axes, the corresponding element of the vector becomes zero (or nullified). In QR decomposition, the upper triangular matrix R can be obtained by applying a sequence of Givens rotations to nullify all sub-diagonal elements. Givens rotation is unitary and the product of multiple Givens rotations is also unitary. If the matrix is complex-valued, complex-valued counterparts should be used. A complex-valued value Givens rotation can be realized by applying three real-valued Givens rotations as shown in Equations (5) and (6). " # " # " # x e jφ1 0 x || || = − (5) jφ2 y 0 e− · y " # " # " # r cos(θ) sin(θ) x = || || (6) 0 sin(θ) cos(θ) · y − The first two convert complex-valued operands x, y respectively to their norm 2 values and the third one performs the nullification. Figure4 shows the pseudo code of QR decomposition. The complex-valued VS is first made into a real-valued counterpart by stacking the real part on top of the imaginary part. The nullifications proceed from the first to the last (the Kth) column (the outer loop) and, in each column (the inner loops), proceed from the bottom to the top. This processing order prevents the occurrences of fill-in. Notation (i, j, θ) stands a real-valued Givens rotation operating th th < on the i and the j rows of VeS by an angle θ. In implementation practices, the value of θ is not calculated explicitly. The details will be addressed when discussing the hardware implementation. Electronics 2020, 9, x FOR PEER REVIEW 7 of 26 norm 2 value comparisons can be performed in parallel (as indicated by the forall loop), this can greatly speed up the matching process while purely sequential processes are performed in the MP and OMP schemes. The most complicated computing module of the proposed scheme is QR decomposition. There are various computing schemes [21–24] using either Householder Transforms (HT) or Modified Gram Schmidt (MGS). It is demonstrated in [25] that a Givens Rotation (GR) based approach is the most efficient one. A (real-valued) Givens rotation is a trigonometric operation that rotates a 2-D vector by an angle. If the vector is rotated to one of the two coordinate axes, the corresponding element of the vector becomes zero (or nullified). In QR decomposition, the upper triangular matrix R can be obtained by applying a sequence of Givens rotations to nullify all sub-diagonal elements. Givens rotation is unitary and the product of multiple Givens rotations is also unitary. If the matrix is complex-valued, complex-valued counterparts should be used. A complex-valued value Givens rotation can be realized by applying three real-valued Givens rotations as shown in Equations (5) and (6).

− jφ ||xx || 1 =⋅e 0  − jφ  (5) ||yy || 0 e 2  rxcos(θθ ) sin( ) || || =⋅        (6) 0sin()cos()|||| − θθ y 

The first two convert complex-valued operands x, y respectively to their norm 2 values and the third one performs the nullification. Figure 4 shows the pseudo code of QR decomposition. The complex-valued VS is first made into a real-valued counterpart by stacking the real part on top of the imaginary part. The nullifications proceed from the first to the last (the Kth) column (the outer loop) and, in each column (the inner loops), proceed from the bottom to the top. This processing order prevents the occurrences of fill-in. Notation ℜ(,ij ,θ ) stands a real-valued Givens rotation operating

th th  θ θ on the i and the j rows of VS by an angle . In implementation practices, the value of is not Electronics 2020, 9, 641 8 of 27 calculated explicitly. The details will be addressed when discussing the hardware implementation.

FigureFigure 4.4. Pseudo code of Givens rotation basedbased QRQR decomposition.decomposition. 2.3. Computing Complexity Analysis 2.3. Computing Complexity Analysis The performance evaluation of the proposed DoA estimation scheme against prior arts starts with The performance evaluation of the proposed DoA estimation scheme against prior arts starts the computing complexity analysis. Four DoA estimation schemes, the proposed P2MP, ESPRIT, MP with the computing complexity analysis. Four DoA estimation schemes, the proposed P2MP, ESPRIT, and OMP, are compared. The results are summarized in Table1. The measurement unit is a real-valued MP and OMP, are compared. The results are summarized in Table 1. The measurement unit is a real- multiplication. Since we focus on the hardware implementation of DoA estimation schemes, the complexity of other type of computation is estimated as the ratio of its circuit complexity to that of a multiplier. The computing complexity of Givens rotation follows the estimation given in [14]. There are various iterative computing algorithms for singular value decomposition (SVD) and Eigen decomposition. The estimations are based on the numbers reported in [20]. The computing complexity of auto-correlation matrix estimation is excluded because it is common to all estimation schemes. Analytical forms of complexity formulae are derived for each scheme. Three parameters, i.e., N: the size of antenna array, K: the number of signal sources, and M: the size of codebook, are included. Table1 shows the comparison results. The values subject to di fferent combinations of (K, N, M) are illustrated in Figure5.

Table 1. Computing complexity analysis of different DoA estimation schemes.

Scheme Complexity Formula ESPIRIT 21K3 + 3K2(N 1) + 7K + 21N3  −  MP (M + 1.5)N2 + (M + 0.5)N K OMP (N3 + (M + 1)N2 + (M + 2)N + 1)K proposed 2N3 + (M + K + 1)N2 + (M 1)N −

Apparently, the ESPRIT scheme has the highest complexity because of the expensive SVD and ED computations. The number of antenna N is the dominant factor. The coefficient associated with the N3 term is also the largest one because of the iterations in SVD and ED computation. The OMP scheme has the second highest complexity. Its complexity grows linearly with the number of signal sources, K, because of the orthogonalization process performed after every matching. The size of the codebook, M, also impacts the complexity in a linear way. The complexity of the MP scheme is lower than that of the OMP scheme by roughly 30%. The proposed P2MP scheme exhibits the lowest complexity in all but the (2,8,13) configuration. In that case, the MP scheme leads by a 3% margin. Although the number of signal sources, K, plays a role, it mainly affects the complexity of QR decomposition and its contribution Electronics 2020, 9, x FOR PEER REVIEW 8 of 26 valued multiplication. Since we focus on the hardware implementation of DoA estimation schemes , the complexity of other type of computation is estimated as the ratio of its circuit complexity to that of a multiplier. The computing complexity of Givens rotation follows the estimation given in [14]. There are various iterative computing algorithms for singular value decomposition (SVD) and Eigen decomposition. The estimations are based on the numbers reported in [20]. The computing complexity of auto-correlation matrix estimation is excluded because it is common to all estimation schemes. Analytical forms of complexity formulae are derived for each scheme. Three parameters, Electronicsi.e., N: the2020 size, 9, 641 of antenna array, K: the number of signal sources, and M: the size of codebook,9 ofare 27 included. Table 1 shows the comparison results. The values subject to different combinations of (K, N, M) are illustrated in Figure 5. to the overall complexity is less than direct proportionality. The complexity of QR decomposition is also independentTable of 1. the Computing codebook complexity size, M. Theanalysis complexities of different ofDoA MP estimation/OMP schemes, schemes. however, grow linearly with the value of K. This explains the complexity advantage of the proposed scheme against the other three schemes.Scheme Note that the complexityComplexity of the proposed formula DoA scheme is independent of the Field of ViewESPIRIT (FoV). It is primarily aff21ected +3 by the( codebook−1) + 7 size. + 21 If angular resolution of steering vectors remains unchanged, then doubling the FoV would mean a twice large codebook. This will MP (( + 1.5) + (+0.5)) affect the parallel matching pursuit part but not the signal sub-space projection part. If the codebook size is fixed andOMP a coarse angular resolution( +(+1) is adopted to+ cover(+2 a wider)+1) FoV, it will not affect scheme and the hardwareproposed design. 2 +(++1) +(−1)

40000 35000 30000 25000 20000 15000 10000 5000 0 (2,8,13) (2,8,17) (3,8,19) (4,12,25)

ESPRIT MP OMP Proposed

Figure 5. Computing complexity comparisons of didifferentfferent (K,, N, M) parameter combinations.

3. SimulationApparently, Results the ESPRIT and Performance scheme has Analysis the highest complexity because of the expensive SVD and ED computations. The number of antenna N is the dominant factor. The coefficient associated with 3.1. Simulation Settings the term is also the largest one because of the iterations in SVD and ED computation. The OMP schemeFor has the performancethe second highest analysis, complexity. we assume Its the comp antennalexity array grows size, linearlyN, is 8 andwith two the codebook number of versions signal aresources, adopted. K, because The first of onethe consistsorthogonalization of 19 steering process vectors perf (Mormed= 19) after uniformly every spanningmatching. an The FoV size (Field of the of View)codebook, of 90 M◦., The also angular impacts resolution the complexity is thus in 5◦ a. Thelinear second way. oneThe hascomplexity 10 steering of the vectors MP scheme with an is angular lower resolutionthan that equalof the to OMP 10◦. Itscheme should by be notedroughly that 30%. the half-powerThe proposed beam P2 widthMP scheme for a sized exhibitsN antenna the lowest array cancomplexity be calculated in all asbut follows the (2,8,13) [26]: configuration. In that case, the MP scheme leads by a 3% margin. Although the number of signal sources, K, plays a role, it mainly affects the complexity of QR λ 100 decomposition and its contribution toθ the= overall50 =complexity(deg is), less than direct proportionality. The (7) 3db Nd N complexity of QR decomposition is also independent of the codebook size, M. The complexities of whereMP/OMPd = schemes,λ/2 is the however, inter-antenna grow element linearly spacing. with the For valueN = 8,ofθ 3Kdb. isThis roughly explains 12.5 ◦the. For complexity version 1 codebook,advantage theof the angular proposed resolution scheme of theagainst codebook the other is much three finer schemes. than beamNote resolutionthat the complexity and the overlap of the betweenproposed two DoA adjacent scheme beams is independent is high. This of the suggests Field of a moreView challenging(FoV). It is DoAprimarily estimation affected problem. by the Forcodebook version size. 2 codebook, If angular theresolution angular of resolution steering vectors is similar remains to θ 3unchanged,db, which suggests then doubling a relaxed the DoAFoV estimationwould mean problem. a twice Thelarge simulations codebook. areThis subject will affect to four the Signal parallel to Noisematching Ratio pursuit (SNR) part settings, but not i.e., the 10, 20, 30 and 40 dB. The evaluation indexes are the percentage of incorrect estimates and the Root Mean Square (RMS) value of estimation error in incident angles based on the results of 100 simulation trials per case. The percentage is calculated on a per signal source basis. For the three codebook based schemes, an estimate error occurs if an incorrect vector is selected or a vector is chosen multiple times. For the ESPRIT scheme, an estimate error occurs if the estimated angle deviates from the correct one by half the codebook’s angular resolution. In each simulation trial, the directions of signal sources are randomly selected from the angles defined in the codebook. This is to exclude the factor of estimation error attributed to limited angular resolution of the codebook. Electronics 2020, 9, 641 10 of 27

3.2. Simulation Results and Discussions We first simulate the single signal source case. No estimation error occurs in all four schemes. This means DoA estimation problem for single signal source is less challenging and can be correctly carried out by all four schemes with a reasonable SNR setting, i.e., greater than 10 dB. For cases with multiple signal sources, the number K is first set to 3. In addition, the receiving powers of the three signal sources are different to mimic the scenario of different distances. The powers of the second and the third signal sources are 3 and 6 dB, respectively, lower than that of the first signal source. The simulation results for version 1 codebook are shown in Figure6, where Figure6a corresponds to the percentage of incorrect estimation and Figure6b shows the RMS value of angle error. Among the 4 schemes, the ESPRIT scheme exhibits the best estimation accuracy and the estimation error can be reduced to zero if the SNR value is sufficiently large, i.e., above 20 dB. The result is not surprising because the ESPRIT scheme adopts a most complicated approach to achieve the accuracy. The proposed scheme is ranked in the second place and the error percentage and the RMS value dwindles rapidly as the SNR value increases. It can also achieve error free estimation when the SNR value reaches 30 dB. The benefit of signal sub-space approach stands out when the noise factor diminishes. In case of inferior SNR setting (10 dB), the percentage is 10.3%. However, in most cases, vectors neighboring to the correct one are selected, which means the estimation error is mostly less than 5◦, the angular resolution of the codebook. This percentage also drops drastically as the SNR value improves. For the MP scheme, the error percentage ranges from 15% to 18.7% and does not decrease monotonically with the increase of SNR value. Further examination on the simulation results reveal that estimate errors occur most likely when the angular separation of two signal sources is smaller than the angular resolution of the codebook. The subtraction of the contribution from the selected vector performed in each iteration in the MP scheme may partially remove the signal from the second source. This can be considered as a “masking” effect. This effect is also true for the orthogonalization process performed in the OMP scheme. Another explanation is because the MP and OMP scheme employs a greedy and sequential search strategy. If the first match is incorrect, which happens most likely when none of signal sources has a dominant power over others, the error propagates to the subsequent matching process. The phenomenon of error propagation cannot be corrected through the orthogonalization process and is irrelevant to the SNR value. Note that the proposed scheme, after sub-space projection, employs a parallel matching strategy and is thus free of the error propagation problem. Although the error percentage of the OMP scheme reduce slightly with the SNR value, the trend of reduction is slow. In terms of the RMS value, the ESPRIT scheme still takes the lead. The RMS value of the proposed scheme is 4.39◦ when SNR = 10 dB, reduces to 0.76◦ when SNR reaches 20 dB, and finally becomes zero when SNR is 30 dB or larger. The OMP scheme performs slightly better than the MP scheme and the RMS value gradually reduces to less than 3◦. For version 2 codebook, the simulation results are shown in Figure7. Because of a smaller codebook with a coarser angular resolution, the error percentages and the RMS values of all DoA schemes or the proposed scheme improve significantly. For the ESPRIT scheme, it can achieve error free estimations. For the proposed scheme, errors only occur in the low SNR region is only 0.3% and the RMS value is roughly 0.4◦ and both subside quickly to zero when SNR value increases. The error percentages and the RMS values of both MP and OMP scheme also improve significantly using this version of codebook. This is because of the increased angular separation of two neighboring signal sources, which alleviates the masking effect in estimation. Electronics 2020, 9, x FOR PEER REVIEW 10 of 26 process performed in the OMP scheme. Another explanation is because the MP and OMP scheme employs a greedy and sequential search strategy. If the first match is incorrect, which happens most likely when none of signal sources has a dominant power over others, the error propagates to the subsequent matching process. The phenomenon of error propagation cannot be corrected through the orthogonalization process and is irrelevant to the SNR value. Note that the proposed scheme, after sub-space projection, employs a parallel matching strategy and is thus free of the error propagation problem. Although the error percentage of the OMP scheme reduce slightly with the SNR value, the trend of reduction is slow. In terms of the RMS value, the ESPRIT scheme still takes the lead. The RMS value of the proposed scheme is 4.39° when SNR = 10 dB, reduces to 0.76° when SNR reaches 20 dB, and finally becomes zero when SNR is 30 dB or larger. The OMP scheme performs slightly better than the MP scheme and the RMS value gradually reduces to less than 3°. For version 2 codebook, the simulation results are shown in Figure 7. Because of a smaller codebook with a coarser angular resolution, the error percentages and the RMS values of all DoA schemes or the proposed scheme improve significantly. For the ESPRIT scheme, it can achieve error free estimations. For the proposed scheme, errors only occur in the low SNR region is only 0.3% and the RMS value is roughly 0.4° and both subside quickly to zero when SNR value increases. The error percentages and the RMS values of both MP and OMP scheme also improve significantly using this version of codebook.Electronics This2020 is ,because9, 641 of the increased angular separation of two neighboring signal11 of 27 sources, which alleviates the masking effect in estimation.

percentage of incorrect estimates 20% 18% 16% 14% ESPRIT 12% 10% MP 8% OMP 6% proposed 4% 2% 0% 10 dB 20 dB 30 dB 40 dB

(a) Percentage of incorrect estimates

RMS of angle errors (degree) 6

5

4 ESPRIT 3 MP

2 OMP proposed 1

0 10 dB 20 dB 30 dB 40 dB

(b) Root Mean Square (RMS) of angle estimation errors

Figure 6. Estimation performance comparisons for K = 3 using codebook version 1 (M = 19 and 5◦ angular resolution). Electronics 2020, 9, x FOR PEER REVIEW 11 of 26

Figure 6. Estimation performance comparisons for K = 3 using codebook version 1 (M = 19 and 5° angularElectronics resolution).2020, 9, 641 12 of 27

percentage of incorrect estimates 1.20%

1.00%

0.80% ESPRIT

0.60% MP OMP 0.40% proposed 0.20%

0.00% 10 dB 20 dB 30 dB 40 dB

(a)

RMS of angle errors (degree)

1.6

1.4

1.2

1 ESPRIT 0.8 MP 0.6 OMP

0.4 proposed

0.2

0 10 dB 20 dB 30 dB 40 dB

(b)

Figure 7.Figure Estimation 7. Estimation performance performance comparisons comparisons for for KK= =3 3 using using codebook codebook version version 2 (M = 210 (M and = 1010◦ and 10° angular angularresolution). resolution). (a) Percentage (a) Percentage of ofincorrect incorrect estimates; estimates; (b ()b RMS) RMS of angle of angle estimation estimation errors. errors. We also conducted simulations on the cases of two signal sources (K = 2), the power of the second We signalalso conducted source is 3 dB simulations lower than theon firstthe one.cases Since of two the estimationsignal sources complexity (K = 2), decreases the power with of the theK second signal sourcevalue, is only 3 dB version lower 1 cookbook, than the a first more one. challenging Since the one, estimation with a finer angularcomplexity resolution decreases is used with in the K value, onlysimulations. version The 1 cookbook, results are shown a more in Figure challenging8. In terms one, of percentage with a finer of error angular estimates, resolution The ESPRIT is used in simulations.scheme The is zeroresults and are the proposedshown in scheme Figure is almost8. In terms error freeof percentage except for the of lowest error SNR estimates, setting. BothThe ESPRIT scheme is zero and the proposed scheme is almost error free except for the lowest SNR setting. Both MP and OMP schemes yield identical results in this simulations and error percentage drop from 8% to 5% as SNR value increases. The RMS of angle estimation errors, however, are confined to less than 1.4°. It is speculated that the effectiveness of the orthogonalization process adopted in OMP is more significant when there are more signal sources.

Electronics 2020, 9, 641 13 of 27

MP and OMP schemes yield identical results in this simulations and error percentage drop from 8% to 5% as SNR value increases. The RMS of angle estimation errors, however, are confined to less than 1.4 . It is speculated that the effectiveness of the orthogonalization process adopted in OMP is more Electronics 2020◦ , 9, x FOR PEER REVIEW 12 of 26 significant when there are more signal sources.

percentage of incorrect estimates 9.00% 8.00% 7.00% 6.00% ESPRIT 5.00% MP 4.00% OMP 3.00% proposed 2.00% 1.00% 0.00% 10 dB 20 dB 30 dB 40 dB

(a)

RMS of angle errors (degree)

1.4

1.2

1

0.8 ESPRIT MP 0.6 OMP 0.4 proposed

0.2

0 10 dB 20 dB 30 dB 40 dB

(b)

Figure 8.Figure Estimation 8. Estimation performance performance comparisons comparisons for for KK ==2 2 using using codebook codebook version version 1 (M = 119 (M and = 519◦ and 5° angular resolution).angular resolution). (a) Percentage (a) Percentage of incorrect of incorrect estimates; estimates; (b ()b RMS) RMS of angleof angle estimation estimation errors. errors.

In conclusion, the proposed P2MP scheme outperform the two MP based schemes and can achieve an estimation performance close to that of the ESPRIT scheme but requires only a small fraction of the computing efforts.

3.3. Evaluation of K Value Selection

Recall that the number of signal source, K, is a priori information to the proposed P 2 MP scheme in constructing the matrix. In practice, this information could be either not available or incorrect. We will next evaluate how resilient the proposed scheme is to the accuracy of K value. We assume the correct number of signal sources is 3 and use different K values ranging from 1 to 8 to perform the DoA estimation. The antenna array size and the codebook (with 5° angular resolution) definition are identical to those used in the previous case of simulations. The power attenuations of the second and the third signal sources with respect to the first one are -7 and -10 dB, respectively. The larger the power attenuation is, the more difficult the DoA estimation becomes because the

Electronics 2020, 9, 641 14 of 27

In conclusion, the proposed P2MP scheme outperform the two MP based schemes and can achieve an estimation performance close to that of the ESPRIT scheme but requires only a small fraction of the computing efforts.

3.3. Evaluation of K Value Selection Recall that the number of signal source, K, is a priori information to the proposed P2MP scheme in constructing the VS matrix. In practice, this information could be either not available or incorrect. We will next evaluate how resilient the proposed scheme is to the accuracy of K value. We assume the correct number of signal sources is 3 and use different K values ranging from 1 to 8 to perform the DoA estimation. The antenna array size and the codebook (with 5◦ angular resolution) definition are identical to those used in the previous case of simulations. The power attenuations of the second and the third signal sources with respect to the first one are -7 and -10 dB, respectively. The larger the power attenuation is, the more difficult the DoA estimation becomes because the weakest signal source tends to be masked by the strongest one. The simulations are conducted by using two SNR settings, i.e., 20 and 10 dB. Table2 summarize the simulation results based on 100 trials each. When the selection of K value is less than the correct number of signal sources, i.e., K = 1 or 2, the RMS values of estimation are considerably large in both SNR cases. In particular, the percentages of incorrect matching are well above 30%. This is because the constructed VS matrix is short in rank to span the signal sub-space properly. The larger the rank deficiency is, the worse the estimation performance becomes. When the K value reaches 3, it means the rank of VS matrix is equal to the number of signal sources and the RMS value of estimation errors reduces dramatically to less than 1 ◦ and 5◦ in SNR = 20 dB and SNR = 10 dB cases, respectively. The percentage of incorrect matching also drops to 1.7% and 12.6%, respectively. In the high SNR case, the estimation performance improves slightly as the K value increases to 4. When K equals 5, the estimation error diminishes to zero. When the K value reaches 8, i.e., the number of antennas N, VS equals Rxx. If Rxx is full ranked, the noise sub-space is completely nullified, and the projection matrix PS becomes an identity matrix. In other words, no projection is performed and the accuracy of estimation degrades drastically. In the low SNR case, further increase in k value beyond 3 yields constant improvement. When k equals 7, the error percentage is only 0.7%. The trend of performance improvement with the increase of the K value can be explained as follows: In the proposed scheme, the first K columns of the Rxx matrix are used to construct the VS matrix. Despite of its simplicity, these K vectors may not always span the signal sub-space faithfully, in particular in the low SNR cases. As more vectors added to the basis, the span would approximate the real signal sub-space provided the noise power is not significant. Actually, if the proper vectors are selected, K of them would be sufficient. This, however, requires extra computing efforts to select. In conclusion, if the K value falls between the correct number of signal sources and N 1, the estimation performance − is guaranteed. This suggests the performance of the proposed scheme is rather robust to the K value selection. However, for low SNR applications, a larger K value is suggested to yield better results.

Table 2. Evaluation of K value on the DoA estimation accuracy of the proposed scheme.

K Value 1 2 3 4 5 6 7 8

SNR 20dB 29.1◦ 17.3◦ 0.67◦ 0.29◦ 0◦ 0◦ 0◦ 30◦ RMS of estimation error SNR 10dB 33.65◦ 23.7◦ 4.83◦ 3.53◦ 2.34◦ 0.91◦ 0.41◦ 31.83◦ SNR 20dB 80.3% 58.3% 1.7% 1% 0% 0% 0% 90% Percentage of incorrect matching SNR 10dB 60% 33.3% 12.6% 7.3% 3.7% 1.3% 0.7% 80.3%

3.4. Detection Results of FMCW Radar In this subsection, we will evaluate how the DoA estimation and beamforming affect the FMCW radar detection results. We will use Minimum Variance Distortionless Response (MVDR) beamforming Electronics 2020, 9, 641 15 of 27 technique [27] to combine the baseband signals received from 8 antennas. The weight vector can be calculated as 1 Rxx− h0 WMVDR = · (8) hHR 1 h 0 xx− · 0 where h0 is the steering vector found in DoA estimation. Follow the design sepcs defined in Table3, Figure9 shows the fast Fourier transform (FFT) spectrum of a single antenna signal when there are 3 signalElectronics sources 2020, 9 located, x FOR PEER at 1.2 REVIEW m, 1.8 m and 2.3 m away from the FMCW radar, respectively. The14 power of 26 attenuation is assumed to be proportional to the square of the distance. The SNR setting is 10 dB. dB. There are three more prominent peaks indicating the distances of the three signal sources. Note There are three more prominent peaks indicating the distances of the three signal sources. Note that that the third peak is not as significant as the first two and it is possible that this detection may fail. the third peak is not as significant as the first two and it is possible that this detection may fail. On the On the contrary, some noise spikes can be mistreated as the signal and cause false alarm. In Figure contrary, some noise spikes can be mistreated as the signal and cause false alarm. In Figure 10, three 10, three FMCW spectrums are shown and each one is the result of applying the MVDR beamforming FMCW spectrums are shown and each one is the result of applying the MVDR beamforming using the using the steering vectors obtained from DoA estimation, respectively. Apparently, beamforming steering vectors obtained from DoA estimation, respectively. Apparently, beamforming successfully successfully suppresses the interference from other directions and simplify the peak detection efforts, suppresses the interference from other directions and simplify the peak detection efforts, provided the provided the DoA estimation results are correct. DoA estimation results are correct.

TableTable 3.3. Specificationsspecifications ofof thethe radarradar systemsystem andand thethe designdesign requirementsrequirements ofof thethe DoADoA estimation.estimation.

ChirpChirp frequency Frequency Rangerange 7777 GHz~78 GHz~78 GHzGHz MaxMax Beat Beat Frequency Frequency 3.2 MHz3.2 MHz BandwidthBandwidth 11 GHz GHz AntennaAntenna array array size size 8 8

SweepSweep time time 2020μµs FoVFoV 90◦ 90° ChirpChirp rate rate ±11 G G Hz/20 Hz/20 µμss SweepSweep angle angle Range Range 45 −~4545°~45° ± − ◦ ◦ SamplingSampling rate 12.812.8 M SamplesSamples/s/s AngularAngular resolution resolution 5◦ 5° SnapshotsSnapshots of Number 256256 points CodebookCodebook size size 19 19 MaximumMaximum range 9.69.6 m FFTFFT size size 256 256 DoA Estimation rate One estimate per sweep Maximum no. of targets 3 DoA Estimation rate One estimate per sweep Maximum no. of targets 3

FigureFigure 9.9. FFTFFT spectrumspectrum ofof FrequencyFrequency Modulated Modulated Continuous Continuous Waveform Waveform (FMCW) (FMCW) radar radar signal signal from from a singlea single antenna. antenna.

Electronics 2020, 9, 641 16 of 27

Electronics 2020, 9, x FOR PEER REVIEW 15 of 26

(a) Beamforming toward the direction of first signal source

(b) Beamforming toward the direction of the second signal source

(c) Beamforming toward the direction of the third signal source

Figure 10. FFT spectrums of FMCW radar signal after applying Minimum Variance Distortionless Response (MVDR) beamforming. Electronics 2020, 9, 641 17 of 27

Although there are many Constant False Alarm Rate (CFAR) based FMCW signal detection schemes, for simplicity, we use the FindPeak function in MatLab with adjusted parameter setting to locate the beat frequencies. Table3 summarizes the FMCW radar detection results with and without beamforming. The results cover two SNR values, i.e., 10 dB and 20 dB. For the non-beamforming case, only the distance matters. For the beamforming case (P2MP estimation + MVDR beamforming), a correct detection means both the arriving angle and the distance are correct. For SNR = 10 dB, the detection rate with beamforming can be boosted from 77.53% to 90.13%. The false alarm rate can be reduced to less than 1% as well. For SNR = 20 dB, the detection rate with beamforming can be further improved to 99.13%. The false alarm rate without beamforming is significantly lower in high SNR operations. Applying beamforming can further minimize this error rate. In Table4, we also list the average "prominence" value of the detected peaks in the FFT spectrum. It is clear that beamforming can greatly enhance the prominence values of these peaks to facilitate better detection results.

Table 4. FMCW detection results with and without beamforming.

Signal Source Number= 3, Codebook ( 45◦:5◦:45◦) Average Results Based on 500 Simulation− Trials SNR 10 dB 20 dB Detection rate w/o beamforming 77.53% 77.93% Detection rate for P2MP+MVDR 90.13% 99.13% beamforming False alarm rate without 28.93% 2.73% beamforming False alarm rate for P2MP+MVDR 0.87% 0.53% beamforming Prominence Simulation signal 1 signal 2 signal 3 signal 1 signal 2 signal 3 Average prominence of detected 53.71 23.44 9.44 53.32 22.65 9.37 peaks w/o beamforming Average prominence of detected peaks for P2MP+MVDR 218.71 85.11 28.97 204.49 86.21 46.62 beamforming

4. Hardware Accelerator Design and Chip Implementation Results Based on the presented DoA estimation scheme, an efficient hardware accelerator design is developed. The target application is a mm-wave Frequency Modulated Continuous Waveform (FMCW) radar system, which transmits a continuous wave signal with frequency varies linearly up and down within a specified period (sweep time). The frequency difference between the transmitted and received signals (beat frequency) is a constant, which is proportional to the time lap between the transmission and the reception of signal. The time lap equals the round trip time between the radar and the target and can be used to calculate the distance by multiplying with the light speed.

4.1. Design Specs and System Architecture The specifications of the FMCW radar system equipped with an antenna array and the design requirements of the DoA estimation are listed in Table3. The parameters crucial to the DoA estimation include antenna array size (N = 8), the codebook size (M = 19), the sweep time (20 µs), and the baseband signal sampling rate (12.8 MS/s). The goal is achieving one estimation per sweep. There are 256 samples in each sweep. Since the maximum detection range of this radar system is set to 9.6 m, the largest time lap between transmission and receiving is 64 ns. That is, the beat frequency would become stable 64 ns after the sweep starts. For a 12.8 MHz beat signal sampling rate, this interval is less than the sampling period, which is 78 ns. In other words, invalid data accounts for only 1 out of 256 sample vectors received in a chirp. Therefore, we did not drop the first sample vector deliberately and a Electronics 2020, 9, 641 18 of 27

256-point FFT is performed to obtain the spectrum in range detection. One fourth of these samples, i.e., 64, are used for calculating the auto-correlation matrix. This also means the collection of 64 samples needed for auto-correlation matrix calculation would consume one fourth of the time budget available. What remains to be performed in a sweep includes DoA estimation, possible beamforming if required, and distance calculation. Therefore, the time budget for the DoA estimation is set to a quarter of the sweep time, as well. The maximum number of signal sources (targets), the K value, is set to 3. The target system clock rate is set to 8 times the sampling rate, which is 102.4 MHz. The number of the clock cycles available for the DoA estimation is thus 102.4 MHz 5 µs = 512. × Figure 11 shows the system block diagram of the hardware design of the P2MP DoA estimation Electronics 2020, 9, x FOR PEER REVIEW 18 of 26 Electronicsscheme. 2020 It contains, 9, x FOR PEER four REVIEW majorbuilding blocks, i.e., auto-correlation matrix calculation (AM),18 of QR 26 decomposition, projection matrix calculation (PM) and codebook search via matching pursuit (MP). Fixed point simulations are conducted first to evaluate the performance loss due to finite precision Fixed point simulations are conducte conductedd first first to evaluate the performanceperformance loss due to finitefinite precision calculations and then determine the required word length for each module. The results are shown in calculations and then determine the required word length for each module. The results are shown in Figure 12, where the three-tuple number associated with each module represents the sign bit, and the Figure 1212,, where the three-tuple number associated withwith each module represents the sign bit, and the bit widths of the integral and fractional parts, respectively. The implementation loss in terms of the bit widths of the integral and fractional parts, respectively.respectively. The implementation loss in terms of the RMS value of estimation error is small than 1°. RMS value of estimation error is small than 11°.◦.

Input Input Input delay Q Data Input delay R Q Q S Q Data line R [8×8] Q S Q 8×W line [8×3] [8×8] [8×3] 8×W [8×3] [8×3]

Outer Outer H product & VS P=QS × QSH product & V P=QS × QS R matrix [8×3] S R [8×8] R XX matrix [8×3] R [8×8] estimationXX estimation PM QRQR PM AMAM

K Vector K Max K vector Vector D Indexes Max K vector Norm 2 D Indexes selection Norm 2 Ψ [19×8] output selection calculation Ψ [19×8] output calculation [19×8] [19×8] MPMP

Figure 11. System block diagram of the DoA estimation hardware accelerator design. Figure 11. SystemSystem block block diagram of the DoA esti estimationmation hardware accelerator design. D Data D Data <1,0,11> <1,0,11>

OUTPUT INPUT Data OUTPUT INPUT Data AM QR PM MP Norm Max DoA <1,7,6> AM QR PM MP Norm Max DoA <1,7,6> <0,5,0> <1,7,13> <1,0,13> <1,0,11> <1,0,11> <1,1,12> <0,5,0> <1,7,13> <1,0,13> <1,0,11> <1,0,11> <1,1,12> AM QR PM MP AM QR PM MP

Figure 12. Word length specs for the computing modules. Figure 12. WordWord length specs for the computing modules.

4.2.4.2. AlgorithmAlgorithm MappingMapping andand ModuleModule DesignsDesigns 4.2. Algorithm Mapping and Module Designs InIn thethe AMAM module,module, becausebecause thethe inputinput datadata raterate isis boundedbounded byby thetheanalog-to-digital analog-to-digital conversionconversion In the AM module, because the input data rate is bounded by the analog-to-digital conversion (ADC)(ADC) samplingsampling raterate whilewhile thethe systemsystem clockclock raterate runsruns 88 timestimes higher,higher, 88 datadata itemsitems acquiredacquired inin eacheach (ADC) sampling rate while the system clock rate runs 8 times higher, 8 data items acquired in each samplingsampling areare fedfed to the AM module in in 8 8 consecutiv consecutivee clock clock cycles. cycles. An An input input delay delay line line is isemployed employed to sampling are fed to the AM module in 8 consecutive clock cycles. An input delay line is employed to buffer the data. Figure 13 shows the systolic array design for the AM module by computing the outer- buffer the data. Figure 13 shows the systolic array design for the AM module by computing the outer- product of input vector and then then taking average. Only 3 (or K) sets of complex-valued Multiply- product of input vector and then then taking average. Only 3 (or K) sets of complex-valued Multiply- And-Accumulate (MAC) units are needed. Each MAC unit is responsible of computing one column And-Accumulate (MAC) units are needed. Each MAC unit is responsible of computing one column of . The input delay line serves as a data pipeline passing input data from one MAC unit to of . The input delay line serves as a data pipeline passing input data from one MAC unit to another. Since only K columns of the are required, the length of the delay line is K. Another set another. Since only K columns of the are required, the length of the delay line is K. Another set of registers taking the complex conjugate of input data are employed to hold another vector needed of registers taking the complex conjugate of input data are employed to hold another vector needed in outer product computations. One vector outer product can be performed every eight clock cycles in outer product computations. One vector outer product can be performed every eight clock cycles and computations of 64 consecutive input vectors can be seamlessly chained together. It takes 64 × 8 and computations of 64 consecutive input vectors can be seamlessly chained together. It takes 64 × 8 + K-1 clock cycles to complete the computations. The extra K-1 clock cycles arise because the + K-1 clock cycles to complete the computations. The extra K-1 clock cycles arise because the operations of K MAC units are skewed by each skewed by one clock cycle. This meets the time budget operations of K MAC units are skewed by each skewed by one clock cycle. This meets the time budget as specified before. as specified before.

Electronics 2020, 9, 641 19 of 27 to buffer the data. Figure 13 shows the systolic array design for the AM module by computing the outer-product of input vector and then then taking average. Only 3 (or K) sets of complex-valued Multiply-And-Accumulate (MAC) units are needed. Each MAC unit is responsible of computing one column of Rxx. The input delay line serves as a data pipeline passing input data from one MAC unit to another. Since only K columns of the Rxx are required, the length of the delay line is K. Another set of registers taking the complex conjugate of input data are employed to hold another vector needed in outer product computations. One vector outer product can be performed every eight clock cycles and computations of 64 consecutive input vectors can be seamlessly chained together. It takes 64 8 × + K-1 clock cycles to complete the Rxx computations. The extra K-1 clock cycles arise because the operations of K MAC units are skewed by each skewed by one clock cycle. This meets the time budget

asElectronics specified 2020 before., 9, x FOR PEER REVIEW 19 of 26 Complex Complex ... D D r r Multiplier Adder 8K 1K

D ...

Complex Complex ...

D D r82 r12 ...... ComplexMultiplier ComplexAdder X X1 D D r r 8 Multiplier Adder 81 11 D Complex conjugate D

FigureFigure 13. 13.Systolic Systolic array array design design of of the the AM AM module. module.

TheThe QR QR decomposition decomposition is is the the most most complicated complicated computing computing module. module. As As described described in in Section section 2.2 2.2,, thethe decomposition decomposition is achievedis achieved by applying by applying a sequence a sequence of Givens rotations,of Givens which rotations, involve which trigonometric involve operationstrigonometric such operations as sine() and such cosine(). as sine() Although and cosine(). there Although are various there computing are various schemes, computing the Coordinate schemes, Rotationthe Coordinate Digital ComputerRotation Digital (CORDIC) Computer scheme (CORDIC) is adopted scheme for its is effi adoptedciency in for hardware its efficiency implementation. in hardware Theimplementation. CORDIC scheme The CORDIC computes scheme various computes trigonometric various functions trigonometric by performing functions by a performing sequence of a micro-rotationssequence of micro-rotations iteratively using iteratively shift and using add shift operations and add only. operations The number only. The of iterations number requiredof iterations is, inrequired general, is, equal in general, to the bitequal width to the of bit the width desired of datathe desired precision. data A precision. normalization A normalization process is needed process inis theneeded final stagein the to final restore stage the to correct restore vector the corr norm.ect Thisvector corresponds norm. This to corresponds a constant multiplication. to a constant Inmultiplication. CORDIC, there In CORDIC, are two computing there are two modes. computing In the mo vectordes. In mode, the vector the rotation mode, the nullifies rotation one nullifies entry andone turnsentry theand other turns to the the other norm to 2the value norm of 2 a value 2-D vector. of a 2-D In vector. the rotation In the mode,rotation the mode, vector the is vector rotated is subjectrotated to subject a specified to a angle.specified In QRangle. decomposition, In QR decompos the vectorition, modethe vector is used mode when is nullifyingused when the nullifying leading elementthe leading of a row element while of the a rotation row while mode the is appliedrotation when mode performing is applied subsequent when performing rotations insubsequent the same row.rotations Note in that the the same values row. rotation Note that angle the θvaluesneed notrotation be computed angle  need directly. not Inbe thecomputed CORDIC directly. scheme, In it can be represented by a sequence of 1 indicating the direction of each micro-rotation. Therefore, the CORDIC scheme, it can be represented± by a sequence of ±1 indicating the direction of each micro- therotation. leading Therefore, vector mode the CORDICleading vector unit can mode simply CORDIC pass the unit micro-rotation can simply sequencepass the tomicro-rotation the trailing rotationsequence mode to the CORDIC trailing unitsrotation in themode same CORDIC row. Figure units 14in theshows same the row. CORDIC Figure unit 14 shows designs the for CORDIC vector andunit rotation designs modes, for vector respectively. and rotation To reducemodes, therespecti numbervely. of To clock reduce cycles the needed number in of the clock iterative cycles process, needed twoin the micro-rotations iterative process, are performed two micro-rotations per clock cycle. are performed Fixed point per simulation clock cycle. results Fixed indicate point 8 simulation iterations areresults suffi cientindicate to achieve8 iterations the requiredare sufficient data to precision. achieve the In total,required four data plus precision. one clock In cycles total, are four needed plus one to perform a 2 2 real-valued Givens rotation. The extra one is for the normalization. clock cycles× are needed to perform a 2 × 2 real-valued Givens rotation. The extra one is for the normalization. K K

>>> ± >>> ± × X0" >>> ± >>> ± × Xn" X0 Xn

Y0 Yn >>> ± >>> ± 0 >>> ± >>> ± × Yn"

Iteration Iteration Iteration Iteration K 0,2,4,6 1,3,5,7 0,2,4,6 1,3,5,7 (a) (b)

Figure 14. CORDIC unit designs. (a) vector mode; (b) rotation mode.

After elaborating the basic function unit CORDIC, the systolic architecture of the QR decomposition is presented. As illustrated in Figure 15, the QR module consists of a triangular

systolic array followed by a rectangular one. The triangularization of the matrix is performed in the triangular systolic array, where each circle indicates a Processing Element (PE) for complex- valued Givens rotation. As stated before, one complex-valued Givens rotation is implemented by three real-valued Givens rotation. Therefore, each PE design consists of three CORDIC units. The

sized 8 × 3 matrix is admitted from the bottom of the triangular array and the columns of the

Electronics 2020, 9, x FOR PEER REVIEW 19 of 26 Complex Complex ... D D r r Multiplier Adder 8K 1K

D ...

Complex Complex ...

D D r82 r12 ...... ComplexMultiplier ComplexAdder X X1 D D r r 8 Multiplier Adder 81 11 D Complex conjugate D

Figure 13. Systolic array design of the AM module.

The QR decomposition is the most complicated computing module. As described in section 2.2, the decomposition is achieved by applying a sequence of Givens rotations, which involve trigonometric operations such as sine() and cosine(). Although there are various computing schemes, the Coordinate Rotation Digital Computer (CORDIC) scheme is adopted for its efficiency in hardware implementation. The CORDIC scheme computes various trigonometric functions by performing a sequence of micro-rotations iteratively using shift and add operations only. The number of iterations required is, in general, equal to the bit width of the desired data precision. A normalization process is needed in the final stage to restore the correct vector norm. This corresponds to a constant multiplication. In CORDIC, there are two computing modes. In the vector mode, the rotation nullifies one entry and turns the other to the norm 2 value of a 2-D vector. In the rotation mode, the vector is rotated subject to a specified angle. In QR decomposition, the vector mode is used when nullifying the leading element of a row while the rotation mode is applied when performing subsequent rotations in the same row. Note that the values rotation angle  need not be computed directly. In the CORDIC scheme, it can be represented by a sequence of ±1 indicating the direction of each micro- rotation. Therefore, the leading vector mode CORDIC unit can simply pass the micro-rotation sequence to the trailing rotation mode CORDIC units in the same row. Figure 14 shows the CORDIC unit designs for vector and rotation modes, respectively. To reduce the number of clock cycles needed in the iterative process, two micro-rotations are performed per clock cycle. Fixed point simulation results indicate 8 iterations are sufficient to achieve the required data precision. In total, four plus one Electronics 2020, 9, 641 20 of 27 clock cycles are needed to perform a 2 × 2 real-valued Givens rotation. The extra one is for the normalization. K K

>>> ± >>> ± × X0" >>> ± >>> ± × Xn" X0 Xn

Y0 Yn >>> ± >>> ± 0 >>> ± >>> ± × Yn"

Iteration Iteration Iteration Iteration K 0,2,4,6 1,3,5,7 0,2,4,6 1,3,5,7 (a) (b)

FigureFigure 14. 14. CORDICCORDIC unit unit designs. designs. (a ()a vector) vector mode; mode; (b) rotation mode. mode.

After elaboratingelaborating the the basic basic function function unit CORDIC, unit CO theRDIC, systolic the architecture systolic architecture of the QR decomposition of the QR isdecomposition presented. As is illustrated presented. in FigureAs illustrated 15, the QR in moduleFigure consists15, the ofQR a triangularmodule consists systolic of array a triangular followed

bysystolic a rectangular array followed one. The by triangularizationa rectangular one. of The the triangularizationVS matrix is performed of the in matrix the triangular is performed systolic in Electronicsarray,the 2020 triangular where, 9, x FOR each systolic PEER circle REVIEW array, indicates where a Processingeach circle Elementindicates (PE) a Processing for complex-valued Element (PE) Givens for complex- rotation.20 of 26 Asvalued stated Givens before, rotation. one complex-valued As stated before, Givens one rotationcomplex-valued is implemented Givens byrotation three real-valuedis implemented Givens by rotation. Therefore, each PE design consists of three CORDIC units. The sized 8 3 V matrix is resultantthree R real-valued matrix are Givens obtained rotation. at Therefore,the PEs alongeach PE hypotenuse. design consists The of three matrix CORDIC× is S units.calculated The in a admitted from the bottom of the triangular array and the columns of the resultant R matrix are obtained rectangularsized 8 array × 3 containing matrix is admitted 8 × 3 PEs. from A sizedthe bottom 8 × 8 identityof the triangular matrix isarray fed toand the the array columns from of the the bottom at the PEs along hypotenuse. The Q matrix is calculated in a rectangular array containing 8 3 PEs. and the matrix is available at theS top of the array. The leading PE in each row operates× in the A sized 8 8 identity matrix is fed to the array from the bottom and the Q matrix is available at the × S vectortop mode of the and array. passes The leading the micro-rotations PE in each row operates sequence in the to vector the modedownstream and passes PEs, the micro-rotationsall operating in the rotationsequence mode. to Figure the downstream 16 shows PEs,a detailed all operating systolic in thetriangular rotation mode.array design Figure 16 for shows R matrix a detailed calculation after substitutingsystolic triangular each arraycomplex-valued design for R matrixPE with calculation three CORDIC after substituting units. Label each “GV” complex-valued means vector PE mode operationwith threeand label CORDIC “GR” units. means Label rotation “GV” means mode vector operat modeion. operation For the andlabel label at the “GR” right means end rotation of each row, “R-I” modemeans operation. the CORDIC For the units label atin the this right row end is ofresponsible each row, “R-I” for meansconverting the CORDIC the operand units in thisto real-valued row is responsible for converting the operand to real-valued by eliminating its imaginary part. Label “R-R” by eliminating its imaginary part. Label “R-R” means the Givens rotation for the two converted (to means the Givens rotation for the two converted (to real-valued) operands. The detailed design of the real-valued)rectangular operands. array can The be likewisedetailed derived. design of the rectangular array can be likewise derived.

Qs matrix output q11 q21 q31 q41 q81 r13 q12 q22 q32 q42 q82 r23 q13 q23 q33 q43 q83 r33 R matrix output r12 r22

r11

Triangular array Rectangular array for R matrix for Q matrix Vs matrix input Identity matrix input FigureFigure 15. 15. SystolicSystolic array array design design ofof the the QR QR decomposition decomposition module. module.

GV GR R-R M3 GV R-I M3 GV GR GR GR R-R M2 GV GR R-I M2 GV GR GR GR GR GR R-R M1 GV GR GR R-I M1

Figure 16. Systolic tri-array design for R matrix using real-valued Coordinate Rotation Digital Computer (CORDIC) units.

The PM module performs a matrix multiplication to obtain the projection matrix. As shown in

Figure 17, the design is a linear systolic array that consists of eight complex-valued MACs. The matrix is inputted to the module column by column in three consecutive clock cycles and a column of the P matrix is obtained every three clock cycles. For the Matching pursuit module, the projection of the codebook, i.e., = ∙ is performed first. The codebook need to store 19 steering vectors and each vector is 8-dimensional with 24-bit complex-valued entries (12-bit for real part and 12-bit for imaginary part). The total memory size is only 456 bytes. Although we can use memory compiler to generate a small on-chip SRAM module for it, to support computing parallelism, these data are stored in registers so that they can be accessed in parallel. The projection is, again, a matrix-matrix multiplication and differs from the counterpart in the PM module only in matrix size. A similar linear systolic array design with eight MACs is adopted here. However, it takes 8 clock cycles to complete the projection of one steering vector from the codebook and there are 19 vectors in total. The norm 2 value of each projected vector is next calculated. Since this value is used in the magnitude comparison process, the square of the norm 2 value is calculated instead to eliminate square root operations. As

Electronics 2020, 9, x FOR PEER REVIEW 20 of 26

resultant R matrix are obtained at the PEs along hypotenuse. The matrix is calculated in a rectangular array containing 8 × 3 PEs. A sized 8 × 8 identity matrix is fed to the array from the bottom

and the matrix is available at the top of the array. The leading PE in each row operates in the vector mode and passes the micro-rotations sequence to the downstream PEs, all operating in the rotation mode. Figure 16 shows a detailed systolic triangular array design for R matrix calculation after substituting each complex-valued PE with three CORDIC units. Label “GV” means vector mode operation and label “GR” means rotation mode operation. For the label at the right end of each row, “R-I” means the CORDIC units in this row is responsible for converting the operand to real-valued by eliminating its imaginary part. Label “R-R” means the Givens rotation for the two converted (to real-valued) operands. The detailed design of the rectangular array can be likewise derived.

Qs matrix output q11 q21 q31 q41 q81 r13 q12 q22 q32 q42 q82 r23 q13 q23 q33 q43 q83 r33 R matrix output r12 r22

r11

Triangular array Rectangular array for R matrix for Q matrix Vs matrix input Identity matrix input Electronics 2020, 9, 641 21 of 27 Figure 15. Systolic array design of the QR decomposition module.

GV GR R-R M3 GV R-I M3 GV GR GR GR R-R M2 GV GR R-I M2 GV GR GR GR GR GR R-R M1 GV GR GR R-I M1

FigureFigure 16. 16.Systolic Systolic tri-array tri-array design design for R formatrix R matrix using real-valued using real-valued Coordinate Coordinate Rotation Digital Rotation Computer Digital (CORDIC)Computer units.(CORDIC) units.

TheThe PMPM module module performs performs a a matrix matrix multiplication multiplication to to obtain obtain the the projection projection matrix. matrix. AsAs shownshown inin Figure 17, the design is a linear systolic array that consists of eight complex-valued MACs. The QS Figure 17, the design is a linear systolic array that consists of eight complex-valued MACs. The matrixmatrix is is inputted inputted to to the the module module column column by by column column in in three three consecutive consecutive clock clock cycles cycles and and a column a column of theof the P matrix P matrix is obtained is obtained every every three three clock clock cycles. cycl Fores. the For Matching the Matching pursuit pursui module,t module, the projection the projection of the H codebook, i.e., Ψ = D PS is performed first. The codebook need to store 19 steering vectors and each of the codebook, i.e., =· ∙ is performed first. The codebook need to store 19 steering vectors vectorand each is 8-dimensional vector is 8-dimensional with 24-bit complex-valuedwith 24-bit comple entriesx-valued (12-bit entries for real (12-bit part and for 12-bitreal part for imaginaryand 12-bit part).for imaginary The total part). memory Thesize total is memory only 456 size bytes. is only Although 456 bytes. we canAlthough use memory we can compiler use memory to generate compiler a smallto generate on-chip a SRAMsmall moduleon-chip forSRAM it, to module support for computing it, to support parallelism, computing these dataparallelism, are stored these in registers data are sostored that theyin registers can be accessedso that they in parallel. can be accessed The projection in parallel. is, again, The projection a matrix-matrix is, again, multiplication a matrix-matrix and dimultiplicationffers from the and counterpart differs from in the counterpart PM module in only the inPM matrix module size. only A in similar matrix linearsize. A systolic similar arraylinear designsystolic with array eight design MACs with is eight adopted MACs here. is However,adopted he itre. takes However, 8 clock it cycles takes to 8 completeclock cycles the to projection complete ofthe one projection steering of vector one steering from the vector codebook from and the therecodebook are 19 and vectors there inare total. 19 vectors The norm in total. 2 value The of norm each 2 Electronicsprojectedvalue of 2020 each vector, 9 ,projected x FOR is next PEER vector calculated. REVIEW is next Since calculated. this value Since is this used value in theis used magnitude in the magnitude comparison comparison process,21 of 26 theprocess, square the of square the norm of the 2 value norm is 2 calculated value is calculat insteaded to instead eliminate to squareeliminate root sq operations.uare root operations. As there are As thereeight are clock eight cycles clock for cycles the magnitude for the magnitude comparison, comp a singlearison, complex-valued a single complex-valued MAC unit MAC would unit meet would the meettiming the demands. timing demands. The next The maximal next maximalK value selectionK value selection can be easily can be implemented easily implemented by chaining by chaining3 (or K) 3comparators. (or K) comparators. Finally, Finally, the indexes the indexes of the K oflargest the K steering largest steering vectors arevectors exported. are exported. ... Complex Complex q Q2 Complex Complex D PM 8 Q Mult. Adder D 8 1 Mult. Adder ... Complex Complex ... QS q D PM2 PM 2 ComplexMult. ComplexAdder q D PM1 [8×3] 1 Mult. Adder [8×8] Q1Q2 ...

q1 q2 ... q8

Figure 17. Linear systolic array design for the PM module. Figure 17. linear systolic array design for the PM module. 4.3. System Timing 4.3. System Timing Figure 18 depicts the timing diagram of the overall hardware accelerator design. The first 512 plusFigure two clock 18 depicts cycles are the allotted timing fordiagram the auto-correlation of the overall matrix.hardware This accelerator latency is design. actually The bounded first 512 by plusthe input two clock data rate,cycles which are allotted takes 512 for clockthe auto-correla cycles to admittion matrix. 64 sample This vectors. latency The is actually AM module bounded spends by thejust input two more data clockrate, which cycles totakes complete 512 clock computations cycles to admit after 64 all sample input data vectors. are available. The AM Themodule QR modulespends justconsumes two more 91 clockclock cyclescycles toto generatecompleteR computationsand Q matrices. after all The input PM moduledata are spendsavailable. another The QR 8 module3 clock S × consumescycles to compute 91 clock thecycles projection to generate matrix. R and The MP matrices. module completesThe PM module the projection spends ofanother a steering 8 × 3vector clock cyclesevery eightto compute clock cycles. the projection The projected matrix. vector The MP is immediately module completes passed the to next projection unit to of compute a steering its normvector 2 everysquare eight value. clock The cycles. vector The projections projected and vector the is norm imme 2diately square passed value calculations to next unit areto compute performed its innorm the 2same square pace value. to form The a vector two-stage projections computing and pipeline. the norm In 2 thesquare final value phase, calculations it takes six are clock performed cycles to doin the the same pace to form a two-stage computing pipeline. In the final phase, it takes six clock cycles to do the sorting and output the indexes of the estimation results. In total, it takes 795 clock cycles to complete the auto-correlation matrix calculation and DoA estimation. Taking the AM calculation part away, the DoA estimation itself requires 283 clock cycles. This meets the design specs that the estimation should be completed within 512 clock cycles (5 μs).

System Timing Schedule of the P2MP

Time T1 T2 T3 T4 ⁞ T512 T513 T514 T515 T516 ⁞ T604 T605 T606 T607 T608 ⁞ T627 T628 T629 DataIN IN(1-1) IN(1-2) IN(1-3) IN(1-8) ... IN(64-8) AM AM QR QR PM P(1) ... P(8)

Time T630 ⁞ T637 T638 T639 ⁞ T780 T781 T782 ⁞ T790 T791 T792 T793 T794 T795 MP MP(1) ... MP(19) Norm Norm(1) ... Norm(19) Max DOA(1) DOA(2) DOA(3)

Figure 18. System Timing chart of the P2MP DoA estimation Hardware Design.

4.4. Chip Design The hardware accelerator design is coded in Verilog and synthesized by using the Synopsys Design Compiler tool. An ARM standard cell library is adopted in the backend design and the target implementation technology is a 40nm TSMC CMOS process. Figure 19 shows the final layout view accomplished by using Synopsys IC Compiler. The accelerator chip design details are summarized in Table 5, where layout area, gate count and power consumption are based on the report from IC Compiler. The timing information is obtained from the PrimeTime-PX tool. The total chip die size is 2.61 mm, while the internal core design occupies an area of 0.76 mm. The core design is roughly divided into four parts, i.e., MP (matching pursuit), QRD, PM (Projection Matrix) and AM (Auto-

Electronics 2020, 9, x FOR PEER REVIEW 21 of 26 there are eight clock cycles for the magnitude comparison, a single complex-valued MAC unit would meet the timing demands. The next maximal K value selection can be easily implemented by chaining 3 (or K) comparators. Finally, the indexes of the K largest steering vectors are exported. ... Complex Complex q Q2 Complex Complex D PM 8 Q Mult. Adder D 8 1 Mult. Adder ... Complex Complex ... QS q D PM2 PM 2 ComplexMult. ComplexAdder q D PM1 [8×3] 1 Mult. Adder [8×8] Q1Q2 ...

q1 q2 ... q8

Figure 17. linear systolic array design for the PM module.

4.3. System Timing Figure 18 depicts the timing diagram of the overall hardware accelerator design. The first 512 plus two clock cycles are allotted for the auto-correlation matrix. This latency is actually bounded by the input data rate, which takes 512 clock cycles to admit 64 sample vectors. The AM module spends just two more clock cycles to complete computations after all input data are available. The QR module consumes 91 clock cycles to generate R and matrices. The PM module spends another 8 × 3 clock cycles to compute the projection matrix. The MP module completes the projection of a steering vector Electronicsevery eight2020 clock, 9, 641 cycles. The projected vector is immediately passed to next unit to compute its22 norm of 27 2 square value. The vector projections and the norm 2 square value calculations are performed in the same pace to form a two-stage computing pipeline. In the final phase, it takes six clock cycles to do sortingthe sorting and outputand output the indexes the indexes of the of estimation the estimation results. results. In total, In it total, takes 795it takes clock 795 cycles clock to completecycles to thecomplete auto-correlation the auto-correlation matrix calculation matrix calculation and DoA estimation.and DoA estimation. Taking the Taking AM calculation the AM calculation part away, part the DoAaway, estimation the DoA itselfestimation requires itself 283 clockrequires cycles. 283 This clock meets cycles. the designThis meets specs the that design the estimation specs that should the beestimation completed should within be 512completed clock cycles within (5 µ512s). clock cycles (5 μs).

System Timing Schedule of the P2MP

Time T1 T2 T3 T4 ⁞ T512 T513 T514 T515 T516 ⁞ T604 T605 T606 T607 T608 ⁞ T627 T628 T629 DataIN IN(1-1) IN(1-2) IN(1-3) IN(1-8) ... IN(64-8) AM AM QR QR PM P(1) ... P(8)

Time T630 ⁞ T637 T638 T639 ⁞ T780 T781 T782 ⁞ T790 T791 T792 T793 T794 T795 MP MP(1) ... MP(19) Norm Norm(1) ... Norm(19) Max DOA(1) DOA(2) DOA(3)

2 Figure 18. System Timing chart of the P2MP DoA estimation Hardware Design. 4.4. Chip Design 4.4. Chip Design The hardware accelerator design is coded in Verilog and synthesized by using the Synopsys The hardware accelerator design is coded in Verilog and synthesized by using the Synopsys Design Compiler tool. An ARM standard cell library is adopted in the backend design and the target Design Compiler tool. An ARM standard cell library is adopted in the backend design and the target implementation technology is a 40nm TSMC CMOS process. Figure 19 shows the final layout view implementation technology is a 40nm TSMC CMOS process. Figure 19 shows the final layout view accomplished by using Synopsys IC Compiler. The accelerator chip design details are summarized accomplished by using Synopsys IC Compiler. The accelerator chip design details are summarized in Table5, where layout area, gate count and power consumption are based on the report from in Table 5, where layout area, gate count and power consumption are based on the report from IC IC Compiler. The timing information is obtained from the PrimeTime-PX tool. The total chip die Compiler. The timing information is obtained from the PrimeTime-PX tool. The total chip die size is size is 2.61 mm2, while the internal core design occupies an area of 0.76 mm2. The core design is 2.61 mm, while the internal core design occupies an area of 0.76 mm. The core design is roughly roughly divided into four parts, i.e., MP (matching pursuit), QRD, PM (Projection Matrix) and AM divided into four parts, i.e., MP (matching pursuit), QRD, PM (Projection Matrix) and AM (Auto- (Auto-correlation Matrix), using different colors. The equivalent gate count in total is 454.4 k. Figure 20 shows the breakdown of hardware complexity measured in gate count. Among these function units, QRD is the most complex one and accounts for one half (49.9%) of the total gate count. MP is ranked in the second place and consumes 30.9% of the total gate count. AM and PM are of similar complexities and the percentages are 8.4% and 7.6%, respectively. The remaining 3.2% is attributed to the I/O buffer. The clock rate of the core design can reach 333MHz and the corresponding power consumptions for the core design is 83mW. The total DoA estimation takes 795 (= 512 + 283) clock cycles. Note that the first 512 clock cycles are actually bounded by the time span needed to collect 64 input samples for auto-correlation matrix calculation. The remaining 283 clock cycles are dedicated to DoA estimation. The original time budget is 10 µs (1024 clock cycles @102.4MHz). The implementation result is 2.4 µs (795 clock cycles @333MHz). If we exclude the AM calculation part, the estimation itself requires only 0.85 µs. This suggests a throughput rate of 1.18 M estimates/sec.

4.5. Comparison with Other DoA Hardware Implementation Works Regarding the hardware design efficiency of the proposed work, our literature survey cannot identify any other work with chip design. However, there have been works [28–31] with hardware implementations reported in recent years and all of them are using the FPGA technology. In industry practices, FPGA is usually employed in the cases of system prototyping or when product volume is not large enough to economically justify the Application Specific Integrated Circuit (ASIC) implementation. The design efforts and the NRE (Non-Recurring Engineering) cost of ASIC are significantly higher than those of the FPGA. The unit cost of ASIC, once the economical volume is reached, however, can be lower than that of FPGA. In this paper, we choose the ASIC approach simply to demonstrate how DSP intensive applications can benefit from a carefully crafted ASIC design in terms of speed and throughput. Electronics 2020, 9, x FOR PEER REVIEW 22 of 26 correlation Matrix), using different colors. The equivalent gate count in total is 454.4 k. Figure 20 shows the breakdown of hardware complexity measured in gate count. Among these function units, QRD is the most complex one and accounts for one half (49.9%) of the total gate count. MP is ranked in the second place and consumes 30.9% of the total gate count. AM and PM are of similar complexities and the percentages are 8.4% and 7.6%, respectively. The remaining 3.2% is attributed to the I/O buffer. The clock rate of the core design can reach 333MHz and the corresponding power consumptions for the core design is 83mW. The total DoA estimation takes 795 (= 512 + 283) clock cycles. Note that the first 512 clock cycles are actually bounded by the time span needed to collect 64 input samples for auto-correlation matrix calculation. The remaining 283 clock cycles are dedicated to DoA estimation. The original time budget is 10 s (1024 clock cycles @102.4MHz). The implementationElectronics result2020, 9 ,is 641 2.4 s (795 clock cycles @333MHz). If we exclude the AM calculation part, 23 of 27 the estimation itself requires only 0.85 s. This suggests a throughput rate of 1.18 M estimates/sec.

MP

QRD

AM

PM

Figure 19.Figure Layout 19. of Layoutthe proposed of the proposedDoA estimation DoA estimation accelerator accelerator chip design chip. design. Table 5. Accelerator chip design summary. Table 5. Accelerator chip design summary. Item Specification Item Specification Technology process TSMC 40 nm CLN 40 G Technology process TSMC 40 nm CLN 40 G Cell library Cell libraryARM Cell-based ARM Design Cell-based Kit Design Kit Voltage (Core/IO) Voltage (Core/IO)0.9/3.3 V 0.9/3.3 V Maximum clock frequencyMaximum clock frequency333 MHz 333 MHz Minimum clock periodMinimum clock period3ns 3ns 2 Core Size Core Size 0.76 mm 0.76 mm2 2 Die Size Die Size 2.61 mm 2.61 mm2 IO Pad IO Pad75 75 Gate Count 454.4 k Gate Count 454.4 k Power (Chip/Core) 177/83 mW Power (Chip/Core) 177/83 mW Latency 512 (auto-correlation matrix) + 283 (DoA estimation) Cycles 512 (auto-correlation matrix) + 283 (DoA Latency  Initiation interval 1.536 s (512clockestimation) cycles) Cycles Throughput 1.18 M estimates/sec Initiation interval 1.536 µs (512clock cycles) Throughput 1.18 M estimates/sec

Table6 lists the design indexes of the proposed work and five other designs. In this table, we list the DoA estimation scheme, antenna array size, input word length, clock rate, latency, initiation interval and throughput results associated with each design. Note that there are various factors affecting the implementation results. These include the antenna array size (or the problem size), the adopted estimation algorithm complexity, the accuracy of estimation, the maximum number of signal sources, the codebook size, the hardware complexity, the implementation technology, and so on. It is, thus, not likely to come out with a leveled index taking every factor into account. Therefore, the comparison emphasizes on the merits of hardware design and does not consider the discrepancy in algorithm complexity and estimation accuracy. However, some basic normalization measures are adopted for a fair comparison. First of all, the computing latency is measured in terms of clock cycles rather than the Electronics 2020, 9, 641 24 of 27 absolute time span. This can effectively eliminate the factor of implementation technology. Secondly, the generic problem complexity is considered. Despite the differences in DoA estimation schemes, all of them bear a algorithm complexity of order N3, where N is the antenna array size. We set the problem complexity of N = 8 as 1 and the problem complexity of N = 4 becomes 1/8. This leads to a normalized latency by taking the problem complexity into account. Electronics 2020, 9, x FOR PEER REVIEW 23 of 26

14.5 34.5 38.1 I/O Buffer 140.4 MP QRDU AM PM

226.6

FigureFigure 20.20. Logiclogic complexity complexity breakdown of chip design.

4.5. ComparisonTable with 6. Performance Other DoA comparison Hardware ofImplementation related DoA estimation Works hardware implementations.

Regarding the hardware[28][ design 29efficiency][ of th30][e proposed work,31 our] literature survey This Work cannot identify any other work with chip design. HoweveLU decomp.r, there haveLDL been workCholeskys [28–31] withProjection hardware and estimation scheme Bartlett MUSIC implementations reported in recent years and all ESPRITof them are usingdecomp. the FPGADecomp. technology.parallel In industry MP Antenna array size 1 4 1 8 1 4 1 4 1 4 1 8 practices, FPGA is usually× employed in× the cases of× system prototyping× or when× product volume× is Norm. problem not large enough to economically1/8 1justify the A 1/8pplication 1Specific/8 Integrated 1/8 Circuit 1 (ASIC) implementation.complexity The design efforts and the NRE (Non-Recurring Engineering) cost of ASIC are significantlyImplementation higher than FPGA those of the FPGA FPGA. The FPGAunit cost of ASIC, FPGA once the FPGA economical volume ASIC is reached,Design however, effort can medium be lower than medium that of FPGA. medium In this mediumpaper, we choose medium the ASIC highapproach simplyNRE to costdemonstrate medium how DSP intensive medium applications medium can mediumbenefit from mediuma carefully crafted high ASIC designUnit in terms cost of speed medium and throughput. medium medium medium medium low InputTable word 6 length lists the design 8 bits indexes of 16 bitsthe proposed 20 bitswork and five 20 bits other designs. 20 bits In this table, 21 bits we list the DoAClock estimation Rate scheme, 225 MHz antenna 160 MHzarray size, 56.47 inpu MHzt word 52.5 length, MHz clock 54.3 rate, MHz latency, 333 initiation MHz interval and throughput results associated with23,438 each LUTs design.21,870 Note LUTs that21,956 there LUTs are various factors 54,100 LUTs 454.4 k affectingCircuit complexity the implementation NA results. These include10 BRAMs the antenna10 BRAMs array size10 (or BRAMs the problem size), the 64 DSP48s gates adopted estimation algorithm complexity, the accura265 DSP48scy of estimation,233 DSP48s the ma230ximum DSP48s number of signal sources,Initiation the interval codebook 0.804size,µ sthe hardware 93.4µs complexi 2.82ty,µs the implementation 3.77µs technology, 4.05µs and 1.536 soµ son. It is, thus,Throughput not likely to come out with a leveled index taking every factor into account. Therefore, the 1.24 M 10.7 K 0.36 M 0.27 M 0.25 M 0.65 M comparison(estimates/ sec)emphasizes on the merits of hardware design and does not consider the discrepancy in algorithmLatency (cycles)complexity and 181 estimation 14,944 accuracy. Ho 159wever, some 198 basic normalization 220 measures 512+283 are adoptedNorm. for latency a fair comparison. 1448 First of 14,944 all, the comp 1272uting latency 1584 is measured 1760 in terms of clock 795 cycles rather than the absolute time span. This can effectively eliminate the factor of implementation technology.From Table Secondly,6, design the [ 29generic] and theproblem proposed complexity work target is considered. a 1 8 antenna Despite array the while differences the remaining in DoA × fourestimation use a 1 schemes,4 antenna all of array. them This bear means a algorithm the problem complexity complexity of order of N a3 1, where4 antenna N is the array antenna is roughly array × × onesize. eighth We set that the of problem a 1 8 antenna complexity array. of As N for = 8 the as input1 and word the problem length, whichcomplexity may a ffofect N the= 4 widthbecomes of data 1/8. × pathThis design,leads to mosta normalized designs adoptlatency word by taking length the between problem 16 complexity and 21 bits. into The account. reported clock rates vary fromFrom 52.5 MHz Table (design 6, design [31 [29]]) to and 333 the MHz proposed (the proposed work target work). a 1 × FPGA, 8 antenna because array of while its programmable the remaining featurefour use and a 1 × a 4 lot antenna of logic array. overhead, This means can barelythe problem meet thecomplexity speed achieved of a 1 × 4 byantenna ASIC. array The is hardware roughly one eighth that of a 1×8 antenna array. As for the input word length, which may affect the width of data path design, most designs adopt word length between 16 and 21 bits. The reported clock rates vary from 52.5 MHz (design [31]) to 333 MHz (the proposed work). FPGA, because of its programmable feature and a lot of logic overhead, can barely meet the speed achieved by ASIC. The hardware complexity, although listed in the comparison table, is not compared because there is no credible complexity conversion formulas between FPGA and ASIC technologies. Note that BRAM and DSP48 in FPGA implementations indicate embedded memory module and hardwired multiplier, respectively. The latency is measured as number of clock cycles, regardless of the clock rate, needed

Electronics 2020, 9, 641 25 of 27 complexity, although listed in the comparison table, is not compared because there is no credible complexity conversion formulas between FPGA and ASIC technologies. Note that BRAM and DSP48 in FPGA implementations indicate embedded memory module and hardwired multiplier, respectively. The latency is measured as number of clock cycles, regardless of the clock rate, needed to complete the estimation. This latency includes both the time spent in auto-correlation matrix calculation and the time to perform DoA estimation. From Table6, designs [ 28,30,31] all can accomplish the estimation in less than 220 clock cycles. This is because they work on a smaller antenna array. The proposed design takes 512 clock cycles in calculating the auto-correlation matrix and another 283 clock cycles in performing the estimation. The total latency is thus 795 clock cycles. Note that in our design, the auto-correlation matrix calculation is performed in pace with the data input rate. This is considered an input rate constrained design and the hardware circuitry is minimized to just meet the throughput requirement. It takes 512 clock cycles to collect the needed 64 sample vectors and, in the mean time, calculates the auto-correlation matrix incrementally. Once all sample vectors are admitted, it actually spends two additional clock cycles only to wrap up the entire auto-correlation matrix calculation. If we take the problem complexity into consideration, the normalized latency of the proposed design is the shortest one in all designs. The initiation interval indicates the time interval between two successive DoA estimations and its reciprocal corresponds to the throughput. For the proposed design, the auto-correlation matrix calculation module and the remaining computing modules can work in parallel. Therefore, the initiation interval is bounded by the latency of auto-correlation matrix calculation. This corresponds to a 1.536 µs interval when the clock rate is 333 MHz. The corresponding throughput is thus 0.65 M estimates/sec. Except for design [29], all other designs have successfully confined their initiation intervals to less than 4 µs. Design [29] uses relatively fewer FPGA resources than other FPGA based design. It also deals with a 1 8 rather than a 1 4 antenna array. That may explain × × why it has the largest initiation interval (lowest throughput) among all designs. Design [28], due to its smaller problem complexity and a much smaller data path width, shows the highest throughput rate. In summary, this table provides an overview of how DoA estimation algorithms are implemented in hardware. Most of these hardwired designs can initiate a new estimation in just a few micro-seconds, which are much faster than the software based designs can achieve, usually measured in the unit of mili-seconds. The proposed design also features the shortest normalized latency.

5. Conclusions In conclusion, an efficient DoA estimation scheme and its hardware accelerator design in chip implementation is presented in this paper. The proposed scheme features a combination of signal sub-space projection and parallel matching pursuit techniques, i.e., applying signal projection first before performing matching pursuit from a codebook. This measure helps minimize the interference from noise sub-space and makes the matching process free of extra orthogonalization computations. It thus exhibits the lowest computing complexity and the second best estimation accuracy among the schemes under comparison. We also evaluate the impact of K value selection on the estimation accuracy and demonstrate how this DoA estimation scheme improves the FMCW radar detection. After algorithm performance verifications, the corresponding hardware accelerator chip design is also developed and can accomplish one multiple-source (up to 3) DoA estimation every 1.536 µs, which indicates a processing throughput as high as 0.65 M estimates per second. Hardware design comparison with 5 other related works is also conducted. Although the comparison result is not conclusive due to the discrepancy in many design factors, the proposed design excels others in that it spends the fewest clock cycles (after normalization) in performing the estimation.

Author Contributions: Conceptualization, K.-T.C., W.-H.M. and Y.-T.H.; methodology, K.-T.C. and W.-H.M.; software, W.-H.M. and K.-Y.C.; validation, W.-H.M. and K.-Y.C.; writing—original draft preparation, K.-T.C. and W.-H.M.; writing—review and editing, Y.-T.H.; project administration, Y.-T.H.; funding acquisition, Y.-T.H.; All authors have read and agreed to the published version of the manuscript. Electronics 2020, 9, 641 26 of 27

Funding: This research is supported by the Ministry of Science & Technology, Taiwan, Project no. MOST 107-2218-E-005-013. Conflicts of Interest: The authors declare no conflict of interest.

References

1. Petre, S.; Nehorai, A. MUSIC, maximum likelihood, and Cramer-Rao bound. IEEE Trans. Acoust. Speech Signal Process. 1989, 37, 720–741. 2. Yan, F.; Jin, M.; Qiao, X. Low-complexity DOA estimation based on compressed MUSIC and its performance analysis. IEEE Trans. Signal Process. 2013, 61, 1915–1930. [CrossRef] 3. Ali, H.; Shubair, R.M.; Salahat, E. Enhanced DOA estimation algorithms using MVDR and MUSIC. In Proceedings of the IEEE International Conference on Current Trends in Information Technology (CTIT), Dubai, United Arab Emirates, 11–12 December 2013. 4. Rao, B.D.; Hari, K.V.S. Performance analysis of root-MUSIC. IEEE Trans. Acoust. Speech Signal Process. 1989, 37, 1939–1949. [CrossRef] 5. Marius, P.; Gershman, A.B.; Haardt, M. Unitary root-MUSIC with a real-valued eigen decomposition: A theoretical and experimental performance study. IEEE Trans. Signal Process. 2000, 48, 1306–1314. 6. Tanaka, A.; Imai, H. MUSIC-Based DoA Estimation by Oblique Projection along the signal sub-space. In Proceedings of the IEEE Workshop on Statistical Signal Processing (SSP), Gold Coast, Australia, 29 June–2 July 2014. 7. Richard, R.; Kailath, T. ESPRIT-estimation of signal parameters via rotational invariance techniques. IEEE Trans. Acoust. Speech Signal Process. 1989, 37, 984–995. 8. Sylvie, M.; Marsal, A.; Benidir, M. The propagator method for source bearing estimation. Signal Process. 1995, 42, 121–138. 9. Zheng, G.; Chen, B.; Yang, M. Unitary ESPRIT algorithm for bistatic MIMO radar. Electron. Lett. 2012, 48, 179–181. [CrossRef] 10. Lin, J.; Ma, X.; Yan, S.; Hao, C. Time-Frequency Multi-Invariance ESPRIT for DOA Estimation. IEEE Antennas Wirel. Propag. Lett. 2016, 15, 770–773. [CrossRef] 11. Zhang, W.; Han, Y.; Jin, M.; Li, X.-S. An Improved ESPRIT-Like Algorithm for Coherent Signals DOA Estimation. IEEE Commun. Lett. 2020, 24, 339–343. [CrossRef] 12. Mallat, S.; Zhang, Z. Matching pursuit with time-frequency dictionaries. IEEE Trans. Signal Process. 1994, 41, 3397–3415. [CrossRef] 13. McClure, M.R.; Carin, L. Matching pursuits with a wave-based dictionary. IEEE Trans. Signal Process. 1997, 45, 2912–2927. [CrossRef] 14. Karabulut, G.; Kurt, T.; Yongaçoglu, A. Estimation of directions of arrival by matching pursuit (EDAMP). Eurasip J. Wirel. Commun. Netw. 2005, 2005, 618605. [CrossRef] 15. Tropp, J.; Gilbert, A.C. Signal recovery from random measurements via orthogonal matching pursuit. IEEE Trans. Inf. Theory 2007, 53, 4655–4666. [CrossRef] 16. Zhang, X.; Li, Y.; Yuan, Y.; Jiang, T.; Yuan, Y. Low-Complexity DOA Estimation via OMP and Majorization- Minimization. In Proceedings of the 2018 IEEE Asia-Pacific Conference on Antennas and Propagation (APCAP), Auckland, New Zealand, 5–8 August 2018. 17. Nasu, T.; Kikuma, N.; Sakakibara, K. Performance Improvement of DOA Estimation Using Conjugate Gradient Method with Subtraction Scheme. In Proceedings of the 2018 IEEE International Workshop on Electromagnetics: Applications and Student Innovation Competition (iWEM), Nagoya, Japan, 29–31 August 2018. 18. Wang, W.; Wu, R. High Resolution Direction of Arrival (DOA) Estimation Based on Improved Orthogonal Matching Pursuit (OMP) Algorithm by Iterative Local Searching. Sensors 2013, 13, 11167–11183. [CrossRef] 19. Aghababaiyan, K.; Shah-Mansouri, V.; Maham, B. High-Precision OMP-Based Direction of Arrival Estimation Scheme for Hybrid Non-Uniform Array. IEEE Commun. Lett. 2020, 24, 354–357. [CrossRef] 20. Wilkinson, J.H.; Bauer, F.L.; Reinsch, C. Linear Algebra. In Handbook for Automatic Computation (2); Springer: Berlin/Heidelberg, Germany, 1971; ISBN 978-3-662-39778-7. 21. Tang, C.F.T.; Liu, K.J.R.; Tretter, S.A. On systolic arrays for recursive complex Householder transformations with applications to array processing. In Proceedings of the IEEE Acoustics, Speech, and Signal Processing, Toronto, ON, Canada, 14–17 April 1991; pp. 1033–1036. Electronics 2020, 9, 641 27 of 27

22. Chung, K.-L.; Yan, W.-M. The complex Householder transform. IEEE Trans. Signal Process. 1997, 45, 2374–2376. [CrossRef] 23. Lin, K.-H.; Chang, R.C.-H.; Lin, H.-L.; Wu, C.-F. Analysis and Architecture Design of a Downlink M-Modification MC-CDMA System Using the Tomlinson–Harashima Precoding Technique. IEEE Trans. Veh. Technol. 2008, 57, 1387–1397. 24. Singh, C.K.; Prasad, S.H.; Balsara, P.T. VLSI Architecture for Matrix Inversion using Modified Gram-Schmidt based QR Decomposition. In Proceedings of the International VLSI Design Conference, Bangalore, India, 6–10 January 2007; pp. 836–841. 25. Hwang, Y.-T.; Chen, W.-D. Design and Implementation of a High Throughput Fully-Parallel Complex-Valued QR Factorization Chip. IET Circ. Devices Syst. 2011, 5, 424–432. [CrossRef] 26. Richards, M. Fundamentals of Radar Signal Processing; McGraw-Hill: New York, NY, USA, 2005. 27. Zaharov, V.; Teixeira, M. SMI-MVDR beamformer implementations for large antenna array and small sample size. IEEE Trans. Circ. Syst. I Reg. Pap. 2008, 55, 3317–3327. [CrossRef] 28. Unlersen, F.M.; Yaldiz, E.; Imeci, S.T. FPGA based fast Bartlett, “DoA estimator for ULA antenna using parallel computing”. Appl. Comput. Electromagn. Soc. J. 2018, 33, 450–459. 29. Yan, J.; Huang, Y.; Xu, H.; Vandenbosch, G.A.E. Hardware acceleration of MUSIC based DoA estimator in MUBTS. In Proceedings of the 8th European Conference Antennas Propag (EuCAP), The Hague, The Netherlands, 6–11 April 2014; pp. 2561–2565. 30. Hussain, A.A.; Tayem, N.; Butt, M.O.; Soliman, A.-H.; Alhamed, A.; Alshebeili, S. FPGA Hardware Implementation of DOA Estimation Algorithm Employing LU Decomposition. IEEE Access 2018, 6, 17666–17680. [CrossRef] 31. Hussain, A.A.; Tayem, N.; Soliman, A.-H.; Radaydeh, R.M. FPGA-Based Hardware Implementation of Computationally Efficient Multi-Source DOA Estimation Algorithms. IEEE Access 2019, 7, 88845–88858. [CrossRef]

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).