Received 8 October 2019; revised 31 March 2020; accepted 19 April 2020. Date of publication 30 April 2020; date of current version 5 June 2020. The review of this article was arranged by Associate Editor Cedric Richard. Digital Object Identifier 10.1109/OJSP.2020.2991586 Efficient and Self-Recursive Delay Vandermonde Algorithm for Multi-Beam Antenna Arrays SIRANI M. PERERA 1, ARJUNA MADANAYAKE 2 (Member, IEEE), AND RENATO J. CINTRA 3,4 (Senior Member, IEEE) 1Department of Mathematics, Embry-Riddle Aeronautical University, Daytona Beach, FL 32114 USA 2Department of Electrical and Computer Engineering, Florida International University, Miami, FL 33174 USA 3Departamento de Estatística, Universidade Federal de Pernambuco, Recife 50711970, Brazil 4Department of Electrical and Computer Engineering, University of Calgary, Calgary, AB T2N 1N4, Canada

CORRESPONDING AUTHOR: SIRANI M. PERERA (e-mail: [email protected]) This work was supported by the National Science Foundation with the Award number 1902283, and in part by CNPq under Grant 306437/2015-5 and 308423/2019-4.

ABSTRACT This paper presents a self-contained factorization for the delay Vandermonde (DVM), which is the super class of the discrete , using sparse and companion matrices. An efficient DVM algorithm is proposed to reduce the complexity of radio-frequency (RF) N-beam analog beamform- ing systems. There exist applications for wideband multi-beam beamformers in wireless communication networks such as 5G/6G systems, system capacity can be improved by exploiting the improvement of the signal to noise ratio (SNR) using coherent summation of propagating waves based on their directions of propagation. The presence of a multitude of RF beams allows multiple independent wireless links to be established at high SNR, or used in conjunction with multiple-input multiple-output (MIMO) wireless systems, with the overall goal of improving system SNR and therefore capacity. To realize such multi-beam beamformers at acceptable analog circuit complexities, we use sparse factorization of the DVM in order to derive a low arithmetic complexity DVM algorithm. The paper also establishes an error bound and stability analysis of the proposed DVM algorithm. The proposed efficient DVM algorithm is aimed at implementation using analog realizations. For purposes of evaluation, the algorithm can be realized using both digital hardware as well as software defined radio platforms.

INDEX TERMS Delay , efficient algorithms, self-recursive algorithms, complexity and performance of algorithms, approximation algorithms, wireless communications, beamforming, software defined radio.

I. INTRODUCTION 6G wireless networks are built on mm-wave (mmW) bands The demand for wireless data communication networks hav- (typically, 20–600 GHz), which are of sufficiently high fre- ing increased capacity and data transfer rates is growing at quencies to allow the necessary growth in system bandwidth. a tremendous rate. The wireless networks are on the verge of A 5G wireless data connection may operate around 60 GHz rolling out their newest generation of mobile networks; the so- (indoors) and operate at about 200–1000 MHz of bandwidth, called fifth generation (5G) network, which is expected to be which shows 10- to 25-fold growth in capacity [29]. Such the underlying data transfer network of emerging technologies levels of growth are typical of a new generation of wireless such as the wireless internet of things (IoT), networks on chip, networks. In this paper, we address the problem of obtaining a body-area networks and cyberphysical systems (CPS). Expo- multitude of directional mmW RF beams using digital signal nentially growing demands for capacity require exponentially processing. The paper aims to reduce to arithmetic complexity growing bandwidth for a given SNR level. Emerging 5G and of the beamforming operation that is based on multiplication

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ 64 VOLUME 1, 2020 FIGURE 1. A linear array with analog antenna outputs xk (t ) for achieving N-beam wideband multi-beam beamformer using left) an analog DVM circuit = that realizes a spatial linear transform y AN x employing true time delay analog blocks for RF beams yk (t ); and right) the digital signal flow graph for a parallel digital hardware or software defined radio approach in which an M-point discrete Fourier transform (DFT) is computed along the temporal dimension for binning the sampled wideband signals in temporal frequency domain. Each signal component temporal frequency bin is applied to the α = − jωτ ω = 2πm . DVM algorithm by computing the efficient DVM algorithm for M separate values of e where M In both cases, the DVM algorithm leads to ψ = , ,..., N-beams by coherently combining radio waves in N discrete directions k 1 2 N. Here, temporal sample blocks consist of M samples for the fast Fourier transforms (FFTs). of a Vandermonde matrix with an input vector obtained from processing receiver that operates on an N-element array of the array of antennas. The proposed multi-beam signal proces- antennas by applying a spatial linear transform AN across sor uses a low-complexity algorithm that enables acceptable the array outputs xk (t ), k = 1, 2,...,N in continuous-time digital circuit complexity in order to achieve multi-beam RF to produce a number of continuous-time outputs yk (t ) corre- beams for a base-station. sponding to the RF beams. The paper is organized as follows. Section II introduces The linear transform typically takes the form of a spatial the concept of wideband multi-beam beamforming for wire- discrete Fourier transform, implemented via spatial FFT, and less communications. Section III proposes a self-contained leads to N number of RF beams corresponding to directions factorization for the DVM and an efficient DVM algorithm, pertaining to spatial frequencies of the incident waves that while Section IV contains the derivation of arithmetic com- match DFT bin frequencies 2πk/N. The use of an FFT pro- plexity and elaborate numerical results of the proposed DVM duces beams that have a frequency dependent axis because algorithm. Section V furnishes the derivation of a theoreti- it can be shown that the beam orientation is a function of cal error bound, establishes numerical results, and addresses wavelength. the numerical stability of the proposed DVM algorithm. In By progressively delaying each antenna by a multiple of Section VI, applications of the DVM matrix factorization a constant time delay, the RF energy can be directed in a in the engineering discipline will be discussed. Finally, particular direction in a frequency independent manner. For jω Section VII concludes the paper. example, let Xk (e ) be the Fourier transform of the input xk (t ) for the array outputs k = 1, 2,...,N. The application II. MULTI-BEAM WIDEBAND ARRAY of a linear delay of duration τ causes a corresponding phase SIGNAL PROCESSING rotation by ωτ. In the frequency domain, the output of the jω − jωτ A. REVIEW OF ANTENNA BEAMFORMING delay becomes Xk (e )e . The application of delays τkl The coherent combination of multiple antennas is known as to the kth antenna, where l ∈ Z is an integer causes signal beamforming [15], [30]. Fig. 1 shows an overview of an array components to be rotated by the frequency dependent phase

VOLUME 1, 2020 65 M. PERERA ET AL.: EFFICIENT AND SELF-RECURSIVE DELAY VANDERMONDE ALGORITHM FOR MULTI-BEAM ANTENNA ARRAYS

ωτkl. Delay and sum operations - described by a Vander- analog delay lines. In analog realizations the DVM fast al- monde matrix (DVM) by input vector product- leads to N RF gorithm is realized in an analog circuit consisting of wide- beams that have beam orientations ψk, k = 1, 2,...,N that band delays, amplifiers and adders. The low arithmetic com- have no dependence on the wavelength. plexity of the proposed algorithms will result in correspond- Example beamformers for arrays can be found in [2], [3], ingly low circuit complexity for wideband analog multi-beam [17], [18], [26], [32], [39], [40], [43]–[45]. These systems are beamforming circuits. For purposes of fast verification using limited in number of beams, although they do possess rela- software models, the proposed DVM algorithms assumes a tively high numbers of antennas, due to the inherent computa- particular input frequency (i.e., a constant ω) for the incident tional/circuit complexity of such multi-beam systems. Unlike waves. Such a simplification allows us to compute the rele- DFT beamformers, delay-based Vandermonde matrix multi- vant matrix-vector product using a computer based numerical beam beamformers have frequency independent beam angles- simulation model. a property that is desired in wideband basestations. During software-based numerical verification, we assume the input signals are over-sampled in time. This is because all computer models much be discrete in time. For sam- B. WIDEBAND N-BEAM ALGORITHMS pled digital signal processing systems, the algorithms can For a linear-array with inter-element spacing x and speed be applied at a particular value of ω in the temporal fre- of light c, the marginal time-difference of arrival between quency domain by first computing temporal FFTs for the ψ τ = x ψ antennas for direction is c sin .Thekth true-time- antenna channels, and then applying the DVM algorithm for delay RF beam can be realized by coherently summing the each bin of a temporal M-point DFTs. In such temporally- antenna signals l = 0, 1,...,N − 1 such that the beam output sampled software/digital implementations, each input xk (t ) − j2πm/M is yk (t ) = x0(t ) + x1(t − kτ ) +···+x(t − k(N − 1)τ ). becomes xk (e , t ), m = 0, 1,...,M − 1 and corre- − j2πm/M Assuming inter-element spacing x = λmin/2, the N si- sponding outputs yk (e , t ) where t = MTnfor tem- multaneous RF beams in the discrete domain can be written poral sample period T and M-point temporal FFT block num- as the product of an N × N DVM containing the frequency- ber n. There needs to be M parallel digital circuits, or M dependent phase-rotations and an input signal vector with ele- calls to the DVM algorithm in software realizations of the ments xk (t ) consisting of the spatial signals from the uniform fast algorithm to process all of the temporal frequency bins. linear array of antennas. There will be call to the DVM algorithm for each temporal The wideband multi-beam beamforming algorithm there- FFT output bin at e− j2πm/M in order to support wideband fore consists of the computation of a DVM-vector product operation, and therefore M number of calls to the algorithm y = AN x at time t ∈ R where x and y are input and out- in order to process wideband signals transformed by M-point put vectors containing signals xk (t ) and yk (t ) respectively. In temporal FFTs. In our analysis, the code implementations our previous work [26], we have proposed a low-complexity assume M = 1 because the aim is to numerically model the DVM algorithm using the product of complex 1-band upper efficient DVM algorithm. and lower matrices. The DVM algorithm in [26] extends the The reported arithmetic complexities scale linearly with M results in [21], [22], [42] utilizing complex nodes without for finer temporal M-point DFTs. For notational simplicity, considering quasiseparability and displacement equations as we will simply use xk (t ) and yk (t ) for describing the effi- in [12], [13], [23]. Moreover, we have addressed error bounds cient DVM algorithm keeping in mind the above mentioned and stability of the DVM algorithm in [26] by filling the gaps details when considering analog circuit or software/digital in [21], [22], [42]. There are several mathematical techniques implementations. available to derive radix-2 and split-radix FFT algorithms, as described in [5], [9], [16], [28], [33], [38]. On the other hand, among the known approximation algorithms which run in quadratic arithmetic time to compute Vandermonde matrices III. SELF-CONTAINED FACTORIZATIONS AND FAST DVM BEAMFORMING ALGORITHM by a vector, the author in [25] presented linear arithmetic time , − The DVM A = [αkl]N N 1 , is a Vandermonde structured approximation algorithm to compute Vandermonde matrices N k=1,l=0 − jωτ by vector. But our main intention in this paper is to propose matrix with complex entries. Here, we defined α ≡ e for an exact algorithm with self-contained factors not an approx- notational convenience. Recall τ = x/c is a time delay. On imation algorithm. the other hand, the DFT matrix is also a well known Vander- th Even though the derivation of size N DFT into two size monde structured matrix having N roots of unity as nodes. In N contrast to the DFT, however, the DVM does not necessarily 2 DFTs can be done easily, the extension of this idea to the DVM is cumbersome as the useful DFT properties are not possess nice properties, such as unitary, periodicity, symme- necessarily present in the DVM case. However, one could try, and circular shift. still derive an efficient algorithm for the delay Vandermonde The DVM is defined using distinct complex nodes α,α2,...,αN matrix as an polynomial evaluation problem. and hence it is non-singular. The matrix AN = ˜ ˜ = αkl N−1 For wideband analog inputs where ω spans a continuous can be scaled as AN AN DN , where AN [ ]k,l=0 and = αk N−1 range of values, the phase rotations are to be realized using DN diag[ ]k=0 . In the following, we will provide a

66 VOLUME 1, 2020 self-contained sparse factorization for A˜ N followed by the holds α,α2,...,αN DVM (i.e. AN ) over complex nodes . ˜ = ˘ ˜ , A N ,α2 C N D N A N ,α2 (6) Lemma 3.1: Let the scaled delay Vandermonde 2 2 2 2 ˜ = αkl N−1 matrix AN,α [ ]k,l=0 be defined by nodes where 2 N−1 t ⎡ ⎤ {1,α,α ,...,α }∈C and N = 2 (t ≥ 1). Then A˜ N,α can 00··· 0 −w0 be factored into ⎢ ⎥ ⎡ ⎤ ⎢10··· 0 −w ⎥ N ⎢ 1 ⎥ 2 ⎢ ⎥ I N C ··· −w ⎢ N ⎥ C N = ⎢01 0 2 ⎥ (7) A˜ N 2 2 ⎢ ⎥ T ,α2 ⎢ ⎥ 2 . . . . A˜ ,α = P 2 ⎢ ⎥ , (1) ⎢...... ⎥ N N ˜ ⎣ ⎦ ⎣. . . . . ⎦ A N ,α2 N N 2 ˜ α 2 2 ˜ D N C N D N 00··· 1 −w N − 2 2 2 2 1 ˜ = is the of the polynomial p(z) with coeffi- where PN is the even-odd , A N ,α2 2 N − N 2k 2 1 N −1 N −1 cients w (i = 0, 1,..., − 1) and D˘ N = diag(α ) .By α2kl 2 ˜ = αl 2 i 2 k=0 [ ]k,l=0, I N is the , D N diag[ ]l=0 , and 2 2 2 ˜ using (6), non-singularity of A N ,α2 , and induction on N, one C N is the companion matrix defined by the monic polynomial 2 2 can easily show p(z) = (z − 1)(z − α2)(z − α4) ···(z − αN−2) N N Proof: We show (1) by divide-and-conquer technique. We ˜ 2 = ˘ 2 ˜ A N ,α2 C N D N A N ,α2 (8) first permute rows of A˜ N by multiplying with PN and then 2 2 2 2 write the result as the block matrices: N ˘ 2 = ˆ for any even number N. Note that, D N D N . Thus PN A˜ N,α 2 2 ⎡ ⎤ ⎡ ⎤ N N −1 ˜ ˜ 2 N −1 2k N +l 2 ⎢ A N ,α2 A N ,α2 C N ⎥ ⎢ α2kl 2 α 2 ⎥ ⎢ 2 2 2 ⎥ ⎢ k,l=0 ⎥ (2) ˜ = ⎢ ⎥ = ⎢ k,l=0 ⎥ PN AN,α ⎢ N ⎥ ⎣ N ⎦ −1 N ,α2 ⎣ N −1 (2k+1) N +l 2 ⎦ ˜ ˜ α 2 ˜ 2 ˜ α(2k+1)l 2 α 2 A N ,α2 D N A N ,α2 C N D N k,l=0 2 2 2 2 2 k,l=0 and we get the result.  ˜ Now, we consider (1,2), (2, 1), and (2, 2) blocks of PN AN,α Corollary 3.2: Let the delay Vandermonde matrix AN,α = ˜ kl N,N−1 2 N (2) and represent each of these by A N ,α2 and the product of [α ] be defined by nodes {α,α ,...,α } and N = 2 k=1,l=0 diagonal matrices. 2t (t ≥ 1). Then the DVM can be factored into For (1,2) block of (2) we get: ¯ A N ,α2 D N N − = T 2 2 N 2 1 N N − AN,α P 2k +l −1 2 1 N ¯ α 2 = diag(αkN ) 2 · α2kl . (3) A N ,α2 D N k=0 , = 2 2 k,l=0 k l 0 ⎡ ⎤ N 2 (9) For (2, 1) block of (2) we get: ⎢ I N C N ⎥ ⎢ 2 2 ⎥ ⎢ ⎥ N −1 N −1 N − DN k+ l 2 2 l 1 ⎣ N ⎦ α(2 1) = α2kl · diag(α ) 2 . (4) N l=0 ˜ α 2 2 ˜ k,l=0 k,l=0 D N C N D N 2 2 2 For (2, 2) block of (2) we get: N , N −1 N −1 = α2kl 2 2 ¯ = 1 2 N where A N ,α2 [ ] = , = and D N diag[ 2k ] = . −1 2 k 1 l 0 2 α k 0 (2k+1) N +l 2 α 2 Proof: This can easily be seen through the scaling of (1) k,l=0 by DN and D¯ N .  2 N N − N Note that in order to compute the companion matrix N −1 2 1 −1 = α 2 αkN 2 α2kl αl 2 . diag( )k=0 diag( )l=0 C N we have to compute the coefficients of the polyno- k,l=0 (5) 2 = − − α2 − α4 ··· − αN−2 = N + Thus by (3), (4), and (5) we can state (2) as: mial p(z) (z 1)(z )(z ) (z ) z 2 N −1 ⎡ ⎤ 2 w · zi. One can do this by setting p(0)(z) = 1 and ˜ ˆ ˜ i=0 i N A N ,α2 D N A N ,α2 2 ⎢ 2 2 2 ⎥ (k+1) = − α2k (k) = , ,..., N − P A˜ ,α = ⎣ ⎦ , p N (z )p N for k 0 1 2 1. Then take N N 2 2 ˜ ˜ α N ˆ ˜ ˜ N A N ,α2 D N 2 D N A N ,α2 D N ( 2 ) 2 2 2 2 2 p N (z) which is p(z). The following lemma gives this 2 N − N − kN 2 1 l 2 1 procedure. where Dˆ N = diag(α ) = and D˜ N = diag(α ) = . 2 k 0 2 l 0 Lemma 3.3: N W = − Let be an even number, = − − α2 − α4 ··· − αN 2 = 2 4 N−2 k 2i Set p(z) (z 1)(z )(z ) (z ) {1, z , z ,...,z }, and q(z) = = v · z , where N i 1 i N −1 k+1 2 + 2 w · i w ∈ C ≤ N − 2 · = w · 2i z i=0 i z where i . The following equality k 2 2. Then the coefficients of z q(z) i=1 i z

VOLUME 1, 2020 67 M. PERERA ET AL.: EFFICIENT AND SELF-RECURSIVE DELAY VANDERMONDE ALGORITHM FOR MULTI-BEAM ANTENNA ARRAYS can be computed by Proof: One can easily use induction for k ≥ 2toshow ⎡ ⎤ ⎡ ⎤ (12).  w v ⎢ 0 ⎥ ⎢ 0⎥ Remark 3.5: Although the factorization for the DVM can ⎢ . ⎥ ⎢ . ⎥ be stated as in Corollary 3.2, we should recall here that ⎢ . ⎥ ⎢ . ⎥ ⎢ ⎥ ⎢ ⎥ the classical Vandermonde matrix (V ) is extremely ill- ⎢w ⎥ N ⎢v ⎥ ⎢ k+1⎥ Z O −1 ⎢ k⎥ ⎢ ⎥ = 2 ⎢ ⎥ (10) conditioned and in fact the condition number of the matrix ⎢ 0 ⎥ e N −1 0 ⎢ 0 ⎥ ⎢ ⎥ 2 ⎢ ⎥ V grows exponentially with the size [11], [24], [37]. In this ⎢ . ⎥ ⎢ . ⎥ ⎣ . ⎦ ⎣ . ⎦ paper, we will study how bad the complex structured DVM can be in terms of the choices for nodes in Section V. 0 0 We will first state the following algorithm based on Lemma ⎡ ⎤ 00··· 00 3.3 to compute the coefficients of the polynomial ⎢ ⎥ ⎢ .⎥ p z = z − z − α2 z − α4 ··· z − αN−2 ⎢10 0 .⎥ ( ) ( 1)( )( ) ( ) ⎢ ⎥ . N −1 Z = ⎢ .. .⎥ 2 (13) where ⎢01 0 . .⎥ is the lower N i ⎢ ⎥ = z 2 + w · z . ⎢. . . . ⎥ i ⎣...... 0⎦ i=0 ··· 0 010 Later, the coefficients of p(z) will be used to construct the N N N companion matrix C N defined in (7). of size ( − 1) × ( − 1), e N − = zeros(1, − 2) 1 , 2 2 2 2 1 2 = N − , Algorithm 3.6: (com(N,α)). O N −1 zeros( 2 1 1) 2 Input: Even N, and α ∈ C Proof: It is obvious that polynomials in W satisfy the re- (0) (0) (0) k 2 k−1 N w w ··· w = currence relation z = z · z for k = 1, 2,..., − 1 with 1) Set 0 1 N − 10··· 0 2 2 1 0 z = 1. By we can easily get: 2) For k = 1:N1 − 1, ⎡ ⎤ (k) 2 2 4 ··· N−2 w z 1 z z z ⎢ 0 ⎥ ⎢ w(k) ⎥ 1 Z O − − ⎢ ⎥ = N1 1 − α2(k 1) · Z O N ⎢ ⎥ I − −1 . e − 0 − 1 z2 z4 ··· zN 2 2 (11) ⎣ . ⎦ N1 1 e N − 0 2 1 w(k) N1−1 N ⎡ ⎤ = 0 ··· 0 z w(k−1) 0 ⎢ (k−1)⎥ Multiplying (11) by the column of the coefficients we get the ⎢w ⎥ × ⎢ 1 ⎥ result.  ⎢ . ⎥ ⎣ . ⎦ Lemma 3.3 can be used to compute the coefficients (k−1) 2 4 w of the polynomial p(z) = (z − 1)(z − α )(z − α ) ···(z − N1−1 N − N −1 αN 2 = 2 + 2 w · i ) z i=0 i z efficiently. Hence the compan- 3) Take w w ··· w − = 0 1 N1 1 ion matrix C N can be computed efficiently using the − − − 2 w(N1 1) w(N1 1) ··· w(N1 1) Lemma 3.3. 0 1 N1−1 To compute the self-contained DVM factorization, first we Output: Coefficients of p(z) (except the leading coeffi- {w , w , w ,...,w } calculate the powers of the companion matrix. We will use the cient as p(z) is monic) i.e. 0 1 2 N1−1 N 2 satisfying 13. following result for the calculation of C N . 2 We will use the output of Algorithm 3.6 (i.e. com(N,α)) Corollary⎡ 3.4: Let N = 2t (t ≥⎤2), m = 2k (k ≥ 2), and to construct the companion matrix CN1 (7). Following the 00··· 0 −w ⎢ 0 ⎥ self-contained DVM factorization (9), one has to compute ⎢ ··· −w ⎥ C ⎢10 0 1 ⎥ the powers of the companion matrix N1 (7). Corollary 3.4 ⎢ ··· −w ⎥ N suggests the following algorithm to compute Cm for 2 ≤ m ≤ = ⎢01 0 2 ⎥ 2 N1 C N . Then C N can be com- 2 ⎢ ⎥ = t1 ≥ ⎢. . . . . ⎥ 2 N1, where m 2 (t1 1). ⎣...... ⎦

00··· 1 −w N − Algorithm 3.7: (comp(N,α)) 2 1 puted via = t ≥ = N α ∈ C Input: N 2 ( 1), N1 2 , and m m m = 2 · 2 , 1) Set w = w0 w1 ··· wN −1 and C N C N C N (12) ⎡ ⎤ 1 2 2 2 Z N N ⎢ −w ⎥ for 2 ≤ m ≤ , where wi for i = 0, 1,..., − 1 are com- = ⎣ ⎦ 2 2 CN1 eN −1 puted as in Lemma 3.3. 1

68 VOLUME 1, 2020 − 2) for m = 2:N1 = ,α2, N1 1 m m s1 : dvm(N1 [ri]i=0 ), Cm = C 2 C 2 s2 := dvm(N ,α2, [r ]N ), N1 N1 N1 1 i i=N1 end = T T , T T y : PN (s1 s2 ) . Output: CN1 . = N1 Output: y AN,αz.

We will use the output of the Algorithm⎡ 3.7 (i.e.⎤ N1 IV. COMPLEXITY OF DVM ALGORITHMS IN C ⎢ 1 N1 ⎥ The number of additions and multiplications required to carry comp(N,α)) to construct C˜ N s.t. C˜ N = ⎣ ⎦ N N1 out a computation is called the arithmetic complexity. In D˜ N α 1 C D˜ N 1 N1 1 this section the arithmetic complexities of the proposed self- ≥ for all N 4. The self-contained factorization for the scaled recursive scaled DVM and DVM algorithms are established. DVM i.e. Lemma 3.1 together with algorithms 3.6 and 3.7 lead us to establish a recursive radix-2 scaled DVM algorithm N−1 A. ARITHMETIC COMPLEXITY OF DVM ALGORITHMS to compute A˜ ,α = [αkl] as stated next. N k,l=0 Here we analyze the arithmetic complexity of the self- recursive scaled DVM and DVM algorithms presented in Algorithm 3.8: (sdvm(N,α,z)). Section III. Let #a and #m denote the number of complex ad- = t ≥ = N α ∈ C Input: N 2 (t 1), N1 2 , , and ditions and complex multiplications, respectively, required to z ∈ Rn or Cn. compute y = A˜ N,αz or y = AN,αz for scaled DVM and√ DVM. ˜ 1) Set CN . Note that we do not count multiplication by ±1, ± −1, and 2) If N = 2, then permutation. 11 Lemma 4.1: Let N = 2t (t ≥ 2) be given. The arithmetic y = z. 1 α complexity on computing the scaled DVM Algorithm 3.8 is 3) If N ≥ 4, then given by u := C˜ N z, − 1 = ,α2, N1 1 #a(sDVM, N) = Nt + 4t − N , v1 : sdvm(N1 [ui]i=0 ), 2 N 2 v2 := sdvm(N1,α , [ui] = ), i N1 3 1 y := PT (v1T , v2T )T . #m(sDVM, N) = Nt + 4t − 2N. (14) N 2 2 Output: y = A˜ N,αz. Proof: Referring to the sdvm(N,α,z) algorithm, we get Remark 3.6: Recall that the delay Vandermonde matrix ˜ N (i.e. AN,α) and scaled delay Vandermonde matrix (i.e. AN,α) #a(sDVM, N) = 2 · #a sDVM, + #a C˜ . (15) = ˜ · = αk N−1 2 N is related via AN,α AN,α DN , where DN diag( )k=0 . Thus, once the Algorithm 3.8 is executed, we can scale the N The matrix C˜ is constructed using D˜ and C 2 . Moreover, output of the algorithm by DN to obtain a DVM algorithm. N Although the computational cost of the DVM algorithm re- to compute the powers of the Companion matrix C 2 we duces in this fashion, the resulting DVM algorithm won’t be have used the divide-and-conquer technique via algorithm self-recursive. comp(N,α). Since m = 2t in algorithm comp(N,α), by In the following we will state a recursive radix-2 DVM solving a homogeneous first order linear difference equa- algorithm with the help of Corollary 3.2, algorithms 3.6, and tion with respect to t(t ≥ 1) (i.e. for m = 2t (t ≥ 1) solving − ¯ #a/#m(Cm, 2t ) − 2 · #a/#m(Cm, 2t 1) = 0 with initial con- ¯ DN1 3.7. For notation convenience, we define D¯ N = dition #a(Cm, 2) = m − 1or#m(Cm, 2) = m respectively), D¯ N1 m = m2 − m m = m2 ≥ we could obtain #a(C ) 2 2 and #m(C ) 2 .This for all N 4. N fact together with the construction of C˜ using D˜ and C 2 , and Algorithm 3.10: (dvm(N,α,z)). = N m 2 gives us: = t ≥ = N α ∈ C Input: N 2 (t 1), N1 2 , , and 2 2 n n ˜ = N + N , ˜ = N + 3N (16) z ∈ R or C . #a CN 4 2 #m CN 4 2 ˜ ¯ 1) Set DN , CN , and DN . Using the above result we can write (15) as 2) If N = 2, then 1 α N N2 N y = z. #a(sDVM, N) = 2 · #a sDVM, + + 1 α2 2 4 2 ≥ 3) If N 4, then Since N = 2t , the above simplifies to the first order difference = u : DN z, equation with respect to t ≥ 2 ˜ v := CN u, ¯ t t−1 t−1 t−1 r := D¯ Nv, #a(sDVM, 2 ) − 2 · #a sDVM, 2 = 4 + 2 .

VOLUME 1, 2020 69 M. PERERA ET AL.: EFFICIENT AND SELF-RECURSIVE DELAY VANDERMONDE ALGORITHM FOR MULTI-BEAM ANTENNA ARRAYS

Solving the above difference equation using the initial condi- tion #a(sDVM, 2) = 2, we can obtain 1 1 1 #a(sDVM, 2t ) = Nt + 4t − N. 2 2 2 Referring the scaled DVM Algorithm 3.8 and (16), we could obtain another first order difference equation with respect to t ≥ 2 − − − #m(sDVM, 2t ) − 2 · #m sDVM, 2t 1 = 4t 1 + 3 · 2t 1.

Solving the above difference equation using the initial condi- tion #m(sDVM, 2) = 1, we can obtain 3 1 #m(sDVM, 2t ) = Nt + 4t − 2N. 2 2  Lemma 4.2: Let N = 2t (≥ 2), The arithmetic complexity on computing the DVM Algorithm 3.10 is given by 1 #a(DV M, N) = Nt + 4t − N , FIGURE 2. Addition and multiplication counts in computing the scaled 2 DVM and DVM algorithms vs the direct matrix-vector computation. 7 1 7 #m(DV M, N) = Nt + 4t − N. (17) 2 2 2 ,α, Proof: Referringtothedvm(N z) algorithm, we get B. NUMERICAL RESULTS FOR THE COMPLEXITY OF N DVM ALGORITHMS #a(DVM, N) = 2 · #a DVM, + #a (D ) 2 N Numerical results for the arithmetic complexity of the pro- posed algorithms derived via Lemma 4.1 and 4.2 will be ¯ + #a C˜ N + #a D¯ N (18) shown in this section. Figure 2 shows the arithmetic com- plexity of the proposed algorithms vs the direct matrix-vector ¯ computations with the matrix size varying from 4 × 4to By following the structures of DN and D¯ N we get 4096 × 4096. We consider the direct computation of the ma- = , = #a(DN ) 0 #m(DN ) N trix A˜ N by the vector z cost N(N − 1) additions and multi- ¯ ¯ (19) #a D¯ N = 0, #m D¯ N = N plications (refer to Direct sDVM in Figure 2) and, the matrix 2 AN by the vector z cost N(N − 1) additions and N multi- Thus by using the above and (16), we could state (18) as the plications (refer to Direct DVM in Figure 2). Following the first order difference equation with respect to t ≥ 1 Figure 2, the scaled DVM and DVM algorithms have the same addition counts and the similar multiplication counts. When t t−1 t−1 t−1 #a(DVM, 2 ) − 2 · #a DVM, 2 = 4 + 2 . the size of the matrices increases the proposed algorithms require fewer addition and multiplication counts as opposed Solving the above difference equation using the initial condi- to the direct matrix-vector computation. Moreover, for large tion #a(DVM, 2) = 2, we can obtain N, the proposed algorithms have saved ≈ 50% of addition 1 1 1 and multiplication counts as opposed to the direct brute-force #a(DVM, 2t ) = Nt + 4t − N. 2 2 2 matrix-vector calculation. As we couldn’t distinguish the ex- plicit addition and multiplication counts between the proposed Now by using the dvm(N,α,z) algorithm, (16), and (19), algorithms through the Figure 2, we have included the explicit we could obtain another first order difference equation with counts using the Tables II and III in Appendix VII. These respect to t ≥ 2 counts are based on the results obtained in Lemma 4.1 and − − − #m(DVM, 2t ) − 2 · #m DVM, 2t 1 = 4t 1 + 7 · 2t 1. 4.2.

Solving the above difference equation using the initial condi- tion #m(DVM, 2) = 2, we can obtain V. ANALYTIC AND NUMERICAL ERROR BOUNDS OF DVM ALGORITHMS 7 1 7 #m(DVM, 2t ) = Nt + 4t − N. A. THEORETICAL BOUNDS 2 2 2 Error bounds of computing the scaled DVM and DVM algo-  rithms is the main concern in this section. To do so, we use the

70 VOLUME 1, 2020 perturbation of the product of matrices (stated in [14]). Fol- αk, we get lowing the sdvm(N,α,z) and dvm(N,α,z) algorithms, we k have to compute weights α for k = 0, 1,...,N − 1. These  ˜ ≤ γ ˜ = , ,..., − . C(s) 2t−1−s+3 C(s) for s 0 1 t 2 (22) weights affect the accuracy of the DVM algorithms. Thus, we αk will assume that the computed weights are used and satisfy with the use of complex arithmetic. By considering the com- = , ,..., − for all k 0 1 N 1 puted weights αk and evaluation at the weight α2 in each step k k i.e. using (20); α = α + k, | k|≤μ, (20) where μ := cu, u is the unit roundoff, and c is a constant that C˜ (s) = C˜ (s) + C˜ (s), |C˜ (s)|≤μ|C˜ (s)|. (23) depends on the method [38]. Let’s recall the perturbation of the product of matrices − 11 stated in [14, Lemma 3.7] i.e. if A + A ∈ RN×N satisfies Since A˜ (t − 1) has 2t 1 block diagonal matrices of , k k 1 α |A |≤δ |A | for all k, then k k k we get m m m m +  − ≤ + δ − , (Ak Ak ) Ak (1 k ) 1 Ak A˜ (t − 1) ≤ γ3 A˜ (t − 1) . k=0 k=0 k=0 k=0 |δ | < N + δ ±1 = ˜ − α2 where k u. Moreover, recall k=1(1 k ) By evaluating A(t 1) at the weight in each step, we + θ |θ |≤ Nu = γ γ + ≤ γ 1 N where N 1−Nu : N and k u k+1, obtain γ + γ + γ γ ≤ γ + from [14, Lemma 3.1 and k j k j k j Lemma 3.3], and for x, y ∈ C, fl(x ± y) = (x + y)(1√+ δ) A˜ (s) = A˜ (s) + A˜ (s), |A˜ (s)|≤μ|A˜ (s)|. where |δ|≤u, fl(xy) = (xy)(1 + δ) where |δ|≤ 2γ2 from [14, Lemma 3.5]. Thus overall, In the following, we will prove the error bound on comput- = ··· − ˜ − + − ing the scaled DVM and DVM algorithms. y P(0) P(1) P(t 2)(A(t 1) E(t 1)) = ˜ = t ≥ Theorem 5.1: Let y fl(AN z), where N 2 (t 2), be × (C˜ (t − 2) + E(t − 2))···(C˜ (1) + E(1))(C˜ (0) + E(0))z, computed using the sdvm(N,α,z) algorithm, and assume that (20) holds. Then ˜ where |E(s)|≤(μ + γ2t (1 + μ))|C(s)| for s = 0, 1,...,(t − | − |≤ μ + γ + μ | ˜ − | η = tη 2) and E(t 1) ( 3(1 )) A(t 1) .Let |y −y| ≤ |P(0)||P(1)| ···|P(t − 2)| A˜ (t − 1) μ + γ + μ 1 − tη ( 2t (1 )). Hence C˜ (t − 2) ··· C˜ (1) C˜ (0) |z| , (21) |y −y| ≤ (1 + η)t − 1 |P(0)||P(1)| ···|P(t − 2)| ˜ ˜ ˜ ˜ | | where η = (μ + γ2t (1 + μ)). × A(t − 1) C(t − 2) ··· C(1) C(0) z Proof: Using the sdvm(N,α,z) algorithm and the com- tη ˜ ≤ |P(0)||P(1)| ···|P(t − 2)| A˜ (t − 1) puted matrices C(s) at the step numbers (number of exe- 1 − tη cutions/iterations of the algorithm) s = 0, 1, 2,...,t − 2in terms of computed weights αk for k = 0, 1 ··· , N − 1, we get × C˜ (t − 2) ··· C˜ (1) C˜ (0) |z|  = ··· − ˜ − Hence the result. y fl P(0) P(1) P(t 2) A(t 1) t Theorem 5.2: Let y = fl(AN z), where N = 2 (t ≥ 2), be computed using the dvm(N,α,z) algorithm, and assume that × C˜ (t − 2)···C˜ (2)C˜ (1)C˜ (0) z (20) holds. Then (3t − 2)η = ··· − ˜ − + ˜ − |y −y| ≤ |P(0)||P(1)| ···|P(t − 2)| P(0) P(1) P(t 2) A(t 1) A(t 1) 1 − (3t − 2)η ¯ ¯ ¯ × C˜ (t − 2) + C˜ (t − 2) ··· C˜ (2) + C˜ (2) × |A(t − 1)| D¯ (t − 2) ···D¯ (1) D¯ (0) × ˜ − ··· ˜ ˜ × C˜ (1) + C˜ (1) C˜ (0) + C˜ (0) z, C(t 2) C(1) C(0) × |D(t − 2)| ···|D(1)||D(0)||z| (24) s T where P(s):= 2 block diagonal matrices of P t−s and 2 ˜ = s ˜ η = μ + γ t + μ C(s): 2 computed block diagonal matrices of C2t−s .Us- where ( 2 (1 )). ing the fact that each C˜ (s) is computed using the powers of Proof: Using the dvm(N,α,z) algorithm and the com- N t−1−s ¯ companion matrix C 2 with 2 non-zero entry per row, D˜ puted matrices D¯ (s), C˜ (s), and D(s) at the step numbers (ex- with each one having one non-zero entry per row, and weight ecution/iteration step of the algorithm) s = 0, 1,...,t − 2in

VOLUME 1, 2020 71 M. PERERA ET AL.: EFFICIENT AND SELF-RECURSIVE DELAY VANDERMONDE ALGORITHM FOR MULTI-BEAM ANTENNA ARRAYS terms of computed weights αk for k = 0, 1 ...,N − 1, we get Together with (22) and (23) and overall, y = P(0) P(1) ···P(t − 2)(A(t − 1) + E(t − 1)) y = fl P(0) P(1) ···P(t − 2) A(t − 1) ¯ ¯ × (D¯ (t − 2) + E1(t − 2))···(D¯ (0) + E1(0)) × ¯ − ··· ¯ ¯ ¯ D(t 2) D(2)D(1)D(0) × (C˜ (t − 2) + E2(t − 2))···(C˜ (0) + E2(0)) × − + − ··· + , × C˜ (t − 2)···C˜ N (2)C˜ (1)C˜ (0) (D(t 2) E3(t 2)) (D(0) E3(0))z × − ··· ¯ D(t 2) DN (2)D(1)D(0) z where |E1(s)|≤(μ + γ2(1 + μ))|D¯ (s)|, |E2(s)|≤ μ + γ + μ | ˜ | | |≤ μ + γ + ( 2t (1 )) C(s) , and E3(s) ( 2(1 = P(0) P(1) ···P(t − 2) A(t − 1) + A(t − 1) μ))|D(s)| for s = 0, 1, 2, ···(t − 2) and |E(t − 1)|≤ μ + γ + μ | − | η = μ + γ + μ ( 4(1 )) A(t 1) .Let ( 2t (1 )) and ¯ ¯ ¯ ¯ × D(t − 2) + D(t − 2) ··· D(2) + D(2) η1 = (μ + γ4(1 + μ)). Hence ¯ ¯ ¯ ¯ t−1 2t−1 × D¯ (1) + D¯ (1) D¯ (0) + D¯ (0) |y −y| ≤ (1 + η) (1 + η1) − 1 |P(0)| ···|P(t − 2)| × | − | ¯ − ··· ¯ ¯ × C˜ (t − 2) + C˜ (t − 2) ··· C˜ (2) + C˜ (2) A(t 1) D(t 2) D(1) D(0) × C˜ (t − 2) ···C˜ (1) C˜ (0) × C˜ (1) + C˜ (1) C˜ (0) + C˜ (0) × |D(t − 2)| ···|D(1)||D(0)||z| × D(t − 2) + D(t − 2) ··· D(2) + D(2) (3t − 2)η ≤ |P(0)||P(1)| ···|P(t − 2)| × D(1) + D(1) D(0) + D(0) z. 1 − (3t − 2)η ¯ ¯ ¯ × |A(t − 1)| D¯ (t − 2) ···D¯ (1) D¯ (0) = s T ¯ = where P(s): 2 block diagonal matrices of P t−s , D(s): 2 s ¯ ˜ = s × C˜ (t − 2) ··· C˜ (1) C˜ (0) 2 computed block diagonal matrices of D2t−s , C(s): 2 ˜ = s computed block diagonal matrices of C2t−s , and D(s): 2 × |D(t − 2)| ···|D(1)||D(0)||z| computed block diagonal matrices of D2t−s . Using the fact ¯ that each D(s) and D(s) have one non-zero entry per row and Hence the result.  following complex arithmetic, we get Lemma 5.1 shows that the forward error bound of the pro- posed scaled DVM algorithm depends on the size of the ma- ¯ ¯ ˜ = , ,..., − D¯ (s) ≤ γ2 D¯ (s) for s = 0, 1,...,t − 2. trices N, norms of the matrices C(k)fork 0 1 t 2, and the computed weights. Also, Lemma 5.2 shows that the and forward error bound of the proposed DVM algorithm depends D(s) ≤ γ2 D(s) for s = 0, 1,...,t − 2. on the size of the matrices N, norms of the matrices D¯ (k), ˜ = , ,..., − By considering the computed weights αk and evaluation at the C(k), and D(k)fork 0 1 t 2, and the computed weight α2 in each step i.e. using (20), we have weights. Thus, the error bound of the proposed scaled DVM and DVM algorithms rapidly increase with the size of the matrices and norms of the powers of matrices. Hence, the ¯ = ¯ +  ¯ , | ¯ |≤μ| ¯ | D(s) D(s) D(s) D(s) D(s) proposed algorithms can not be computed stably for large and matrices. This will further be shown through the numerical D(s) = D(s) + D(s), |D(s)|≤μ|D(s)|. results in Section V-B. 1 α Since A(t − 1) has 2t−1 block diagonal matrices of , B. NUMERICAL RESULTS α2 1 In this section, we state numerical results in connection to we get the stability of the proposed Algorithm 3.10 using MATLAB (R2014a version) with machine 2.2204e-16. For-  − ≤ γ | − | . A(t 1) 4 A(t 1) ward error results are presented by taking the exact solutions as the output of the scaled DVM or DVM algorithm com- By evaluating A(t − 1) at the weight α2 in each step, we puted with the double precision and the computed value as obtain the output of the proposed sdvm(N,α,z)ordvm(N,α,z) algorithms with single precision. We will show numerical A(s) = A(s) + A(s), |A(s)|≤μ|A˜ (s)|. results for matrix sizes from 4 × 4 to 128 × 128 with |α|=1.

72 VOLUME 1, 2020 TABLE 1. Forward Error in Calculating the Scaled DVM and DVM to implement the delays. A microwave transmission line of − πi Algorithms With α = e 32 , Uniformly Distributed Random Input in the length l approximates to sufficient accuracy the time delay Interval (0,1), Say z1 for Each N, and Uniformly Distributed Random Input T where T = αl/c for which α ≤ 1 is the velocity factor of With Real and Imaginary Parts in the Interval (0,1) for Each N, Say z 2 the transmission line. Typically, these transmission lines can be a length of cable of copper track (coplanar waveguide) on a printed circuit board. When the physical size requirements necessitate smaller circuits, transmission lines can be approx- imated using analog all-pass filters that can be implemented using integrated circuits [1], [41]. Unlike analog DVM circuits requiring delays, digital DVM implementations, may either be in software, using computer software realizations where the speeds of operation are rel- We compare the relative forward error e of the proposed atively low (for example, graphics processor units), or in scaled DVM and DVM algorithms defined by custom digital hardware integrated circuits, for high-speed realizations based on very large scale integration. In both  −  = y yˆ 2 , cases, the true time delays found as a basic building block of e   y 2 the DVM algorithm will be approximated using discrete time interpolation filters [27]. For example, various time delays where y = A˜ z or y = A z is the exact solution computed N N can be rational fractions of the digital systems clock sample using the scaled DVM or DVM algorithm, respectively, with period, and can therefore be approximately realized using both double precision and yˆ is the computed solution of the al- finite impulse response digital interpolation filters as well as gorithms sdvm(N,α,z)ordvm(N,α,z), respectively, with infinite impulse response digital interpolation filters. A de- single precision. tailed discussion of the possible approaches for real-time im- Table I shows numerical results for the forward error of the plementation of the DVM algorithms, albeit analog or digital, proposed sdvm(N,α,z) and dvm(N,α,z) algorithms with remains for a future exploration. |α|=1, and random real and complex inputs z1 and z2,re- spectively, of the scaled DVM and DVM, say Err-sDVM and B. LOW-COMPLEXITY ALGORITHMS BASED ON Err-DVM, respectively. π MATRIX APPROXIMATION α = − i ≥ As shown in Table I, when e 32 and N 128, the In several applied contexts, the physics of the problem admits MATLAB output will produce NaN for the forward error an appreciable level of error tolerance. For instance, this is of the proposed algorithms. This is because the nodes will illustrated in the context of still image compression [6], video be repeated and hence the resulting singular matrices while encoding [7], beamforming [34], motion tracking [10], and > the proposed algorithms are executed for N 64. Even if biomedical image processing [8]. Therefore, the exact oper- α = − πi |α|= e 32 ,but 1 and not a root of unity, we have to com- ation of a given matrix computation can be relaxed into an pute very large powers of matrices (recall that we compute approximate calculation that is carefully tailored to demand α powers of companion matrices having large powers of ’s) a lower arithmetic complexity when compared to the original and differences of close numbers. Thus, the entries resulting exact computation. This can be accomplished by deriving an from such operations cannot be represented as conventional approximate matrix based on the exact matrix. floating-point values and hence lead to undefined numerical Approximate matrices can be designed by several methods, values through MATLAB output. This is also evident from including rough inspection, number representation in dyadic the theoretical error bounds obtained in Section V. rationals, and integer optimization, to cite a few. Integer op- timization is often the method of choice due to its generality. VI. FUTURE ENGINEERING TASKS The general framework is described as follows: A. ANALOG AND/OR DIGITAL CIRCUITS THAT REALIZE THE ˆ ∗ = ˆ , DVM ALGORITHM T arg min error(T T) (25) Tˆ ∈MP (N) Engineering applications require the real-time implementa- tion of the DVM algorithm using a variety of computational where T is the matrix to be approximated and Tˆ is a candidate platforms. High-speed applications revolving around wireless matrix defined over a low-complexity matrix set [35]. The communications and radar systems typically necessitate ana- error function is closely linked to the physics of the context log implementations, which operate on analog signals from where the approximation is intended to be applied. Common an array of sensors, such as antennas. These analog imple- error functions are the Frobenius norm or the mean square mentations typically employ approximations to ideal time error [7]. The search space MP(N)isthesetofN × N ma- delays in the signal flow graphs, using techniques such as trices with entries defined over the low-complexity integer transmission line segments, passive resistor-capacitor lattice set P. A particular common choice for the set P includes ={ , ± , ± } 2 filters, or other types of analog delays. Analog realizations, in the set of trivial multiplicands P1 0 1 2 [5] or P1 for their most direct form, utilize microwave transmission lines approximations over complex integers [34].

VOLUME 1, 2020 73 M. PERERA ET AL.: EFFICIENT AND SELF-RECURSIVE DELAY VANDERMONDE ALGORITHM FOR MULTI-BEAM ANTENNA ARRAYS

For the DVM matrices, there are two major approaches TABLE 2. Arithmetic Complexity of the Scaled DVM Algorithm vs Direct for deriving approximations: (i) directly approximating the Computation non-factorized delay Vandermonde matrix by means of solv- ing (25) and (ii) approximating only the non-trivial multipliers in the DVM factorized form (Corollary 3.2). The former ap- proach has the advantage of being less restrictive, but a fast algorithm (factorization) of the obtained approximation is left to be derived. On the other hand, by approximating from the factorized form, one has the fast algorithm readily available by construction, however the derived approximation is tied to the particular structure of the considered factorization. As demonstrated in the context of trigonometric discrete trans- forms, approximations lead to a tunable trade-off between TABLE 3. Arithmetic Complexity of the DVM Algorithm vs Direct performance and arithmetic complexity, often resulting in dra- Computation matic reductions in computational cost. DVM matrices could benefit from similar strategies.

VII. CONCLUSION We have proposed an efficient and self-recursive DVM algo- rithm having sparse factors. Arithmetic complexities of the proposed algorithm are provided to show that the proposed algorithm is much more efficient than the direct computation of DVM by a vector. The theoretical error bound on comput- ing the proposed algorithm is established. Numerical results of the forward relative error are utilized to analyze the stability of the proposed algorithm. The proposed algorithm lowers the APPENDIX B computational complexity of the computation of N parallel RF FREQUENTLY USED ABBREVIATIONS AND NOTATIONS beams using an array of antennas as detailed in the preceding A. ABBREVIATIONS analysis. Engineering approaches to real-time implementation of the proposed fast algorithms generally take two forms: 1) Cyberphysical systems CPS analog implementations, employing signals which are contin- Delay Vandermonde matrix DVM uous in time, continuous in their range, and free of aliasing Discrete Fourier transform DFT and quantization effects, and 2) digital implementations, em- FFT ploying signals which are discrete in time (i.e., sampled se- Forward error of the DVM algorithm Err-DVM quences), discrete in their range (e.g., quantized to be in a set Forward error of the scaled DVM algorithm Err-sDVM of known values), and therefore susceptible to both aliasing Integrated circuit IC and quantization noise. Typically, discrete domain signals are Internet of things IoT processed using digital electronics and software, while analog mm wave mmW signals are processed using RF integrated circuits, microwave- multiple-input multiple-output MIMO and mm-wave passive circuits, or photonic integrated circuits. radio-frequency RF The algorithms proposed here are agnostic to the type of im- Scaled delay Vandermonde matrix sDVM plementation and lend themselves to all types of engineering Signal to noise ratio SNR approaches based on photonics, analog circuits and digital systems, and hybrids thereof. The potential applications of B. NOTATIONS such systems span emerging 5G/6G wireless networks, wire- less IoT, CPS, radio astronomy instrumentation, radar and Circular frequency ω wireless sensing systems, among others. Companion matrix of p(z) C N 2 N − l 2 1 D˜ N = diag[α ] = APPENDIX A 2 l 0 ADDITION AND MULTIPLICATION COUNTS DVM AN := AN,α = αkl N,N−1 The explicit addition and multiplication counts of the pro- [ ]k=1,l=0 posed scaled DVM and DVM algorithms (proved in Lemma DVM algorithm dvm(N,α,z) 4.1 and 4.2) opposed to the direct matrix-vector (which we Node α ≡ e− jωτ call as the Direct Add and Direct Multi) computation are Number of additions in computing #a(DV M, N) shown in Tables II and 3). DVM algorithm

74 VOLUME 1, 2020 Number of additions [16] S. G. Johnson and M. Frigo, “A modified split-radix FFT with fewer in computing #a(sDVM, N) arithmetic operations,” IEEE Trans. Signal Process., vol. 55, no. 1, pp. 111–119, Jan. 2007. scaled DVM algorithm [17] K. Kibaroglu, M. Sayginer, and G. M. Rebeiz, “An ultra low-cost 32- Number of multiplications element 28 GHz phased-array transceiver with 41 dBm EIRP and 1.0– in computing #m(DV M, N) 1.6Gbps 16-QAM link at 300 meters,” in Proc. IEEE Radio Freq. Integr. Circuits Symp., 2017, pp. 73–76. DVM algorithm [18] H. Krishnaswamy and H. Hashemi, “A 4-channel 4-beam 24-to-26 GHz Number of multiplications spatio-temporal RAKE radar transceiver in 90nm CMOS for vehicu- in computing #m(sDVM, N) lar radar applications,” in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), 2010, pp. 214–215. scaled DVM algorithm [19] X. Li et al., “A 39 GHz MIMO transceiver based on dynamic multi- Polynomial with zeros α2k p(z) beam architecture for 5G communication with 150 meter coverage,” in Proc. IEEE/MTT-S Int. Microw. Symp. IMS, 2018, pp. 489–491. Scaled DVM A˜ N := A˜ N,α = αkl N−1 [20] T. Okuyama et al., “Experimental evaluation of digital beamforming [ ]k,l=0 for 5G multi-site massive MIMO,” in Proc. 20th Int. Symp. Wireless Scaled DVM algorithm sdvm(N,α,z) Personal Multimedia Commun., 2017, pp. 476–480. Speed of light c [21] H. Oruc and H. K. Akmaz, “Symmetric functions and the Vandermonde matrix,” J. Comput. Appl. Math., vol. 172, pp. 49–64, 2004. Temporal frequency f [22] H. Oruc and G. M. Phillips, “Explicit factorization of the Vandermonde true-time-delay τ matrix,” Linear Algebra Appl., vol. 315, pp. 113–123, 2000. [23] V. Y. Pan, “Fast approximate computations with Cauchy matrices and polynomials,” Math. Comput., vol. 86, pp. 2799–2826, 2017. [24] V. Y. Pan, “How bad are vandermonde matrices?” SIAM J. Matrix Anal., REFERENCES vol. 37, no. 2, pp. 676–694, 2016. [1] P. Ahmadi, B. Maundy, A. S. Elwakil, L. Belostotski, and A. [25] V. Y. Pan, “Transformations of matrix structures work again,” Linear Madanayake, “A new second-order all-pass filter in 130-nm (CMOS),” Algebra Appl., vol. 465, pp. 1–32, 2015. IEEE Trans. Circuits Syst. II: Express Briefs, vol. 63, no. 3, pp. 249– [26] S. M. Perera et al., “Wideband N-beam arrays with low-complexity 253, Mar. 2016. algorithms and mixed-signal integrated circuits,” IEEE J. Sel. Topics [2] A. H. Aljuhani, T. Kanar, S. Zihir, and G. M. Rebeiz, “A scalable Signal Process., vol. 12, no. 2, pp. 368–382, May 2018. dual-polarized 256-element ku-band SATCOM phased-array transmit- [27] J. G. Proakis and D. K. Manolakis, Digital Signal Processing. London, ter with 36.5 dBW EIRP per polarization,” in Proc. 48th Eur. Microw. U.K.: Pearson, 2007. Conf., 2018, pp. 938–941. [28] K. R. Rao, D.N. Kim, J. J. Hwang, Fast Fourier Transform: Algorithm [3] A. H. Aljuhani, T. Kanar, S. Zihir, and G. M. Rebeiz, “A scalable dual- and Applications. New York, NY, USA: Springer, 2010. polarized 256-element ku-band phased-array SATCOM receiver with ◦ [29] T. S. Rappaport et al., “Millimeter wave mobile communications for 5G ±70 beam scanning,” in Proc. IEEE/MTT-S Int. Microw. Symp. IMS, cellular: It will work!” IEEE Access, vol. 1, pp. 335–349, 2013. 2018, pp. 1203–1206. [30] B. Sadhu et al., “7.2 A 28 GHz 32-element phased-array transceiver IC [4] V. Ariyarathna et al., “Analog approximate-FFT 8/16-beam algo- with concurrent dual polarized beams and 1.4 degree beam-steering res- rithms, architectures and CMOS circuits for 5G beamforming MIMO olution for 5G communication,” in Proc. IEEE Int. Solid-State Circuits transceivers,” IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 8, no. 3, Conf., 2017, pp. 128–129. pp. 466–479, Sep. 2018. [31] B. Sadhu et al., “A 28-GHz 32-element TRX phased-array IC with con- [5] R. E. Blahut, Fast Algorithms for Signal Processing. Cambridge, U.K.: current dual-polarized operation and orthogonal phase and gain control Cambridge Univ. Press, 2010. for 5G communications,” IEEE J. Solid-State Circuits, vol. 52, no. 12, [6] S. Bouguezel, M. O. Ahmad, and M. N. S. Swamy, “Low-complexity pp. 3373–3391, Dec. 2017. 8×8 transform for image compression,” Electron. Lett., vol. 44, [32] M. Sayginer and G. M. Rebeiz, “An 8-element 2–16 GHz phased array pp. 1249–1250, 2008. receiver with reconfigurable number of beams in SiGe BiCMOS,” in [7] V. Britanak, P. Yip, and K. R. Rao, Discrete Cosine and Sine Trans- Proc. IEEE MTT-S Int. Microw. Symp., 2016, pp. 1–3. forms, New York, NY, USA: Academic, 2007. [33] G. Strang, Introduction to Applied Mathematics. Cambridge, MA, USA: [8] R. J. Cintra, F. M. Bayer, Y. Pauchard, and A. Madanayake, “Signal Wesley-Cambridge, 1986. processing and machine learning for biomedical big data,” in Low- [34] D. Suarez, R. J. Cintra, F. M. Bayer, S. Kulasekera, and A. Madanayake, Complexity DCT Approximations Biomed. Signal Process. Big Data, “Multi-beam RF aperture using multiplierless FFT approximation,” Boca Raton, FL, USA: CRC Press, pp. 151–176, 2018. Electron. Lett., vol. 50, pp. 1788–1790, 2014. [9] J. W. Cooley and J. W. Tukey, “An algorithm for the machine calcu- [35] C. J. Tablada, F. M. Bayer, and R. J. Cintra, “A class of DCT approxima- lation of complex Fourier series,” Math. Comp., vol. 19, pp. 297–301, tions based on the Feig-Winograd algorithm,” Signal Process., vol. 113, 1965. pp. 38–51, 2015. [10] V. A. Coutinho, R. J. Cintra, and F. M. Bayer, “Low-complexity mul- [36] N. Tawa, T. Kuwabara, Y. Maruta, M. Tanio, and T. Kaneko, “28 tidimensional DCT approximations for high-order tensor data decor- GHz downlink multi-user MIMO experimental verification using 360 relation,” IEEE Trans. Image Process., vol. 26, no. 5, pp. 2296–2310, element digital AAS for 5G massive MIMO,” in Proc. 48th Eur. Microw. May 2017. Conf., 2018, pp. 934–937. [11] W. Gautschi and G. Inglese, “Lower bounds for the condition number of [37] E. E. Tyrtyshnikov, “How bad are hankel matrices?” Numerische Math- Vandermonde matrix,” Numerische Mathematik, vol. 52, pp. 241–250, ematik, vol. 67, no. 2, pp. 261–269, 1994. 1988. [38] C. Van Loan, Computational Frameworks for the Fast Fourier Trans- [12] I. Gohberg and V. Olshevsky, “Complexity of multiplication with vec- form. Philadelphia, PA, USA: SIAM, 1992. tors for structured matrices,” Linear Algebra Appl., vol. 202, pp. 163– [39] H. L. Van Trees, Optimum Array Processing: Part IV of Detection, 192, 1994. Estimation, and Modulation Theory, Hoboken, NJ, USA: Wiley, 2002. [13] I. Gohberg and V. Olshevsky, “Fast algorithms with preprocessing for [40] B. Ustundag, K. Kibaroglu, M. Sayginer, and G. M. Rebeiz, “A wide- matrix-vector multiplication problems,” J. Complexity, vol. 10, pp. 411– band high-power multi-standard 23–31 GHz 2×2 quad beamformer 427, 1994. chip in SiGe with > 15 dBm OP1dB per channel,” in Proc. IEEE Radio [14] N. J. Higham, Accuracy and Stability Of Numerical Algorithms. Freq. Integr. Circuits Symp., 2018, pp. 60–63. Philadelphia, PA, USA: SIAM, 1996. [41] C. Wijenayake, Y. Xu, A. Madanayake, L. Belostotski, and L. T. Bruton, [15] J. Hu and Z. Hao, “A two-dimensional beam-switchable patch array “RF analog beamforming fan filters using CMOS all-pass time delay antenna with polarization-diversity for 5G applications,” in Proc. IEEE approximations,” IEEE Trans. Circuits Syst. I: Regular Papers, vol. 59, MTT-S Int. Wireless Symp., 2018, pp. 1–3. no. 5, pp. 1061–1073, May 2012.

VOLUME 1, 2020 75 M. PERERA ET AL.: EFFICIENT AND SELF-RECURSIVE DELAY VANDERMONDE ALGORITHM FOR MULTI-BEAM ANTENNA ARRAYS

[42] S. L. Yang, “On the LU factorization of the Vandermonde matrix,” ARJUNA MADANAYAKE (Member, IEEE) received the B.Sc. degree in Discrete Appl. Math., vol. 146, pp. 102–105, 2005. engineering (First Class Hons.) from the University of Moratuwa, Sri Lanka, [43] Y. Yang, O. Gurbuz, and G. M. Rebeiz, “An 8-element 400 GHz phased- in 2001, and the Ph.D. and M.Sc. degrees in electrical engineering from arrayin45nmCMOSSOI,”inProc. IEEE MTT-S Int. Microw. Symp., the University of Calgary, Canada, in 2008 and 2004, respectively. He is 2015, pp. 1–3. an Associate Professor of electrical and computer engineering at Florida [44] L. Zhang and H. Krishnaswamy, “24.2 A 0.1-to-3.1 GHz 4-element International University (FIU) in Miami, Florida. His research interests are MIMO receiver array supporting analog/RF arbitrary spatial filtering,” in one- and multi-dimensional signal processing, antenna arrays, mm-wave in Proc. IEEE Int. Solid-State Circuits Conf., 2017, pp. 410–411. systems, microwave engineering, analog circuits, integrated circuit design, [45] S. Zihir, O. D. Gurbuz, A. Karroy, S. Raman, and G. M. Rebeiz, “A FPGA architectures and VLSI realizations of algorithms. His research is 60 GHz 64-element wafer-scale phased-array with full-reticle design,” supported by the Defense Advanced Research Projects Agency (DARPA), the in Proc. IEEE MTT-S Int. Microw. Symp., 2015, pp. 1–3. Office of Naval Research (ONR), and the US National Science Foundation (NSF).

SIRANI M. PERERA received the B.Sc. degree in mathematics (First Class Hons.) from the University Sri Jayewardenepura, Sri Lanka, in 2004, the RENATO J. CINTRA (Senior Member, IEEE) received the D.Sc. degree in Part III Tripos in Mathematics (Hons.) from the University of Cambridge, electrical engineering from the Universidade Federal de Pernambuco (UFPE), U.K., in 2006, and the Ph.D. degree in mathematics from the University of Recife, Brazil. He joined as a Faculty Member of with the Centre for Natural Connecticut, USA, in 2012. She was the first Sri Lankan selected by the Cen- and Exact Sciences, UFPE, in 2005. He has held visiting professor appoint- ter for Mathematical Sciences at the University of Cambridge, UK with full ments at the University of Akron, Ohio, USA; the Département Informa- Shell Centenary Scholarship (under The Cambridge Commonwealth Trust). tique, INSA Lyon, France; the INRIA/IRISA Lab, University of Rennes 1, Since 2015, she has been working as an Assistant Professor in Mathematics France; and the University of Calgary, Canada. He was an Associate Editor at Embry-Riddle Aeronautical University (ERAU), USA. She is working for the IEEE GEOSCIENCE AND REMOTE SENSING LETTERS;andSpringer in the field of Numerical Linear Algebra and Scientific Computing. Her Circuits, Systems, and Signal Processing. Currently he serves as an Associate research scope includes structured matrices (e.g. Vandermonde, Quasisepa- Editor for IET Circuits, Devices and Systems;andJournal of Communication rable, Semiseparable, Bezoutian, and Banded), matrices with displacement and Information Systems; and as Subject Editor for Electronics Letters.His structure, DCT, DST, DFT, FFT, fast and stable algorithms, complexity and long-term topics of research include approximation theory for discrete trans- performance of algorithms, signal processing, and image processing. Perera forms, theory and methods for digital signal/image processing, and proba- is a member of SIAM. Her research is supported the National Science Foun- bilistic methods. dation (NSF).

76 VOLUME 1, 2020