Learning the MMSE Channel Estimator David Neumann, Thomas Wiese, Student Member, IEEE, and Wolfgang Utschick, Senior Member, IEEE

Home , Estimation theory

Abstract—We present a method for estimating conditionally LMMSE estimator is the MMSE estimator for jointly Gaussian Gaussian random vectors with random covariance matrices, distributed random variables. Nonetheless, it is often used which uses techniques from the field of machine learning. in non-Gaussian settings and performs well for all kinds of Such models are typical in communication systems, where the covariance matrix of the channel vector depends on random distributions that are not too different from a Gaussian. parameters, e.g., angles of propagation paths. If the covariance In the same spirit, we present a class of low-complexity matrices exhibit certain Toeplitz and shift-invariance structures, channel estimators, which contain a convolutional neural net- the complexity of the MMSE channel estimator can be reduced work (CNN) as their core component. These CNN-estimators to O(M log M) floating point operations, where M is the channel are composed of convolutions and some simple nonlinear dimension. While in the absence of structure the complexity is much higher, we obtain a similarly efficient (but suboptimal) operations. The CNN-MMSE estimator is then the CNN- estimator by using the MMSE estimator of the structured model estimator with optimal convolution kernels such that the as a blueprint for the architecture of a neural network. This net- resulting estimator minimizes the MSE. These optimal kernels work learns the MMSE estimator for the unstructured model, but have to be calculated numerically, and this procedure is called only within the given class of estimators that contains the MMSE learning. Just as the LMMSE estimator is optimal for jointly estimator for the structured model. Numerical simulations with typical spatial channel models demonstrate the generalization Gaussian random variables, the CNN-MMSE estimator is properties of the chosen class of estimators to realistic channel optimal for a specific idealized channel model (essentially a models. single-path model as described by the ETSI 3rd Generation Index Terms—channel estimation; MMSE estimation; machine Partnership Project (3GPP) [8]). In numerical simulations, we learning; neural networks; spatial channel model find that the CNN-MMSE estimator works fine for the channel models proposed by the 3GPP, even though these violate the assumptions under which the CNN-MMSE estimator is I.INTRODUCTION optimal. Accurate channel estimation is a major challenge in the Once we have learned the CNN-MMSE estimator from next generation of wireless communication networks, e.g., real or simulated channel realizations, the computational in cellular massive MIMO [1], [2] or millimeter-wave [3], complexity required to calculate a channel estimate is only [4] networks. In setups with many antennas and low signal O(M log M) floating point operations (FLOPS). Despite this to noise ratios (SNRs), errors in the channel estimates are low complexity, the performance of the CNN-MMSE estimator particularly devastating, because the array gain cannot be fully does not trail far behind that of the unrestricted MMSE realized. Since a large array gain is essential in such setups, estimator, which is very complex to compute. Since the there is currently a lot of research going on concerning the learning procedure is performed off-line, it does not add to modeling and verification of massive MIMO and/or millimeter the complexity of the estimator. wave channels [5], [6] and the question how these models can One assumption of the idealized channel model mentioned aid channel estimation [7]. above is that the covariance matrices have Toeplitz structure. For complicated stochastic models, the minimum mean This assumption has also motivated other researchers to pro- squared error (MMSE) estimates of the channel cannot be pose estimators that exploit this structure. For example, in [4]

arXiv:1707.05674v3 [cs.IT] 6 Feb 2018 calculated in closed form. A common strategy to obtain and [9], methods from the area of compressive sensing are computable estimators is to restrict the estimator to a certain used to approximate the channel vector as a linear combination class of functions and then find the best estimator in that of k steering vectors where k is much smaller than the class. For example, we could restrict the estimator to the class number of antennas M. With these methods, a complexity of linear operators. The linear MMSE (LMMSE) estimator of O(M log M) floating point operations can be achieved is then represented by the optimal linear operator, i.e., the if efficient implementations are used. Although of similar linear operator that minimizes the mean squared error (MSE). complexity, the proposed CNN-MMSE estimator significantly In some special cases, the matrix that represents the optimal outperforms the compressive-sensing-based estimators in our linear estimator can be calculated in closed form; in other simulations. cases, it has to be calculated numerically. We know that the In [10], the maximum likelihood estimator of the channel covariance matrix within the class of all positive semi-definite Copyright (c) 2018 IEEE. Personal use of this material is permitted. Toeplitz matrices is constructed. This estimated covariance However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected]. matrix is then used to estimate the actual channel. However, The authors are with the Professur fur¨ Methoden der Signalverar- even the low-complexity version of this covariance matrix beitung, Technische Universitat¨ Munchen,¨ 80290 Munchen,¨ Germany (email: estimator relies on the solution of a convex program with M {d.neumann, thomas.wiese, utschick}@tum.de). A shorter version of this paper was presented at the 21st International ITG variables, i.e., its complexity is polynomial in the number of Workshop on Smart Antennas (WSA), Berlin, Germany, 2017 antennas. There also exists previous work on learning-based 2 channel estimation [11]–[14], but with completely different The parameters, which describe, for example, angles of focus in terms of system model and estimator design. propagation paths, are also considered as random variables In summary, our main contributions are the following: with distribution δ ∼ p(δ), which is known. In summary, we • We derive the MMSE channel estimator for conditionally have normal channel models, i.e., the channel is normally yt|ht ∼ NC(ht, Σ) (3) distributed given a set of parameters, which are also with known noise covariance matrix Σ and hierarchical prior modelled as random variables. • We show how the complexity of the MMSE estimator ht|δ ∼ NC(0, Cδ) , δ ∼ p(δ) . (4) can be reduced to O(M log M) if the channel covariance matrices are Toeplitz and have a shift-invariance structure. Example. Conditionally normal channels appear in typical • We use the structure of this MMSE estimator to define the channel models for communication scenarios, e.g., in those CNN estimators, which have O(M log M) complexity. defined by the 3GPP for cellular networks [8]. There, the • We describe how the variables of the neural network can covariance matrices are of the form Z π be optimized/learned for a general channel model using H stochastic gradient methods. Cδ = g(θ; δ)a(θ)a(θ) dθ , (5) −π • We introduce a hierarchical learning algorithm, which helps to avoid local optima during the learning procedure. where g(θ; δ) ≥ 0 is a power density function corresponding to the parameters δ and where a(θ) denotes the array manifold vector of the antenna array at the base station for an angle A. Notation θ. As an example, in the 3GPP urban micro and urban The transpose and conjugate transpose of X are denoted by macro scenarios, g(θ; δ) is a superposition of several scaled XT and XH, respectively. The trace of a square matrix X is probability density functions of a Laplace-distributed random denoted by tr(X) and the Frobenius norm of X is denoted variable with standard deviations 2◦ and 5◦, respectively. M×M by kXkF . Two matrices A, B ∈ C are asymptotically The Laplace density models the scattering of the received equivalent, which we denote by A B, if power around the center of the propagation path. Its standard deviation is denoted as the per-path angular spread. lim kA − Bk2 /M = 0. (1) M→∞ F 2 III.MMSECHANNEL ESTIMATION We write exp(x) and |x| to denote element-wise application 2 of exp(·) and |·| to the elements of x. The kth entry of the Our goal is to estimate ht for each t given all observations vector x is denoted by [x]k; similarly, for a matrix X, we write Y = y1,..., yT and knowledge of the model (3), (4). [X]mn to denote the entry in the mth row and nth column. For a fixed parameter δ, the MMSE estimator can be given M The circular convolution of two vectors a, b ∈ C is denoted analytically, since conditioned on δ, the observation yt is M by a ∗ b ∈ C . Finally, diag(x) denotes the square matrix jointly Gaussian distributed with the channel vector ht. Also, with the entries of the vector x on its diagonal and vec(X) is given the parameters δ, the observations of different coherence the vector obtained by stacking all columns of the matrix X intervals are independent. The conditional MMSE estimate of into a single vector. the channel vector ht is [7], [15]

E[ht|Y , δ] = Wδyt (6) II.CONDITIONALLY NORMAL CHANNELS with We consider a base station with M antennas, which receives −1 uplink training signals from a single-antenna terminal. We Wδ = Cδ(Cδ + Σ) . (7) assume a frequency-flat, block-fading channel, i.e., we get Given the parameters δ, the estimate of ht only depends on independent observations in each coherence interval. After yt. Since the parameters δ and, thus, the covariance matrix correlating the received training signals with the pilot sequence Cδ are unknown random variables, the MMSE estimator for transmitted by the mobile terminal, we get observations of the our system model is given by form ˆ yt = ht + zt, t = 1,...,T (2) ht = E[ht | Y ] (8) = E[ E[ht | Y , δ] | Y ] (9) with the channel vectors ht and additive Gaussian noise = E[Wδyt | Y ] (10) zt ∼ NC(0, Σ). For the major part of this work, we assume 2 that the noise covariance is a scaled identity Σ = σ I = E[Wδ | Y ]yt (11) with known σ2. The channel vectors are assumed to be = Wc?(Y ) yt. (12) conditionally Gaussian distributed given a set of parameters δ, i.e., ht|δ ∼ NC(0, Cδ). In contrast to the fast-fading channel where we use the law of total expectation and (6) to get vectors, the covariance matrix Cδ is assumed to be constant the final result. We note that the observations are filtered by over the T channel coherence intervals. That is, T denotes the MMSE estimate Wc? of the filter Wδ. Hence, the main the coherence interval of the covariance matrix in number of difficulty of the non-linear channel estimation lies with the channel coherence intervals. calculation of Wc? from the observations Y . 3

Using Bayes’ theorem to express the posterior distribution b of δ as p(Y |δ)p(δ) p(δ|Y ) = R (13) p(Y |δ)p(δ)dδ T exp(·) vec(Cb) AGE + 1T exp(·) AGE vec(WcGE) we can write the MMSE filter as Z R p(Y |δ)Wδ p(δ)dδ Fig. 1. Block diagram of the gridded estimator WcGE. Wc? = p(δ|Y )Wδdδ = . (14) R p(Y |δ) p(δ)dδ The MMSE estimation in (12), (14) can be interpreted as b(1) b(2) follows. We first calculate Wc? as a convex combination of filters Wδ with weights p(δ|Y ) for known covariance matrices (1) (2) Cδ and then apply the resulting filter Wc? to the observation. x A + φ(x) A + w By manipulating p(Y |δ), we obtain the following expression for the MMSE filter, which shows that Wc? depends on Fig. 2. Neural network with two layers and activation function φ(x). Y only through the scaled sample covariance matrix T 1 X H W δ = δ Cb = yty . (15) where δi is obtained by evaluating (7) for i and σ2 t t=1 b = T log|I −W |. (21) Lemma 1. If the noise covariance matrix is Σ = σ2 I, the i δi MMSE filter Wc? in (14) can be calculated as R If Assumption 1 does not hold, e.g., if p(δ) describes a exp tr(WδCb) + T log|I −Wδ| Wδ p(δ)dδ continuous distribution, expression (20) is only approximately Wc?(Cb) = R . exp tr(WδCb) + T log | I −Wδ| p(δ)dδ true if the grid points δi are chosen as random samples from (16) p(δ). In this case, the estimator (20) is a heuristic, suboptimal with Cb given by (15) and Wδ given by (7). estimator, which neglects that the true distribution of δ is continuous. We refer to this estimator as gridded estimator Proof. See Appendix A. (GE). By the law of large numbers, the approximation error Note that the scaled sample covariance matrix Cb is a vanishes as the number of samples N is increased, but this sufficient statistic to calculate the MMSE filter Wc?. Moreover, also increases the complexity of the channel estimation. ˆ ˆ if we define Hc = [h1,..., hT ], we see from Hc = Wc?(Cb) Y We can improve the performance of the estimator for a that we use all data to construct the sample covariance matrix fixed N by interpreting Wδi and bi as variables that can be Cb and the filter Wδ and then apply the resulting filter to each optimized instead of using the values in (7) and (21). This is observation individually to calculate the channel estimate. This the idea underlying the learning-based approaches, which we structure is beneficial for applications in which we are only describe in the following. interested in the estimate of the most recent channel vector. Let us first analyze the structure of the gridded estimator. In such a case, we can apply an adaptive method to track If we consider the vectorization of WcGE(Cb), i.e., the scaled sample covariance matrix. That is, given the most recent observation y, we apply the update T exp(AGEvec(Cb) + b) vec(WcGE(Cb)) = AGE (22) C ← αC + βyyH (17) T T b b 1 exp(AGEvec(Cb) + b) with suitable α, β > 0 and then calculate the channel estimate 2 where A = [vec(W ),..., vec(W )] ∈ M ×N and b = ˆ GE δ1 δN C h = Wc?(Cb) y. (18) [b1, . . . , bN ], we see that the function (20) can be visualized as the block diagram shown in Fig. 1. A slightly more general IV. MMSE ESTIMATION AND NEURAL NETWORKS structure is depicted in Fig. 2, which is readily identified as For arbitrary prior distributions p(δ), the MMSE filter as a common structure of a feed-forward neural network (NN) given by Lemma 1 cannot be evaluated in closed form. To with two linear layers, which are connected by a nonlinear make the filter computable, we need the following assumption. activation function. The gridded estimator WcGE is a special case of the neural network in Fig. 2, which uses the softmax Assumption 1. The prior p(δ) is discrete and uniform, i.e., function we have a grid {δi : i = 1,...,N} of possible values for δ and exp(x) 1 φ(x) = (23) p(δ ) = , ∀i = 1,...,N. (19) T i N 1 exp(x)

Under this assumption, we can evaluate the MMSE estima- (1) T as activation function and the speciﬁc choices A = AGE, tor of Wδ as (2) (1) (2) A = AGE, b = b and b = 0 for the variables. 1 PN N i=1 exp tr(Wδi Cb) + bi Wδi To formulate the learning problem mathematically, we de- WcGE(Cb) = (20) 1 PN ﬁne the set of all functions that can be represented by the NN N i=1 exp tr(Wδi Cb) + bi 4 in Fig. 2 as A(`), b(`) off-line learning WNN = (24) M 2 M 2 (2) (1) (1) (2) W (·) f(·): C 7→ C , f(x) = A φ(A x + b ) + b , Cb cNN (1) N×M 2 (2) M 2×N (1) N (2) M 2 Wc = WcNN(Cb) A ∈ C , A ∈ C , b ∈ C , b ∈ C . sample cov.

The MSE of a given estimator Wc(·), which takes the scaled Y Wc Hc covariance matrix Cb as input, is given by 2 ε(Wc(·)) = E[kH − Wc(Cb) Y kF ]. (25) Fig. 3. Channel estimator with embedded neural network. The optimal neural network, i.e., the NN-MMSE estimator, is given as the function in the set WNN that minimizes the MSE, outperform the gridded estimator, which has the same com- vec(WcNN(·)) = arg min ε(Wc(·)). (26) putational complexity. However, there are two problems with vec(Wc (·))∈WNN this learning approach, which we address in the following Since we assume that the dimension N and the activation sections. First, finding the optimal neural network WcNN is function φ(·) are fixed, the variational problem in (26) is too difficult, because the number of variables is huge and simply an optimization over the variables A(`) and b(`), the optimization problem is not convex. Second, even if ` = 1, 2. the optimal variables were known, the computation of the ˆ If we choose the softmax function as activation function, channel estimate ht = WcNN(Cb) yt is too complex: Evaluating 2 and if Assumption 1 is fulfilled, we have the output of the neural network WcNN(Cb) needs O(M N) floating point operations due to the matrix-vector products ε(WcGE(·)) = ε(WcNN(·)) = ε(Wc?(·)) (27) (cf. Fig. 2). For example, if the grid size N needs to scale since, in this case, the gridded estimator is the MMSE esti- linearly with the number of antennas M to obtain accurate estimates, the computational complexity scales as O(M 3), mator, WcGE(·) = Wc?(·), and because vec(WcGE(·)) ∈ WNN. In general, we have the relation which is too high for practical applications. ε(W (·)) ≥ ε(W (·)) ≥ ε(W (·)). cGE cNN c? (28) V. LOW-COMPLEXITY MMSEESTIMATION

The optimization problem (26) is a typical learning problem With Assumption 1 the gridded estimator WcGE in (20) is for a neural network with a slightly unusual cost function. the MMSE estimator. In the following, we introduce addi- Due to the expectation in the objective function, we have to tional assumptions, which help to simplify WcGE. With these revert to stochastic gradient methods to find (local) optima assumptions, we get a fast channel estimator, i.e., one with a for the variables of the neural network. Unlike the gridded computational complexity of only O(M log M). Just as for the estimator (20), which relies on analytic expressions for the gridded estimator, the fast estimator is no longer the MMSE covariance matrices Cδ, the neural network estimator merely estimator if the assumptions are violated. However, in analogy needs a large data set {(H1, Y1), (H2, Y2),...} of channel to Sec. IV, the structure of this fast estimator motivates the realizations and corresponding observations to optimize the convolutional neural network (CNN) estimator presented in variables. In fact, we could also take samples of channel vec- Sec. VI. tors and observations from a measurement campaign to learn Our approach to reduce the complexity of WcGE can be bro- the NN-MMSE estimator for the “true” channel model. This ken down into two steps. First, we exploit common structure requires that the SNR during the measurement campaign is of the covariance matrices, which occur for commonly used significantly larger than the SNR in operation. If, as assumed, array geometries. In a second step, we use an approximated the noise covariance matrix is known, the observations can shift-invariance structure, which is present in a certain channel then be generated by adding noise to the channel measure- model with only a single path of propagation. With those two ments. steps, we reduce the computational complexity from O(M 2N) The basic structure of the NN-MMSE estimator is depicted to O(M log M). in Fig. 3. The learning of the optimal variables A(`) and b(`) is performed off-line and needs to be done only once. During A. A Structured MMSE Estimator operation, the channel estimates are obtained by first forming the scaled sample covariance matrix Cb, which is then fed into In the first step, we replace the filters Wδi in (20) with the neural network WcNN(·). Finally, the output WcNN(Cb) of the structured matrices that use only O(M) variables. Specifically, neural network is applied as a linear filter to the observations we make the following assumption. Y to get the channel estimates Hc. Assumption 2. The filters Wδ can be decomposed as With proper initialization and sufficient quality of the i H training data, the neural network estimator is guaranteed to Wδi = Q diag(wi)Q (29) 5

K×M K with a common matrix Q ∈ C and vectors wi ∈ R Given wˆSE(cˆ), the complexity of the estimator depends only where O(K) = O(M). on the number of operations required to calculate matrix- vector products with Q and QH. To achieve the desired com- Note that the requirement O(K) = O(M) ensures the plexity O(M log M), the matrix Q must have some special desired dimensionality reduction and w ∈ K leads to self- i R structure that enables fast computations. If this is the case, adjoint ﬁlters. Combining Assumptions 1 and 2, we get the the complexity of the structured estimator is dominated by following result. the calculation of wˆSE(cˆ), which is O(NK). In Sec. V-B, we Theorem 1. Given Assumptions 1 and 2, the MMSE estimator show how the complexity of the calculation of wˆSE(cˆ) can be of Wδ simpliﬁes to reduced further.

H Examples. For a uniform linear array (ULA), the chan- WcSE(Cb) = Q diag(wˆSE(cˆ))Q (30) nel covariance matrices, which have Toeplitz structure, are where asymptotically equivalent to corresponding circulant matri- T ces (cf. [16], Appendix B). Since all circulant matrices have 1 X 2 cˆ = |Qyt| . (31) the columns of the discrete Fourier transform (DFT) matrix F σ2 t=1 as eigenvectors, we have the asymptotic equivalence

Moreover, the element-wise filter wˆSE is given by H Cδ F diag(cδ)F ∀δ (41) T exp(ASEcˆ + b) H wˆ (cˆ) = A (32) where cδ contains the diagonal elements of FCδF . As a SE SE T T ˆ 1 exp(ASEc + b) consequence, we have a corresponding asymptotic equivalence where the matrix H Wδ F diag(wδ)F ∀δ (42) K×N ASE = [wi,..., wN ] ∈ R (33) H where wδ contains the diagonal elements of FWδF . For contains the element-wise MMSE filters (29), and the entries a large-scale system, this is a very good approximation [17]. of the vector We call the structured estimator that uses Assumption 2 with T N Q = F the circulant estimator. b = [b1, . . . , bN ] ∈ R (34) To reduce the approximation error for finite numbers of an- are given by (21). tennas, we can use a more general factorization with Q = F2, 2M×M where F2 ∈ C contains the first M columns of a Proof. If we replace the filters Wδi in (20) with the parametrization in (29), we can simplify the trace expressions 2M × 2M DFT matrix. The class of matrices that can be according to expressed as H Wδ = F2 diag(wδ)F2 (43) H tr(Wδi Cb) = tr(Q diag(wi)QCb) (35) T are exactly the Toeplitz matrices [17]. Note that the filters Wδ 1 X = tr diag(w ) Qy yHQH (36) do not actually have Toeplitz structure, even if the channel i σ2 t t t=1 covariance matrices are Toeplitz matrices. The Toeplitz as- = wTcˆ (37) sumption only holds in the limit for large numbers of antennas i due to the arguments given above or for low SNR when as cˆ contains the diagonal elements of the matrix the noise covariance matrix dominates the inverse in (7).

T Nevertheless, the Toeplitz structure is more general than the 1 X Qy yHQH. (38) circulant structure and, thus, yields a smaller approximation σ2 t t error. The estimator that uses Assumption 2 with Q = F is t=1 2 denoted as the Toeplitz estimator. Consequently, the gridded estimator in (20) simplifies to An analogous result can be derived for uniform rectangular PN T H arrays (cf. Appendix C). In this case, the transformation i=1 exp(wi cˆ + bi)Q diag(wi)Q WcSE(cˆ) = Q is the Kronecker product of two DFT matrices, whose PN exp(wTcˆ + b ) i=1 i i dimensions correspond to the number of antennas in both ! PN exp(wTcˆ + b )w directions of the array. = QH diag i=1 i i i Q. (39) PN T A third example with a decomposition as in Assumption 2 i=1 exp(wi cˆ + bi) is a setup with distributed antennas [18], [19]. For distributed With the definitions of ASE and b, we can write (39) as (30). antennas, the covariance matrices are typically modelled as diagonal matrices and, thus, the filters Wδ are diagonal as well. That is, for distributed antennas we simply have Q = I. If Assumptions 1 and 2 hold, the MMSE estimates of the channel vectors using the structured estimator (SE) can be calculated as B. A Fast MMSE Estimator

ˆ H The main complexity in the evaluation of wˆSE(·) stems ht = Q diag(wˆSE(cˆ))Qyt (40) from the matrix-vector products in (32). The complexity can H i.e., Wc?(Cb) = Q diag(wˆSE(cˆ)Q. be reduced by using only matrices ASE that allow for fast 6 matrix-vector products. One possible choice are the circulant Algorithm 1 Learned fast MMSE filter matrices. 1: Initialize variables a(`) and b(`) randomly In fact, circulant matrices naturally arise in the structured 2: Generate/select a mini-batch of S channel vectors Hs and estimator for a single-path channel model with a single pa- corresponding observations Ys (and cˆs) for s = 1, ..., S rameter δ for the angle of arrival. In this model, the power 3: Calculate the stochastic gradient spectrum is shift-invariant, i.e., g(θ; δ) = g(θ − δ). As a S 1 ∂ 2 result, for N = K, the samples wi of the structured estimator X H g = Hs − Q diag(wˆ(cˆs))QYs S ∂[a(`); b(`)] F wˆSE in (32) are approximately shift invariant, i.e., their entries s=1 satisfy [w ] = [w ] (the sums are modulo M) and the i j i+n j+n with wˆ(x) as stated in (46) following assumption is satisfied (more details are given in 4: Update variables with a gradient algorithm (e.g., [20]) Appendix D). 5: Repeat steps 1–3 until a convergence criterion is satisfied Assumption 3. The matrix ASE in (33) is circulant and given by H ASE = F diag(F w0)F (44) In Sec. IV we discussed how learning can be used to compensate for the approximation error that results from a finite for some w ∈ K , where F is the K-dimensional DFT 0 R grid size N, i.e., a violation of Assumption 1. Analogously, matrix. we can learn the variables of a convolutional neural network Note that Assumption 3 is, in principle, independent of inspired by wˆFE to compensate for violations of Assumptions 2 Assumption 2. We see from the examples that the structure and 3. To this end, we define the set of CNNs of Wδ and, thus, the choice for Q is motivated by the ( (2) (1) (1) (2) array geometry, while the assumption that ASE is circulant WCNN = x 7→ a ∗ φ a ∗ x + b + b , is motivated by the physical channel model. The example in Appendix D, which is based on the ULA geometry and the ) (`) K (`) K 3GPP channel model, suggests a circulant structure for both a ∈ R , b ∈ R , ` = 1, 2 (47) the filters Wδ and the matrix ASE in wˆSE(·). However, we could think of other system setups, where the and the optimal CNN estimator is the one using structure of Wδ in Assumption 2 is different than the structure H wˆCNN(·) = arg min ε(Q wˆ(·)Q). (48) of ASE in Assumption 3. As an illustration, consider a toy wˆ(·)∈WCNN example where we have an array of antennas along a long corridor, say in an airplane. Then we could reasonably assume Again, we assume that the activation function φ(·) is fixed. diagonal covariance matrices, i.e., Q = I, but at the same time Thus, the optimization is only with respect to the convolution (`) (`) we have a shift-invariance for different positions of the users kernels a and the bias vectors b . in the corridor, i.e., Assumption 3 also holds. Analogously to Sec. IV, if we choose the softmax function Given the relationship between circulant matrices and cir- as activation function, we have wˆFE ∈ WCNN. Consequently, cular convolution, we can write if Assumptions 1–3 are fulfilled we get H ε(QHwˆ (·)Q) = ε(QHwˆ (·)Q) = ε(W (·)). (49) ASEx = F diag(F a)F x = a ∗ x (45) FE CNN c? K In general, we have with a ∈ R . Because of the FFT, the computational complexity of evaluating wˆSE(cˆ) reduces to O(M log M) if H H ε(Q wˆFE(·)Q) ≥ ε(Q wˆCNN(·)Q) ≥ ε(Wc?(·)). (50) O(K) = O(M). That is, we get a fast estimator (FE) The stochastic-gradient method that learns the CNN is wˆFE(cˆ) = w0 ∗ softmax(w˜0 ∗ cˆ + b) (46) described in detail in Alg. 1. We want to stress again that the learning procedure is performed off-line and does not by incorporating the constraint (44) into wˆSE. The vector w˜0 add to the complexity of the channel estimation. During contains the entries of w0 in reversed order. operation, the channel estimation is performed by evaluating wˆCNN(cˆ) and the transformations involving the Q matrix for VI.LOW-COMPLEXITY NEURAL NETWORK given observations. If the variables are learned from simulated For most channel models, Assumptions 1, 2, and 3 only hold samples according to the 3GPP or any other channel model, approximately or even not at all. That is, the estimator wˆFE this algorithm suffers from the same model-reality mismatch in (46) does not yield the MMSE estimator in most practical as does any other model-based algorithm. The fact that the scenarios. Nevertheless, it is still worthwhile to consider an proposed algorithm can also be trained on true channel re- estimator with similar structure: Calculating a channel esti- alizations puts it into a significant advantage over other non- 2 mate with WcGE costs O(M N) FLOPS, while using wˆFE learning-based algorithms, which have to rely on models only. only requires O(M log M) operations. As we discuss in the In the simulations, we compare two variants of the CNN following, another advantage is that the number of variables estimator. First, we use the softmax activation function φ = 2 exp(·) that have to be learned reduces from O(M N) to O(M), since 1T exp(·) . The resulting softmax CNN estimator is a direct we no longer have full matrices, but circular convolutions. improvement over the fast estimator with wˆFE, which was 7

Algorithm 2 Hierarchical Training In Fig. 4, we show a standard box plot [22] of the MSE 1: Choose upsampling factor β > 1 and number of stages n of the estimators obtained by applying the hierarchical and n n 2: Set M0 = dM/β e, K0 = dK/β e the standard learning procedure. Each data point used to (`) (`) K0 3: Learn optimal a0 , b0 ∈ R using Alg. 1 assuming M0 generate the box plot corresponds to a randomly initialized antennas with random initializations estimator and one run of Alg. 2 with β = 2 and n = 3 for 4: for i from 1 to n do the hierarchical learning and n = 0 for the non-hierarchical n−i n−i 5: Set Mi = dM/β e and Ki = dK/β e learning (and the same total number of iterations). The box (`) (`) Ki (`) (`) Ki−1 6: Interpolate ai , bi ∈ R from ai−1, bi−1 ∈ R plot depicts a summary of the resulting distribution, showing (`) the median and the quartiles in a box and outliers outside of the 7: Normalize ai by dividing by β (`) (`) whiskers as additional dots. The whiskers are at the position 8: Learn optimal ai , bi using Alg. 1 assuming Mi (`) (`) of the lowest and highest data point within a distance from antennas and using a , b as initializations i i the box of 1.5 times the box size. As we can see, without the 9: end for hierarchical learning, the learning procedure gets stuck in local optima. With the hierarchical approach, we are less likely to derived under Assumptions 1–3. In the second variant, we be caught in local optima during the learning process. use a rectified linear unit (ReLU) φ(x) = [x]+ as activation VII.RELATED WORK function, since ReLUs were found to be easier to train than In this section, we give a short summary of two alternative other activation functions [21] (and they are also easier to channel estimation methods with O(M log M) complexity. evaluate than the softmax function). These methods will serve as a benchmark in the numerical evaluation of our novel algorithms. A. Hierarchical Learning A. ML Covariance Matrix Estimation Local optima are a major issue when learning the neural networks, i.e., when calculating a solution of the nonlinear op- The common approach to approximate MMSE channel timization problem (48). During our experiments, we observed estimation for unknown covariance matrices is to use a max- that, especially for a large number of antennas, the learning imum likelihood (ML) estimate of the channel covariance often gets stuck in local optima. To deal with this problem, we matrix. That is, we first find an ML estimate of the channel ML devise a hierarchical learning procedure that starts the learning covariance matrix Cδ based on the observations Y and then, with a small number of antennas and then increases the number assuming the estimate is exact, calculate the MMSE estimates of antennas step by step. of the channel vectors as in (6), (7). The disadvantage of ML For the single-path channel model, which motivates the estimation is that a general prior p(δ) cannot be incorporated. circulant structure of the matrices A(`), the convolution kernel The likelihood function for the channel covariance matrix given the noise covariance matrix is w0 contains samples of the continuous function w(u; 0), i.e., [w0]k = w(2π(k − 1)/K; 0) (cf. Appendix D). If we assume L(Cδ|Y ) w(u; 0) that is a smooth function, we can quite accurately T calculate the generating vector w0 for a system with M X H −1 1 = exp − y (Cδ + Σ) yt − T log |Cδ + Σ| antennas from the corresponding vector of a system with less t πMT t=1 antennas by commonly used interpolation methods. (51) This observation inspires the following heuristic for initial- and the ML problem reads as izing the variables a(`) and b(`) of a K-dimensional CNN. We ML first learn the variables of a smaller CNN, e.g., we choose a Cδ = arg max L(Cδ|Y ) (52) C ∈M CNN with dimension K/2. We use the resulting variables to δ initialize every second entry of the vectors a(`) and b(`). The where M is the set of admissible covariance matrices, which has to be included in the set of positive semi-definite matrices remaining entries can be obtained by numerical interpolation. + + S0 , i.e., M ⊂ S0 . For the filter wˆCNN(·), it is desirable to have outputs + of similar magnitude, irrespective of the dimension K. By If M = S0 , the ML estimate is given in terms of the sample doubling the number of entries of the convolution kernels via covariance matrix T interpolation, we approximately double the largest absolute 1 X S = y yH (53) value of a(`) ∗ x. To remedy this issue, we normalize the T t t t=1 kernels of the convolution after the interpolation such that we as get approximately similar values at the outputs of each layer. ML 1/2 −1/2 −1/2 1/2 C = Σ P + Σ SΣ − I Σ (54) This heuristic leads to the hierarchical learning described in δ S0 Alg. 2. where we use the projection PS+ (·) onto the cone of positive The hierarchical learning significantly improves conver- 0 semi-definite matrices [23]. The projection PS+ (X) of a gence speed and reduces the computational complexity per 0 hermitian matrix X replaces all negative eigenvalues of X iteration due to the reduced number of antennas in many with zeros. For Σ = σ2 I, the estimate simplifies to learning steps. In fact, for a large number of antennas the ML 2 hierarchical learning is essential to obtain good performance. C = P + Cb − σ I . (55) δ S0 8

Basic M = 64

Hierarchical M = 64

Basic M = 128

Hierarchical M = 128

0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 MSE

Fig. 4. Box plot with outliers (marked as dots) of the MSE after learning for 10 000 iterations for hierarchical and non-hierarchical learning. We show results for M = 64 and M = 128 antennas for 50 data points per plot and with the DFT matrix Q = F for the transformation. Scenario with three propagation paths, σ2 = 1, T = 1.

Low-Complexity ML Estimation: If we have a ULA at the B. Compressive Sensing Based Estimation base station, we know that the covariance matrix has to be + The ML-based channel estimation techniques exploit the a Toeplitz matrix. Thus, we should choose M = T0 as the Toeplitz structure of the covariance matrix, which is a result of set of positive semi-definite Toeplitz matrices. In this case, regular array geometries and the model (5) with a continuous the ML estimate can no longer be given in closed form and power density function g. In the 3GPP models, this power iterative methods have to be used [10], [24], [25]. density function usually has a very limited angular support, Since we are interested in low-complexity estimators, we i.e., g(θ, δ) is approximately zero except for θ in the vicinity approximate the solution by reducing the constraint set to pos- of the cluster centers δ. The resulting covariance matrices have + itive semi-definite, circulant matrices M = C . This choice a very low numerical rank [27]. As a consequence, under such reduces the complexity of the ML estimator significantly [23]. a model, any given realization of a channel vector admits a The reason is that all circulant matrices have the columns of sparse approximation the DFT matrix F as eigenvectors. That is, we can parametrize h ≈ Dx (60) the ML estimate as in a given dictionary D ∈ CM×Q, where all but k entries of ML H ML Cδ = F diag(cδ )F (56) x are zero. The vector x can be found by solving the sparse approximation problem where cML ∈ M contains the M eigenvalues of CML. δ R δ x = arg min ky − Dxk2 (61) Incorporating (56) into the likelihood function (51), we Q x∈C :| supp(x)|≤k notice that the estimate of the channel covariance matrix can be given in terms of the estimated power spectrum [23], [26] where | supp(x)| denotes the number of nonzero entries of x. This combinatorial optimization problem can be solved T efficiently with methods from the area of compressive sensing, 1 X 2 s = |F y | (57) e.g., the orthogonal matching pursuit (OMP) algorithm [28] or T t t=1 iterative hard thresholding (IHT) [29]. It is common to use a dictionary D of steering vectors 2 where |x| is the vector of absolute squared entries of x. a(θ) where θ varies between −π and π on a grid [4]. Specifically, we have the estimated eigenvalues For ULAs, this grid can be chosen such that the dictionary D results in an oversampled DFT matrix, which has the ML 2 cδ = [s − σ 1]+ (58) advantage that matrix-vector products with this matrix can be computed efficiently. Furthermore, it was shown in [27] that where the ith element of [x]+ is max([x]i, 0) and where 1 is this dictionary is a reasonable choice, at least for the single- the all-ones vector. The approximate MMSE estimate of the cluster 3GPP model, and if the OMP algorithm is used to find channel vector in coherence interval t is given by the sparse approximation. The OMP algorithm can be extended to a multiple measure- ˆ H ML ML 2 −1 ht = F diag(cδ ) diag(cδ + σ 1) F yt (59) ment model H ≈ DX (62) and can be calculated with a complexity of O(M log M) due to the FFT. The almost linear complexity makes the ML with a row-sparse matrix X, i.e., each channel realization is approach with the circulant approximation suitable for large- approximated as a linear combination of the same k dictionary scale wireless systems. vectors. Because the selection of the optimal sparsity level k is 9

non-trivial, we use a genie-aided approach in our simulations. FE O(M log M) The genie-aided OMP algorithm uses the actual channel 0.25 Circulant SE O(M 2) realizations H to decide about the optimal value for k that Toeplitz SE O(M 2) maximizes the metric of interest. The result is clearly an upper GE O(M 3) bound for the performance of the OMP algorithm. Genie Aided 0.20

VIII.SIMULATIONS For the numerical evaluation of the newly introduced algorithms, we focus on the far-field model with a ULA at the base 0.15 station (cf. Appendix B). We assume that the noise variance σ2 Normalized MSE and the correct model for the parameters, i.e., the prior p(δ) and the mapping from δ to Cδ, are known. That is, for the off-line learning procedure required by the CNN estimators, 0.10 we can use the true prior to generate the necessary realizations of channel vectors and observations. 20 40 60 80 We first consider the single-path model that motivates Number of antennas at the base station Assumption 3 (cf. App. D). Even for this idealized model, Assumptions 1–3 only hold approximately. To compare the Fig. 5. MSE per antenna at an SNR of 0 dB for estimation from a single simple gridded estimator (GE) WcGE (Assumption 1) with snapshot (T = 1). Channel model with one propagation path with uniformly distributed angle and a per path angular spread of σ = 2◦. the structured estimator (SE) wˆSE (Assumptions 1 and 2) AS and the fast estimator (FE) wˆFE (Assumptions 1–3), we first generate N = 16M samples δ ∈ [−π, π] according to a i change significantly for a more realistic channel model. In the uniform distribution (single-path model). We then evaluate the following, we consider results for the 3GPP model with three covariance matrices C and the MMSE filters W according δi δi propagation paths, which have different relative path gains. to (5) and (7) with a Laplace power density (88) with an ◦ That is, the power density is given by angular spread of σAS = 2 . The gridded estimator W (·) is then given by (20). We 3 cGE T X have chosen N sufficiently large, such that, for the single-path g3p(θ, δ = [δ1, δ2, δ3, p1, p2, p3] ) = piglp(θ, δi) (63) model, the performance of the GE is close to the performance i=1 of the (non-gridded) MMSE estimator. For the SE that uses where the angles δi are uniformly distributed. The path gains wˆSE(·) we consider circulant and Toeplitz structure, i.e., we pi are drawn from a uniform distribution in the interval [0, 1] P use the DFT matrix Q = F for the circulant SE and the and then normalized such that i pi = 1. The angular spread ◦ partial DFT matrix Q = F2 for the Toeplitz SE as explained of each path is still σAS = 2 . in the examples at the end of Sec. V-A. The coefficients wδi In Figs. 6 and 7, we show the resulting normalized MSE for in (29) are found by solving a least-squares problem and the the numerical simulation with three propagation paths. We see respective estimators can then be evaluated as specified in that the gap between the fast estimator and the Toeplitz SE is Theorem 1. Since we have a finite number of antennas, we much larger than in Fig. 5. The fast estimator does not perform expect a performance loss compared to WcGE, due to violation well in this scenario as the shift-invariance assumption is lost of Assumption 2. For the fast estimator that uses wˆFE in (46), when the model contains more than one propagation path. we use a circulant structure Q = F and we only need to This is where the learning-based estimators shine, since calculate w0 for δ = 0 since the matrices ASE in wˆSE are they can potentially compensate for inaccurate assumptions. replaced by circulant convolutions. We distinguish between the CNN estimator using the softmax As a baseline, we also show the MSE for the genie-aided activation function and the one with a rectified linear unit MMSE estimator, which simply uses Wδ for the correct δ. The (ReLU) as activation function. In both cases, we only show per-antenna MSE of the channel estimation for the different results for Q = F2, since using Q = F lead to consistently approximations for a single snapshot (T = 1) is depicted in worse results. We ran the hierarchical learning procedure Fig. 5 as a function of the number of antennas M for a fixed described in Sec. VI-A for 10 000 iterations with mini-batches SNR of 0 dB. We see, indeed, a gap between the gridded of 20 samples generated from the channel model. We also estimator and the two structured estimators. As expected, the include results for the ML estimator and the genie-aided OMP Toeplitz SE outperforms the circulant SE. For this scenario, the algorithm discussed in Sections VII-A and VII-B, respectively. FE yields performance close to the circulant SE. Apparently, For the OMP algorithm we use a four-times oversampled DFT the assumption of shift invariance is reasonably accurate. matrix as dictionary D. For a large number of antennas, the relative difference in The performance of the softmax-CNN estimator shows that performance of the algorithms diminishes and all algorithms it is, indeed, a good idea to use optimized variables instead get quite close to the genie-aided estimator. of plug-in values that were derived under assumptions that We see that for this simple channel model, there is not much fail to hold. It is astonishing that the ReLU-CNN estimator, potential for our learning-based methods. However, the results which has the same computational complexity as the softmax- 10

0.7 Genie OMP O(M log M) TABLE I SIMULATION PARAMETERS FOR FIG. 8 ML (59) O(M log M) 0.6 FE O(M log M), Softmax O(M log M) Path-loss coefﬁcient 3.5 0 dB ReLU O(M log M) Log-normal shadow fading Min. distance 1000 m 2 0.5 Toeplitz SE O(M ) Max. distance 1500 m Genie Aided SNR at max. distance −10 dB 0.4

and evaluate the rate expression Normalized MSE 0.3 " !# |hˆ Hh|2 r = E log2 1 + (64) 0.2 σ2khˆk2 with Monte Carlo simulations. We assume that the SNR is the 0.1 same during training and data transmission. Note that this rate 20 40 60 80 100 120 expression, which assumes perfect channel state information Number of antennas at the base station (CSI) at the decoder, yields an upper bound on the achievable rate, which is a simple measure for the accuracy of the Fig. 6. MSE per antenna at an SNR of 0 dB for estimation from a single snapshot (T = 1). Channel model with three propagation paths (cf. (63)). estimated subspace. Commonly used lower bounds on the achievable rate that take the imperfect CSI into account are not straightforward to apply to our system model and are, therefore, not shown. 101 Genie OMP O(M log M) ML (59) O(M log M) The performance of both the ML estimator and the ReLU- FE O(M log M), CNN estimator converge towards the genie aided estimator for Softmax O(M log M) large T . For a small to moderate number of observations, the ReLU O(M log M) CNN-based approach is clearly superior. The upper bound on 2 the OMP performance is still lower than the performance of 100 Toeplitz SE O(M ) Genie Aided the circulant ML estimator. We also see that the ReLU-CNN estimator outperforms the Toeplitz SE for a high number of observations. The reason is that we use a ﬁxed number of samples N = 16M, which Normalized MSE 10−1 leads to an error ﬂoor with respect to the number of observations. In other words, for higher numbers of observations the complexity of the gridded estimator has to be increased to improve the estimation accuracy. In contrast, the accuracy of the ReLU-CNN estimator improves just like that. 10−2 The simulation code is available online [30]. −15 −10 −5 0 5 10 15 SNR [dB] IX.CONCLUSION

Fig. 7. MSE per antenna for M=64 antennas and for estimation from a single We presented a novel approach to learn a low-complexity snapshot (T = 1). Channel model with three propagation paths (cf. (63)). channel estimator, which is motivated by the structure of the MMSE estimator. In contrast to other approaches, there are no model parameters which have to be fine-tuned for different channel models. These parameters are learned from CNN estimator, significantly outperforms all other estimators channel realizations that could be generated from a model of comparable complexity. In fact, the ReLU-CNN estimator or measured. Despite this lack of explicit fine tuning, the even outperforms the more complex Toeplitz SE estimator. proposed method outperforms state-of-the-art approaches at This can be explained by the fact that, compared to the single- a very low computational cost. Although we could consider path model, the number of parameters δ is increased and the more general NNs, e.g., by replacing convolution matrices choice of N = 16M samples no longer guarantees a small with arbitrary matrices, our simulation results suggest that this gridding error. is not worthwhile, at least as long as the 3GPP models are Finally, we use the 3GPP urban-macro channel model as used. specified in [8] with a single user placed at different positions It will be interesting to establish whether the NN estimators in the cell. The parameters used in the simulation are given perform equally well for channel models in which the Toeplitz in Table I. In Fig. 8, we depict the performance in terms assumption is not satisfied. In fact, recent work [31] suggests of spectral efficiency with respect to the number of available that the model in (5) based on the far-field assumption does observations. Specifically, we use a matched filter in the uplink not provide a perfect fit when using large arrays with lots of 11

If we plug this expression into the likelihood (67), we obtain 2.8 T −1 Y −1 p(Y |δ) ∝ exp − T tr(Σ (I −Wδ)S) |Σ (I −Wδ)| 2.6 t=1 (72) T 2.4 −1 Y ∝ exp T tr(Σ WδS) | I −Wδ| (73) t=1 −1 2.2 Genie OMP O(M log M) = exp T tr(Σ WδS) + T log | I −Wδ| (74) ML (59) O(M log M) since Σ does depend on δ. If we substitute Σ = σ2 I and FE O(M log M), C = T/σ2S, Lemma 1 follows. 2 Softmax O(M log M) b Spectral efﬁciency in Bit/s/Hz ReLU O(M log M) Toeplitz SE O(M 2) B. Uniform Linear Array 1.8 Genie Aided For a uniform linear array (ULA) with half-wavelength 20 40 60 80 spacing at the base station, the steering vector is given by H Number of observations T a(θ) = 1, exp(iπ sin θ),..., exp(iπ(M − 1) sin θ) .

Fig. 8. Spectral efficiency for M = 64 antennas and an SNR of −10 dB (75) at the cell edge. The urban macro channel model specified in [8] is used to Consequently, the covariance matrix has Toeplitz structure generate the channels. with entries Z π [Cδ]mn = g(θ; δ) exp(−iπ(m − n) sin θ)dθ. (76) antennas. However, the structure of the neural network is not −π required to perfectly fit the channel model, since the optimized If we substitute u = π sin θ, we get variables can compensate for an inappropriate structure, at 1 Z π [C ] = f(u; δ) exp(−i(m − n)u)du (77) least partially. The only requirement is that suitable training δ mn 2π data for the learning procedure is available. −π with g(arcsin(u/π); δ) + g(π − arcsin(u/π); δ) f(u; δ) = 2π √ APPENDIX π2 − u2 (78) A. Proof of Lemma 1 where we extended g periodically beyond the interval [−π, π]. That is, the entries of the channel covariance matrix are Fourier We show Lemma 1 for the slightly more general case coefficients of the periodic spectrum f(u; δ). with arbitrary full-rank noise covariance matrices Σ. Let −1 PT H An interesting property of the Toeplitz covariance matrices S = T t=1 yy denote the sample covariance matrix. is that we can define a circulant matrix Ceδ with the eigenval- The likelihood of Y in ues f(2πk/M; δ), k = 0,...,M −1, such that Ceδ Cδ [16]. R p(Y |δ)W p(δ)dδ That is, to get the elements of the circulant matrices, we hˆ = δ y (65) MMSE R p(Y |δ) p(δ)dδ approximate the integral in (77) by the summation M−1 1 X −i(m−n)2πk/M is proportional to (we only need to consider factors with δ, [Ceδ]mn = f(2πk/M; δ)e . (79) M because other terms cancel out) k=0

T H −1 Y exp − yt (Cδ + Σ) yt C. Uniform Rectangular Array p(Y |δ) ∝ (66) |C + Σ| To work with a two-dimensional array, we need a three- t=1 δ T dimensional channel model. That is, in addition to the azimuth −1 Y −1 angle θ, we also need an elevation angle φ to describe = exp − T tr((Cδ + Σ) S) |(Cδ + Σ) |. t=1 a direction of arrival. Under the far-ﬁeld assumption, the (67) covariance matrix is given by Z π/2 Z π −1 H We express (Cδ + Σ) in terms of Wδ as follows: We have Cδ = g(θ, φ; δ)a(θ, φ)a(θ, φ) dθdφ. (80) −π/2 −π

Cδ = Wδ(Cδ + Σ) (68) For a uniform rectangular array (URA) with half-

⇔ Cδ + Σ = Wδ(Cδ + Σ) + Σ (69) wavelength spacing at the base station, we have M = MH MV antenna elements, where M is the number of antennas in the ⇔ I = W + Σ(C + Σ)−1 (70) H δ δ M −1 −1 horizontal direction and V the number of antennas in the ⇔ Σ (I −Wδ) = (Cδ + Σ) . (71) vertical direction. The correlation between the antenna element 12 at position (m, p) and the one at (n, q), given the parameters 1 δ, is given by δ = 0

π π 0.8 Z2 Z iπ((n−m) sin θ+(q−p) cos θ sin φ) g(θ, φ; δ)e dθdφ (81) ) δ 0.6 π ; − −π u

2 (

π π lp 2 2 0.4 Z Z w = g˜(θ, φ; δ)eiπ((n−m) sin θ+(q−p) cos θ sin φ)dθdφ (82) 0.2 π π − 2 − 2 where 0 −π −π/2 0 π/2 π g˜(θ, φ; δ) = g(θ, φ; δ) + g(π − θ, φ; δ). (83) u We can map the square [−π/2, π/2]2 bijectively onto the circle with radius π with the substitution u = π sin θ and Fig. 9. Functions wlp(u; δ) for different δ ∈ [−π/2, π/2] sampled on a ν = π cos θ sin φ. The transformed integral can be written as uniform grid. The peaks of the graphs are at π sin δ. The graphs for δ ∈ [−π/4, π/4] are depicted with a dashed line-style. Z π Z π f(u, ν; δ)e−iπ((m−n)u+(p−q)ν)du dν (84) −π −π w(u − δ) follows. Finally, the prior of δ has to be uniform on with the same grid that generates the samples of the wi. ( Example. An example that approximately fulfills these f˜(u, ν; δ), for u2 + ν2 ≤ π2, f(u, ν; δ) = (85) assumptions is the 3GPP spatial channel model for a ULA 0, otherwise. with only a single propagation path. In this case, we only The nonzero entries of the two dimensional spectrum are given have one parameter for the covariance matrix: the angle of by the path center δ, which is uniformly distributed. The power √ density function of the angle of arrival (cf. (5)) is given by g˜(arcsin(u/π), arcsin(ν/(π 1 − u2)) f˜(u, ν; δ) = . (86) the Laplace density p(π2 − u2)(π2 − u2 − ν2) glp(θ; δ) = exp(−d2π(δ, θ)/σAS) (88) That is, for a URA, the entries of the channel covariance matrix are two-dimensional Fourier coefficients of the periodic where d2π(θ, δ) is the wrap-around distance between θ and δ spectrum f(u, ν; δ). and can be thought of as |θ − δ| for most (θ, δ) pairs. In other We can use the results for the ULA case to show words, for different δ, the function glp(θ; δ) is simply a shifted that the URA covariance matrix is asymptotically equiv- version of glp(θ; 0), i.e., glp(θ; δ) = glp(θ − δ; 0). alent to a nested circulant matrix with the eigenvalues Due to the symmetry of the ULA, we can restrict the parameter δ to the interval [−π/2, π/2] without loss of generality. f(2πm/MH , 2πp/MV ; δ) where m = 0, ..., MH − 1 and For angles δ ∈ [−π/4, π/4], i.e., if the cluster center is located p = 0,..., MV − 1. The eigenvectors of the nested circulant at the broadside of the array, the arcsin-transform is approx- matrix are given by FMH ⊗ FMV where FM denotes the M- dimensional DFT matrix. imately linear. As a consequence, the correspondence (88) is To show the asymptotic equivalence, we first replace the approximately true also for the transformed spectrum f(u; δ) Toeplitz structure along the horizontal direction by a circulant (cf. (78)) and, by virtue of (87), also the continuous filter is structure. This yields an asymptotically equivalent matrix due approximately shift-invariant. This discussion is illustrated by to the results from [16]. Second, we replace the Toeplitz Fig. 9, which shows the continuous filter wlp(·; δ) for different structure along the vertical direction by a circulant structure δ (the peaks are at π sin δ). For δ ∈ [−π/4, π/4], the different to get the desired result. Clearly, the asymptotic equivalence filters are approximately shifted versions of the central filter, i.e., only holds if MH and MV both go to infinity. wlp(u; δ) ≈ wlp(u − δ; 0). (89) D. Shift Invariance For large M, the approximation error from using (79) A To get circulant matrices A in the structured estimator is reduced and we can approximate the matrix SE by a SE w w (u; 0) wˆ in (32), we need several assumptions. First, we assume circular convolution with uniform samples 0 of lp SE fast estimator wˆ that the circulant approximation in (79) holds exactly, i.e., the as convolution kernel. We get a FE by setting ASE in the structured estimator wˆSE to columns wi of ASE contain uniform samples of the continuous H filter ASE = F diag(F w0)F (90) f(u; δ ) i where [w ] = w (2π(k − 1)/K; 0). w(u; δi) = 2 . (87) 0 k lp f(u; δi) + σ An analogous shift invariance can be derived for a uniform Next, we assume a single parameter δ and shift invariance of rectangular array. In this case, we have a two-dimensional the spectrum, i.e., f(u; δ) = f(u − δ) from which w(u; δ) = shift-invariance and, thus, two-dimensional convolutions. For 13

the case of distributed antennas that appeared in the examples [23] D. Neumann, M. Joham, L. Weiland, and W. Utschick, “Low-complexity in Sec. V-A, we do not see a straightforward way to make a computation of LMMSE channel estimates in massive MIMO,” in International ITG Workshop on Smart Antennas (WSA), Mar. 2015. similar simplification. [24] J. P. Burg, D. G. Luenberger, and D. L. Wenger, “Estimation of structured covariance matrices,” Proceedings of the IEEE, vol. 70, pp. REFERENCES 963–974, Sep. 1982. [25] T. W. Anderson, “Asymptotically efficient estimation of covariance [1] F. Rusek, D. Persson, B. K. Lau, E. Larsson, T. Marzetta, O. Edfors, matrices with linear structure,” The Annals of Statistics, vol. 1, pp. 135– and F. Tufvesson, “Scaling up MIMO: Opportunities and challenges with 141, 1973. very large arrays,” IEEE Signal Process. Mag., vol. 30, pp. 40 –60, Jan. [26] A. Dembo, “The relation between maximum likelihood estimation of 2013. structured covariance matrices and periodograms,” IEEE Trans. Acoust., [2] T. Marzetta, “Noncooperative cellular wireless with unlimited numbers Speech, Signal Process., vol. 34, pp. 1661–1662, Dec. 1986. of base station antennas,” IEEE Trans. Wireless Commun., vol. 9, pp. [27] T. Wiese, L. Weiland, and W. Utschick, “Low-rank approximations 3590 –3600, Nov. 2010. for spatial channel models,” in International ITG Workshop on Smart [3] D. C. Araujo, A. L. F. de Almeida, J. Axnas, and J. C. M. Mota, Antennas (WSA), Munich, Germany, Mar. 2016. “Channel estimation for millimeter-wave very-large MIMO systems,” [28] M. Gharavi-Alkhansari and T. S. Huang, “A fast orthogonal matching in European Signal Processing Conference (EUSIPCO), Sep. 2014, pp. pursuit algorithm,” in IEEE International Conference on Acoustics, 81–85. Speech and Signal Processing, vol. 3, May 1998, pp. 1389–1392. [4] A. Alkhateeb, O. El Ayach, G. Leus, and R. W. Heath, “Channel [29] T. Blumensath and M. E. Davies, “Iterative hard thresholding for estimation and hybrid precoding for millimeter wave cellular systems,” compressed sensing,” vol. 27, no. 3, pp. 265–274, Nov. 2009. IEEE J. Sel. Topics Signal Process., vol. 8, no. 5, pp. 831–846, Oct. [30] D. Neumann and T. Wiese, “Simulation code,” https://github.com/ 2014. tum-msv/learning-mmse-est, 2017. [5] M. K. Samimi and T. S. Rappaport, “3-D millimeter-wave statistical [31] X. Gao, O. Edfors, F. Tufvesson, and E. G. Larsson, “Massive MIMO channel model for 5G wireless system design,” IEEE Trans. Microw. in real propagation environments: Do all antennas contribute equally?” Theory Techn., vol. 64, no. 7, pp. 2207–2225, Jul. 2016. IEEE Trans. Commun., vol. 63, pp. 3917–3928, Nov. 2015. [6] S. Sun, T. S. Rappaport, T. A. Thomas, A. Ghosh, H. C. Nguyen, I. Z. Kovacs,´ I. Rodriguez, O. Koymen, and A. Partyka, “Investigation of prediction accuracy, sensitivity, and parameter stability of large-scale propagation path loss models for 5G wireless communications,” IEEE Trans. Veh. Technol., vol. 65, no. 5, pp. 2843–2860, May 2016. [7] H. Yin, D. Gesbert, M. Filippou, and Y. Liu, “A coordinated approach to channel estimation in large-scale multiple-antenna systems,” IEEE J. Sel. Areas Commun., vol. 31, pp. 264 –273, Feb. 2013. [8] 3GPP, “Spatial channel model for multiple input multiple output (MIMO) simulations (release 12),” 3rd Generation Partnership Project (3GPP), TR 25.996 V12.0.0, 2014. [9] X. Rao and V. K. N. Lau, “Distributed compressive CSIT estimation and feedback for FDD multi-user massive MIMO systems,” IEEE Trans. Signal Process., vol. 62, no. 12, pp. 3261–3271, Jun. 2014. [10] S. Haghighatshoar and G. Caire, “Massive MIMO channel subspace estimation from low-dimensional projections,” IEEE Trans. Signal Process., vol. 65, no. 2, pp. 303–318, Jan. 2017. David Neumann received the Dipl.-Ing. degree in [11] A. Omri, R. Bouallegue, R. Hamila, and M. Hasna, “Channel estimation electrical engineering from Technische Universitat¨ for LTE uplink system by perceptron neural network,” International Mnchen¨ (TUM) in 2011. He is currently working Journal of Wireless & Mobile Networks (IJWMN), vol. 2, no. 3, pp. towards the doctoral degree at the Professorship for 155–165, 2010. Signal Processing at TUM. His research interests [12] L. Zhang and X. Zhang, “MIMO channel estimation and equalization include transceiver design for large-scale communi- using three-layer neural networks with feedback,” Tsinghua Science & cation systems and estimation theory. Technology, vol. 12, no. 6, pp. 658–662, 2007. [13] R. Prasad, C. R. Murthy, and B. D. Rao, “Joint approximately sparse channel estimation and data detection in OFDM systems using sparse Bayesian learning,” IEEE Trans. Signal Process., vol. 62, pp. 3591– 3603, Jul. 2014. [14] X. Zhou and X. Wang, “Channel estimation for OFDM systems using adaptive radial basis function networks,” IEEE Trans. Veh. Technol., vol. 52, pp. 48–59, Jan. 2003. [15] T. Kailath, Linear Systems. Prentice-Hall, 1980. [16] R. M. Gray, “Toeplitz and circulant matrices: A review,” Foundations and Trends R in Communications and Information Theory, vol. 2, no. 3, pp. 155–239, 2006. [17] C. L. Epstein, “How well does the finite Fourier transform approximate the Fourier transform?” Communications on Pure and Applied Mathe- matics, vol. 58, pp. 1421–1435, Oct. 2005. [18] H. Q. Ngo, A. Ashikhmin, H. Yang, E. G. Larsson, and T. L. Marzetta, “Cell-free massive MIMO: Uniformly great service for everyone,” in IEEE International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), Jun. 2015, pp. 201–205. [19] ——, “Cell-free massive MIMO versus small cells,” IEEE Trans. Thomas Wiese received the Dipl.-Ing. degree in Wireless Commun., vol. 16, pp. 1834–1850, Mar. 2017. electrical engineering and the Dipl.-Math. degree in [20] D. P. Kingma and J. Ba, “Adam: A method for stochastic mathematics from Technische Universitat¨ Mnchen¨ optimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Available: (TUM) in 2011 and 2012, respectively. He is cur- http://arxiv.org/abs/1412.6980 rently working towards the doctoral degree at the [21] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural Professorship for Signal Processing at TUM. His re- networks,” in International Conference on Artificial Intelligence and search interests include compressive sensing, sensor Statistics (AISTATS), Fort Lauderdale, FL, USA, 2011, pp. 315–323. array processing, and convex optimization. [22] M. Frigge, D. C. Hoaglin, and B. Iglewicz, “Some implementations of the boxplot,” The American Statistician, vol. 43, pp. 50–54, 1989. 14

Wolfgang Utschick (SM’06) completed several years of industrial training programs before he received the diploma in 1993 and doctoral degree in 1998 in electrical engineering with a dissertation on machine learning, both with honors, from Technis- che Universitat¨ Munchen¨ (TUM). Since 2002, he is Professor at TUM where he is chairing the Pro- fessorship of Signal Processing. He teaches courses on signal processing, stochastic processes, and optimization theory in the ﬁeld of wireless communications, various application areas of signal processing, and power transmission systems. Since 2011, he is a regular guest professor at Singapore’s new autonomous university, Singapore Institute of Technology (SIT). He holds several patents in the ﬁeld of multi-antenna signal processing and has authored and coauthored a large number of technical articles in international journals and conference proceedings and has been awarded with a couple of best paper awards. He edited several books and is founder and editor of the Springer book series Foundations in Signal Processing, Communications and Networking. Dr. Utschick has been Principal Investigator in multiple research projects funded by the German Research Fund (DFG) and coordinator of the German DFG priority program Communications Over Interference Limited Networks (COIN). He is a member of the VDE and therein a member of the Expert Group 5.1 for Information and System Theory of the German Information Technology Society. He is senior member of the IEEE, where he is currently chairing the German Signal Processing Section. He has also been serving as an Associate Editor for IEEE Transactions on Signal Processing and has been member of the IEEE Signal Processing Society Technical Committee on Signal Processing for Communications and Networking. Since 2017, he serves as Dean of the Department for Electrical and Computer Engineering at TUM.