Training Signal Design for Estimation of Correlated MIMO Channels with Colored Interference

Yong Liu †, Tan F. Wong ∗†, and William. W. Hager ‡

† Wireless Information Networking Group Department of Electrical & Computer Engineering University of Florida, Gainesville, Florida 32611-6130 Phone: 352-392-2665, Fax: 352-392-0044 [email protected], [email protected]

‡ Department of Mathematics University of Florida, Gainesville, Florida 32611-8105 Phone: 352-392-0281, Fax: 352-392-8357 [email protected] EDICS: MSP-CEST, SPC-APPL

Abstract

In this paper, we study the problem of estimating correlated multiple-input multiple-output (MIMO) channels in the presence of colored interference. The linear minimum mean square error (MMSE) channel estimator is derived and the optimal training sequences are designed based on the MSE of channel estimation. We propose an algorithm to estimate the long-term channel statistics in the construction of the optimal training sequences. We also design an efficient scheme to feed back the required information to the transmitter where we can approximately construct the optimal sequences. Numerical results show that the optimal training sequences provide substantial performance gain for channel estimation when compared with other training sequences.

Index Terms

MIMO system, channel estimation, optimal training sequence, MSE

This work was supported by the National Science Foundation under Grants 0203270 and ANI-0020287. 2

I. INTRODUCTION

Many multiple antenna communication systems are designed to perform coherent detection that requires channel state information (CSI) in the demodulation process. For practical wireless communication systems, it is common that the channel parameters are estimated by sending known training symbols to the receiver. The performance of this training-based channel estimation scheme depends on the design of training signals which has been extensively investigated in the literature [1]-[9].

It is well known that imperfect knowledge of the channel has a detrimental effect on the achievable rate it can sustain [10]. Training sequences can be designed based on information theoretic metrics such as the ergodic capacity and outage capacity of a MIMO channel [1] [2] [3]. The mean square error (MSE) is another commonly used performance measure for channel estimation. Many works [4]-[9] have been carried out to investigate the training sequence design problem based on MSE for MIMO channels. In [5], the authors study the problem of training sequence design for multiple-antenna systems over flat fading MIMO channels in the presence of colored interference. The MIMO channels are assumed to be spatially white, i.e., there is no correlation among the transmit and receive antennas. The optimal training sequences are designed to minimize the channel estimation MSE under a total transmit power constraint. The optimal training sequence design result implied that we should intentionally assign transmit power to the subspace with less interference. In [6], the problem of transmit signal design is investigated for the estimation of spatial correlated MIMO Rayleigh flat fading channels. The optimal training signal is designed to optimize two criteria: the minimization of the channel estimation MSE and the maximization of the conditional mutual information (CMI) between the channel and the received signal. The authors adopted the virtual channel representation model [11] for MIMO correlated channels. It is shown that the optimal training signal should be transmitted along the strong directions in which more scatters are present. The power transmitted along these directions is determined by the water-filling solutions based on the minimum MSE and maximum CMI criteria.

In the present work, we investigate the problem of estimating correlated MIMO channels with colored interference. We adopt the correlated MIMO channel model from [12] [13] which expresses the channel matrix as a product of the receive correlation matrix, a white “channel” matrix with identically and independent distributed (i.i.d.) entries, and the transmit correlation matrix. This model implies that transmit and receive correlation can be separated. This fact has been verified by field measurements. The colored interference model used here is more suitable than the white noise model when jamming signals and/or co-channel interference are present in the wireless communication system. We consider an interference 3 limited wireless communication system, i.e., we ignore the thermal noise which is insignificant compared to the interference. Then we show that the covariance matrix of the interference has a form which implies that the temporal and spatial correlations of the interference are separable. The channel estimation MSE is used as a performance metric for the design of training sequences. The optimization problem formulated here minimizes the channel estimation MSE under a power constraint. This is a generalization of two previous optimization problems which are encountered widely in the signal processing area [5], [8], [9], [14]. In [7], the authors encounter essentially the same optimization problem in a different form. According to previous optimization results for the special case in [8], the authors choose to optimize the training sequence matrix in a particular set of matrices which have the same solution structure and eigenvector ordering as our solution. Here we rigorously prove that this particular solution structure and eigenvector ordering result are optimal for arbitrary matrices under the power constraint. The optimal training sequence design assigns more power to the transmission direction constructed by the eigen-direction with larger channel gains and the interference subspace with less interference. In order to implement the channel estimator and construct the optimal training sequences, we propose an algorithm to estimate long-term channel statistics and design an efficient feedback scheme so that we can approximately construct the optimal sequences at the transmitter. Numerical results show that with the optimal training sequences, the channel estimation MSE can be reduced substantially when compared with the use of other training sequences.

II. SYSTEM MODEL

We consider a single user link with multiple interferers. The desired user has nt transmit antennas and nr receive antennas. We assume that there are MI interfering signals and the ith interferer has ni transmit antennas. The MIMO channel is assumed to be quasi-static (block fading) in that it varies slowly enough to be considered invariant over a block. However, the channel changes to independent values from block to block. We assume that the users employ a frame-based transmission protocol which comprises training and payload data. The received baseband signals at the receive antennas during the training period are given in matrix form by MI T T Y = HS + HiSi , (1) Xi=1 E where T denotes the transpose of a matrix. The n n matrix H and the n n matrix H are r|× {zt } r × i i the channel gain matrices from the transmitter and the ith interferer to the receiver, respectively. S is 4 the N n training symbol matrix known to the receiver for estimating the channel gain matrix H of × t the desired user during the training period. N is the number of training symbols from each transmit antenna and N is usually much larger than n . S is the N n interference symbol matrix from the t i × i ith interferer. We assume that the elements in Si are identically distributed zero-mean complex random variables, correlated across both space and time. The interference processes are assumed to be wide-sense stationary in time. We consider an interference limited wireless communication system. Hence we ignore the effect of the thermal noise [15]. We adopt the correlated MIMO channel model [12], [13] which models the channel gain matrix H as 1/2 1/2 H = Rr HwRt , where Rt models the correlation between the transmit antennas and Rr models the correlation between the receive antennas, respectively. We assume that both Rr and Rt are of full rank. The notation ( )1/2 stands for the Hermitian square root of a matrix. H is a matrix whose elements are · w independent and identical distributed zero-mean circular-symmetric complex Gaussian random variables with unit variance. Let hw = vec(Hw), where vec(X) is the vector obtained by stacking the columns of 1/2 1/2 X on top of each other, then we have h = vec(H) = (R Rr )h , with h (0, R R ) where t ⊗ w ∼ CN t ⊗ r denotes complex Gaussian distribution, denotes the Kronecker product, and means “distributed CN ⊗ ∼ 1/2 1/2 as”. Similarly, the channel gain matrix from the ith interferer to the receiver is Hi = Rr HwiRti and 1/2 1/2 h = vec(H ) = (R Rr )h . Using the vec operator, we can write the received signal in (1) in i i ti ⊗ wi vector form as

y = vec(Y) = (S I )h + e, (2) ⊗ nr where I denotes the n n identity matrix and e = vec(E). nr r × r To derive the linear MMSE channel estimator, we need the following lemma. Lemma 2.1: E(e) = 0 and the covariance matrix of e is

MI E(eeH )= Q R = Q R Ni ⊗ r N ⊗ r Xi=1 where ni Ri (0) . . . ni Ri (N 1) k=1 k,k k=1 k,k − . .. . QNi =  P . . P .  ,

 ni ni   Ri (N 1) . . . Ri (0)   k=1 k,k − k=1 k,k    MI ˆ 1/2 P P i QN = i=1 QNi, Si = SiRti which is called the transformed interference symbol matrix, Rk,k(τ)=

E [Sˆi]Pm,k[Sˆi]m+τ,k represents the correlation between [Sˆi]m,k and [Sˆi]m+τ,k, and H denotes the conjugateh transposei of a matrix. 5

Proof: See Appendix A.

We note that QN captures the temporal correlation of the interference and Rr represents the spatial correlation. The covariance matrix of the interference has the Kronecker product form which implies that the temporal and spatial correlations of the interference are separable. Since (2) is a linear model, based on the Bayesian Gauss-Markov Theorem [16], the linear minimum mean square error estimator (LMMSE) for h is given as:

H 1 1 1 H 1 hˆ = [(S I )(Q R )− (S I ) + (R R )− ]− (S I )(Q R )− y ⊗ nr N ⊗ r ⊗ nr t ⊗ r ⊗ nr N ⊗ r H 1 1 1 H 1 = [(S Q− S + R− )− S Q− I ]y. N t N ⊗ nr Using the equality vec(AYB) = (BT A)vec(Y), we can rewrite the channel estimator in the more ⊗ compact matrix form as ˆ H 1 1 1 H 1 T H = Y (S QN− S + Rt− )− S QN− .   Hence the channel estimator does not depend on the receive channel correlation matrix Rr. The performance of the channel estimator is measured by the estimation error ǫ = h hˆ whose mean − is zero and whose covariance matrix is

C = E[(h hˆ)(h hˆ)H ] ǫ − − H 1 1 1 = [(S I )(Q R )− (S I ) + (R R )− ]− ⊗ nr N ⊗ r ⊗ nr t ⊗ r H 1 1 1 = (S Q− S + R− )− R . N t ⊗ r

The diagonal elements of the error covariance matrix Cǫ yields the minimum Bayesian MSE and their sum is usually referred to as the total MSE. The total MSE is a commonly used performance measure for MIMO channel estimation. By using the fact that tr(A B)= trAtrB where tr denotes the trace of ⊗ a matrix, we have

H 1 1 1 H 1 1 1 tr(C )= tr((S Q− S + R− )− R )= tr((S Q− S + R− )− )tr(R ). ǫ N t ⊗ r N t r Thus the minimization of the total MSE over training sequences does not depend on the receive channel correlation matrix. Only the temporal interference correlation matrix QN and the transmit correlation matrix Rt need to be considered in obtaining the optimal training sequences.

III. OPTIMAL TRAINING SEQUENCE DESIGN

In the section, we investigate the problem of designing optimal training sequence for the channel estimation approach considered in the previous section. With the total MSE as the performance measure, 6 the optimization of training sequences can be formulated as follows

H 1 1 1 min tr(S Q− S + R− ) S N t − subject to tr SH S P (3) { }≤ where tr SH S P specifies the power constraint. { }≤ Some special cases of this optimization problem (with either QN or Rt equal to the identity matrix) have been encountered in joint linear transmitter-receiver design [8], [14], [17] and training sequence design for channel estimation in MIMO systems [5], [9]. The solution in the special case Rt = I, found for example in [5] and [14], can be expressed in terms of the eigenvalues and eigenvectors of

QN and a Lagrange multiplier associated with the power constraint. Similarly, the solution in the special case QN = I, found for example in [6], [8] and [9], can be expressed in terms of the eigenvalues and eigenvectors of Rt and a Lagrange multiplier associated with the power constraint. The optimization of the MSE problem introduced here is more difficult. We will show that (3) has a solution that can H be expressed as S = UΣV where U and V are unitary matrices of eigenvectors for QN and Rt respectively, and Σ is diagonal. Solving (3) involves computing diagonalizations of QN and Rt, and finding an ordering for the columns of U and V. The optimal training sequences should be designed according to the following theorem which summarizes the solution to the optimization problem (3).

H Theorem 3.1: Suppose that QN and Rt are Hermitian positive definite matrices, and let UΛU and V∆VH be the associated diagonalizations where the columns of U and V are orthonormal eigenvectors, the corresponding eigenvalues λ of Q are arranged in an increasing order, and the corresponding { i} N eigenvalues δ of R are arranged in a decreasing order. Then the optimal solution of (3) is given by { i} t

S = UΣVH , (4) where Σ specifies the power allocation, which is a diagonal matrix with diagonal elements given by

1/2 λi λi σi = max , 0 for 1 i M =△ min nt, N , (5) (s µ − δi ) ≤ ≤ { }

M 2 with the parameter µ chosen so that i=1 σi = P . Proof: See Appendix B. P ˆ T With the optimal training sequences, the channel estimator simplifies to H = YUM∗ ΓVM , where Γ =

σiδi diag γ ,...,γ with γ = 2 , the columns of U are the eigenvectors of Q corresponding to 1 M i σ δi+λi M N { } i 7

its M smallest eigenvalues, and the columns of VM are the eigenvectors of Rt corresponding to its M largest eigenvalues. The design of the optimal training sequences summarized in the above theorem has a clear physical interpretation. Each eigenvector of the transmit correlation matrix Rt represents a transmit direction and the associated eigenvalue indicates the channel gain in that direction. More power should be assigned to the signals transmitted along the directions with larger channel gains. On the other hand, each eigenvector of the interference temporal correlation matrix QN represents an interference subspace and the corresponding eigenvalue indicates the amount of interference in that subspace. Hence, we should choose the subspaces with the least amount of interference for transmission. The power assignment is determined by the water- filling argument under a finite power constraint. To facilitate the understanding of the water-filling interpretation for the power assignment, we can rewrite the optimal power assignment solution in an alternative way as: 1/2 √λ σ˜ = max µ˜ i , 0 for 1 i M, i − δ ≤ ≤  i  1 4 M 2 with µ˜ = 1/√µ, σ˜i = σi/λi , and µ˜ chosen so that i=1 √λiσ˜i = P , where µ˜ represents the water level, √λi/δi specifies the depth profile which canP be visualized as the surface over which the water { } is poured, and the volume of each subchannel is weighted by √λi for the calculation of the total water volume. A simple algorithm [18], which terminates in at most N steps, can be used to calculate the optimal power assignment solution: Input: set of pairs (λ , δ ) and the power constraint P . { i i } Output: the water level µ˜ and σ˜ . { i} √λi 1. Choose µ˜ as the maximum of and set Lnew = M + 1. { δi } √λi 2. Set Lold = Lnew. Let I be the set of indices with µ˜ 0 and Lnew be the cardinal number of − δi ≥ the set I. Compute µ˜ = (P + λi )/ √λ . i I δi i I i ∈ ∈ 3. If Lnew

To implement the channel estimator and construct the optimal training sequences, we need knowledge of the transmit antenna correlation matrix Rt and the interference covariance matrix QN at both the receiver and transmitter. Since these two matrices are long-term channel characteristics, they can be estimated by 8 using the observed training signals at the receiver and then fed back to the transmitter for the construction of the optimal training sequences. In this section, we propose an algorithm to estimate these long-term channel characteristics and design an efficient feedback scheme so that we can approximately construct the optimal training sequences at the transmitter. Let us assume that the training signal matrix S is sent over a sequence of K packets. During the transmission of each packet, the channel is assumed to be invariant. Then the received training signals for the nth packet are given as

y(n) = (S I )h(n)+ e(n) ⊗ nr = (SR1/2 R1/2)h (n)+ e(n). t ⊗ r w We note that the correlation matrix of the received signal also has the Kronecker product form:

R = E[y(n)y(n)H ]

= (SR1/2 R1/2)E(h (n)h (n)H )(R1/2SH R1/2)+ E(e(n)e(n)H ) t ⊗ r w w t ⊗ r = R R , q ⊗ r

H where Rq = SRtS + QN . We calculate the sample average correlation matrix of the received signal from the previous K packets: K 1 Rˆ = y(n)y(n)H . K n=1 X If e(n) is Gaussian, Rˆ is a sufficient statistic for the estimation of the correlation matrix R. If R = R R , then R = αR 1 R for any α = 0. Hence, R and R can not be uniquely q ⊗ r q ⊗ α r 6 q r identified from observing y(n). Fortunately, the channel estimator and the design of optimal sequences are invariant to scaling of the estimates of Rt and QN because

′ H 1 1 1 H 1 T Hˆ (n)= Y(n) (S (αQN )− S + (αRt)− )− S (αQN )− = Hˆ (n)   and

H 1 1 1 H 1 1 1 tr((S (αQN )− S + (αRt)− )− = αtr((S QN− S + Rt− )− ).

For the estimation of Rq and Rr, we need to impose an additional constraint on Rr. Here we force tr(Rr) = nr. Then an iterative flip-flop algorithm [19] [20] [21] can be used to estimate Rq and Rr. If the received interference signal e(n) is Gaussian distributed, the flip-flop algorithm, when converges, provides the maximum likelihood estimates (MLEs) of Rq and Rr [19]. When e(n) is not Gaussian, the 9 algorithm gives the estimates of R and R in the least square sense. For fixed R (j 1), the MLE of q r r − R is obtained as q b nr nr K 1 r 1 T Rq(j)= σuv Yu (n)Yv∗(n) (6) nr K Xu=1 Xv=1 n nX=1 o b where σr is the (u, v)th element of R 1(j 1) and Y (n) is the uth row vector of the received signal uv r− − u matrix Y(n). Similarly, for fixed R (j), the MLE of R is obtained as q b r N N K 1 1 R (j)=b σq W (n)WH (n) (7) r N uv K u v u=1 v=1 n=1 X X n X o q b 1 where σuv is the (u, v)th element of Rq− (j) and Wu(n) is the uth column vector of the received signal Y(n). Then to get uniquely identifiable R and R , we need to scale R (j) to make tr(R (j)) = n . b q r r r r We note that the terms inside the braces in (6) and (7) can be computed before the running of the b b iterative estimation algorithm to reduce computational complexity. To start the iterative algorithm, an

initial value of either Rq or Rr should be assigned. A natural choice is to initially make Rr(0) = Inr . Then the iterative algorithm alternates between the calculations of R and R until convergence. While b b q r b it is difficult to analytically prove that the algorithm converges to the MLE, extensive data experiments b b in statistics [19] show that it always converges to the MLE for situations of practical sample sizes. The convergence in our case is also verified by the numerical results in Section V. Then we need to estimate R and Q based on R . Before doing so, let denote the range space t N q R of a matrix, denote the orthogonal complementary subspace of the range of a matrix, and consider R⊥ b the following lemma: Lemma 4.1: Let be the linear map defined by (R , Q )= SR SH + Q where R and Q are L L t N t N t N Hermitian positive semi-definite matrices and S is of full rank. Let D be defined by D = (R , Q ) : { t N (Q ) (S) . Then : D CN N is one-to-one. Moreover, given any (R , Q ) and R = R N ⊂ R⊥ } L → × t N q ′ ′ ′ ′ (R , Q ), there exists (R , Q ) = (R , Q ) such that (R , Q )= R . L t N t N 6 t N L t N q Proof: See Appendix C.

Based on the above lemma, we see that estimating QN and Rt simultaneously from Rq is not possible. We can only uniquely determine Q up to (S) from R . Fortunately, this is not much a limitation N R⊥ q b when N is large as shown in Lemma 4.2 below. Let Q be the weak norm of Q which is defined | N |w N by Q = tr(QH Q )/N. | N |w N N q Lemma 4.2: With the assumption that QN is an absolutely summable Hermitian Toeplitz matrix, the difference between the two sequences of matrices QN and P⊥S QN P⊥S approaches zero in weak norm as

N increases, i.e., limN QN P⊥S QN P⊥S w = 0. →∞ | − | 10

Proof: See Appendix D.

Since P⊥S RqP⊥S = P⊥S QN P⊥S , we can estimate P⊥S QN P⊥S from P⊥S RqP⊥S . For notational simplicity, let A denote P R P . Since the interference signals are wide-sense stationary in time, Q has the ⊥S q ⊥S b N form of a Topelitz matrix which can be represented by a sequence q ; k = 0, 1,... (N 1) b { k ± ± − } with [QN ]k,j = qk j. Then the (i, j)th element of P⊥S QN P⊥S is given by l k pilql kpkj with pij − − denoting the (i, j)th element of P⊥S . Equating the (i, j)th element of P⊥S QPN PP⊥S with aij, we have a set of linear equations in q . Noticing the Hermitian nature of PS Q PS and A and separating the { k} ⊥ N ⊥ real and imaginary parts of q and a , we have N 2 linear equations with 2N 1 unknowns in q = k ij − r T [q0, Re(q1), Im(q1),..., Re(qN 1), Im(qN 1)] . This set of linear equations can be solved by employing − − the least square approach. Then an estimate of QN which is denoted as QN can be constructed based on q . Although this Q is only unique up to (S), Lemma 4.2 tells us that this is not too severe r N R⊥ b a deficiency when N is large. In addition, when N is large, Q can be approximated by the circulant b N matrix [22] with fixed eigenvectors as:

˜ H QN = FN ΨN FN , (8) where F is the N N FFT matrix and Ψ is a diagonal matrix containing eigenvalues ψ of Q˜ . N × N { i} N Note that we only require the nt smallest eigenvalues of QN and their corresponding eigenvectors in constructing the optimal training sequences. With the circulant matrix approximation (8), it is equivalent to estimating the nt smallest eigenvalues ψi and identifying the corresponding columns of FN . If we arrange the eigenvalues ψ of Q˜ and the eigenvalues λ of Q in increasing orders, we have [23] { i} N { i} N limN ψi λi = 0. Thus the nt smallest eigenvalues of QN can be used as the estimates of the →∞ | − | n smallest ψ ’s, and the corresponding columns of F are chosen as those closest (in terms of the t i N b Euclidean norm) to the eigenvectors associated with the nt smallest eigenvalues of QN . The estimates of the n smallest ψ ’s and the n indices of the chosen columns of F are then fed t i t b N back to the transmitter for the optimal training sequence construction. We notice that it is bandwidth efficient to just feed back these indices of FN instead of the whole eigenvectors of QN because the number of training symbols N during the training period is usually large. b To derive an estimator of R , we need to use Lemma 4.2 again. When N is large, R SR SH + t q ≈ t H H PS Q PS , and hence PSR PS PSSR S PS + PSPS Q PS PS = SR S . Then with a full rank ⊥ N ⊥ q ≈ t ⊥ N ⊥ t S, we can estimate the transmit channel correlation matrix Rt using

H 1 H H 1 Rt = (S S)− S RqS(S S)− .

b b 11

V. NUMERICAL RESULTS

In this section, we present some numerical results to show the performance gain achieved by the optimal training sequences. We consider a MIMO system with 3 transmit antennas and 3 receive antennas. The antennas form uniform linear arrays at both the transmitter and receiver. For a small angle spread, the correlation coefficient between the ith and the jth transmit antenna [12] can be approximated as:

1 2π d d [R ] exp j2π i j sin∆ t sin θ dθ = J (2π i j sin∆ t ), t i,j ≈ 2π {− | − | λ } 0 | − | λ Z0 where J0(x) is the zeroth-order Bessel function of the first kind, ∆ is the angle spread, dt is the antenna spacing and λ is the wavelength of the narrow-band signal. We set dt = 0.5λ. In the simulations, we consider two channels with different transmit channel correlations: a high spatial correlation channel with ∆ = 5◦ and a low spatial correlation channel with ∆ = 25◦. The receive correlation matrix Rr is calculated similarly as the transmit correlation matrix with ∆ = 25◦. We have assumed that the channel characteristics are estimated based on the observed training signals from 50 previous packets, i.e., K = 50.

We consider two kinds of interference: co-channel interference from other users in the same wireless system and jamming signals which are usually modeled by autoregressive (AR) random processes. We compare the channel estimation performance in terms of the total MSE for systems using different sets of training sequences. The following training sequence sets are considered for comparison: 1) the optimal training sequences described in Section III; 2) the approximate optimal training sequence constructed based on the channel and interference statistics obtained by using the proposed estimation algorithm in Section IV; 3) the temporally optimal training sequences for which the transmit channel correlation matrix is assumed to be an identity matrix and only temporal interference correlation is considered in designing the optimal training sequences (we also consider the approximate temporally optimal sequences which are constructed based on the channel statistics obtained by using the proposed algorithm); 4) the spatially optimal training sequences for which the interference is assumed to be temporally white and only transmit correlation is considered in designing the optimal training sequences (we also consider the approximate spatially optimal sequences which are constructed based on the channel statistics obtained by using the proposed algorithm); 5) binary orthogonal sequences which are generated by using the first nt columns of the Hadamard matrix; and 6) random sequences where the training symbols are i.i.d. binary random variables with zero mean and unit variance. 12

A. Co-channel Interference

In a cellular wireless communication system, co-channel interference (CCI) from other cells exists due to frequency reuse. Hence, the interfering signals have the same signal format as that of the desired user. We can express the interfering signal transmitted from the ith transmit antenna of the mth interferer as

(m) Pm ∞ (m) si (t)= bi,l ψ(t lT τm), niN − − r l= X−∞ where P is the transmit power of the mth interferer, and b(m) are data symbols transmitted from m { i,l } the ith transmit antenna of the mth interferer. The data symbols are assumed to be i.i.d. binary random variables with zero mean and unit variance. In addition, ψ(t) is the symbol waveform and T is the symbol duration. It is assumed that the receiver is synchronized to the desired user but not necessarily to the interfering signals and τm is the symbol timing difference between the mth interferer and the desired user signal. Without loss of generality, we assume 0 τ < T . The elements of the interference symbol ≤ m matrix Si are samples at the matched filter output at the receiver at time index jT . The (j, i)th element of Si is (m) Pm ∞ m ˆ sj,i = bi,lψ((j l)T τm), niN − − r l= X−∞ where ψˆ(t)= ∞ ψ(t s)ψ∗(s)ds is the autocorrelation of the symbol waveform. For the co-channel −∞ − interference, theR temporal interference correlation is due to the intersymbol interference in the sampled interfering signals. In the simulations, it is assumed that there are two interfering signals with two transmit antennas in the system and the signal-to-interference ratio (P/ Pm) is set to be 0dB. The ISI-free symbol waveform with raised cosine spectrum [24] is chosen as theP symbol waveform. For this case, we have cos(πβt/T ) ψ(t)= sinc(πt/T ) . 1 4β2t2/T 2 − We set the roll-off factor β = 0.5, τ1 = 0.2T and τ2 = 0.5T . In Figs. 1 and 2, we show the total channel estimation MSEs for the high spatial correlation channel and low spatial correlation channel, respectively. For both cases, the optimal sequences outperform the orthogonal sequences and random sequences significantly. For the high spatial correlation channel, the optimal sequences provide a substantial performance gain over both the spatially optimal sequences and the temporally optimal sequences. The approximate optimal sequences achieve most of the performance gain obtained by the optimal sequences. For the low spatial correlation channel, the temporally optimal sequences achieve an estimation performance similar to that achieved by the optimal sequences. These 13 two optimal sequences provide significant performance gains over the spatially optimal sequences. In this case, the temporal correlation has a stronger impact on channel estimation than the spatial channel correlation due to the fact that the length of training sequences N is much larger than the number of transmit antennas nt. The MSE performance of the approximate optimal sequences is a little worse than that of the temporally optimal seqeunces because of the errors in estimating QN and Rt. Note that the approximate temporally optimal sequences give performance that is in turn slightly worse than that given by the approximate optimal sequences.

B. Jamming Signals

We assume that there are two jammers, each with one transmit antenna, in the system. The jamming signals are modeled as two first-order AR processes driven by temporally white Gaussian processes u , { i,t} i.e.,

si,t = αisi,t 1 + ui,t − where si,t represents the jamming signal transmitted by the ith jammer at the tth time index, αi is the 2 temporal correlation coefficient, and ui,t has zero mean with variance σu,i which decides the transmit power of the ith jammer. The signal-to-interference ratio is set to be 0 dB. We choose α1 = 0.4 and

α2 = 0.5. In Figs. 3 and 4, we show the total channel estimation MSEs for the high spatial correlation channel and low spatial correlation channel, respectively. For the AR jammers, similar conclusions on the estimation performance achieved by different training sequences can be made as in the case of co-channel interference.

APPENDICES

A. Proof of Lemma 2.1

Let E = MI E = MI H ST and e = vec(E ). Since h (0, I ), E(e ) = 0. Then we i=1 i i=1 i i i i wi ∼ CN nr nt i have E(e) =P 0. The receivedP signal from the ith interferer can be written as

T 1/2 1/2 T 1/2 ˆT Ei = HiSi = Rr Hwi Rti Si = Rr HwiSi .

SˆT i | {z } Since Si is wide-sense stationary in time, Sˆi is also wide-sense stationary in time. Using the vec operator, we can rewrite the interfering signal from the ith interferer as

e = vec(E ) = (I R1/2)vec(H SˆT ). i i N ⊗ r wi i 14

The covariance matrix of ei is given as

E(e eH ) = E[(I R1/2)vec(H SˆT )vec(H SˆT )H (I R1/2)H ] i i N ⊗ r wi i wi i N ⊗ r = (I R1/2)E[vec(H SˆT )vec(H SˆT )H ](I R1/2). N ⊗ r wi i wi i N ⊗ r ′ ˆT ′ Let ei = vec(HwiSi ), it is easy to see that the covariance matrix of ei is

ni Ri (0)I . . . ni Ri (N 1)I k=1 k,k r k=1 k,k − r ′ ′H  . .. .  E[eiei ] = P . . P .

 ni i ni i   R (N 1)Ir . . . R (0)Ir   k=1 k,k − k=1 k,k  = Q I .  NiP⊗ nr P Then we have

E[e eH ] = (I R1/2)(Q I )(I R1/2) i i N ⊗ r Ni ⊗ nr N ⊗ r = Q R . Ni ⊗ r The covariance matrix of e is then given as

MI E[eeH ]= Q R = Q R . Ni ⊗ r N ⊗ r Xi=1 B. Solution of the optimization problem (3)

We solve the optimization problem by using the method introduced in [25]. First, we analyze the optimal structure of the solution by using the Lagrangian method, then find the optimal power allocation scheme, and finally determine the optimal ordering for the related eigenvector matrices. 1) Solution Structure: We begin by analyzing the structure of an optimal solution to (3). Let us define

T = UH SV. (9)

Substituting S = UTVH in (3) gives the following equivalent optimization problem:

H 1 1 1 H N nt min tr(T Λ− T + ∆− )− subject to tr(T T) P, T C × . (10) ≤ ∈ We now show that the solution to (10) has at most one nonzero in each row and column.

Theorem 5.1: There exists a solution of (10) of the form T = Π1ΣΠ2 where Π1 and Π2 are permutation matrices and σ = 0 for all i = j. ij 6 Proof: We argue that it suffices to prove the theorem under the following nondegeneracy assumption:

δ = δ > 0 and λ = λ > 0 for all i = j. (11) i 6 j i 6 j 6 15

Indeed, since the cost function of (10) is a continuous function of ∆ and Λ, and since any λ> 0 and δ > 0 can be approximated arbitrarily closely by vectors δ and λ satisfying the nondegeneracy conditions (11), we conclude that the theorem holds for arbitrary λ> 0 and δ > 0. There exists an optimal solution of (10) since the feasible set is compact and the cost function is a

1 1 H 1 continuous function of T. Since the eigenvalues of ∆ 2 T Λ− T∆ 2 are nonnegative, the eigenvalues 1 1 H 1 1 of (∆ 2 T Λ− T∆ 2 + I)− are less than or equal to 1. Also, by [28, Chap. 9, H.1.g], the trace of the product of two positive semi-definite Hermitian matrices is bounded by the dot product of the eigenvalues of the two matrices which are arranged in decreasing order. It follows that for any choice of T,

H 1 1 1 1 H 1 1 1 tr(T Λ− T + ∆− )− = tr∆(∆ 2 T Λ− T∆ 2 + I)− tr(∆), ≤ with equality when T = 0. Hence, there exists a nonzero optimal solution of (10), which is denoted T¯ . The first-order necessary condition for an optimal solution is the following: There exists a scalar γ 0 ≥ such that

d H 1 1 1 H tr (T Λ− T + ∆− )− + γT T = 0. (12) dT T=T¯

 −1 ¯ H 1 ¯ 1 dM Let M = T Λ− T + ∆− . Since the derivative [26] of the invertible matrix M is given by dt = M M 1 d M 1 for every element t of the matrix T, (12) is equivalent to: − − dt −

 H H 1 H 1 H 1 1 tr γ[T¯ δT + δT T¯ ] M− [T¯ Λ− δT + δT Λ− T¯ ]M− = 0 −  for all matrices δT CN nt . ∈ × Let Real(z) denote the real part of z C. Based on the fact that tr (A + AH ) = 2(Real [tr (A)]) and ∈ tr (AB)= tr (BA), we have Real tr γT¯ H δT M 2T¯ H Λ 1δT = 0. By taking δT either pure real − − − or pure imaginary, we deduce that tr [γT¯ H M 2T¯ H Λ 1]δT = 0 for all δT. By choosing δT to be − − − completely zero except for a single nonzero entry, we conclude that

H 2 H 1 γT¯ M− T¯ Λ− = 0. (13) −

If γ = 0, then T¯ = 0 since both ∆ and Λ are invertible. Hence, γ > 0. We multiply (13) on the right by T¯ to obtain

H 2 H 1 H 1 1 2 H 1 γT¯ T¯ = M− T¯ Λ− T¯ = (T¯ Λ− T¯ + ∆− )− T¯ Λ− T¯ (14)

Since T¯ H T¯ is Hermitian, we have

H 1 1 2 H 1 H 1 H 1 1 2 (T¯ Λ− T¯ + ∆− )− T¯ Λ− T¯ = T¯ Λ− T¯ (T¯ Λ− T¯ + ∆− )− . 16

H 1 1 Then we will show that T¯ Λ− T¯ and ∆− commute with each other. We need the following lemma [27, P. 249]: Lemma 5.1: If A and B are diagonalizable, they share the same eigenvector matrix if and only if AB = BA. H 1 1 2 2 Let A = T¯ Λ− T¯ and B = ∆− . Then we have (A + B)− A = A(A + B)− . According to 2 2 Lemma 5.1, A and (A + B)− share the same eigenvector matrix. Since A + B and (A + B)− have the same eigenvector matrix, A and A + B share the same eigenvector matrix. Then we have H 1 1 A(A + B) = (A + B)A. Hence, AB = BA, which implies that T¯ Λ− T¯ and ∆− commute with 1 H 1 each other. Since ∆− is diagonal, it follows from the nondegeneracy assumption that T¯ Λ− T¯ is H 1 H diagonal. Since T¯ Λ− T¯ is diagonal, T¯ T¯ is diagonal by (14). H 1 1 1 2 Since T¯ Λ− T¯ and ∆− are diagonal, both M and M− are diagonal. Hence, the factor M− in

ej t¯ij (13) is diagonal with real diagonal elements denoted ej, 1 j nt. By (13), we have γt¯ij = . If ≤ ≤ λi ej t¯ij = 0, then this further implies that = γ = 0. By the nondegeneracy condition (11), no two diagonal 6 λi 6 ej elements of Λ are equal. If for any fixed j, t¯ij = 0 for i = i1 and i2, then the identity = γ yields 6 λi ¯ a contradiction since γ = 0 and λ 1 = λ 2 . Hence, each column of T has at most one nonzero. Since 6 i 6 i T¯ H T¯ is diagonal, two different columns cannot have their single nonzero in the same row. This implies that each column and each row of T have at most one nonzero. A suitable permutation of the rows and columns of T¯ gives a diagonal matrix Σ, which completes the proof. Combining the relationship (9) between T and S and Theorem 5.1, we conclude that problem (3) has H a solution of the form S = UΠ1ΣΠ2V , where Π1 and Π2 are permutation matrices. We will show H that we can eliminate one of these two permutation matrices. Substituting S = UΠ1ΣΠ2V in (3), the equivalent optimization problem is obtained as:

1 H H 1 1 H − min tr Σ (Π1 Λ− Π1)Σ + Π2∆− Π2 Σ,Π1,Π2   M subject to σ2 P (15) i ≤ Xi=1 where M represents the minimum of N and nt. In the above optimization problem, the minimization is over diagonal matrices Σ with σ1,...,σM as the diagonal elements, and two permutation matrices Π1 H 1 1 H and Π2. Since the symmetric permutations Π1 Λ− Π1 and Π2∆− Π2 essentially interchange diagonal elements of Λ and ∆, (15) is equivalent to

M 1 min σ,π1,π2 2 σi /λπ1(i) + 1/δπ2(i) Xi=1 17

M subject to σ2 P, π , π (16) i ≤ 1 ∈ PN 2 ∈ Pnt Xi=1 where is the set of bijections of 1, 2,...,N onto itself. PN { } We will now show that the optimal solution only depends on the smallest eigenvalues of QN and the largest eigenvalues of Rt. H H Lemma 5.2: Let UΛU and V∆V be diagonalizations of QN and Rt respectively where the columns of U and V are orthonormal eigenvectors. Let σ, π1, and π2 denote an optimal solution of (16) and define the sets = i : σ > 0 , = λ : i , and = δ : i . If has l M { i } Q { π1(i) ∈ M} R { π2(i) ∈ M} M elements, then the elements of the set constitute the l smallest eigenvalues of Q , and the elements Q N of constitute the l largest eigenvalues R , respectively. R t Proof: Assume k and λ < λ for some i . It is easy to see that by interchanging 6∈ M π1(k) π1(i) ∈ M the values of π1(i) and π1(k), the new ith term in the cost function is smaller than the previous ith term. This contradicts the assumption that σ and π are optimal. Hence, λ λ . π1(k) ≥ π1(i) Suppose that k and δ > δ for some i . Let C denote the cost value due to the sum 6∈ M π2(k) π2(i) ∈ M of the ith term and the kth term before the interchange. Similarly, let C+ denote the cost value due to the sum of the ith term and the kth term after the interchange of the values of π2(i) and π2(k). We have 1 C = 2 + δπ2(k) σi /λπ1(i) + 1/δπ2 (i) and

+ 1 C = 2 + δπ2(i) σi /λπ1(i) + 1/δπ2(k)

Since δπ2(k) > δπ2(i), we have 4 2 2 2 (δ 2 δ 2 )(σi δ 2 δ 2 /λ + σi δ 2 /λ 1 + σi δ 2 /λ 1 ) + π (k) − π (i) π (k) π (i) π1(i) π (k) π (i) π (i) π (i) C C = 2 2 < 0. − − (σi δπ2(k)/λπ1(i) + 1)(σi δπ2(i)/λπ1(i) + 1)

The cost is reduced by interchanging the values of π2(i) and π2(k), which violates the optimality of σ and π. Hence, δ δ . π2(k) ≤ π2(i) Using Lemma 5.2, we now show that one of the permutations in (16) can be deleted if the eigenvalues of QN and Rt are arranged in a particular order. H H Theorem 5.2: Let UΛU and V∆V be diagonalizations of QN and Rt respectively where the columns of U and V are orthonormal eigenvectors, the eigenvalues of QN are arranged in an increasing order and the eigenvalues of Rt are arranged in a decreasing order. Then (16) is equivalent to

M M 1 2 min subject to σ P, π M , (17) σ,π 2 i σi /λπ(i) + 1/δi ≤ ∈ P Xi=1 Xi=1 18

where σi = 0 for i > M. Proof: Since σ has at most M entries, and since the elements of are the smallest eigenvalues Q of Q and the elements of are the largest eigenvalues of R , we can assume that π (i) [1, M] and R t 1 ∈ 1 π (i) [1, M] for each i . Hence, we restrict the sum in (16) to those indices i =△ π− (j) : 2 ∈ ∈ M ∈ S { 2 1 1 j M . Let us define σ′ = σ −1 and π(j)= π1(π− (j)). Since π(j) [1, M] for j [1, M], it ≤ ≤ } j π2 (j) 2 ∈ ∈ 1 follows that π . In (16) we restrict the summation to i and we replace i by π− (j) to obtain ∈ PM ∈ S 2 M M 1 1 2 = , where (σ′ ) P. σ2/λ + 1/δ σ 2/λ + 1/δ j ≤ i i π1(i) π2(i) j=1 j′ π(j) j i=1 X∈S X X This completes the proof of (17). Combining the relationship (9) between T and S, Theorems 5.1 and 5.2 yields the following corollary: Corollary 5.1: Problem (3) has a solution of the form S = UΠΣVH where the columns of U and

V are orthonormal eigenvectors of QN and Rt respectively with the eigenvalues of QN arranged in increasing order and the eigenvalues of Rt arranged in a decreasing order, Π is a permutation matrix, and Σ is diagonal.

Proof: Let σ and π be a solution of (17). For i > M, define π(i) = i and σi = 0. If Π is the permutation matrix corresponding to π, then making a substitution S = UΠΣVH in the cost function of (3) yields the cost function in (17). Since (16) and (17) are equivalent by Theorem 5.2, S is optimal in (3). 2) The Optimal Σ: We now consider the optimization problem which minimizes the cost function over σ with the permutation π in (17) given. In the next subsection, we find the optimal permutation π based on the solution to the optimization problem considered here. For the sake of notational simplicity, let ρi denote 1/λπ(i) and qi denote 1/δi. Hence, for fixed π, (17) is equivalent to the following optimization problem: M M 1 min subject to σ2 P. (18) σ 2 i ρiσi + qi ≤ Xi=1 Xi=1 The solution of (18) can be expressed in terms of a Lagrange multiplier related to the power constraint. The structure of this solution has a water filling interpretation in the communication literature. Theorem 5.3: The optimal solution of (18) is given by 1/2 1 q σ = max i , 0 , i ρ µ − ρ r i i  M 2 where the parameter µ is chosen so that i=1 σi = P . Proof: Since the minimization of theP cost function in (18) is over a closed and bounded set, there exists a solution. At an optimal solution to (18), the power constraint must be an equality. Otherwise, 19 we can multiply σ by a scalar larger than 1 to reduce to the value of the cost function. For the sake of 2 notation simplicity, let ti = σi . Then the reduced optimization problem (18) is equivalent to M M 1 min subject to t = P, t 0. (19) t i ρiti + qi ≥ Xi=1 Xi=1 Since the cost function is strictly convex and the constraint is convex, the optimal solution to (19) is unique. The first-order necessary conditions (Karush-Kuhn-Tucker conditions) for an optimal solution of (19) are the following: There exists a scalar µ 0 and a vector ν RM such that ≥ ∈ ρi 2 + µ νi = 0, νi 0, ti 0, νiti = 0, 1 i M. (20) −(ρiti + qi) − ≥ ≥ ≤ ≤ Due to the convexity of the cost and the constraint, any solution of these conditions is the unique optimal solution of (19). A solution to (20) can be obtained as follows. We define the function + 1 q t (µ)= i . (21) i ρ µ − ρ r i i  Here x+ = max x, 0 . This particular value for t is obtained by setting ν = 0 in (20) and solving for { } i i ti; when the solution is < 0, we set ti(µ) = 0 (this corresponds to the + operator (21)). We note that t (µ) is a decreasing function of µ which approaches + as µ approaches 0 and which approaches 0 i ∞ as µ grows to + . Hence, the equation ∞ M ti(µ)= P (22) Xi=1 has a unique positive solution. Since t (ρ /q2) = 0, we have t (µ) = 0 for µ ρ /q2. Then we have i i i i ≥ i i ρi ρi 2 2 + µ = 2 + µ> 0 for µ > ρi/qi . −(ρiti(µ)+ qi) −qi We deduce that the Karush-Kuhn-Tucker conditions can be satisfied when µ is the positive solution of (22). 3) Optimal Eigenvector Ordering: Finally, we need to find an optimal permutation in (17), or equivalently, an optimal ordering for the eigenvalues of QN and Rt. Theorem 5.4: If the eigenvalues λ of Q are arranged in an increasing order and the eigenvalues { i} N δ of R are arranged in a decreasing order, then an optimal permutation in (17) is { i} t π(i)= i, 1 i M. (23) ≤ ≤

Proof: Suppose that σ and π are optimal in (17). For convenience, let λi stand for λπi . If there exist indices i and j such that i < j, σi > 0, σj > 0, λi > λj and δi > δj, (equivalently, ρi < ρj and 20

qi

+ 1 1 C = min + subject to ti + tj = P,¯ ti 0, tj 0, (27) ti,tj ρjti + qi ρitj + qj ≥ ≥

Assuming the optimal solution of (24) is positive (after the exchange of ρi and ρj), we have ( 1 + 1 )2 √ρi √ρj C+ = . P¯ + qj + qi ρi ρj We need to use the following lemma [28]:

Lemma 5.3: If ai, bi, i = 1,...,n are two sets of numbers, n n n a[i]b[n i+1] aibi a[i]b[i], − ≤ ≤ Xi=1 Xi=1 Xi=1 where a . . . a denote the components of a in a decreasing order. [1] ≥ ≥ [n] i By Lemma 5.3, we have qj + qi > qi + qj since ρ < ρ and q ρj and λi < λj. In this way, the λi’s are arranged in an increasing order. Since the δi’s are arranged in a decreasing order, we conclude that the associated optimal permutation π is (23). 21

Now, consider the case where the optimal solution of (24) is not strictly positive. Since the original solution of (24), before the exchange, is positive, it follows from (25) and (26) that

q q q q P¯ > i j and P¯ > j i . (28) √ρiρj − ρj √ρiρj − ρi After the exchange, the analogous inequalities that must be satisfied to preserve nonnegativity are

q q P¯ > j i , (29) √ρiρj − ρj and q q P¯ > i j . (30) √ρiρj − ρi

Note that (30) is satisfied from (28) and the fact that ρi < ρj and qi < qj. If (29) is also satisfied, the proof is completed since the solution of (24) after the exchange of λi and λj is positive. Now, suppose that (29) is violated. In this case, we have

q q P¯ j i . (31) ≤ √ρiρj − ρj Combining (28) and (31), it follows that

q q q q q q max i j , j i < P¯ j i . (32) √ρiρj − ρj √ρiρj − ρi ≤ √ρiρj − ρj n o We show that for all P¯ satisfying (32), C+ C. Consequently, by exchanging λ and λ , the cost cannot ≤ i j increase.

Let C∗ be the objective function value in (27) corresponding to tj = 0 and ti = P¯: 1 1 C∗ = + . ρjP¯ + qi qj We show that C C. Since C+ C , we deduce that C+ C. ∗ ≤ ≤ ∗ ≤ The inequality C C is equivalent to the following: ∗ ≤ ( 1 + 1 )2 1 1 √ρi √ρj + . ρ P¯ + q q ≤ P¯ + qi + qj j i j ρi ρj Multiplying both sides of the above inequality by (ρ P¯ + q )q (P¯ + qi + qj ) gives j i j ρi ρj

qi qj qi qj 1 1 2 qj(P¯ + + ) + (ρjP¯ + qi)(P¯ + + ) ( + ) (ρjP¯ + qi)qj. ρi ρj ρi ρj ≤ √ρi √ρj After some rearrangement, this reduces to

2 ρjqi ρjqj 2√ρjqj qi qj 2 f(P¯)= ρjP¯ + (qi + qj + )P¯ + ( ) 0. ρi − ρi − √ρi √ρi − √ρj ≤ 22

We now evaluate f at the possible endpoints of the interval given in (32). We have

qi qj √ρj qi qj qj qi f( ) = ( + 1)(qi qj)( + ) 0 √ρiρj − ρj √ρi − √ρiρj − ρj − √ρiρj ρi ≤ when qi qj qj qi , √ρiρj − ρj ≥ √ρiρj − ρi qj qi qj qj qi qi qj f( )= (ρi ρj)( + ) 0 √ρiρj − ρi qi − √ρiρj − ρi − √ρiρj ρj ≤ when qj qi qi qj , and √ρiρj − ρi ≥ √ρiρj − ρj qj qi 1 f( )= qj(qi qj) (√ρi + √ρj)(ρj ρi) 0. √ρiρj − ρj − ρi√ρiρj − ≤ Since f is convex and nonpositive at the ends of the interval (32), f is nonpositive on the entire interval. This implied that C C and the proof is complete. ∗ ≤

C. Proof of Lemma 4.1

H 1 H Let PS = S(S S) S be the projection onto (S) and PS = I PS be the projection onto − R ⊥ − ′ ′ ′ ′ ′ (S). First, let (R , Q ), (R , Q ) D. Let R = SR SH +Q and R = SR SH +Q . Consider R⊥ t N t N ∈ q t N q t N H ′ ′ ′ ′ ′ H P⊥S Rq = P⊥S QN = QN , PSRq = SRtS , and P⊥S Rq = P⊥S QN = QN , PSRq = SRtS . Since S is ′ ′ of full rank, PSRq = PSRq iff Rt = Rt. Also since PS and P⊥S are projections onto complementary ′ ′ ′ ′ ′ subspaces, Rq = Rq iff P⊥S Rq = P⊥S Rq and PSRq = PSRq, i.e. (Rt, QN ) = (Rt, QN ). Moreover, ′ ′ ′ given (R , Q ), choose R = R and define Q = Q + SR SH SR SH . Since S is of full rank, t N t 6 t N N t − t ′ ′ ′ ′ Q = Q . But R = SR SH + Q = SR SH + Q = R . N 6 N q t N t N q

D. Proof of Lemma 4.2

In addition to the weak norm defined just right before the statement of the lemma, we are also interested

H H H in the strong norm [22], [23] of a matrix A: A = maxx xH x [x A Ax]= λ (A A), where k k : =1 max λ represents the largest eigenvalues of a matrix. If A is Hermitian, A = pλ (A) . max k k | max | Note that QN can be represented by a sequence qk; k = 0, 1, 2,... with QN = qk,j = qk j , { } { } { − } qk = q∗ k and k∞=0 qk < . It is shown in [29] that QN 2 ( q0 + 2 k∞=1 qk ) = 2Mq < . − | | ∞ k k≤ | | | | ∞ To proceed, weP need the following lemma [28]: P Lemma 5.4: For two Hermitian positive semi-definite matrices G and H, λ (GH) max ≤ λmax(G)λmax(H). Then, we have

PSQ = λ (Q PSQ ) λ (Q )λ (PS)λ (Q )= Q . (33) k N k max N N ≤ max N max max N k N k p p 23

We now show that the difference between the two matrices goes to zero asymptotically in weak norm. Using the properties of weak norm, we have

Q P⊥S Q P⊥S = PSQ + Q PS PSQ PS | N − N |w | N N − N |w PSQ + Q PS + PSQ PS . (34) ≤ | N |w | N |w | N |w We need the following Lemma [22], [29]: Lemma 5.5: Given two n n matrices G and H, then GH G H . × | |w ≤k k | |w First note that PS = tr[S(SH S) 1SH ]/N = tr[I ]/N = n /N. Then using the above lemma, | |w − nt t we have Q PS Qp PS 2M n /Np. Similarly, PpSQ = Q PS 2M n /N. | N |w ≤k N k | |w ≤ q t | N |w | N |w ≤ q t Combining Lemma 5.5 and (33), we have pPSQ PS PSQ n /N Q pn /N | N |w ≤k N k t ≤k N k t ≤ 2Mq nt/N. Thus, from (34), we have limN QN P⊥S QN P⊥S w =p 0. p →∞ | − | p REFERENCES

[1] B. Hassibi and B. M. Hochwald, “How much training is needed in multiple-antenna wireless links,” IEEE Trans. Inform. Theory, vol. 49, no. 4, pp. 951–963, Apr. 2003. [2] F. Digham, N. Mehta, A. Molisch, and J. Zhang, “Joint pilot and data loading technique for MIMO systems operating with covariance feedback,” Intern. Conf. 3G Mobile Commun. Technol., Oct. 2004. [3] X. Ma, L. Yang, and G. B. Giannakis, “Optimal training for MIMO frequency-selective fading channels,” IEEE Trans. Wireless Commun., vol. 4, pp. 453-466, Mar. 2005. [4] C. Fragouli, N. Al-Dhahir, and W. Turin, “Training based channel estimation for multiple-antenna broadband transmissions,” IEEE Trans. Wireless Commun., vol. 2, pp. 384-391, Mar. 2003. [5] T. F. Wong, and B. Park, “Training sequence optimization in MIMO systems with colored interference,” IEEE Trans. Commun., vol. 52, pp. 1939-1947, Nov. 2004. [6] J. H. Kotecha, and A. M. Sayeed, “Transmit signal design for optimal estimation of correlated MIMO channels,” IEEE Trans. Signal Processing, vol. 52, pp. 546-557, Feb. 2004. [7] X. Cai, G. B. Giannakis and M. D. Zoltowski, “Space-time spreading and block coding for correlated fading channels in the presence of interference,” IEEE Trans. Commun., vol. 53, pp. 515-525, Mar. 2005. [8] S. Zhou and G. B. Giannakis, “Optimal transmitter eigen-beamforming and space-time block coding based on channel correlations,” IEEE Trans. Inform. Theory, vol. 49, pp. 1673–1690, July 2003. [9] M. Biguesh and A. B. Gershman, “MIMO channel estimation: optimal training and tradeoffs between estimation techniques,” IEEE Intern. Conf. Commun., 2004. [10] M. Medard, “ The effect upon channel capacity in wireless communications of perfect and imperfect knowledge of the channel,” IEEE Trans. Inform. Theory, vol. 46, no. 3, pp. 933–946, May 2000. [11] A. M. Sayeed, “Deconstrucing multi-antenna fading channels,” IEEE Trans. Signal Processing, vol. 50, pp. 2563–2579, Oct. 2002. [12] D. S. Shiu, G. J. Foschini, M. J. Gans, and J. M. Kahn, “Fading correlation and its effect on the capacity of multielement antenna systems,” IEEE Trans. Commun., vol. 48, pp. 502-513, Mar. 2000. 24

[13] C. Chuah, D. Tse, J. Kahn and R. Valenzuela, “Capacity Scaling in MIMO Wireless Systems under Correlated Fading,” IEEE Trans. Inform. Theory, vol. 48, pp. 637–650, Mar. 2002. [14] A. Scaglione, P. Stoica, S. Barbarossa, G. B. Giannakis, and H. Sampath, “Optimal designs for space-time linear precoders and decoders,” IEEE Trans. Signal Processing, vol. 50, pp. 1051–1064, May 2002. [15] Y. Song and S. D. Blostein, “Channel Estimation and Data Detection for MIMO Systems under Spatially and Temporally Colored Interference,” EURASIP Journal of Applied Signal Processing, pp. 685–695, May. 2004. [16] S. M. Kay, Fundamentals of Statistical Signal Processing: Estimation Theory, Prentice Hall, 1993. [17] D. Palomar, J. Cioffi, and M. Lagunas, “Joint Tx-Rx beamforming design for multicarrier MIMO channels: a unified framework for convex optimization,” IEEE Trans. Signal Processing, vol. 51, pp. 2381-2401, Sept. 2003. [18] D. Palomar and J. R. Fonollosa, “Practical algorithms for a family of waterfilling solutions,” IEEE Trans. Signal Processing, vol. 53, no. 2, pp. 686–695, Feb. 2005. [19] N. Lu, “Tests on multiplicative covariance structures,” Ph.D. Thesis, University of Iowa, 2002. [20] P. J. Brown, M. G. Kenward, and E. E. Bassett, “Bayesian discrimination with longitudinal data,” Biostatistics, vol. 2, pp. 417–432, 2001. [21] P. Dutilleul, “The MLE algorithm for the matrix normal distribution,” Journal of Statistical Computation and Simulation, vol. 64, pp. 105–123, 1999. [22] R. M. Gray, “On the asymptotic eigenvalue distribution of Toeplitz matrices,” IEEE Trans. Inform. Theory, vol. 18, pp. 725–730, 1972. [23] U. Grenander and G. Szego, Toeplitz Forms and Their Applications, Berkeley, CA: Univ. California Press, 1958. [24] J. G. Proakis, Digital Communications, New York: McGraw-Hill, 2001. [25] W. W. Hager, Y. Liu and T. F. Wong, “Optimiztion of generalized mean square error in signal processing and communication,” Linear Algebra and Its Applications, 2006. To appear. [26] J. R. Magnus and H. Neudecher, Matrix Differential Calculus with Applications in Statistics and Econometrics, Chichester, West Sussex: Wiley, 1988. [27] G. Strang, Linear Algebra and Its Applications, Thomson, 2006. [28] A. W. Marshall and I. Olkin, Inequalities: Theory of Majorization and Its Applications, New York: Academic, 1979. [29] R. M. Gray, Toeplitz and Circulant Matrices: a Review, Revised Aug. 2002. [Online]. Available: http://www- ee.stanford.edu/ gray/toeplitz.pdf. 25

−1 10 Optimal sequences Approximate optimal sequences Temporally optimal sequences Spatially optimal sequences Orthogonal sequences Random sequences Approximate temporally optimal sequenes Approximate spatially optimal sequences MSE

−2 10

15 20 25 30 35 40 45 50 55 60 65 N

Fig. 1. Comparison of total MSEs obtained using different training sequences. ISI-free symbol waveform and high spatial correlation channel.

Optimal sequences Approximate optimal sequences Temporally optimal sequences Spatially optimal sequences −1 10 Orthogonal squences Random sequences Approximate temporally optimal sequences Approximate spatially optimal sequences MSE

−2 10 15 20 25 30 35 40 45 50 55 60 65 N

Fig. 2. Comparison of total MSEs obtained using different training sequences. ISI-free symbol waveform and low spatial correlation channel. 26

Optimal squences Approximate optimal sequences Temporally optimal sequences Spatially optimal sequences Orthogonal sequencs Random sequences

−1 Approximate temporally optimal sequences 10 Approximate spatially optimal sequences MSE

−2 10

15 20 25 30 35 40 45 50 55 60 65 N

Fig. 3. Comparison of total MSEs obtained using different training sequences. AR jammers and high spatial correlation channel.

Optimal sequences Approximate optimal sequences Temporally optimal sequences Spatially optimal sequences Orthogonal sequences Random sequences Approximate temporally optimal sequences Approximate spatially optimal sequences

−1 10 MSE

−2 10 15 20 25 30 35 40 45 50 55 60 65 N

Fig. 4. Comparison of total MSEs obtained using different training sequences. AR jammers and low spatial correlation channel.