Topics in Statistical Signal Processing for Estimation and Detection in Wireless Communication Systems

by Ido Nevat B.Sc. (Electrical Engineering), Technion - Institute of Technology, Israel, 1998.

A thesis submitted in partial fulfillment for the degree of Doctor of Philosophy

in the Faculty of Engineering School of Electrical Engineering and Telecommunications The University of New South Wales

December 2009

Originality Statement

I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgment is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged.

Signature:

Date:

iii Copyright Statement

I hereby grant the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstract International. I have either used no substantial portions of copyright material in my thesis or I have obtained permission to use copyright material; where permission has not been granted I have applied/will apply for a partial restriction of the digital copy of my thesis or dissertation.

Signature:

Date:

Authenticity Statement

I certify that the Library deposit digital copy is a direct equivalent of the final officially approved version of my thesis. No emendation of content has occurred and if there are any minor variations in formatting, they are the result of the conversion to digital format.

Signature:

Date:

iv “What is a scientist after all? It is a curious man looking through a keyhole, the keyhole of nature, trying to know what’s going on.”

Jacques Yves Cousteau THE UNIVERSITY OF NEW SOUTH WALES

Abstract

Faculty of Engineering School of Electrical Engineering and Telecommunications

Doctor of Philosophy

by Ido Nevat

During the last decade there has been a steady increase in the demand for incorporation of high data rate and strong reliability within wireless communication applications. Among the different solutions that have been proposed to cope with this new demand, the utilization of multiple antennas arises as one of the best candidates due to the fact that it provides both an increase in reliability and also in information transmission rate. A Multiple Input Multi- ple Output (MIMO) structure usually assumes a frequency non-selective characteristic at each channel. However, when the transmission rate is high, the whole channel can become frequency selective. Therefore, the use of Orthogonal Frequency Division Multiplexing (OFDM) that transforms a frequency selective channel into a large set of individual frequency non-selective narrowband channels, is well suited to be used in conjunction with MIMO systems.

A MIMO system employing OFDM, denoted MIMO-OFDM, is able to achieve high spectral efficiency. However, the adoption of multiple antenna elements at the transmitter for spatial transmission results in a superposition of multiple transmitted signals at the receiver, weighted by their corresponding multipath channels. This in turn results in difficulties with reception, and imposes a real challenge on how to design a practical system that can offer a true spectral efficiency improvement.

In addition, as wireless networks continue to expend in geographical size, the distance between the source and the destination precludes direct communication between them. In such scenarios, a repeater is placed between the source and the destination to achieve end-to-end communica- tion. New advances in electronics and semiconductor technologies have enabled and made relay based systems feasible. As a result, these systems have become a hot research topic in the wireless research community in recent years. Potential application areas of cooperation diversity are the next generation cellular networks, mobile wireless ad-hoc networks, and mesh networks for wireless broadband access. Besides increasing the network coverage, relays can provide ad- ditional diversity to combat the effects of the wireless fading channel. This thesis is concerned with methods to facilitate the use of MIMO, OFDM and relay based systems.

In the first part of this thesis, we concentrate on low complexity algorithms for detection of symbols in MIMO systems, with various degrees of quality of channel state information. First, we design algorithms for the case that perfect Channel State Information (CSI) is available at the receiver. Next, we design algorithms for the detection of non-uniform symbols constellations where only partial CSI is given at the receiver. These will be based on non-convex and stochastic optimisa- tion techniques.

The second part of this thesis addresses primary issues in OFDM systems. We first concentrate on a design of an OFDM receiver. First we design an iterative receiver for OFDM systems which performs detection, decoding and channel tracking that aims at minimising the error propaga- tion effect due to erroneous detection of data symbols. Next we focus our attention to channel estimation in OFDM systems where the number of channel taps and the power delay profile are both unknown a priori. Using Trans Dimensional Markov Chain Monte Carlo (TDMCMC) methodology we design algorithms to perform joint model order selection and channel estimation.

The third part of this thesis is dedicated to detection of data symbols in relay systems with non-linear relay functions and where only partial CSI is available at the receiver. In order to design the optimal data detector, the likelihood function needs to be evaluated at the receiver. Since the likelihood function cannot be obtained analytically or not even in a closed form in this case, we shall utilse a “Likelihood Free” inference methodology. This will be based on the Approximate Bayesian Computation (ABC) theory to enable the design of novel data sequence detectors. Acknowledgements

During the course of this thesis I have met and interacted with many interesting people, who influenced the path of my research. I would first like to thank my academic supervisor Dr. Jinhong Yuan, for supporting me over the years, and for giving me the freedom to explore my ideas. His constant friendly attitude, encouragement, technical insights and constructive criticism have been invaluable. I would also like to thank him for his financial support that enabled me to give my research work full atten- tion and travel to international conferences. I have been very fortunate to work with Dr. Gareth Peters of School of Mathematics at Uni- versity of NSW. His patience and willingness to share his vast knowledge is inestimable. I will cherish our fruitful discussions and his helpful comments, both on the whiteboard and during our numerous coffee breaks. Many thanks go to Dr. Miquel Payar´oof CTTC, Barcelona, who helped me while I was very confused in the beginning of my studies. I am most grateful to Dr. Ami Wiesel and Dr. Yonian Eldar of Electrical Engineering at the Technion for helping me get a better understanding of Bayesian inference and optimisation tech- niques. Thank you also goes to Dr. Scott Sisson and Dr. Yanan Fan of School of Mathematics at University of NSW for sharing their knowledge in Bayesian . I would also like to thank all my friends and colleagues at the Wireless Communication Lab at UNSW with whom I have had the pleasure of working over the years; thank you goes to Imitiaz, Tom, Adeel, Marwan, Anisul, Jonas, David, Giovanni, Tuyen and Nam Tram. Thank you also goes to UNSW staff, especially to Joseph Yiu who was willing to land a hand with any technical matter and to Gordon Petzer and May Park for taking care of all adminis- trative issues. Many thanks also go to the people of ACoRN, especially Dr. Lars Rasmussen and Christine Thursby for arranging academic activities and making sure we have sufficient funding.

I would like to thank my family for their unwavering support and love throughout my life; without them, none of this would have been possible.

Finally I would like to thank my partner Dr. Karin Avnit, for too many reasons to list here.

viii Dedicated to my Parents

“And since you know you cannot see yourself,

so well as by reflection, I, your glass,

will modestly discover to yourself,

that of yourself which you yet know not of.”

William Shakespeare

ix

Contents

Originality Statement iii

Copyright Statement iv

Authenticity Statement iv

Abstract vi

Acknowledgements viii

List of Figures xix

List of Tables xxi

List of Algorithms xxiii

Acronyms xxiv

Notations and Symbols xxx

1 Introduction 1 1.1 Motivation ...... 1 1.2 Outline of the dissertation ...... 4 1.3 Research contributions ...... 6

2 Bayesian Inference and Analysis 9 2.1 Introduction ...... 9 2.2 Background ...... 9 2.3 Bayesian Inference ...... 10 2.3.1 Prior Distributions ...... 11 2.3.2 Point Estimates ...... 13 2.3.3 Interval Estimation ...... 16 2.4 Bayesian Model Selection ...... 16

xi 2.4.1 Introduction ...... 16 2.5 Bayesian Estimation Under Unknown Model Order ...... 18 2.5.1 Bayesian Model Averaging ...... 18 2.5.2 Bayesian Model Order Selection ...... 19 2.6 Bayesian Filtering ...... 19 2.6.1 State Space Models ...... 20 2.6.2 Sequential Bayesian Inference ...... 21 2.6.3 Filtering Objectives ...... 22 2.6.4 Sequential Scheme ...... 22 2.6.5 Linear State-Space Models - the Kalman Filter ...... 23 2.7 Consistency Tests ...... 25 2.8 The EM and Bayesian EM Methods ...... 26 2.9 Lower Bounds on the MSE ...... 28 2.10 Bayesian Methodology Summary ...... 29 2.11 Monte Carlo Methods ...... 29 2.11.1 Motivation ...... 29 2.11.2 Mote Carlo Techniques ...... 30 2.11.3 Sampling From Distributions ...... 31 2.11.4 Inversion Sampling ...... 32 2.11.5 Accept-Reject Sampling ...... 33 2.11.6 Importance Sampling ...... 34 2.12 Markov Chain Monte Carlo Methods ...... 35 2.12.0.1 Introduction ...... 35 2.12.1 Basics of Markov Chains ...... 36 2.12.2 Metropolis Hastings Sampler ...... 37 2.12.3 Gibbs Sampler ...... 38 2.12.4 Simulated Annealing ...... 40 2.12.4.1 Introduction ...... 40 2.12.4.2 Methodology ...... 40 2.12.5 Convergence Diagnostics of MCMC ...... 41 2.12.5.1 Burn in Period ...... 41 2.12.5.2 ...... 41 2.13 Trans Dimensional Markov Chain Monte Carlo ...... 42 2.13.1 Introduction ...... 42 2.13.1.1 Posterior densities as proposal densities ...... 44 2.13.1.2 Independent sampler ...... 44 2.13.1.3 Standard Metropolis-Hastings ...... 44 2.14 Stochastic Approximation Markov Chain Monte Carlo ...... 45 2.14.1 Stochastic Approximation ...... 45 2.14.2 Stochastic Approximation Markov Chain Monte Carlo ...... 46 2.15 Approximate Bayesian Computation ...... 47 2.15.1 Introduction ...... 48 2.15.2 Basic ABC Algorithm ...... 48 2.15.3 Data Summaries ...... 50 2.15.4 MCMC-ABC Samplers ...... 50 2.15.5 Distance Metrics ...... 52 2.15.6 ABC methodology summary ...... 53 2.16 Concluding Remarks ...... 54

3 Introduction to Wireless Communication 55 3.1 Introduction ...... 55 3.2 Modeling of Fading Channels ...... 55 3.2.1 Tapped Delay-line Channel Model ...... 57 3.2.2 Doppler Offset ...... 57 3.2.3 Power Delay Profile ...... 58 3.2.4 Coherence Bandwidth ...... 58 3.2.5 Coherence Time ...... 58 3.2.6 Time Selective and Fast Fading Channels ...... 58 3.2.7 Slow Fading Channels ...... 59 3.2.8 Frequency Selective Channels ...... 59 3.2.9 Flat Fading Channels ...... 60 3.3 Channel Models ...... 60 3.3.1 Rayleigh Fading Channels ...... 60 3.3.2 Clarke’s / Jake’s Model ...... 61 3.3.3 Approximation of Jake’s Model ...... 62 3.4 Overview of Multi Antenna Communication Systems ...... 63 3.4.1 The Linear MIMO Channel ...... 63 3.4.2 Channel Model ...... 63 3.4.3 Uncertainty Models for the Channel State Information ...... 64 3.5 Detection Techniques in MIMO Systems ...... 65 3.5.1 Linear Detectors ...... 66 3.5.2 VBLAST Detector ...... 66 3.5.3 Sphere Decoder ...... 67 3.6 Overview of OFDM Systems ...... 69 3.6.1 OFDM Signals and Orthogonality ...... 69 3.6.2 OFDM Symbols Transmission ...... 70 3.6.3 Multi Carrier versus Single Carrier Modulation Schemes ...... 74 3.7 Channel Estimation in OFDM Systems ...... 75 3.7.1 Pilot Aided Channel Estimation ...... 76 3.7.2 Blind Channel Estimation ...... 77 3.7.3 Semi Blind Channel Estimation ...... 78 3.7.4 MIMO-OFDM System Model ...... 79 3.8 Channel Coding ...... 79 3.8.1 Linear Codes ...... 79 3.8.2 Convolutional Coding ...... 80 3.8.3 BICM Technique ...... 80 3.9 Iterative Processing Techniques ...... 81 3.9.1 The Turbo Principle ...... 81 3.9.2 Iterative Detection, Decoding and Estimation ...... 82 3.9.3 Iterative Detector and Decoder Components ...... 82 3.10 Relay Based Communication Systems ...... 84 3.10.1 Introduction ...... 84 3.10.2 Relay System Model ...... 84 3.10.3 MAP Detection in Memoryless Relay Functions ...... 86 3.11 Concluding Remarks ...... 87

4 Detection in MIMO Systems using Power Equality Constraints 89 4.1 Introduction ...... 89 4.2 Background ...... 90 4.3 System Description ...... 91 4.4 Power Equality Constraint Least Square Detection ...... 93 4.4.1 Basic Definitions and Problem Settings ...... 93 4.4.2 Constraint LS Detection for a Specific Power Group ...... 96 4.4.3 PEC-LS Detection with QAM Modulation ...... 97 4.4.4 Ordered PEC-LS Detection with Reduced Number of Power Groups . . . 99 4.5 Improved Ordered Power Equality Constraint Detection ...... 102 4.6 Efficient Implementation and Complexity Analysis ...... 104 4.6.1 Efficient Implementation of Constrained LS Detector ...... 105 4.6.2 Overall Complexity ...... 108 4.7 Simulation Results ...... 109 4.7.1 System Configuration ...... 110 4.7.2 Discussion of Simulation Results ...... 110 4.7.3 Complexity Assessment ...... 113 4.8 Chapter Summary and Conclusions ...... 115

5 Detection of Gaussian Constellations in MIMO Systems under Imperfect CSI117 5.1 Introduction ...... 117 5.2 Background ...... 118 5.3 System Description ...... 119 5.3.1 Pilot Aided Maximum Likelihood Channel Estimation ...... 121 5.3.2 Detection in a Mismatched Receiver ...... 121 5.4 Bayesian Detection under Channel Uncertainty ...... 123 5.4.1 Optimal MAP Detection ...... 123 5.4.2 Linear MMSE Detection ...... 124 5.4.3 Hidden Convexity Based MAP Detector ...... 124 5.4.4 Bayesian EM based MAP Detector ...... 126 5.4.4.1 Initial guess of x0 ...... 128 5.4.5 MAP Detection with Unknown Noise Variance ...... 128 5.4.6 Efficient Implementation and Complexity Analysis ...... 131 5.5 Simulation Results ...... 132 5.5.1 System Configuration ...... 133 5.5.2 Constellation Design ...... 133 5.5.3 Comparison of Detection Techniques ...... 134 5.6 Chapter Summary and Conclusions ...... 139 5.7 Appendix I ...... 140 5.8 Appendix II ...... 141

6 Iterative Receiver for Joint Channel Tracking and Decoding in BICM-OFDM Systems 143 6.1 Introduction ...... 143 6.2 Background ...... 144 6.3 System Description ...... 147 6.3.1 BICM-OFDM Transmitter ...... 147 6.3.2 Channel Model ...... 148 6.3.3 Autoregressive Modeling ...... 149 6.4 Receiver Structure ...... 150 6.4.1 Soft Demodulator ...... 150 6.4.2 MAP Decoder ...... 152 6.4.3 Channel Tracking with Known Symbols ...... 153 6.4.4 Decision-Directed Based Channel Tracking ...... 153 6.4.5 Robust Estimation ...... 155 6.4.6 Channel Tracking using Adaptive Detection Selection ...... 156 6.4.7 Tracking Quality Indicator using Consistency Test ...... 158 6.4.8 Adaptive Detection Selection Algorithm ...... 159 6.4.9 Discussion - Soft versus Hard Kalman Filter ...... 159 6.5 Estimation Error Analysis ...... 160 6.6 Simulation Results ...... 163 6.6.1 System Configuration ...... 163 6.6.2 Kalman Filter Consistency Tests ...... 163 6.6.3 BER and Channel Estimation Error Results ...... 164 6.7 Chapter Summary and Conclusions ...... 169 6.8 Appendix - LLR Values of the A posteriori Probabilities ...... 170

7 Channel Estimation in OFDM Systems using Trans Dimensional MCMC 173 7.1 Introduction ...... 173 7.2 Background ...... 174 7.3 System Description ...... 175 7.3.1 Channel Model ...... 176 7.4 Channel Estimation with Unknown Channel Length ...... 177 7.4.1 Channel Estimation using Bayesian Model Averaging ...... 177 7.4.2 Channel Estimation using Bayesian Model Order Selection ...... 178 7.4.3 Complexity Issues ...... 179 7.4.4 Simulation Results ...... 179 7.5 Channel Estimation with Unknown Channel Length and Unknown PDP . . . . . 182 7.6 Trans Dimensional Markov chain Monte Carlo ...... 184 7.6.1 Specification of Within-Model Moves: Metropolis-Hastings within Gibbs . 188 (t 1) 7.6.1.1 Specification of Transition Kernel T (h − ) (h ) . . . . 188 1:L(t−1) → i∗ 7.6.1.2 Specification of Transition Kernel T ((β(t 1)) (β )) ...... 188 − → ∗ 7.6.2 Specification of the Between-Model Moves Transition Kernel ...... 189 7.7 Design of Between-Model Birth and Death Proposal Moves ...... 190 7.7.1 Algorithm 1: Basic Birth Death Moves ...... 190 7.7.2 Algorithm 2: Stochastic Approximation TDMCMC ...... 190 7.7.3 Algorithm 3: Conditional Path Sampling TDMCMC ...... 194 7.7.3.1 Generic Construction of the CPS proposal ...... 194 7.8 Complexity Analysis ...... 198 7.9 Estimator Efficiency via Bayesian Cram´erRao Type Bounds ...... 198 7.10 Simulation Results ...... 203 7.10.1 System Configuration and Algorithms Initialization ...... 203 7.10.2 Model Sensitivity Analysis ...... 204 7.10.2.1 Sensitivity of Model Order to Prior Choice P r (L) ...... 204 7.10.2.2 Sensitivity of Model Order to the True Decay Rate β ...... 204 7.10.2.3 Analysis of Posterior Precision for Marginals ...... 205 7.10.2.4 Estimated Pairwise Marginal Posterior Distributions ...... 206 7.10.3 Comparative Performance of Algorithms ...... 207 7.10.4 Algorithm Performance ...... 209 7.11 Chapter Summary and Conclusions ...... 213 7.12 Appendix ...... 214

8 Bayesian Symbol Detection in Wireless Relay Networks Using “Likelihood Free” Inference 217 8.1 Introduction ...... 217 8.1.1 Relay Communications ...... 218 8.1.2 Model and Assumptions ...... 222 8.1.2.1 Prior Specification and Posterior ...... 223 8.1.2.2 Evaluation of the Likelihood Function ...... 224 8.1.3 Inference and MAP Sequence Detection ...... 224 8.2 Likelihood-Free Methodology ...... 225 8.3 Algorithm 1 - MAP Sequence Detection via MCMC-ABC ...... 227 8.3.1 Observations and Synthetic Data ...... 227 8.3.2 Summary Statistics ...... 228 8.3.3 Distance Metric ...... 229 8.3.4 Weighting Function ...... 229 8.3.5 Tolerance Schedule ...... 229 8.3.6 Performance Diagnostic ...... 231 8.4 Algorithm 2 - MAP Sequence Detection via Auxiliary MCMC ...... 231 8.5 Alternative MAP Detectors and Lower Bound Performance ...... 233 8.5.1 Sub-optimal Exhaustive Search Zero Forcing Approach ...... 233 8.5.2 Lower Bound MAP Detector Performance ...... 234 8.6 Simulation Results ...... 234 8.6.1 Analysis of Mixing and Convergence of MCMC-ABC Methodology . . . . 235 8.6.2 Analysis of ABC Model Specifications ...... 237 8.6.3 Comparisons of Detector Performance ...... 237 8.7 Chapter Summary and Conclusions ...... 240

9 Conclusions and Future Work 241 9.1 Conclusions ...... 241 9.2 Future Work ...... 243

10 Appendix 245 10.1 Properties of Gaussian Distribution ...... 245 10.2 Definition of circularity ...... 246 10.3 Bayesian Derivation of the Kalman Filter ...... 246 10.3.1 Introduction ...... 246 10.3.2 MMSE Derivation of Kalman Filter ...... 247 10.3.3 MAP Derivation of Kalman Filter ...... 249 10.4 The EM algorithm ...... 252 10.4.1 Introduction ...... 252 10.4.2 Derivation of EM ...... 252 10.4.3 BEM algorithm ...... 254

List of Figures

2.1 Parameter estimation criteria based on the marginal posterior distribution . . . . 15 2.2 The inverse transform method to obtain samples ...... 32 2.3 Sample path of a bivariate distribution using the Gibbs sampler ...... 39

3.1 Types of fading channels ...... 56 3.2 MIMO system model ...... 64 3.3 Idea behind the sphere decoder ...... 68 3.4 Block diagram of OFDM transceiver ...... 70 3.5 Frequency domain representation of three OFDM subcarriers ...... 71 3.6 Rate 1/2 convolutional encoder ...... 80 3.7 Block diagram of a BICM encoder ...... 81 3.8 Block diagram of a BICM decoder ...... 81 3.9 Iterative receiver for coded systems ...... 83 3.10 Parallel Relay Channels with one source, L relay nodes and one destination . . . 85

4.1 Quantum Power Level for 16QAM modulation ...... 94 4.2 Decision Boundaries for Φ1 and Φ3 ...... 103 4.3 Decision Boundaries for Φ2 ...... 104 4.4 BER performance of a MIMO system with 16QAM M = N = 2 ...... 111 4.5 BER performance of a MIMO system with 16QAM M = N = 4 ...... 112 4.6 BER performance of a MIMO system with 64QAM M = N = 2 ...... 112 4.7 BER performance of a MIMO system with 64QAM M = 2,N = 4 ...... 113 4.8 Average number of groups ...... 114

5.1 MIMO system model ...... 120 5.2 (Discrete) finite Gaussian constellation of 16-PAM ...... 120 5.3 Near-Gaussian constellation of 16-PAM with λ = 1/40 ...... 134 5.4 BER performance of MIMO systems with N = 2,M = 1 ...... 135 5.5 BER performance of MIMO systems with N = 4,M = 2 ...... 136 5.6 BER performance of MIMO systems with M = N = 4 ...... 137 5.7 BER performance of MIMO systems with M = N = 8 ...... 137 5.8 BER performance comparison between BEM and BEM-Gibbs detectors . . . . . 138 5.9 Average number of iterations of the BEM algorithm for different starting points . 138

6.1 Block diagram of the BICM-OFDM transmitter ...... 148 6.2 Block diagram of the BICM-OFDM iterative receiver ...... 151

xix 6.3 Number of subcarriers used for different values of confidence interval ...... 164 6.4 BER performance comparison of ADS versus USE-ALL for v = 90 km/hr . . . . 165 6.5 Channel estimation error performance comparison of ADS versus USE-ALL . . . 166 6.6 BER performance comparison of ADS versus GENIE-AIDED for ...... 167 6.7 BER performance comparison of ADS versus EM-KALMAN ...... 168 6.8 BER performance comparison of ADS versus IRLS-HM ...... 168

7.1 Channel model with unknown number of taps and power delay profile ...... 176 7.2 Channel estimation MSE performance of BMA and BMOS estimators ...... 180 7.3 BER performance comparison of BMA and BMOS estimators ...... 180 7.4 CIR length estimation, L = 8, SNR = 10 dB ...... 181 7.5 MAP channel order estimation, K = 32, 64, 128 , L=8 ...... 181 { } 7.6 Sensitivity of MAP estimate from P r(L y) to prior mean λ ...... 205 | 7.7 Sensitivity of MAP estimate from P r(L y) to β ...... 206 | 7.8 Sensitivity of MAP estimate from P r(L y) to β ...... 207 | 7.9 Marginal distribution of L, P r(L y) versus SNR ...... 208 | 7.10 Pairwise marginal posterior distributions for p(h , h y) ...... 209 i j| 7.11 Average MSE of the marginal posterior model probability, P r(L = 8 y) . . . . . 210 | 7.12 MSE performance for the OFDM system using CPS-TDMCMC algorithm . . . . 211 7.13 BER performance of CPS-TDMCMC algorithm ...... 212

8.1 Two hop relay system with L relay nodes ...... 220 8.2 Comparison of performance for MCMC-ABC with different distance metrics . . . 236 8.3 Maximum distance between the edf and the baseline “true” edf ...... 238 8.4 SER performance of the proposed detector schemes ...... 239 List of Tables

2.1 Summary of point estimators ...... 16 2.2 Bayesian estimation under model uncertainty ...... 20

3.1 OFDM system models summary ...... 74

4.1 Relation between Ω (2,S), Φ (S) and (2,S) ...... 95 G 4.2 Decision Boundaries for Φ1 and Φ3 ...... 104 4.3 Decision Boundaries for Φ2 ...... 105 4.4 Complexity of PEC-LS detector components ...... 110 4.5 Number of iterations in the line search as a function of power groups ...... 114

7.1 Computational complexity of within-model moves of the TDMCMC algorithms . 199 7.2 Computational complexity of the between-model moves of BD-TDMCMC algorithm199 7.3 Computational complexity of the CPS-TDMCMC algorithm ...... 200

xxi

List of Algorithms

1 Inversion sampling algorithm ...... 32 2 Accept-Reject sampling algorithm ...... 33 3 Importance sampling algorithm ...... 34 4 Metropolis-Hastings algorithm ...... 37 5 Gibbs sampling algorithm ...... 39 6 Generic TDMCMC algorithm ...... 44 7 Rejection algorithm 1 ...... 48 8 Rejection algorithm 2 ...... 49 9 ABC algorithm 1 : ǫ-tolerance rejection ...... 49 10 ABC algorithm 2 ...... 50 11 ABC-MCMC sampler ...... 52 12 PEC-LS detector ...... 100 13 OPEC based detector for MIMO systems ...... 101 14 Constrained LS soft detection of the ω-th power group ...... 109 15 Consistency Test and Adaptive Subcarriers Selection ...... 159 16 Generic TDMCMC Algorithm ...... 187 17 SA-TDMCMC: Between-Model Moves Transition Kernel ...... 193 18 CPS-TDMCMC: Between-Model Moves Transition Kernel ...... 196

19 Sampling (h1:∗ L∗ , β∗,L∗) via CPS method ...... 197 20 MAP sequence detection algorithm using MCMC-ABC ...... 230 21 MAP sequence detection algorithm using AV-MCMC ...... 232

xxiii Acronyms

“There are three kinds of lies: lies, damned lies, and statistics.”

Benjamin Disraeli

TDMA Time Division Multiple Access

ABC Approximate Bayesian Computation

ACF Autocorrelation Function

ADS Adaptive Detection Selection

AF Amplify and Forward

APP A Posteriori Probabilities a.s. almost surely

AR Auto Regressive

ARMA Auto Regressive Moving Average

AV Auxiliary Variable

AWGN Additive White Gaussian Noiselong

BCRLB Bayesian Cram´erRao Lower Bound

BD Birth-Death

BEM Bayesian Expectation Maximisation

BER Bit Error Rate

BICM Bit Interleaved Coded Modulation

BIM Bayesian Information Matrix

BMA Bayesian Model Averaging

BMOS Bayesian Model Order Selection

BPSK Binary Phase Shift Keying cdf cumulative distribution function

CF Compress and Forward

CFO Carrier Frequency Offset

CFR Channel Frequency Responselong

CIR Channel Impulse Responselong

CP Cyclic Prefix

CPS Conditional Path Sampling

CRLB Cram´erRao Lower Bound

CSI Channel State Information

DD Decision Directed

DF Decode and Forward

DFT Discrete Fourier Transform

ECC Error Control Coding edf empirical distribution function

EF Estimate and Forward

EIV Errors In Variables

EM Expectation Maximisation

FD Frequency Domain

FEC Forward Error Correction

FIR Finite Impulse Response

FFT Fast Fourier Transform

HD Hard Decision

IBI Inter Block Interference

ICI Inter Carrier Interference IDFT Inverse Discrete Fourier Transform

IFFT Inverse Fast Fourier Transform

IG Inverse Gamma i.i.d. independent and identically distributed

IOPEC Improved Ordered Power Equality Constraint

IRLS Iteratively Reweighted Least Squares

ISI Inter Symbol Interference

KF Kalman filter

LDSSM Linear Dynamic State Space Model

LLR Log Likelihood Ratio

LMMSE Linear Minimum Mean Squared Error

LOS Line Of Sight

LS Least Squares

LU Lower Upper

MA Moving Average

MMAP Marginal Maximum A Posteriori

MAP Maximum A Posteriori

MB Maxwell Boltzman

MC Multi Carrier

MCMC Markov Chain Monte Carlo

MF Matched Filter

MH Metropolis Hastings

MIMO Multiple Input Multiple Output

MISO Multiple Input Single Output ML Maximum Likelihood

MLSE Maximum Likelihood Sequence Estimator

MSE Mean Squared Error

MST Most Significant Taps

MMSE Minimum Mean Squared Error

MV Minimum Variance

NIS Normalized Innovation Squared

NLOS Non Line Of Sight

NSE Normalized Squared Error

OFDM Orthogonal Frequency Division Multiplexing

OPEC Ordered Power Equality Constraint

PAPR Peak to Average Power Ratio

PAM Pulse Amplitude Modulation

PASM Pilot Symbol Aided Modulation

PEC Power Equality Constraint pdf probability distribution function

PDP Power Delay Profile pmf probability mass function

PN Phase Noise

PS Parallel to Serial

PSD Power Spectral Density

PSP Per Survivor Processing

PTM Probability Transition Matrix

QAM Quadrature Amplitude Modulation QPL Quantum Power Level

QPSK Quadrature Phase Shift Keying

RF Radio Frequency

RJMCMC Reversible Jump Markov Chain Monte Carlo

RMS Root Mean Square r.v. random variable

SA Stochastic Approximation

SAMC Stochastic Approximation Monte Carlo

SC Single Carrier

SD Soft Decision

SDR Semi-Definite Relaxation

SER Symbol Error Rate

SES Sub-Optimal Exhaustive Search

SES-ZF Sub-Optimal Exhaustive Search - Zero Forcing

SIMO Single Input Multiple Output

SISO Single Input Single Output

SMC Sequential Monte Carlo

SNR Signal to Noise Ratio

SD Soft Decision

SP Serial to Parallel s.t. such that

SVD Singular Value Decomposition

TD Time Domain

TDMCMC Trans Dimensional Markov Chain Monte Carlo TRS Trust Region Subproblem

VBLAST Vertical Bell Labs Layered Space Time

VCO Voltage Controlled Oscillator w.p. with probability

WSSUS Wide Sense Stationary Uncorrelated Scattering

ZF Zero Forcing

Notations and Symbols

“We could, of course, use any notation we want; do not laugh at notations; invent them, they are powerful. In fact, mathematics is, to a large extent, invention of better notations.”

Richard P. Feynman

It shall be assumed that a random variable x can be defined on a probability space of the form (E, ǫ, P). Where, E will represent the space of all outcomes which may be either discrete of continuous and may be of multiple dimension. ǫ will represent σ (E) which is the sigma algebra generated by the space E, which is the set of all possible outcomes and P will be a probability measure on the space E. The notation p (dx) shall be used to represent the law or distribution of the random variable x, which is a probability measure given by the image measure on the space in question. In this thesis, all models considered will be either discrete or continuous, open or compact subsets of Euclidean space. Furthermore, it shall be assumed that all distributions of interest admit densities with respect to either the counting measure or Lebesgue measure.

We now introduce some notation that will be used throughout the thesis. Boldface upper-case letters denote matrices, boldface lower-case letters denote column vectors, and lower-case italics denote scalars.

xxxi N, Z, R, C The set of all natural,integer, real and complex numbers, respectively. R+, The set of all strictly positive real numbers. Zn m, Rn m, Cn m The set of n m matrices with integer-, real- and complex-valued × × × × entries, respectively. If m = 1, the index can be dropped. XT Transpose of the matrix X.

X∗ Conjugate of the matrix X. XH Complex conjugate and transpose (Hermitian) of the matrix X. 1 X− Inverse of the matrix X. T r A Trace of the matrix A. { } X , or Det (X) Determinant of the matrix X. | | vec (X) Vector constructed with the elements in the diagonal of matrix X. diag (x) Matrix constructed with elements of x on its main diagonal.

(X)† Moore-Penrose pseudo-inverse of the matrix X. I or I Identity matrix and identity matrix of dimension n n, respectively n × [H] (i, j) th component of the matrix H. ij − x Magnitude of the complex scalar x. | | x 2 Squared Euclidean norm of the vector x : x 2 = xH x. k k k k x Largest integer smaller than or equal x. ⌊ ⌋ x Smallest integer larger than or equal x. ⌈ ⌉ max,min Maximum and minimum. Equal up to a scaling factor (proportional). ∝ , Defined as. CN (m, C) Complex circularly symmetric Gaussian vector distribution with mean m and covariance matrix C. N (m, C) Gaussian vector distribution with mean m and covariance matrix C. (α, β) Inverse Gamma distribution with shape parameter α and scale parameter β. IG 2 χK chi-squared distribution with K degrees of freedom. P oi ( ; λ) Poisson distribution with mean λ. · U [α, β] uniform distribution on the support [α, β]. pX (x), p(x) Probability density function of the random variable x. P r(x) Probability mass function of the random variable x. p(x y) Conditional distribution of x given y. | p(x, y) Joint distribution of x and y. z p(z) z is distributed according to p(z). ∼ A Cardinality of the set A, i.e., number of elements in A. | | x Sequence of vectors x , , x . 0:n 0 ··· n x k Vector with k-th component missing x k , [x1, x2, . . . , xk 1, xk+1, . . . , xK ] − − − E Mathematical expectation. {·} arg Argument. inf infimum (highest lower bound). Equal up to a scaling factor (proportional). ∝ lim Limit. log ( ) Natural logarithm. · log ( ) Base-a logarithm. a · Re Real part. {·} Im Imaginary part. {·} a.s. almost sure convergence. −→ d convergence in distribution. −→ Matrix Kronecker product. ⊗ δ ( ) Dirac delta function. · I ( ) Indicator function. · T (y) Summary statistics of y. ρ (x, y) Distance metric between x and y. O(N) The computation complexity is order N operations.

Chapter 1

Introduction

“In theory, theory and practice are the same. In practice, they are not.”

Lawrence Peter ”Yogi” Berra

1.1 Motivation

Claude Shannon, an engineer employed at the Bell Labs, is possibly the most important figure in the field of communication theory. Among the many fundamental results in his well-known paper “A Mathematical Theory of Communication” [1], of special importance is his discovery of the capacity formula. Shannon proved that reliable communication between a transmitter and a receiver is possible even in the presence of noise, and laid the foundation to a new field of research, known as information theory. For the particular case of a unit-gain band-limited continuous channel corrupted with Additive White Gaussian Noise (AWGN), Shannon obtained his celebrated formula for the capacity

P + P C = W log T N , (1.1) P  N  where W , PT and PN represent the bandwidth, the average transmitted power, and the noise power, respectively.

The electromagnetic bandwidth available, W , and the maximum radiated power, PT , are sub- ject to fundamental physical constraints as well as regulations and practical constraints, and therefore limited. One approach to increase capacity is by emitting more power. However, due to the logarithmic dependence of the spectral efficiency on the transmitted power, it would be extremely expensive. It may also violate regulation power masks and result in nonlinearity in the power amplifier. Moreover, the effects of electromagnetic radiation on people’s well being 1 1.1 Motivation 2 should also be taken into consideration. A second approach to increase capacity would be to utilize a wider electromagnetic band. How- ever, as mentioned before, the radio spectrum is a scarce, and therefore, a very expensive re- source. Consequently, designers face the challenge of designing wireless systems that are capable of providing increased data rates and improved performance while utilising existing frequency bands and channel conditions.

Foschini and Telatar showed in [2] and [3] that by using multiple antennas at the transmitter and the receiver, Multiple Input Multiple Output (MIMO) systems can increase the capacity without having to increase transmission or use a wider band when the channel exhibits rich scattering and its variations are accurately tracked by the receiver. More specifically, the capacity of a MIMO system can grow, in principle, linearly with the minimum over the number of inputs and outputs.

Therefore, MIMO systems are a key technology in order to fulfill the requirements for future communication networks. Thus, over the last decade, there has been a growth of research activity in the area of MIMO systems.

A fundamental task of any receiver is the detection of the data transmitted by the transmitter. The optimal data detection in MIMO systems, i.e. the estimation of the transmitted data, can be carried out using a brute force search over all possible codewords. However, such a search results in an exponential explosion in the search space, prohibiting its use. Many suboptimal but less complex schemes have been suggested. Still, data detection in MIMO systems is an ongoing field of research, especially under channel uncertainty.

OFDM transmission technique [4], [5], [6], [7] has been widely investigated in the last 40 years. OFDM is an attractive multi-carrier modulation technique because of its high spectral efficiency and simple single-tap equaliser structure, as it splits the entire bandwidth into a number of over- lapping narrow band subchannels requiring lower symbol rates. Furthermore, the Inter Symbol Interference (ISI) and Inter Carrier Interference (ICI) can be easily eliminated by inserting a Cyclic Prefix (CP) in front of each transmitted OFDM block. However, while the attractive features of OFDM technique exist, some problems still remain to be solved. Reliable estimation of the time-varying parameters, such as channel fluctuations, frequency offset, synchronisation etc., while reducing the overhead of pilot symbols is a challenging task. Another challenge is to perform channel estimation in the absence of a priori knowledge of different physical properties, such as the number of channel taps and Power Delay Profile (PDP).

Another technique which has recently gained attention is the cooperative or relay assisted com- munication where several distributed terminals cooperate to transmit/receive their intended signals [8]. This technique based on the seminal works issued in the 70’s by van der Meulen [8], Cover and El Gamal [9], where they introduced a new component, the relay terminal. 1.1 Motivation 3

In that scenario, the source wishes to transmit a message to the destination, but obstacles degrade the source-destination link quality, or the distance between the source-destination link is too big to perform robust communication. The message sent by the source is also received by the relay terminal, which can re-transmit that message to the destination. The destination may combine the transmissions from the source and relay in order to decode the message. This architecture exhibits some properties of MIMO systems, is therefore known as a virtual MIMO system. In contrast to conventional MIMO systems, the relay-assisted transmission is able to combat the channel impairments due to shadowing and path-loss. These effects occur in the source-destination and the relay-destination links due to the fact that they are statistically independent. The detection of the transmitted data is a challenging task in relay systems. This is due to the fact that in many cases the relay performs non-linear processing on the received signal before re-transmitting it to the destination mode. In many cases, this prohibits the use of classical estimation techniques, since the likelihood function cannot be obtained analytically.

From all that has been discussed above, the aim of this dissertation will be devoted to tackle practical problems in communications systems, under realistic conditions, such as data detection in MIMO and relay systems, channel estimation and receiver design for OFDM systems.

In the first part of this dissertation, the focus is on the design of low complexity data detection schemes for MIMO systems. In the first instance, we develop low complexity symbol detection algorithms when full Channel State Information (CSI) is provided at the receiver. By relaxing the non-convex discrete constellation constraint, and replacing it with a non-convex continuous constraint of specific structures of the symbols constellation, we are able to design algorithms with different levels of complexity and performance.

Secondly, we design low complexity algorithms for data detection of non-uniform symbols con- stellations in MIMO systems when only partial (noisy estimate) CSI is given at the receiver. The resulting non-convex optimisation problem is solved efficiently using the hidden convexity methodology. An alternative solution based on the Bayesian Expectation Maximisation (BEM) methodology is then presented and compared to the hidden convexity based solution in terms of Bit Error Rate (BER) and computational complexity. We then extend this methodology to the case where the noise variance is unknown a priori and needs to be estimated jointly. This shall be achieved using the concept of annealed Gibbs sampling coupled with the BEM approach. The algorithms presented are suitable for both uniform and non-uniform symbols constellations.

In the second part of this dissertation, we focus on receiver design for OFDM systems. First we design an iterative receiver for OFDM systems under high mobility. We propose an algorithm for joint demodulation, data decoding and channel tracking that aims at minimising the error 1.2 Outline of the dissertation 4 propagation effect due to erroneous detection of data symbols. By monitoring the health pa- rameters of the tracking unit, an algorithm is devised for subcarriers subset selection, leading to a decreased error propagation effect, resulting in improved performances.

In the second instance, we design a channel estimation scheme for OFDM systems, in the absence of a priori knowledge of the Channel Impulse Response (CIR) length and unknown PDP decay rate. This will be achieved using Trans Diemnsional Markov Chain Monte Carlo (TDMCMC) methodology to obtain samples from the intractable posterior distribution, and perform numerical integration. We develop three novel algorithms that are based on different sampling methodologies.

The third part of this dissertation is dedicated to data detection in relay systems with non-linear relay functions. This problem is challenging since in most cases there is no analytical expression for the likelihood function. We shall utilise a “Likelihood Free” inference methodology named Approximate Bayesian Computation (ABC) in order to circumvent this problem. We present three novel algorithms: one based on Markov Chain Monte Carlo (MCMC)-ABC; the second is based on an auxiliary MCMC methodology, in which we consider the estimation problem in augmented space. While this enables the use of traditional MCMC approach, this comes at the expense of a bigger state-space solution; the third algorithm is based on a suboptimal exhaustive search Zero Forcing (ZF) detector to perform the detection of the transmitted symbols. This is based on known summary statistics of the channel model and conditional on the mean of the noise at the relay nodes. This will allow for an explicit exhaustive search over the parameter space of code words. We shall compare these three approaches and discuss under what scenarios to use each.

1.2 Outline of the dissertation

In general terms, the scope of this dissertation is the design of wireless communication systems. The main attention is given to two fundamental problems in wireless communication systems, namely detection of the transmitted symbols and channel estimation. In particular, we consider three system configurations throughout this dissertation. These are OFDM, MIMO and relay systems.

The outline of each of the chapters is as follows:

Chapter 1 presents the motivation of this dissertation. It also presents the outline and lists the contributions of this dissertation.

Chapter 2 - This chapter provides an overview of some basic concepts in Bayesian inference that will be used extensively throughout this dissertation. The aim is to give the reader a thorough understanding of the Bayesian methodology, not only to make later chapters comprehensible, 1.2 Outline of the dissertation 5 but also to provide the method for general scientific inference.

Chapter 3 - The basics of modern digital communication systems employing channel estima- tion, channel coding and modulation/detection are described in this chapter. In particular, we provide an overview of different channel models, MIMO and OFDM systems as well as channel estimation and data detection algorithms. General mathematical models representing typical characteristics of these main components as well as the multipath fading transmission channel are presented. The estimation objectives for several digital communication problems are also stated.

Chapter 4 - This chapter deals with the ubiquitous problem of data detection in MIMO sys- tems. We present low complexity algorithms that are based on tightening the constraints of the Zero Forcing (ZF) based detector. This leads to a non-convex optimisation problem that can be solved efficiently. We present several algorithms with different levels of complexity and performance.

Chapter 5 - Here we present algorithms for data detection in MIMO systems with only partial CSI at the receiver. The detector is formulated as a non convex problem that can be solved effi- ciently using the hidden convexity methodology. We also present a competitive approach based on the BEM methodology. An extension to the case where the noise variance is not known a priori is also presented using a stochastic optimisation methodology that is based on the concept of annealed Gibbs sampler.

Chapter 6 - In this chapter we develop a complete receiver design for OFDM systems, suit- able for high mobility. Based on the Turbo principle, we present an algorithm to perform demodulation, decoding and channel tracking. Special attention is given to the problem of er- ror propagation that can lead to poor performance due to divergence of the tracking module. We present an approach to deal with this problem by monitoring the health parameters, and adapting the number of subcarriers used accordingly. This leads to reduced misspecification of the stat-space model enabling superior overall performance to other methods.

Chapter 7 - This chapter deals with the problem of channel estimation in OFDM systems where the number of channel taps and the PDP decay rate are unknown a priori. We formu- late the problem under the Bayesian framework and show that the solution involves solving intractable integrals. In order to circumvent this problem we use the TDMCMC methodology to sample from the intractable posterior density, and perform numerical integration. In particu- lar, we develop three novel algorithms and analyse quantities of interest, such as computational complexity, convergence analysis, sensitivity to different choices of priors and BER.

Chapter 8 - In this chapter we consider the problem of detection of transmitted data in relay systems, where only partial CSI is available at the destination node. We formulate the problem under the Bayesian framework and show that the solution is prohibited due to the fact that 1.3 Research contributions 6 no analytical expression for the likelihood function is available. We therefore use a “Likelihood Free” methodology to design a novel sampling methodology based on the MCMC-ABC algo- rithm. We also develop two alternative solutions: an auxiliary variable MCMC approach in which the addition of auxiliary variables results in closed form expressions for the full condi- tional posterior distributions; The other detector involves an approximation based on known summary statistics of the channel model and an explicit exhaustive search over the parameter space of code words. We study the performance of each algorithm under different settings of our relay system model, and make recommendations regarding the choice of algorithms and various parameters.

Chapter 9: Here we conclude the dissertation and give some topics for future research.

Note that the chapters with original contributions are Chapters 4, 6, 5, 7 and 8.

1.3 Research contributions

To a certain extent, the chapters in this thesis are self contained and can be read independently. The main contributions of this dissertation are:

Low complexity data detection algorithms for MIMO systems with different degrees of CSI • quality.

Design of an iterative receiver for OFDM systems under high mobility conditions. • Channel estimation for OFDM, with no a priori knowledge of the number of taps of the • channel and its PDP decay rate.

Detection of transmitted data in relay systems, with non-linear relay functions, and partial • CSI at the receiver.

In the following, a detailed list of the research contributions in each chapter is presented.

Chapter 4 The main results of this chapter deal with the design of low complexity algorithms for data detection in MIMO systems. These results have been submitted to publication in a journal.

I. Nevat, T. Yang, K. Avnit, and J. Yuan, “Detection for MIMO Systems with High- • level Modulations using Power Equality Constraints,” accepted for publication in IEEE Transactions on Vehicular Technologies, January 2010. 1.3 Research contributions 7

Chapter 5 The main results of this chapter deal with the design of low complexity data detection algorithms for MIMO systems, with only partial CSI available at the receiver. These results have been published in three conference papers and one journal.

I. Nevat, A. Wiesel, J. Yuan, and Y.C. Eldar, “Maximum a-posteriori Estimation in Lin- • ear Models With a Gaussian Model Matrix,” in Proc. IEEE Conference on Information Sciences and Systems (CISS), Baltimore, MD, USA, March 2007.

I. Nevat, G.W. Peters, and J. Yuan, “Maximum A-Posteriori Estimation in Linear Models • With a Random Gaussian Model Matrix: a Bayesian-EM Approach,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP08), Las Vega, Nevada, USA, April 2008.

I. Nevat, G.W. Peters, and J. Yuan, “MAP Estimation in Linear Models With a Random • Gaussian Model Matrix : Algorithms and Complexity,” in Proc. IEEE international Sym- posium on Personal, Indoor and Mobile Radio Communications 2008 (PIMRC08), Cannes, France, October 2008.

I. Nevat, G.W. Peters, and J. Yuan, “Detection of Gaussian Constellations in MIMO • Systems under Imperfect CSI,” IEEE Transactions on Communications, vol. 58, no. 3, March 2010.

Chapter 6 The main results of this chapter deal with the design of an iterative receiver for OFDM systems. These results have been published in three conference papers and one journal.

I. Nevat and J. Yuan, “Channel Tracking For OFDM Systems Using Measurements Prun- • ing,” in Proc. NEWCOM-ACoRN workshop,Vienna, Austria, September 2006.

I. Nevat and J. Yuan, “Channel Tracking Using Pruning for MIMO-OFDM Systems Over • Gauss-Markov Channels,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP07), Hawaii, USA, April 2007.

I. Nevat and J. Yuan, “Error Propagation Mitigation for Iterative Channel Tracking, De- • tection and Decoding of BICM-OFDM Systems,” in Proc. IEEE International Symposium on Wireless Communication Systems 2007 (ISWCS07), Trondheim, Norway, October 2007. This paper has won the best paper award.

I. Nevat and J. Yuan, “Joint Channel Tracking and Decoding for BICM-OFDM Systems • using Consistency Tests and Adaptive Detection Selection,” IEEE Transactions on Vehic- ular Technologies, vol. 58, no. 8, pp. 4316 - 4328, October 2009. 1.3 Research contributions 8

Chapter 7 The main results of this chapter deal with the design of algorithms for channel estimation in OFDM systems, with no a priori knowledge of the number of taps of the channel and its power delay decay rate. These results have been published in two conference papers and one journal.

I. Nevat, G.W. Peters, and J. Yuan, “OFDM CIR Estimation with Unknown Length via • Bayesian Model Selection and Averaging,” in Proc. IEEE Vehicular Technology Conference (VTC08), Singapore, May 2008.

I. Nevat, G.W. Peters, and J. Yuan, “Channel Estimation in OFDM Systems with Un- • known Power Delay Profile using Trans-dimensional MCMC via Stochastic Approxima- tion,” in Proc. IEEE Vehicular Technology Conference (VTC09), Barcelona, Spain, April 2009.

I. Nevat, G.W. Peters, and J. Yuan, “Channel Estimation in OFDM Systems with Un- • known Power Delay Profile using Trans-dimensional MCMC,” IEEE Transactions on Signal Processing, vol. 57, no. 9, pp. 3545 - 3561, September 2009.

Chapter 8 The main results of this chapter deal with the design of algorithms for the detection of the transmitted data in relay systems with non-linear functions and partial CSI at the receiver. These results have been published in one conference paper and one journal.

I. Nevat, G.W. Peters and J. Yuan, “Coherent Detection for Cooperative Networks with Ar- • bitrary Relay Functions using “Likelihood Free” Inference,” in Proc. NEWCOM-ACoRN workshop, Barcelona, Spain, March 2009.

G.W. Peters, I. Nevat, S. Sisson, Y. Fan and J. Yuan, “Bayesian symbol detection in • wireless relay networks via likelihood-free inference,” accepted for publication in IEEE Transactions on Signal Processing, February, 2010.

Other contributions not presented in this dissertation

I. Nevat, G.W. Peters, S. Sisson, Y. Fan and J. Yuan, “Model Selection, Detection and • Channel Estimation for Relay Systems,” to be submitted to IEEE Transactions on Signal Processing.

C. Kim, T. Lehmann, S. Nooshabadi and I. Nevat, “An Ultra-Wideband Transceiver Ar- • chitecture for Wireless Endoscopes,” in Proc. IEEE International Symposium on Commu- nications and Information Technologies (ISCIT’07), Sydney, Australia, October 2007. Chapter 2

Bayesian Inference and Analysis

“Making predictions is hard, especially about the future.”

Niels Bohr

2.1 Introduction

This chapter sets out the essential statistical theory required to understand the material pre- sented in the subsequent chapters. An emphasis is placed on a particular interpretation of general statistical theory, known as Bayesian statistics.

2.2 Background

Our everyday experiences can be summarised as a series of decisions to take actions which manipulate our environment in some way or another. We base our decisions on the results of predictions or inferences of quantities that have some bearing on our quality of life, and we come to arrive at these inferences based on models of what we expect to observe.

Models are designed to capture salient trends or regularities in the observed data with a view to predicting future events. For the majority of real applications the data are far too complex or the underlying processes not nearly well enough understood for the modeller to design a perfectly accurate model. If this is the case, we can hope only to design models that are simplifying approximations of the true processes that generated the data.

The purpose of Bayesian inference is to provide a mathematical machinery that can be used for modeling systems, where the uncertainties of the system are taken into account and the decisions 9 2.3 Bayesian Inference 10 are made according to rational principles. The tools of this machinery are the probability distributions and the rules of probability calculus.

In probabilistic inference there are two well known approaches, namely the frequentist approach and the Bayesian approach. In the frequentist approach (a.k.a classical approach), the param- eters of interest are handled as deterministic with unknown values and the estimation is based solely on the observations. This approach is often associated with the work of R.A. Fisher, R. von Mises, J. Neyman and E. Person.

The Bayesian approach takes a different view. In a Bayesian analysis, all the quantities have a probability distribution associated with them. The performance of different estimators are decided based on the average performance over different realizations of the parameters of interest. Bayes rule provides a means of updating the distribution over parameters from the prior to the posterior distribution in light of observed data. In theory, the posterior distribution captures all information inferred from the data about the parameters. This posterior is then used to make optimal decisions or predictions, or to select between models.

The Bayesian framework depends on the existence of a priori distributions. These priors reflect the user’s knowledge about the quantities of interest before any data have been considered. Many researchers advocating the so-called classical methods of inference oppose the use of priors. They prefer to regard these quantities as having unknown but deterministic values to which it is meaningless to assign a distribution as if it were random. Furthermore, it is often argued that assigning a prior to a proposition means that the resulting inference will be affected by the subjective prior decided by the user. While all these arguments are of great importance, we will not pursue this debate here. We simply note that both the classical and the Bayesian frameworks of inference have their respective merits, and the user should try to be pragmatic and choose the framework most appropriate for their application.

2.3 Bayesian Inference

The Bayesian paradigm is used extensively in estimation theory. Bayesian inference [10] is so named as it centres around Bayes’ rule

p(y x)p(x) p (x y) = | | p(y) (2.1) p(y x)p(x) = | , p(y x)p(x) dx Ω | R where Ω denotes the parameter space of x. The basic blocks of a Bayesian model are the prior model, p (x), containing our original beliefs about x before observing the data, and the likelihood model, p (y x), determining the stochastic mapping from the parameter x to the measurement | 2.3 Bayesian Inference 11 y. Using Bayes rule, it is possible to infer an estimate of the parameter from the measurement. The distribution of the parameter x, conditioned on the observed measurement y, p (x y), is | called the posterior distribution and it is the distribution representing the state of knowledge about the parameter when all the information in the observed measurement y and the model is used.

The distribution p (y) is the evidence or marginal likelihood and is not directly a function of the model parameter x because it is integrated out:

p (y) = p(y x)p(x)dx. (2.2) | ZΩ One of the major problems within Bayesian inference is the calculation of the marginal likeli- hood (2.2). The calculation of this normalising constant usually complicates matters since this integral is analytically intractable in most practically useful scenarios, possibly due to its high dimensional nature. One thus has to resort to approximation techniques.

A second, and related problem, is that of marginalisation of variables. Consider a random vector that can be partitioned as follows x = (x , x ). The marginal density p (x y) can be attained 1 2 1| through:

p (x y) = p(x , x y)dx . (2.3) 1| 1 2| 2 Z In general, this integral is analytically intractable.

There is a multitude of references available on this topic, such as [11], [12], [13] and [14].

2.3.1 Prior Distributions

In order to use Bayesian inference, one needs knowledge of the prior distribution p (x) in (2.1). Hence there is the need for principles that can produce the initial distributions required. These are solely based on whatever information we have beforehand. We briefly present the most popular approaches for the choice of a prior distribution.

Non-informative priors- In many practical situations no reliable prior information concerning x exists, or inference based solely on the data is desirable. In this case we typically wish to define a prior distribution p(x) that contains no information about x in the sense that it does not favor one x value over another. We may refer to a distribution of this kind as a noninformative prior for x and argue that the information contained in the posterior about x stems from the data only. In the case that the 2.3 Bayesian Inference 12 parameter space is x = x ,.. . , x , i.e., discrete and finite, then the distribution { 1 n} 1 p(x ) = , i 1, . . . , n (2.4) i n ∈ { } places the same prior probability to any candidate x value. Likewise, in the case of a bounded continuous parameter space, say Φ = [a, b], < a < b < , then the uniform distribution −∞ ∞ 1 p(x ) = , a < x < b (2.5) i b a i − appears to be noninformative. The uniform prior is not invariant under reparameterisation. Thus, an uninformative prior can be converted, in the case of a different model, to an informative one. One approach that overcomes this difficulty is Jeffreys prior [15], [16] given by

p(x) I(x), (2.6) ∝ p where I(x) is the expected Fisher information matrix, having its ij-th element

2 E ∂ [I]ij = y x p(y x) . (2.7) | ∂x ∂x |  i j  This rule has the property that the prior is invariant regardless of any transformation that may be performed on x. Although this invariance property is desirable, there are certain situations where Jeffreys’ prior cannot be applied [17].

Conjugate priors- Conjugate prior families were first formalised by Raiffa and Schlaifer [18]. They showed that when choosing a prior from a parametric family, some choices may be more computationally convenient than others. The advantage of using a conjugate family of distributions is that it is generally straightforward to calculate the posterior distribution, which makes it useful in many situations. In particular, it may be possible to select a prior distribution p(θ) which is conjugate to the likelihood p(y θ), that is, one that leads to a posterior p(θ y) belonging to the same family | | as the prior.

It is shown in [19] that exponential families, where likelihood functions often belong, do in fact have conjugate priors, so that this approach will typically be available in practice.

A conjugate prior is constructed by first factoring the likelihood function into two parts. One factor must be independent of the parameter(s) of interest but may be dependent on the data. The second factor is a function dependent on the parameter(s) of interest and dependent on the data only through the sufficient statistics. The conjugate prior family is defined to be proportional to this second factor. It can be shown [18] that the posterior distribution arising from the conjugate prior is itself a member of the same family as the conjugate prior. 2.3 Bayesian Inference 13

As an example, that will be useful in Chapter 5, consider the following posterior:

p (θ y) p (y θ) p (θ) , (2.8) | ∝ | where the prior p (θ) follows an Inverse Gamma (IG) distribution:

p (θ) = (θ; α, β) IG α (2.9) β α 1 β = θ− − exp − , Γ(α) θ   where α and β are the shape parameter and scale parameter, respectively. The Gamma function, Γ(z), is defined by

∞ z 1 Γ(z) , t − exp t dt, (2.10) {− } Z0 for a complex number z with positive real part Re [z] > 0.

The posterior in (2.8) follows a conjugacy structure. This is because the prior, p (θ), follows the IG distribution and the likelihood, p (y θ), follows a Normal distribution. Due to the conjugacy | property of the Normal-Inverse Gamma model [20], the posterior follows an IG distribution:

p (θ y) = α, β , (2.11) | IG  with posterior hyperparameters

α = α + N/2, (2.12a) N y [i] µ 2 β = β + i=1 | − i| , (2.12b) 2 P where N is the number of elements in y. The mode of p (θ y), which is the the location where | the probability density function attains its maximum value, can be written as

β mode [p (θ y)] = . (2.13) | α + 1

This will be useful in finding the Maximum A Posteriori (MAP) estimate which will be defined in the next Section.

2.3.2 Point Estimates

As stated before, Bayesian analysis revolves around the posterior distribution p (x y). However, | there is more than one way to find the value of the parameter of interest. Point estimates for 2.3 Bayesian Inference 14 parameters of interest are obtained via the specification of a loss function. The construction of a point estimate, which itself is a random variable, is done by defining a suitable loss function that penalises erroneous estimates i.e. we specify a loss function which defines the quality of an estimate. The loss function C (x, x (y)) : E E R+ is used to define the optimal Bayesian × 7→ estimator by minimising the expected loss, as follows: b

∞ ∞ R , E C (x, x) = C (x, x) pxy(x, y) dx dy { } Z−∞ Z−∞ (2.14) ∞ b = I (x) py(y)bdy, Z−∞ where b

∞ I (x) = C (x, x) px y(x y) dx. (2.15) | | Z−∞ Since (2.14) is the expectation ofb a positive quantityb I (x), it is sufficient to minimize (2.15).

From a probabilistic point of view, a rather appealingb point estimate is provided by choosing the value that minimizes the variance of the estimation error, referred to as the Minimum Variance (MV), or more commonly as the Minimum Mean Squared Error (MMSE) estimate. This estimator has a loss function defined as C (x, x) , x x 2. The MMSE estimator can be k − k derived as follows: b b 2 xMV , arg min E x x y x k − k | n n oo E T T E T b = argb min x x yb 2x x y + x x (2.16) x | − { | }   E 2 E 2 E 2 = argb min x x y b+ x yb b x y . x k − { | }k k k | − k { | }k n n o o The two last terms in (2.16)b are independentb of x, which leads to the well known MMSE estimate

x = E bx y MV { | } (2.17) = xp (x y) dx. b | Z

Alternatively, one may choose to minimize the hit-or-miss cost function given by

0, x x (y) ǫ C (x, x (y)) = k − k ≤ , (2.18) 1, otherwise  b b  2.3 Bayesian Inference 15

∧ ∧ ∧ xMedian xMV xMAP

Figure 2.1: Parameter estimation criteria based on the marginal posterior distribution where ǫ 0 is a positive scalar. Optimising this cost function yields the MAP estimator: →

xMAP (y) = arg max p (x y) x { | } b = arg max log p (x y) (2.19) x { | } = arg max log p (y x) + log p (x) , x { | } where we used the fact the the log function is monotonic increasing. The MAP estimate chooses the model with highest posterior probability density (the mode).

The most frequently used location measures are the mean, the median and the mode of the posterior distribution since they all have appealing properties. In the case of a flat prior the mode is equal to the maximum likelihood estimate. For symmetric posterior densities the mean and the median are identical. Moreover, for unimodal symmetric posteriors all the three measures coincide. These estimators are illustrated in Figure 2.1

In many cases, the computational complexity involved in solving the MMSE estimate in (2.17) is too high for practical applications. Instead, a common approach is to consider the Linear Minimum Mean Squared Error (LMMSE) estimator that achieves the smallest Mean Squared Error (MSE) among all linear estimators, i.e. of the form x = Ay + b. The LMMSE estimator satisfies the following closed form solution b T 1 T x (y) = E x + E xy E− yy (y + E y ) . (2.20) LMMSE { } { }   b 2.4 Bayesian Model Selection 16

MMSE estimator:

x = E x y . MV { | }

MAPb estimator:

x (y) = arg maxx log p (y x) + log p (x) . MAP { | }

LMMSEb estimator:

x (y) = E x + E xyT E 1 yyT (y + E y ) . LMMSE { } − { }   b Table 2.1: Summary of point estimators

Another, less common estimator yields the posterior median. Its cost function is

C (x, x (y)) = a x x (y) , (2.21) | − | where a > 0. The points estimators areb sumamrised inb Table 2.1.

2.3.3 Interval Estimation

In some cases we are not only interested in the actual value of the estimated parameter x, but we would also like to assign it confidence regions on x, i.e. subsets C of the parameter space Ψ b where x should be with high probability [11]. A 100 (1 α)% credibility set for x is a subset C − of Ψ such that

1 α p(C y) = p(x y)dx. (2.22) − ≤ | | ZC This definition enables appealing statements like “The probability that x lies in C given the observed data y is at least 1 α”. This comes in contrast with the usual interpretation of the − confidence intervals based on the frequency of a repeated experiment.

2.4 Bayesian Model Selection

2.4.1 Introduction

Model selection can have different meanings, depending on the specific problem at hand. There are three broad approaches to understanding model selection which are labeled as the Mopen, 2.4 Bayesian Model Selection 17

Mcompleted and the Mclosed modeling perspectives [13]. They can be summarised as follows:

M : takes the view that the class of models under consideration contains the true • closed model.

M : takes the view that although a formulated belief model is known, due to • completed intractability of analysis other models are considered.

M : takes the view that none of the models under consideration completely captures • open the intricate relationship between the inputs and the outputs.

Throughout this thesis the problem of model selection refers to the inference of which model in an Mclosed set is the most appropriate one for describing a given observation signal. We shall also restrict ourselves to nested models. A set of models is nested when each model in the set can be described as a special case of the models in the set with higher complexity. The problem can be described as follows: suppose we have a finite set of K possible models, where M ,...,M { 1 K } are model indicators, such that:

M1 : x = [x1],  M2 : x = [x1, x2],    .   (2.23)   . .     MK : x = [x1, . . . , xK ].    Each model is assigned with a prior probability P r (Mk). We would like to find the poste- rior model probabilities, P r (M y) ,M 1,...,K . There are two common approaches to k| k ∈ { } model selection/estimation: the Bayesian Model Order Selection (BMOS) and Bayesian Model Averaging (BMA) methods. BMOS involves the selection of the model, denoted M , which most accurately represents the { i} observations, according to some criterion of interest. The BMOS posterior is expressed as

p(y M )P r(M ) P r (M y) = | i i i| p(y) (2.24) p(y x,M )p(x M )P r(M ) dx = | i | i i . K p(y x,M )p(x M )P r(M ) dx k=1R | k | k k The above analysis reflects the probabilityP R of each given possible model. One can therefore choose to perform a point estimate, such as the MAP estimate of the model

M = arg max p(y Mi)P r(Mi). (2.25) Mi | c 2.5 Bayesian Estimation Under Unknown Model Order 18

The BMOS approach establishes all inferences on one of the K possible models. However, it ignores all other models which means that the information we have on the model uncertainty is discarded. An alternative approach is the BMA method. The BMA considers several or all candidate models in a weighted manner, thus, more information is exploited and we better performance can be expected. The BMA is formulated as follows:

K M = kP r(M y) k| Xk=1 (2.26) K c p(y M )P r(M ) = k | k k . p(y) Xk=1 A thorough discussion on these methods can be found in [21], [22], [23].

2.5 Bayesian Estimation Under Unknown Model Order

In some cases the actual model structure is not of primary interest. Instead, one is interested in estimating a quantity x, where the number of elements of x is unknown. Examples are polynomials of unknown degrees, CIR with unknown number of taps, Auto Re- gressive Moving Average (ARMA) models of different orders, Finite Impulse Response (FIR) filters of different lengths, many types of mixture models etc.. In these cases, a single integer valued parameter, known as the model order is sufficient to describe the model complexity. In the following sections we shall discuss two different approaches to Bayesian model selection.

2.5.1 Bayesian Model Averaging

One popular approach to estimation with unknown model order is BMA [21], [22]. In this approach, the inference is based on average of all possible models in the model space M, instead of a single “best” model. Suppose M M = 1,...,M = K , and let x be the quantity of ∈ { 1 K } interest. The posterior distribution of x is given by

K p(x y) = p(x y,M )P r(M y). (2.27) | | k k| Xk=1 The MMSE estimate is given by

x , E x y MMSE { | } (2.28) = xp(x y)dx. b | Z 2.6 Bayesian Filtering 19

In order to obtain the likelihood function p(x y), we need to marginalise over the latent model | order k, as follows:

K p(x y) = p(x y, k)P r(k y). (2.29) | | | Xk=1 Therefore, the MMSE estimate in (2.28) can be expressed as

K xMMSE = x p(x y, k)P r(k y) dx | | ! Z Xk=1 b K = P r(k y) xp(x y, k)dx (2.30) | | Xk=1 Z K = P r(k y)E x y, k . | { | } Xk=1 The BMA approach performs a weighted sum of the MMSE estimates of all possible K models. Hoeting et al. showed in [21] that averaging over all the possible models k = 1,...,K provides a better estimate than any single model under the logarithmic scoring criterion:

K E log P r(k y)p(x y, k) E log p(x y, j) , j = 1,...,K. (2.31) − ( ( | | )) ≤ − { { | }} Xk=1

2.5.2 Bayesian Model Order Selection

In contrast to the BMA approach, in the BMOS one first finds the most probable model using the MAP estimate of the model order, kMAP , as in (2.25), and then conditions on this estimate to obtain the MMSE estimate of x. This procedure is composed of the following two steps: b

1. kMAP = arg maxk 1,...,K P r(k y) = arg maxk 1,...,K p(y k)P r(k). ∈{ } | ∈{ } | , E 2. bxMMSE k x y, kMAP . | MAP | n o b b b The BMA and BMOS approaches are summarised in Table 2.2.

2.6 Bayesian Filtering

In many applications, one needs to extract the signal from the data corrupted by additive random noise and interferences of different kinds in order to recover the unknown quantities of interest. The data often arrive sequentially in time and, therefore, require on-line decision- making responses. Examples of such applications are speech enhancement [24]; visual tracking 2.6 Bayesian Filtering 20

Bayesian model averaging and MMSE estimator:

x = K P r(k y)E x y, k . MMSE k=1 | { | } P Bayesianb model order selection and MMSE estimator:

1. kMAP = arg maxk 1,...,K P r(k y) = arg maxk 1,...,K p(y k)P r(k). ∈{ } | ∈{ } | , E 2. bxMMSE k x y, kMAP . | MAP | n o b b b

Table 2.2: Bayesian estimation under model uncertainty

[25]; target tracking [26]; stochastic volatility models in economy [27] and many more. In this thesis we shall concentrate on a particular instance of sequential filtering, that is, state space models.

2.6.1 State Space Models

State-space models have been widely studied within the areas of signal processing and systems and control theory [28], [29], [30]. A state-space model is a model where the relationship between the input signal, the output signal and the noises is provided by a system of first-order differential equations. The state vector xn contains the quantities of interest of the underlined system up to and including time n, which is needed to determine the future behavior of the system, given the input. The system is described by (2.32a)-(2.32b).

General state-space model with Gaussian noise

xn = f (xn 1, vn) (2.32a) − yn = h (xn, wn) , (2.32b)

where v N(0, Q), n ∼ w N(0, R), n ∼ E T vnwn = 0. 

The functions f ( ) and h ( ) can be in general non-linear. We, however, will restrict ourselves · · to the special case where f ( ) and g ( ) are linear functions, hence, Linear Dynamic State Space · · Model (LDSSM) systems. 2.6 Bayesian Filtering 21

A fundamental property ascribed to the system model is the Markov property.

Definition 2.1. (Markov property). A discrete-time stochastic process x is said to possess { n} the Markov property if p (x x ,..., x ) = p (x x ) . n+1| 1 n n+1| n

This means that the realization of the process at time n contains all information about the past, which is necessary in order to calculate the future behavior of the process. Hence, if the present realization of the process is known, the future is independent of the past.

The state process xn is an unobserved (hidden) Markov process. Information about this process is indirectly obtained from measurements (observations) yn according to the measurement model,

y p (y x ) . (2.33) n ∼ n| n

The observation process y is assumed to be conditionally independent of the state process { n} x , i.e., { n}

p (y x ,..., x ) = p (y x ) , 1 n N. (2.34) n| 0 N n| n ∀ ≤ ≤

2.6.2 Sequential Bayesian Inference

The Bayesian posterior, p (x y ), reflects all the information we have about the state of the 0:n| 1:n system x0:n, contained in the measurements y1:n and the prior p (x0:n), and gives a direct and easily applicable means of combining the two last-mentioned densities via Bayes’ theorem:

p (y1:n x0:n) p (x0:n) p (x0:n y1:n) = | . (2.35) | p (y1:n)

Taking into account that the observations up to time n are independent given x0:n, the likelihood p (y x ) in the above equation can be factorized as follows: 1:n| 0:n n p (y x ) = p (y x ) , (2.36) 1:n| 0:n i| 0:n Yi=1 and, since, conditional on xi, the measurement yi is independent of the states at all other times, it is given by:

n p (y x ) = p (y x ) . (2.37) 1:n| 0:n i| i Yi=1 2.6 Bayesian Filtering 22

In addition, as a result of the Markov structure of the system in (2.32a), the prior p (x0:n) takes the following form:

n p (x0:n) = p (x0) p (xi xi 1) , (2.38) | − Yi=1 resulting in the posterior probability density being equal to

n p (x0) i=1 p (yi xi) p (xi xi 1) p (x y ) = | | − . (2.39) 0:n| 1:n p (y ) Q 1:n

2.6.3 Filtering Objectives

Our objective is to obtain the estimates of the state at time n, conditional upon the measurements up to time n, such as, for example, MMSE of xn:

MMSE E xn = p(xn y1:n) xn | { } (2.40) = x p (x y ) dx , b n n| 1:n n Z or Marginal Maximum A Posteriori (MMAP) given by:

MAP xn = arg max p (xn y1:n) . (2.41) xn | b 2.6.4 Sequential Scheme

The probability density of interest p (x y ) can be obtained by marginalization of (2.39), n| 1:n however, the dimension of the integration in this case grows as n increases. This can be avoided by using a sequential scheme. A recursive formula for the joint probability density can be obtained straightforwardly from (2.39):

p (yn xn) p (xn xn 1) p (x0:n y1:n) = p (x0:n 1 y1:n 1) | | − , (2.42) | − | − p (yn y1:n 1) | − with the marginal p (x y ) also satisfying the recursion [31] n| 1:n

p (xn y1:n 1) = p (xn xn 1) p (xn 1 y1:n 1) dxn 1, (2.43) | − | − − | − − Z

p (yn xn) p (xn y1:n 1) p (x y ) = | | − , (2.44) n| 1:n p (y y ) n| 1:n 2.6 Bayesian Filtering 23 where

p (yn yn 1) = p (yn xn) p (xn y1:n 1) dxn. (2.45) | − | | − Z Equations (2.43) and (2.44) are called respectively prediction and updating. Although the above expressions appear simple, the integrations involved are usually intractable. One cannot typi- cally compute the normalizing constant p (y ) and the marginals of p (x y ), particularly, 1:n n| 1:n p (x y ), except for several special cases when the integration can be performed exactly. The n| n problem is of great importance, which is why a great number of different approaches and filters have been proposed.

The most important special case, and the one we will be interested in this thesis, occurs when all equations are linear and the noise terms are Gaussian in (2.32a)-(2.32b). The optimal solution in terms of MSE is provided by the celebrated Kalman filter [32]. In the nonlinear, non-Gaussian case, there is a class of methods, referred to as Sequential Monte Carlo (SMC) methods, available for approximating the optimal solution [33].

An important property of the linear model (2.32a)-(2.32b) is that all density functions involved are Gaussian. This is due to the fact that a linear transformation of a Gaussian random variable will result in a new Gaussian random variable. Furthermore, a Gaussian density function is completely parameterized by two parameters, the first and second order moments, i.e., the mean and the covariance.

2.6.5 Linear State-Space Models - the Kalman Filter

When Rudolf E. Kalman developed the framework for the Kalman Filter, he did not require the underlying state-space model to be linear as well as all the probability densities to be Gaussian. Kalman’s original derivation of the Kalman filter did not assume these conditions. The only assumptions he made were the following:

1. Consistent minimum variance estimates of the system random variables (i.e. posterior state distribution) can be calculated, by recursively propagating and updating only their first and second order moments.

2. The estimator itself is a linear function of the prior knowledge of the system, summarized

by p (xn y1:n 1), and the new observed information, summarized by p (yn xn). | − | 3. Accurate predictions of the state and of the system observations can be calculated.

Based on these assumptions, Kalman derived in [32] a recursive form of the conditional mean of the state E xn y1:n 1 . In case of linear state-space model and Gaussian noise terms, the { | − } state-space can be expressed as (2.46a)-(2.46b). 2.6 Bayesian Filtering 24

Linear state-space model with Gaussian noise

xn = Anxn 1 + vn, (2.46a) − yn = Bnxn + wn, (2.46b)

where v N(0, Q), n ∼ w N(0, R), n ∼ E T vnwn = 0. 

In that case the Kalman filter procedure for estimating xn given the system model in (2.46a)- (2.46b) is given below:

Kalman filter equations

Step 1: a priori estimation (prediction)

xn− = Anxn 1, (2.47a) − H Pn− = AnPn 1An + Q. (2.47b) b b −

Step 2: a posteriori estimation (update)

H H 1 Kn = Pn−Bn BnPn−Bn + R − , (2.48a)

x = x− + K y B x− , (2.48b) n n n n − n n Pn = P− KnBnP−.  (2.48c) b b n − n b

First, the filter calculates the a priori estimates of the channel xn− and its error covariance matrix

Pn−, based on the history prior to the current observation yn, where b H P− = E x− x x− x . (2.49) n n − n n − n n   o Next, the a posteriori estimates of the channelb xn andb its error covariance matrix Pn after the observation yn is available at the receiver are evaluated, where b P = E (x x )(x x )H . (2.50) n n − n n − n n o b b 2.7 Consistency Tests 25

The Kalman gain Kn in (2.48a) balances the weight between the predicted estimates and the innovation process. In Appendix 10.3 we derive the Kalman filter from Bayesian point of view. The Kalman filter shall be utilised in Chapter 6 for the purpose of channel tracking in OFDM systems.

2.7 Consistency Tests

We now discuss the issue of verifying that the Kalman filter is performing correctly. As mentioned in the previous sections, the Kalman filter is optimal in the sense that it provides the MMSE estimate. However, this relies on perfect knowledge of An, Bn, Q and R. There are instances where some of those quantities may be different from what we have assumed. For that reason it is important to obtain a measure of reliability regarding the performance of the filter. This gives us a quantifiable confidence in the accuracy of our filter estimates. A consistency test indicates whether the state-space model is consistent with the data y. We begin by expressing the distribution of y1:n as

p (y1:n) = p (yn, y1:n 1) − = p (yn y1:n 1) p (y1:n 1) | − − (2.51) n = p (yi y1:i 1) . | − Yi=1 Further, we assume that the elements in the product of (2.51) are Gaussian distributed, that is:

N p (yi y1:i 1) = yi−, Si , (2.52) | −  b T where yi− = Bixi is the predicted observation, and Si = BiPi−Bi + R is the associated obser- vation covariance matrix. It follows that b b

p (ǫi y1:i 1) = N (0, Si) , (2.53) | − where ǫ , y y−. Using (2.53) in (2.52) yields, with the explicit expression for the Gaussian i i − i probability distribution function (pdf): b n p (yn) = p (ǫi) i=1 Yn n (2.54) 1/2 1 T 1 = 2πSi − exp ǫi Si− ǫi . | | ! (−2 ) Yi=1 Xi=1 2.8 The EM and Bayesian EM Methods 26

The distribution in (2.54) is known as the likelihood function of the filter, and the exponent:

n T 1 λn = ǫi Si− ǫi (2.55) Xi=1 T 1 = λn 1 + ǫn Sn− ǫn −

T 1 2 is called the modified log-likelihood function. The individual terms ǫn Sn− ǫn are χK distributed 2 with K degrees of freedom, where K is the dimension of yn. It follows that λn is χKN distributed with Kn degrees of freedom. Using these results, a quality threshold can be established by using 2 the χKn distribution and λn can serve as a track quality indicator. When λn exceeds the quality threshold, it indicates that a model mismatch occurs. We shall use these results in Chapter 6 to reduce model mismatch in the iterative receiver.

2.8 The EM and Bayesian EM Methods

The Expectation Maximisation (EM) method, introduced by Dempster, Laird and Rubin [34], [35], presents an iterative approach for obtaining the mode of the likelihood function. The strategy underlying the EM algorithm is to separate a difficult maximum likelihood problem into two linked problems, each of which is easier to solve than the original problem. The main property of the EM algorithm is that it ensures an increase in the likelihood function at each iteration. Suppose we wish to find

x = arg max p (y x) , (2.56) x | where y is the observation vector andb x is the unknown vector to be estimated. We assume that p (y x) is the marginal of some real-valued function p (y, θ x). It is convenient to think of | | θ either as missing observations or as latent variables. That is, in situations where it is hard to maximise p (y x), EM will allow us to accomplish this by working with p (y, θ x). | | The likelihood function p (y x) cab be expressed as |

p (y x) = p (y, θ x) , (2.57) | | Xθ where θ g (θ) denotes either integration or summation of θ g (θ) over the whole range of θ. The function p (y x, θ) is assumed to be non-negative for all x and θ. The Maximum Likelihood P | P 2.8 The EM and Bayesian EM Methods 27

(ML) estimate in (2.56) can be cast as

x = arg max p (y x) x | b = arg max log p (y x) x | (2.58)

= arg max log p (y, θ x) . x | ! Xθ The pivotal problem encountered while maximising (2.58) is the logarithm of the summation cannot be decoupled. The EM algorithm attempts to compute (2.58) iteratively as follows:

1. Make some initial guess x0.

2. Expectation step: evaluate

, E Q (x, xk) θ y;x log p (y x, θ) . (2.59) | k { | }

3. Maximisation step: compute

xk+1 = arg max Q (x, xk) . (2.60) x

4. Repeat steps 2-3 until convergence.

In the above formulation, the superscript xk denotes the k-th iteration. The full details regarding the derivation of the EM method can be found in Appendix 10.4.

The main property of the EM algorithm is

p(y x ) p(y x ), (2.61) | k+1 ≥ | k which states that the likelihood of the current estimate xk is non decreasing at every iteration.

An important aspect of the EM algorithm is the choice of the unobserved data vector θ. This choice should be done such that the maximization step is easy or at least easier than the max- imization of the likelihood function. In general, doing so is not an easy task. In addition, the evaluation of the conditional expectation may also be rather challenging.

Despite the fact the algorithm can display slow numerical convergence in some cases, the EM algorithm has become a very popular computational method in statistics. One of the main reasons motivating its use is that the implementation of the E-step and M-step is easy for many statistical problems. 2.9 Lower Bounds on the MSE 28

The main disadvantages of the EM method are:

1. Convergence to global optimum is not guaranteed.

2. May have slow convergence rate, i.e. has a linear order of convergence.

3. In some problems, the E-step or M-step may be analytically intractable.

The EM can be extended to the Bayesian setting in order to find the MAP estimate. In that case the expectation step in (2.59) is replaced with

, E Q (x, xk) θ y;x log p (y, x, θ) . (2.62) | k { }

The BEM methodology shall be used in Chapter 5 to perform inference in the presence of nuisance parameters.

2.9 Lower Bounds on the MSE

As shown in Section 2.3.2, the conditional mean E x y is the technique minimizing the MSE. { | } Thus, from a theoretical perspective, finding the MMSE estimator in any given problem is a mechanical task. In practice, however, the complexity involved in computing the conditional mean is often prohibitive. As a result, various alternatives, such as the LMMSE and the MAP techniques, could be used, as these methods can often be solved efficiently. An important goal is to quantify the performance degradation resulting from the use of these suboptimal techniques. One way to do this is to compare the MSE of the method used in practice with the MMSE. Unfortunately, computation of the MMSE is itself infeasible in many cases. Therefore, we are interested in finding simple lower bounds for the MMSE in various settings. As mentioned before, in probabilistic inference there are two well known approaches, namely the frequentis approach and the Bayesian approach. Although the deterministic and Bayesian settings stem from different points of view, a connection between them can be made. Here we concentrate on the Bayesian version of lower bounds of the MSE.

Suppose x = x (y) is an estimator of x Cm given an observation vector y Cn. Its Bayesian ∈ ∈ MSE matrix, Σ, is given by b b 2 Σ = Ey x x x , k − k n o (2.63) = x x 2 p (y, x) dydx, k b− k ZZ b where Ey,x denotes expectation with respect to p (y, x). The Bayesian Cram´erRao Lower Bound (BCRLB) provides a lower bound on the MSE matrix for random parameters [36], [37]. 2.11 Monte Carlo Methods 29

It is the inverse of the Bayesian Information Matrix (BIM) J

1 Σ C , J− , (2.64) ≥ where the matrix inequality indicates that Σ C is a positive semi-definite matrix. The BIM − for x, J, is defined as

x J = Ey x ∆ ln p (y, x) , (2.65) , {− x } where ∆α is defined as the m n matrix of second-order partial derivatives with respect to the β × m 1 parameter vector β and n 1 parameter vector α, × ×

2 2 2 ∂ ∂ ... ∂ ∂β1α1 ∂β1α2 ∂β1αn 2 2 2  ∂ ∂ ... ∂  α ∂β2α1 ∂β2α2 ∂β2αn ∆β = . . . . . (2.66)  . . .. .     2 2 2   ∂ ∂ ... ∂   ∂βnα1 ∂βnα2 ∂βmαn    The BCRLB will be relevant to results obtained in Chapter 7 where we derive BCRLB type bounds.

2.10 Bayesian Methodology Summary

The Bayesian methodology provides a mathematically rigorous quantification of uncertainty through the application of probability theory. The subjective prior probabilities are updated in light of data. Within the Bayesian inference there exists many cases where intractable cal- culations can prevent its application. Whilst in the past, such issues would have been avoided through the use of conjugate prior distributions, such problems can be overcome using Monte Carlo techniques, which are a class of asymptotically optimal approximate algorithms. We now review a class of Monte Carlo techniques which can be applied to these problems.

2.11 Monte Carlo Methods

2.11.1 Motivation

In Bayesian inference, whenever we attempt to carry out normalisation, marginalisation or ex- pectations, high-dimensional integrals need to be evaluated. Apart from that, many applications involve elements of non-Gaussianity, nonlinearity and nonstationarity. All those reasons preclude the use of analytical integration. In order to obviate this problem, we can either resort to approx- imation methods, numerical integration or Monte Carlo simulation. Approximation methods, 2.11 Monte Carlo Methods 30 such as Gaussian approximation [38] and variational methods [39], are easy to implement. Yet, they do not take into account all the salient statistical features of the processes under consid- eration, thereby often leading to poor results. Numerical integration in high dimensions is far too computationally expensive to be of any practical use. Monte Carlo methods provide the middle ground. They lead to better estimates than the approximate methods. This occurs at the expense of extra computing requirements, but the advent of cheap and massive computa- tional power, in conjunction with some recent developments in applied statistics, means that many of these requirements can now be met. Monte Carlo methods are very flexible in that they do not require any assumptions about the probability distributions of the data. From a Bayesian perspective, Monte Carlo methods allow one to compute the full posterior probability distribution.

2.11.2 Mote Carlo Techniques

Monte Carlo methods are commonly used for approximation of intractable integrals and rely on the ability to draw a random sample from the required probability distribution. The idea (i) N is to simulate N independent and identically distributed (i.i.d.) samples x i=1 from the distribution of interest, which is usually in the Bayesian framework the posterior p (x y) , and  | use them to obtain an empirical estimate of the distribution:

N 1 p (x y) = δ x x(i) . (2.67) N | N − Xi=1   b Using this empirical density, the expected value of x

E x y = xp (x y) dx, (2.68) { | } | Z or in general

E ϑ (x) y = ϑ (x) p (x y) dx, (2.69) { | } | Z can be obtained consequently by approximating the corresponding integrals by the sums:

E x y = xp (x y) dx { | } N | Z N (2.70) b 1 = b x(i), N Xi=1 2.11 Monte Carlo Methods 31

E ϑ (x) y = ϑ (x) p (x y) dx { | } N | Z N (2.71) b 1 = ϑ(bx(i)). N Xi=1 1 The estimate (2.71) is unbiased with the variance proportional to N and according to the strong law of large numbers we have that

a.s. lim pN (x y) p (x y) , (2.72) N | −→ | →∞ where a.s. denotes almost surely (a.s.) convergenceb [33]. If we assume that σ2 = E ϑ2(x) y −→ | − E2 ϑ (x) y < the Central Limit Theorem can be applied, which gives { | } ∞ 

lim √N E ϑ (x) y E ϑ (x) y d N 0, σ2 , (2.73) N { | } − { | } −→ →∞    b where d denotes convergence in distribution [33]. −→ The aforementioned procedure can be easily obtained providing one can sample from p (x y). | This is usually not the case, however, with p (x y) being multivariate, non standard and typically | only known up to a normalizing constant.

2.11.3 Sampling From Distributions

The problem under consideration in this section is to generate samples from some known proba- bility density function, referred to as the target density p(x). However, since we cannot generate samples from p(x) directly, the idea is to employ an alternate density that is simple to draw samples from, referred to as the sampling density s(x). The only restriction imposed on s(x) is that its support should include the support of p(x). When a sample x s(x) is drawn, ∼ the probability that it was in fact generated from the target density can be calculated. This probability can then be used to decide whether x should be considered to be a sample from p(x) or not. This probability is referred to as the acceptance probability, and it is typically expressed as a function of q(x), defined by the following relationship

p(x) q(x)s(x). (2.74) ∝

Depending on the exact details of how the acceptance probability is computed different methods are obtained. The three most common methods are briefly explained in the next sections. 2.11 Monte Carlo Methods 32

1

u F−1(u)

0 x

Figure 2.2: The inverse transform method to obtain samples

2.11.4 Inversion Sampling

A simple yet elegant approach for sampling from distributions had been considered by Ulam prior to 1947. If p (x) is a distribution and it is possible to invert its cumulative distribution function (cdf), F , then it is possible to transform a sample, u, from a uniform distribution over [0, 1] into a sample, x, from p (x) by making use of the following transformation:

1 x = F − (u) . (2.75)

Actually, it suffices to obtain a generalised inverse of F , a function with the property that, F 1 (u) = inf F (x) u as the image of the set of points over which the true inverse is multi- − x { ≥ } valued is π-null. This algorithm is summarized in Algorithm 1, and illustrated in Figure 2.2. Except for simple cases, such as uniform and normal distributions, finding the inverse of the cdf is difficult or impossible, that is, to solve

x F (x) = p(t)dt = u. (2.76) Z−∞ For this reason, other approaches to obtain samples from difficult distributions have been devised.

Algorithm 1 Inversion sampling algorithm Input: F (x) Output: a sample from p (x) 1: Generate u U [0, 1] ∼ 2: Set F (x) = u 1 3: Invert the cdf and solve for x: x = F − (u) 2.11 Monte Carlo Methods 33

2.11.5 Accept-Reject Sampling

The Accept-Reject technique is based on generating a proposal and then accepting or rejecting it with certain acceptance probability. As a result, a true, independent sample from the required distribution is obtained. Let us assume that we have a distribution p(x) f(x) that is easy to ∝ evaluate at discrete points, but from which samples are difficult to draw, and our goal is to obtain realizations of p(x). In such cases we use the Accept-Reject algorithm [40]. According to the algorithm, a sample is drawn from another distribution q(x) g(x) and, based on the outcome, ∝ a decision is made on whether to keep the sample or reject it. This algorithm is summarized in Algorithm 2. The distribution g(x) must be chosen such that f(x) cg(x) for some constant ≤ c. Thus, if the proposal density is quite different from the target one, the method naturally compensates by sampling more points from the required distribution. This results, however, in an unpredictable number of iteration required to complete each step, and proves to be extremely computationally expensive in high-dimensional spaces. The Accept-Reject is widely studied and the interested reader is referred to [11], [41]. The proof of this procedure is easy to obtain:

f(x) p x(i) < z = p x < z u | ≤ cg(x)     f(x) p x < z, u cg(x) = ≤  p u f(x)  ≤ cg(x)  f(x)  z cg(x) 0 du g(x)dx (2.77) = −∞ f(x) R R cg(x′) ∞ 0 du g(x)dx −∞ 1 z Rc R f(x)dx = 1 −∞ c R ∞ f(x)dx z−∞ = R f(x)dx, Z−∞ which proves the required result.

Algorithm 2 Accept-Reject sampling algorithm Input: p(x) f(x), q(x) g(x) and number of attempts N ∝ ∝ Output: x(i), samples from p(x) 1: Find a constant c such that f(x) cg(x) for all x ≤ 2: for i = 1 ...N do 3: Generate x from g(x) 4: Generate u U [0, 1] f(x) ∼ 5: if u then ≤ cg(x) 6: x(i) = x (accept) 7: else 8: goto step 3 9: end if 10: end for 2.11 Monte Carlo Methods 34

2.11.6 Importance Sampling

In the Accept-Reject method, computing f(x) and q(x), then throwing x away along with all the associated computations seems wasteful. Importance sampling avoids the trouble of trying to sample directly from the target distribution by instead sampling from an importance distribution q(x). The distribution q(x) is selected to have the property that is simpler to obtain samples from that the target distribution. Then correction is made since these samples were not taken from the distribution of interest p(x), but instead from the importance distribution q(x). This correction step is known as importance weighting. Integrals of some bounded, integrable function ϑ, with respect to the target distribution can be expressed as

E ϑ (x) = ϑ (x) p(x)dx p(x) { } Z p(x) = ϑ (x) q(x)dx (2.78) q(x) Z p(x) = E ϑ (x) , q(x) q(x)   and may be approximated as

N 1 E ϑ (x) ϑ(x(i))W (x(i)), (2.79) p(x) { } ≈ N Xi=1

p(x) (i) where W (x) = q(x) is the correction importance weight. The particles, x , are samples from the importance distribution, q(x). This will produce an unbiased estimate with variance of the estimate inversely proportional to the number of particles N . The importance sampling algorithm is summarised in Algorithm 3.

Algorithm 3 Importance sampling algorithm Input: p(x) f(x) is difficult to sample, yet can be evaluated analytically up to a normalisation ∝ constant, and number of samples N (i) N Output: x i=1, samples from p(x) 1: for i = 1 ...N do  2: x(i) q(x) ∼ f(x(i)) 3: w(i) = q(x(i)) 4: end for 5: for i = 1 ...N do (i) w(i) 6: W = N (j) j=1 w 7: end for P N (i) 8: p(x) = i=1 W δx(i) (x) P 2.12 Markov Chain Monte Carlo Methods 35

In some situations the described techniques fail or become too computational intensive. In order to overcome these hurdles, techniques which utilise the framework of Monte Carlo have been developed. One of these techniques which we now discuss is the MCMC technique.

2.12 Markov Chain Monte Carlo Methods

2.12.0.1 Introduction

Monte Carlo methods provide a general statistical approach for simulating multivariate dis- tributions by generating an efficient discretized representation of the needed posterior density. MCMC methods can be used to sample from distributions that are complex and have unknown normalization. This is achieved by relaxing the requirement that the samples should be inde- pendent.

A Markov chain generates a correlated sequence of states. Each step in the sequence is drawn from a transition operator T (x x), which gives the probability of moving from state x to state ′ ← x′. According to the Markov property, the transition probabilities depend only on the current state, x. In particular, any free parameters σ e.g. step sizes, in a family of transition operators, T (x x; σ), cannot be chosen based on the history of the chain. A basic requirement for T is ′ ← that given a sample from p(x), the marginal distribution over the next state in the chain is also the target distribution of interest p:

p(x′) = T (x′ x)p(x), for all x′. (2.80) x ← X By induction all subsequent steps of the chain will have the same marginal distribution. The transition operator is said to leave the target distribution p stationary. MCMC algorithms often require operators that ensure the marginal distribution over a state of the chain tends to p(x) regardless of starting state. This requires irreducibility: the ability to reach any x where p(x) > 0 in a finite number of steps, and aperiodicity: no states are only accessible at certain regularly spaced times. For now we note that as long as T satisfies equation (2.80) it can be useful as the other conditions can be met through combinations with other operators. Since usually p(x) is a complicated distribution, it might seem unreasonable to expect that we could find a transition operator T leaving it stationary. However, it is often easy to construct a transition operator satisfying detailed balance:

T (x′ x)p(x) = T (x x′)p(x′), for all x, x . (2.81) ← ← ′

Detailed balance states that a step starting at equilibrium and transitioning under T has the same probability “forwards” x x or “backwards” x x. ← ′ ′ ← 2.12 Markov Chain Monte Carlo Methods 36

To summarise, MCMC algorithm explores the parameter space by a random walk, but spends more time in regions of higher probability, such that in the long run, the amount of time spent in any region is proportional to the amount of probability in that region.

2.12.1 Basics of Markov Chains

Let (Ω, σ, P) be a probability space, let X = 1, 2,...,N be a finite set and let xn(ω) be a stochastic process in discrete time such that x (ω) for all n = 0, 1, 2,... and all ω Ω. Suppose n ∈ there is an N N matrix P (called the Probability Transition Matrix (PTM) of the process) × such that for all i, j X we have ∈

p (x = i x = j) = [P] . (2.82) n+1 | n i,j

Then we call the process xn a time-homogeneous Markov chains with finite state space in discrete time. The matrix P has non-negative entries and its columns sum to 1.

Definition 2.2. Let P be a PTM. It is said to be irreducible if for every pair i, j X there is ∈ an n 1 such that [P n] > 0. Loosely speaking, a Markov chain is irreducible if (almost) all ≥ i,j states communicate; the property corresponds to the existence of a path of positive probability from (almost) any point in the space to (almost) any measurable set. In a Bayesian context, irreducibility guarantees that we can visit all the sets of parameter values in the posterior’s support

Definition 2.3. Let P be an N N PTM and let 1 i N. The set of return times for state × ≤ ≤ i is defined via n R(i) = n > 0 & [P ]i,i > 0 . n o Definition 2.4. Let P be an N N PTM and let 1 i N and let R(i) be the set return × ≤ ≤ times for state i. Then the period of state i is defined via

p(i) = g.c.d R(i), where g.c.d stands for ”greatest common divisor”.

Definition 2.5. An N N PTM P is called aperiodic if p(i) = 1 for each i = 1,...,N. In the × discrete state space case, a Markov chain is aperiodic if there exist no cycles of length greater than one, where a cycle is defined as the greatest common denominator of the length of all routes of positive probability between two states. 2.12 Markov Chain Monte Carlo Methods 37

2.12.2 Metropolis Hastings Sampler

The Metropolis Hastings (MH) algorithm is the most ubiquitous class of MCMC algorithms. The MH is a general algorithm for computing estimates using the MCMC method. It was introduced by W. Keith Hastings in 1970 [42], as a generalization of the algorithm proposed by Nicholas Metropolis et al. in 1953 [43].

An introduction to the MH algorithm is provided by Chib and Greenberg [44]. The idea of the algorithm which is depicted in Algorithm 4 is borrowed from Accept-Reject sampling, in that the generated samples are either accepted or rejected. However, when a sample is rejected the current value is used as a sample from the target density.

Algorithm 4 Metropolis-Hastings algorithm Input: initial x and number of samples N (i) N Output: x i=1, samples from p(x) (0) 1: Initialise x to an arbitrary staring point 2: for i = 1 ...N do 3: Propose x q(x x(i 1)) ∗ ∼ ∗ ← − (i 1) p(x∗) q(x(i−1) x∗) 4: Compute α x , x = min 1, ← ∗ − p(x(i−1)) × q(x∗ x(i−1))  ←  5: Generate u U [0, 1]  ∼ 6: if u α x , x(i 1) then ≤ ∗ − 7: x(i) = x (accept) ∗  8: else (i) (i 1) 9: x = x − (reject) 10: end if 11: end for

The Markov chain created by this algorithm is reversible and has the required target distri- bution, p(x). The choice of the proposal distribution q( ) is very general, however arbitrary · selection can lead to slow mixing of the chain and long burn in periods. This will be reflected in the acceptance probability ratio. It is straightforward to show that the Metropolis updating rule satisfies detailed balance, and is therefore a valid MCMC algorithm. This means that long Markov chains simulated from the acceptance rule of the MH algorithm will explore the param- eter space with the fraction of time spent in any volume being proportional to the total amount of probability contained in that volume. There have been many studies for optimal acceptance rates using different types of proposal distribution in different dimensions [45]. This will ensure that:

the chain is not proposing steps which are too large, hence rejecting many of the moves, • the chain is not proposing steps which are too small, hence accepting most moves, but • exploring the state space very slowly. 2.12 Markov Chain Monte Carlo Methods 38

Most algorithms use symmetric proposal distributions [43], the Independence sampler [45], Ran- dom walk Metropolis [45], Configurational Bias Monte Carlo [46], Multiply Try Metropolis [47] and the single component MH algorithm. The MH independence sampling algorithm, which is a special case of the MH algorithm, is given by a very interesting type of MH al- gorithm, can be obtained when we adopt the full conditional distributions as p(xi xj=i) = | 6 p(xi x1, . . . , xi 1, xi+1, xm) proposal distributions. This algorithm, known as the Gibbs sampler, | − has been very popular since its development [48]. The following Section describes it in more detail.

2.12.3 Gibbs Sampler

The Gibbs sampler [48] is the most widely used form of the single component MH algorithm. The Gibbs sampler produces a Markov chain by updating one component of the state vector during each iteration. The value of each element at the i-th iteration is sampled from the distribution of that element conditional upon the values of all the other parameters at the (i 1)-th iteration and those parameters which have already been update at the i-th iteration. − (i) (i) (i) (i) Suppose that at the i-th iteration the current state is x = x1 , x2 , . . . , xK . If we also know (i) (i) (i) (i) (i) h i the full conditionals p(xk x1 , . . . , xk 1, xk+1, . . . , xK ) k 1,...,K , the following proposal | − ∈ { } distribution for k 1,...,K can be used ∈ { }

p x x(i) , if x = x(i) (i) k∗ k ∗ k k q x∗ x = | − − − (2.83) ←      0, otherwise.

The corresponding acceptance probability is:

(i) p (x∗) p x x∗ (i) k k α x∗, x = min 1, | −  (i)  (i)   p x p xk∗ x k    | −   (i) (i)   p (x∗) p xk x k  = min 1, | − (i)    p x p xk∗ x∗ k  (2.84)  | −     p x∗ k  = min 1, −  (i)  p x k  −    = 1.  

As depicted in (2.84), the acceptance probability is equal to one. The Gibbs sampler algorithm is presented in Algorithm 5.

An example of a sample path of bivariate distribution is given in Figure 2.3. It is easy to see that in every step only one component is updated while the other remains at the same location. 2.12 Markov Chain Monte Carlo Methods 39

Algorithm 5 Gibbs sampling algorithm Input: initial x and number of samples N (i) N Output: x i=1, samples from p(x) 1: Initialise x(1) to an arbitrary staring point  2: for i = 1,...,N do (i+1) (i) (i) (i) 3: Sample x1 p(x1 x2 , x3 , . . . , xK ). (i+1) ∼ | (i+1) (i) (i) 4: Sample x p(x x , x , . . . , x ). 2 ∼ 2| 1 3 K 5: . 6: . 7: . (i+1) (i+1) (i+1) (i) (i) 8: Sample x p(x x , . . . , x , x , . . . , x ). k ∼ k| 1 k 1 k+1 K 9: . − 10: . 11: . (i+1) (i+1) (i+1) (i+1) 12: Sample x p(x x , x , . . . , x ). K ∼ K | 1 2 K 1 13: end for −

Gibbs sampling is often viewed as an algorithm in its own right. It is, however, simply a special case of the MH algorithm in which a single component is updated during each step, and the proposal distribution which is used is the true conditional distribution of that parameter given the present values of the others. That is, a single iteration of the Gibbs sampler corresponds to

2.2

2

1.8

2 1.6 x

1.4

1.2 Starting point

1

0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 x 1

Figure 2.3: Sample path of a bivariate distribution using the Gibbs sampler 2.12 Markov Chain Monte Carlo Methods 40 the application of N successive MH steps, with the relevant conditional distribution used as the proposal kernel for each step. The consequence of this special choice of proposal distribution is that the MH acceptance probability is always one, and rejection never occurs. We shall use the Gibbs sampler throughout this Thesis, due to its simplicity and efficiency. In particular, it shall be used in Chapters 5, 7 and 8.

2.12.4 Simulated Annealing

2.12.4.1 Introduction

In many cases we are interested in finding the maximum of a distribution and not in the actual distribution. For example, if one wishes the find the MAP estimate, it only needs to find the location for which the posterior distribution is maximised and not the actual value of the distribution. By contrast, if one wishes to find the MMSE estimate, he will need to obtain exact samples from the posterior. Simulated annealing is a Monte Carlo optimization method that attempts to find the global optima and improves on local search which can get trapped in local optima [49].

Simulated annealing has its roots in statistical physical. For example, when a metal is in a high temperature environment, these metal molecules can move freely as random walkers. When the temperature is decreased slowly, the moving range of these molecules would be restricted. Finally when the temperature is the lowest, these molecules will be localized and fluctuated in certain way. Our goal is to get the ground state of the metal. If the speed of decreasing temperature is too fast, the system will be trapped at the glassy state, it’s not stable and non- uniform. Annealing uses a temperature parameter so that at high temperatures, with some non-zero probability, the algorithm makes unfavorable moves that allow it to move out of local optima. The annealing starts at high temperature and gradually the temperature is lowered so that unfavorable moves become less and less likely.

2.12.4.2 Methodology

(i) ∞ Let T denote the temperature at the i-th iteration. The sequence T i=1 is a cooling schedule (i) (i) (i) if limi T = 0. Let β = 1/T be an inverse temperature parameter. The simulated →∞  annealing method involves simulating a non-homogeneous Markov chain whose invariant distri- bution at iteration i is no longer equal to p (x), but to:

(i) p (x)(i) (p (x))β . (2.85) ∝

( ) The reason for doing that is that under weak regularitory assumptions on p (x), p (x) ∞ is a probability density that concentrates itself on the set of global maximum of p (x). Similar to the 2.12 Markov Chain Monte Carlo Methods 41

MH algorithm, the simulated annealing method with distribution p (x) and proposal distribution involves sampling a candidate value x given the current value x(i) according to q x(i) x . ∗ → ∗ The simulated annealing is the same as Algorithm 4, with the only difference being the modified acceptance probability, which can be expressed as:

β(i) (i) (i) (p (x∗)) q x∗ x α x , x∗ = min 1, ← (2.86) SA β(i)  p x(i) q x(i) x     ← ∗     2.12.5 Convergence Diagnostics of MCMC

A critical issue in MCMC methods is how to determine when it is safe to stop sampling and use the samples to estimate characteristics of the distribution of interest. The theoretical convergence of MCMC algorithms is an area of active research. In this Section we shall briefly outline a few related aspects. For further reading, see [50], [51].

2.12.5.1 Burn in Period

Since the Markov chain created in MCMC methods is not initialised from the stationary distri- bution, a burn in period is used. This means that first L values of the chain are discarded. It is assumed that after the chain has run for L steps, it will have converged to its equilibrium, so that the resulting samples will be from the target distribution. How to choose the burn in period is a hard question to answer. Some converges diagnostics have been suggested, for example [45] and [52].

2.12.5.2 Autocorrelation Time Series

Autocorrelation time series is a diagnostic tool for MCMC algorithms. The level of correlation in the final samples will affect the accuracy of the Monte Carlo estimate. This is quantified using the autocorrelation time [53]. Assuming the chain is in equilibrium, let x(i) be the value of the chain at the i-th iteration. The autocorrelation, ρΓ(k), at lag k for some function Γ(φ) is defined as

E Γ x(i) , Γ x(i+k) ρ (k) = . (2.87) Γ (i)  var Γ x   Expectation is taken over the values of x(i), whose density is p(x(i)). The autocorrelation time,

τΓ, for the function Γ is defined as

∞ τΓ = ρΓ(k). (2.88) X−∞ 2.13 Trans Dimensional Markov Chain Monte Carlo 42

If N >> τΓ, then an approximation to the variance of the estimator of the expected value of Γ(x),

N 1 Γ(x(i)), (2.89) N Xi=1 is var Γ(x) τΓ . This is a factor of τ larger than that of an estimator based on i.i.d. samples of { } N Γ size N. Therefore, the number of effectively independent samples in a run of length N is roughly

N/τΓ.

The autocorrelation is an estimate of the efficiency of the Markov chain once in equilibrium, (i.e. when stationarity has been achieved), and not an estimate of how long to run the chain until the chain approaches stationarity.

2.13 Trans Dimensional Markov Chain Monte Carlo

2.13.1 Introduction

In many cases, when trying to carry out Bayesian analysis in a situation where there is a range of models which have parameter spaces of different dimensionality, a common approach is to assign a prior distribution over the collection of competing models. In such situations the posterior distribution over the unknown models and model parameters, cannot be analysed using the standard MH framework. In the MH algorithm, any transition and its reverse transition are enabled by the same move type. With dimension changing moves this is not possible. Whenever one move increases the number of parameters, the reverse move has to decrease it. One obvious solution would consist of upper bounding the number of possible models by say K and running K independent MCMC samplers, each being associated with a fixed model order k = 1,...,K. However, this approach suffers from severe drawbacks. Firstly, it is computationally very expensive since K can be large. Secondly, the same computational effort is attributed to each value of k. In fact, some of these values are of no interest in practice because they have a very weak posterior model probability P r (k x, y). However, in practical settings we do not | know a priori on which k-s we should focus our computational effort.

The Reversible Jump Markov Chain Monte Carlo (RJMCMC) algorithm, proposed by Green [54], [55], solves this problem by extending the MH algorithm to problems where the dimension of the parameter space is variable. The RJMCMC methodology is designed to create a Markov chain which has its invariant distribution which takes its support on such general state spaces. Up to this Section, we have been comparing densities in the acceptance ratio. However, if we are carrying out model selection, then comparing the densities of objects in different dimensions has no meaning. It is like trying to compare spheres with circles. Instead, we have to be more 2.13 Trans Dimensional Markov Chain Monte Carlo 43 formal and compare distributions P (dx) = P r(x dx) under a common measure of volume. ∈ The distribution P (dx) will be assumed to admit a density p(x) with respect to a measure of interest, e.g. Lebesgue (see [56]) in the continuous case: P (dx) = p(x)dx. Hence, the acceptance ratio will now include the ratio of the densities and the ratio of the measures (Radon Nikodym derivative [56]). The latter gives rise to a Jacobian term. To compare densities pointwise, we need, therefore, to map the two models to a common dimension.

Green [54] realised that instead of doing the model search in the full product space, one could focus on disjoint union spaces of the form X = K ( k X ) and the target distribution ∪k=1 { } × k defined on such a space is then given by,

K I p(k, dx) = p(m, dxm) m X (m, x) , (2.90) { }× k m=1 X where K is the family of models and x X are the model dependent parameters. To m ∈ m summarise, under Green’s formulation, the RJMCMC allows the Markov chain to explore within the sub-spaces and also jump between sub-spaces, say from Xm to Xn.

It is important to mention that to allow this behavior one must extend each pair of commu- nicating spaces, X and X , to X , X U and X , X U , and also define m n m,n m × m,n n,m n × n,m a deterministic diffeomorphism, dimension matching function between these extended spaces, labeled h . This means that the user must define the proposal distributions q ( m, x ) nm mn ·| m and q ( n, x ) which go from (n, x ) to (m, x ) and back again, the extended state spaces nm ·| n n m Xm,n and Xn,m and the deterministic transform between the spaces hmn. As shown in [33], in a move which goes from (n, x ) to (m, x ) one must first generate u q ( n, x ) and then n m n,m ∼ nm ·| n evaluate (xm, um,n) = hmn(xn, un,m) where the notation xm∗ = hnm(xn, hn,m) is used for the xm component of the function hnm. This move will then be accepted according to the following acceptance probability of a dimension changing move as shown below

p(m, x ) q (n m) q (u m, x ) ∂h (x , u ) min 1, m∗ | mn m,n| m∗ det nm m m,n , (2.91) p(n, x ) × q (m n) × q (u n, x ) × ∂(x , u )  n nm n,m n m m,n  | | where the term det ∂hnm(xm,um,n) is the Jacobian of the function h . The generic TDMCMC ∂(xm,um,n) n,m algorithm is presented in Algorithm 6. The algorithm generates samples for the model indicators and the model posterior P r (k y) can be estimated by |

N 1 P r (k y) = I (k ) . (2.92) | N k l Xl=1 c Some care has to be taken when switching between models with different model indicators. In accordance with the definition of the model indicator such moves will be called dimension 2.14 Stochastic Approximation Markov Chain Monte Carlo 44

Algorithm 6 Generic TDMCMC algorithm Input: Initial state of the Markov chain, x(0), k(0) Output: N samples from the joint distribution p (x, k)  1: for i = 1 ...N do 2: Propose a move from model n to model m with probability q (m n). | 3: Sample u from a proposal density q (u n, x ). n,m nm n,m| n 4: Set (x , u ) = h (x , u ), where h ( ) is a bijection between (x , u ) and m∗ m,n nm n n,m nm · n n,m (xm∗ , um,n), where un,m and um,n play the role of matching the dimensions of both vectors.

5: The acceptance probability of the new model, (xm∗ , m) can be calculated as

p(m, xm∗ ) q (n m) qmn(um,n m, xm∗ ) ∂hnm(xm, um,n) α ((x∗ , m) , (x , n)) = min 1, | | det . m n p(n, x ) × q (m n) × q (u n, x ) × ∂(x , u )  n nm n,m n m m,n  | |

6: Generate u U [0, 1] ∼ 7: if u α then ≤(i) (i) 8: x , k = (xm∗ , m) (accept) 9: else (i) (i) 10: x , k = (xn, n) (reject) 11: end if  12: end for changing moves. Therefore, any TDMCMC algorithm is based on matched pairs of dimension changing moves. We now list a few options for the proposal kernels of the TDMCMC algorithm.

2.13.1.1 Posterior densities as proposal densities

If p (x y, k) is available in closed form for each model k, then acceptance probability (2.91) k| reduces to

pn (x∗) qnm (x∗, x) α (x, x∗) = min 1, . (2.93) mn p (x) q (x, x )  m mn ∗ 

2.13.1.2 Independent sampler

If all parameters of the proposed model are generated from the proposal distribution, then

(xm∗ , um,n) = (xn, un,m) and the Jacobian in (2.91) is one.

2.13.1.3 Standard Metropolis-Hastings

When the proposed model m equals the current model n, then Algorithm 6 corresponds to the traditional MH algorithm. The TDMCMC methodology will be used in conjunction to other methodologies in 7 to solve the problem of channel estimation with unknown number of channel taps. 2.14 Stochastic Approximation Markov Chain Monte Carlo 45

2.14 Stochastic Approximation Markov Chain Monte Carlo

It is well known that the MH algorithm is prone to get trapped into local energy minima in simulations from a system for which the energy landscape is rugged. This means that they may explore the underlined space in a very slow fashion, hence, the mixing rate may be very slow. To overcome the local trap problem, advanced Monte Carlo algorithms have been proposed, such as parallel tempering [57], simulated tempering [58], evolutionary Monte Carlo [59] and dynamic weighting [60], among others. We concentrate on a class of adaptive MCMC processes which aim at behaving as an “optimal” target process via a learning procedure. The special case of adaptive MCMC algorithms governed by Stochastic Approximation (SA) is considered next.

2.14.1 Stochastic Approximation

Consider the problem of finding the unique root ζ of a function h(x). If h(x) can be evaluated exactly for each x and if h is sufficiently smooth, then various numerical methods can be employed to locate ζ. A majority of these numerical procedures, including the popular Newton-Raphson method, are iterative by nature, starting with an initial guess x(0) of ζ and iteratively defining a sequence x(n) that converges to ζ as n . The update step of Newton-Raphson for finding → ∞ ζ has the form:

1 (n) (n 1) 2 (n 1) − (n 1) x = x − h x − h x − . (2.94) − ∇ ∇ h  i   Here h ( ) is the gradient and 2h ( ) is the Hessian matrix. ∇ · ∇ · Now consider the situation where only noisy observations on h(x) are available; that is, for any input x one observes y = h(x) + ǫ, where ǫ is a zero mean random error. Unfortunately, standard deterministic methods cannot be used in this problem. In their seminal paper, Robbins and Monro [61] proposed a stochastic approximation algorithm for defining a sequence of design points x(n) targeting the root ζ of h in this noisy case. Start with an initial guess x(0). At stage n 1, use the state x(n 1) as the input, observe y(n) = h(x(n 1)) + ǫ(n), and update the ≥ − − guess x(n 1), y(n) x(n). More precisely, the Robbins-Monro algorithm defines the sequence − → (n) x as follows:  start with x(0) and, for n 1, set  ≥

(n) (n 1) (n) (n) x = x − + w y (2.95) (n 1) (n) (n 1) (n) = x − + w h(x − ) + ǫ   2.14 Stochastic Approximation Markov Chain Monte Carlo 46 where ǫ(n) is a sequence of i.i.d. random variables with mean zero, and the weight sequence (n) w satisfies  w(n) > 0, w(n) = , ∞ n (2.96) X 2 w(n) < . ∞ n X   While the simulated annealing algorithm above works in more general situations, we can develop our intuition by looking at the special case considered in [61], namely, when h is bounded, continuous and monotone decreasing. If x(n) < ζ, then h(x(n)) > 0 and we have

E x(n+1) x(n) = x(n) + w(n+1) h x(n) + E ǫ(n+1) | n o    n o = x(n) + w(n+1)h x(n) (2.97)   > x(n).

Likewise, if x(n) > ζ, then E x(n+1) x(n) < x(n). This shows that the move x(n) x(n+1) will | → be in the correct direction on average. 2 Some remarks on the conditions in (2.96) are in order. While w(n) < is necessary n ∞ to prove convergence, an immediate consequence of this condition is that w(n) 0. Clearly P  → w(n) 0 implies that the effect of the noise vanishes as n . This, in turn, has an averaging → → ∞ effect on the iterates y(n). On the other hand, the condition w(n) = washes out the effect n ∞ (0) of the initial guess x . P

2.14.2 Stochastic Approximation Markov Chain Monte Carlo

Let x be a random variable taking the support on some finite or compact space E with a dominating measure ν. Let p(x) = κp0(x) be a probability density on X with respect to µ with possibly unknown normalizing constant κ > 0. We wish to estimate fdµ, where f is some function depending on p or p . For example, suppose p(x) is a prior and g(y x) is the conditional 0 R | density of y given x. Then f(x) = g(y x)p(x) is the unnormalized posterior density of x and | its integral, the marginal density of y, is needed to compute a Bayes factor. The following

Stochastic Approximation Monte Carlo (SAMC) method is introduced in [62]. Let A1,...,Am be a partition of X and let η = fdµ; i = 1, . . . , m. Take η(0) as an initial guess, and let η(n) i Ai i i be the estimate of ηi at iterationRn 1. For notational convenience,write ≥ b b (n) (n) θi = log ηi T (2.98) (n) (n) (n) θ = θ1b , . . . , θm .   2.15 Approximate Bayesian Computation 47

T The probability vector π = (π1,. . . , πm) will denote the desired sampling frequency of the Ais; that is, πi is the proportion of time we would like the chain to spend in Ai . The choice of π is flexible and does not depend on the particular partition A ,...,A . { 1 m} The generic SA-MCMC algorithm of [62] is presented in the following:

Stochastic Approximation Markov Chain Monte Carlo Algorithm

1. Initialisation: start with initial estimate θ(0) and for n 0 do: ≥ 2. Sampling: Draw a sample z(n+1) from working density using any MCMC method: p(z θ(n)) m f(z) exp θ(n) I (z), z X. | ∝ i=1 i Ai ∈ P n o 3. Weight update: Update the working estimate of θ(n), recursively by setting

θ(n+1) = θ(n) + w(n+1) ζ(bn+1) π , −   (n) (n+1) I (n+1) I (n+1) T where w is as in (2.96) and ζ = A1 (z ,..., Am (z .  

It turns out that, in the case where no Ai are empty, the observed sampling frequency πi of Ai converges to π . This shows that π is independent of its probability pdµ. Consequently, the i i Ai b resulting chain will not get stuck in regions of high probability, as aR standard Metropolis chain b might. The SA-MCMC method will be used in Chapter 7 to design a TDMCMC based algorithm.

2.15 Approximate Bayesian Computation

This section deals with the class of Bayesian statistical models which involve intractability in the likelihood model. These classes of models are typically referred to as either likelihood-free or Approximate Bayesian computation (ABC), and these terms will be used interchangeably throughout. The term “intractability” will be used with a slight abuse of notation, in particular it can be used to refer to settings in which the likelihood: can not be expressed in a closed analytic form; can only be written down analytically as a function, operation or an integral expression, which can not be solved analytically; can not be directly evaluated point-wise; or evaluation point-wise involves a prohibitive computational cost. We shall demonstrate the ABC model approximation to the true posterior in its most general form, and describe how evaluation of the intractable likelihood is circumvented. 2.15 Approximate Bayesian Computation 48

2.15.1 Introduction

There are a large number of models for which computation of likelihoods is either impossible or very time consuming. The likelihood function, p (y x), is of fundamental importance to both the | frequentist and the Bayesian schools of statistical inference. The frequentist approach is based on ML estimation in which we aim to find x = arg max p (y x) whereas the Bayesian approach x | is centered around finding the posterior distribution p (x y) p (y x) p (x). b | ∝ | Clearly, if the likelihood is unknown, both of these approaches may be impossible, and we will need to perform likelihood-free inference.

Likelihood free inference techniques that can be applied in this situation have been developed over the previous decade and are often known as ABC methods. The first use of ABC ideas was in the field of genetics, as developed by Tavare et al. [63] and Pritchard et al. [64].

The development of the ABC methodology has led to consideration of intractable likelihood models being considered in several different research disciplines: finance [65] [66]; statistics [67]; ecology, [68]; extreme value theory, [69]; protein networks, [70]; and operational risk [65]. A detailed overview of Likelihood Free techniques is found in [71], and the theoretical properties of sampling algorithms working with such ABC posterior models are studied in [72].

The basic algorithm is based upon the rejection method, but has since been extended in various ways. Marjoram et al. [73] suggested an approximate MCMC algorithm, Sisson et al. [74] a sequential importance sampling approach, and Beaumont, Zhang and Balding [75] improved the accuracy by performing local-linear regression on the output.

2.15.2 Basic ABC Algorithm

The simplest ABC algorithm is based upon the rejection algorithm. This was first given by von Neumann in 1951 [76] and described in Algorithm 7.

Algorithm 7 Rejection algorithm 1 Input: p (x) Output: A sample from the p (x y) | 1: Draw a sample x∗ from p (x) 2: Accept x with probability p (y x) ∗ |

It’s clear that this algorithm cannot be implemented as p (y x) is unknown. However, there | is an alternative version of this algorithm which does not depend on explicit knowledge of the likelihood, but only requires that we can simulate from the model, as depicted in Algorithm 8.

However, difficulties arise in the case that the observations stem from a continuous space, since it would be impossible to get any samples as the probability of y = y∗ is virtually 0. 2.15 Approximate Bayesian Computation 49

Algorithm 8 Rejection algorithm 2 Input: p (x) Output: A sample from the p (x y) | 1: Draw a sample x∗ from p (x) 2: Simulate data y from p (y x ) ∗ | ∗ 3: Accept x∗ if y = y∗

A solution to this problem is to relax the requirement of equality, and to accept x∗ values when the simulated data is close to the real data. To define that we require a metric ρ on the state space X, with ρ ( , ): X X R+, and a tolerance ǫ. The algorithm, our first ABC algorithm, · · × → is depicted in Algorithm 9.

Algorithm 9 ABC algorithm 1 : ǫ-tolerance rejection Input: p (x) Output: A sample from the p (x ρ (y, y ) < ǫ) | ∗ 1: Draw a sample from p (x) 2: Simulate data y from p (y x ) ∗ | ∗ 3: Accept x if ρ (y, y ) ǫ ∗ ∗ ≤

Unlike Algorithms 7 and 8, Algorithm 9 is only an approximation of the posterior density. Ac- cepted x∗ values do not form a sample from the posterior distribution, but from some distribution that is an approximation to it. The accuracy of the algorithm (measured by some suitable dis- tance measure) depends on ǫ in a non-trivial manner. Algorithm 9 is obviously approximative when ǫ = 0. The output from the ǫ-tolerance rejection algorithm is thus associated with the 6 distribution

p (x ρ (y, y∗) ǫ) p (x) px (ρ (y, y∗) ǫ) (2.99) | ≤ ∝ ≤ with

px (ρ (y, y∗) ǫ) = I [ρ (y, y∗) ǫ] p (y∗ x∗) dx∗. (2.100) ≤ ≤ | Z

The choice of ǫ is therefore paramount for good performances of the method. If ǫ is too large, the approximation is poor; when ǫ it amounts to simulating from the prior since all simulations → ∞ are accepted (as px (ρ (y, y ) ǫ) 1 when ǫ ). If ǫ is sufficiently small, p (x ρ (y, y ) ǫ) ∗ ≤ → → ∞ | ∗ ≤ is a good approximation of p (x y). There is no approximation when ǫ = 0, since the ǫ-tolerance | rejection algorithm corresponds to the exact rejection algorithm, but the acceptance probability may be too low to be practical. Selecting the “right” ǫ is thus crucial. It is customary to pick ǫ such that the acceptance probability is around 20%. 2.15 Approximate Bayesian Computation 50

2.15.3 Data Summaries

For problems with large amounts of high-dimensional data, Algorithm 9 will be impractical, as the simulated data will never closely match the observed data. The standard approach to reducing the number of dimensions is to use a summary statistic, T (y), which should summarise the important parts of the data. We then adapt Algorithm 9 so that parameter values are accepted if the summary of the simulated data is close to the summary of the real data. This approach also lends itself naturally to application in cases where y has continuous components. This Algorithm is presented in Algorithm 10.

Algorithm 10 ABC algorithm 2 Input: p (x) Output: A sample from the p (x ρ (T (y) , T (y )) < ǫ) | ∗ 1: Draw a sample from p (x) 2: Simulate data y from p (y x ) ∗ | ∗ 3: Accept x if ρ (T (y) , T (y )) ǫ ∗ ∗ ≤

The summary statistic T (y) is called a sufficient statistic for x if and only if

p (x y) = p (x T (y)) , (2.101) | | i.e., if the conditional distribution of x given the summary equals the posterior distribution. The idea is that if T (y) is known, then the full data set, y, can not provide any extra information about x. If a sufficient statistic is available, then Algorithm 10 is essentially the same as Algo- rithm 9. However, for problems where both the posterior distribution and the likelihood function are unknown, it will not in general be possible to determine whether a statistic is sufficient.

The ABC presented before still suffers from two serious problems:

1. The rejection based algorithms are inefficient. Only a fraction of the proposed samples are actually accepted, while most of the samples are discarded.

2. In the majority of cases, directly sampling from the target ABC posterior via inversion of the cdf is not achievable.

Because of these two reasons, a class of sampling algorithms known as MCMC-ABC has been developed. We present it in the following Section.

2.15.4 MCMC-ABC Samplers

As mentioned before the ABC rejection algorithm is inefficient because it will continue to draw samples in regions of parameter space that are clearly not useful. One approach to this problem 2.15 Approximate Bayesian Computation 51 is to derive a MCMC-ABC Algorithm. The hope is that the Markov chain will spend more time in interesting regions of high probability compared to ABC rejection.

It is shown in [71] that the ABC method embeds an “intractable” target posterior distribution, denoted by p (x y), into an augmented model, |

p (x, z, y) = p (y z, x) p (z x) p (x) , (2.102) | | where z X is an auxiliary vector on the same space as y. In this augmented Bayesian model, ∈ the density p (y z, x) weights the intractable posterior. A popular example that we shall utilize | is given by

1, if ρ (T (y) , T (z)) ǫ, p (y z, x) ≤ (2.103) | ∝  0, otherwise.

This makes a Hard Decision (HD) to reward a summary statistic of the augmented auxiliary variables, T (z), within an ǫ-tolerance of the summary statistic of the actual observed data, T (y), as measured by distance metric ρ. In cases where the data samples are small in size a comparison of the actual data with the auxiliary variables is feasible and so do not require summary statistics and the distance metric that can be considered is Euclidean distance. Other more sophisticated choices of weighting function and distance metric are considered in [72], [71] and [66]. Hence, in the ABC context the intractable target posterior marginal distribution, p (x y), that we are | interested in is given by:

pABC (x y) p (y z, x) p (z x) p (x) dz | ∝ X | | Z E = p (x) p(z x) p (y z, x) | { | } (2.104) N p (x) p y z(n), x , ≈ N | n=1 X   where z(1),..., z(N) are sampled realizations of data from the (intractable) likelihood. The ABC-MCMC sampler is presented in Algorithm 11

Additionally, we note that the tolerance ǫ typically should be set as low as possible for a given computational budget. Typically, this will depend in the ABC context on the choice of algorithm used to sample from the ABC posterior distribution.

In Chapter 8 we shall introduce a novel weighting function, which will serve as an alternative to the HD rule. Instead of using the HD weighting function which rewards summary statistics of the augmented auxiliary variables, we shall introduce a Soft Decision (SD) weighting function that penalizes summary statistics as a non-linear function of the distance between summary 2.15 Approximate Bayesian Computation 52

Algorithm 11 ABC-MCMC sampler Input: p (x) N Output: x(i) , samples from p (x ρ (T (y) , T (y )) < ǫ) i=1 | ∗ 1: Initialise x(0) to an arbitrary staring point 2: for i = 1,...,N do 3: Propose x q(x x(i 1)) ∗ ∼ ∗ ← − 4: Simulate data y from p (y x ) ∗ | ∗ 5: if ρ (T (y) , T (y )) ǫ then ∗ ≤ (i 1) p(x∗) q(x(i−1) x∗) 6: Compute α x , x = min 1, ← ∗ − p(x(i−1)) × q(x∗ x(i−1))  ←  7: Generate u U [0, 1]  ∼ 8: if u α x , x(i 1) then ≤ ∗ − 9: x(i) = x (accept) ∗  10: else (i) (i 1) 11: x = x − (reject) 12: end if 13: else (i) (i 1) 14: x = x − (reject) 15: end if 16: end for statistics. That is, even though the weighing may be small, it will remain non-zero, unlike the HD rule. The intension of the new rule is to improve the mixing rate of the Markov chain.

2.15.5 Distance Metrics

Having obtained summary statistic vectors T (y) and T (z), likelihood-free methodology then measures the distance between these vectors using a distance metric, denoted generically by ρ (T (y) , T (z)). The most popular example involves the basic Euclidean distance metric which sums up the squared error between each summary statistic as follows:

dim(T ) ρ (T (y) , T (z)) = (T (y) T (z))2 . (2.105) i − i Xi Recently more advanced choices have been proposed and their impact on the methodology has been assessed, these include: scaled Euclidean distance given by

dim(T ) ρ (T (y) , T (z)) = W (T (y) T (z))2 ; (2.106) i i − i Xi Mahalanobis distance:

1 T ρ (T (y) , T (z)) = (T (y) T (z)) Σ− (T (y) T (z)) ; (2.107) − − 2.15 Approximate Bayesian Computation 53

Lp norm:

dim(T ) ρ (T (y) , T (z)) = [ T (y) T (z) p]1/p ; (2.108) | i − i | Xi and city block distance:

dim(T ) ρ (T (y) , T (z)) = T (y) T (z) . (2.109) | i − i | Xi In particular we note that distance metrics which include information regarding correlation between the summary statistics produces estimates of the marginal posterior which are, for a finite computational budget, typically more accurate and involve greater efficiency in the simulation algorithms utilized. The ABC theory will be useful in Chapter 8, where we develop algorithms in systems with intractable likelihood.

2.15.6 ABC methodology summary

The ABC methodology is still relatively new and in its infancy and there are still many open questions to be answered. While ABC methods are only approximate, as their name suggests, they do have many advantages. Firstly, they are almost trivial to use and code once you are able to simulate from the model. Secondly, changes to the model or data structure can easily be incorporated without any changes to the inference mechanism. Finally, if the likelihood function is unavailable or prohibitively expensive to compute, then MCMC methods can not be used, which is the main reason to use and develop ABC methods.

There are many technical questions which remain to be answered. For example, currently it is not known how accurate the approximation p (x y) is, or how the accuracy depends on the ABC | choice of metric and ǫ. Another problem is that if data summaries need to be used, the intuition of the practitioner is relied upon when choosing a good summary statistic. An approximate sufficiency is still needed along with a methodical way of finding summaries which are nearly sufficient and which therefore capture the pertinent parts of the data. 2.16 Concluding Remarks 54

2.16 Concluding Remarks

In this chapter we have presented background material on Bayesian inference. The main points presented in this chapter:

We formulated estimation objectives under the Bayesian framework and derived a few • common estimators.

We discussed the model selection problem and parameter estimation under model uncer- • tainty.

We provided and overview of Bayesian sequential filtering and the Kalman filter. • We presented the EM and BEM methods to obtain the ML and MAP estimates, respec- • tively.

We presented an overview of Monte Carlo methods, in particular, the MCMC methodology • for Bayesian inference and the TDMCMC methodology for Bayesian inference under model uncertainty.

We formulated the SA and SA-MCMC methodologies to overcome the local trap problem. • We presented an overview of the ABC methodology that enables inference in models for • which the likelihood does not exist or is intractable. Chapter 3

Introduction to Wireless Communication

“Essentially, all models are wrong, but some are useful.”

George Edward Pelham Box

3.1 Introduction

This chapter provides a brief overview of wireless communications. The presentation is not intended to be exhaustive and does not provide new results, but it is rather intended to provide the necessary background to understanding the following chapters.

3.2 Modeling of Fading Channels

The importance of understanding radio propagation channels for successful design of communi- cation systems can not be overstated. Earlier, the wireless medium was viewed as an obstacle or a limiting factor in designing reliable communication links. However, decades of research and subsequent insights have changed this paradigm. Modern day communication systems rather tend to exploit the channel behaviour for increased performances.

Wireless channels operate through electromagnetic radiation from the transmitter to the receiver. The transmitted signal propagates through the physical medium which contains obstacles and surfaces for reflection, causing multiple reflected signals of the same source to arrive at the receiver at different times. The reflected signals are directly influenced by the material properties of the surface it reflects or permeates. These can be relative to dielectric constants, permeability, conductivity, thickness, etc.. The aforementioned effects due to the propagation medium are represented as an abstract entity called the channel. 55 3.2 Modeling of Fading Channels 56

Fading

Large scale fading Small scale fading

Signal disperssion Time variance of the channel

Flat fading Frequency selective Fast fading Slow fading

Figure 3.1: Types of fading channels

The effect of multiple wavefronts is represented as multiple paths in a channel. If the transmitter or the receiver is mobile, the channel is said to be time-varying. In order to recover the transmit- ted signal at the receiver, it is essential to know some information about the channel, referred to as channel estimation. The cancellation of channel effects is referred to as equalization.

In principle, one could solve the electromagnetic field equations, in conjunction with the trans- mitted signal, to find the electromagnetic filed impinging on the receiver antenna. However, this task is not practical to achieve, since in order to do so, one must take into account the physical properties (e.g. location, material type etc.) of the obstructions, such as ground, buildings, vehicles, etc.. Since solving the field equations is a too complex of a task, a simpler model which is mathematically tractable is used. Although only an approximation of the real physical envi- ronment, it gives good performance. The general term fading is used to describe fluctuations in the envelope of a transmitted radio signal. However, when speaking of such fluctuations, one must consider whether the observation has been made over short distances or long distances. For a wireless channel, the former case will show rapid fluctuations in the signal’s envelope, while the latter will give a more slowly varying, averaged view. For this reason, the first scenario is formally called small-scale fading or multi-path, while the second scenario is referred to as large-scale fading or path loss.

Large-scale fading is reflected only on the strength of the received signal, and will not be consid- ered in this thesis. We will look, however, at different types of small-scale fading, both in terms of the signal dispersion and time variance of the channel (see Figure 3.1). We shall now briefly outline a few important properties of wireless channels. 3.2 Modeling of Fading Channels 57

3.2.1 Tapped Delay-line Channel Model

We represent the Channel Impulse Response (CIR) of a time-varying channel as an input-output relation between the transmit and receive antennas whose impulse response is modeled by L propagation paths [77],

L 1 − h(τ, t) = a (t)δ(τ τ (t)), (3.1) l − l Xl=0 where h(τ, t) is the response at time t for an impulse signal transmitted at time t τ, and − L 1 a (t) − are the channel coefficients at time t. { l }l=0 The corresponding Channel Frequency Response (CFR) can be expressed as

H(f, t) = ∞ h(τ, t) exp j2πfτ dτ {− } Z∞ L 1 (3.2) − = a (t) exp j2πfτ (t) . l {− l } Xl=0 The expressions in (3.1)-(3.2) are elegant and easy to work with. Next we discuss different aspects and properties wireless channels.

3.2.2 Doppler Offset

Due to the relative motion between the transmitter and the receiver, each multipath wave experiences a frequency shift. The phenomenon is termed the Doppler shift, and is directly proportional to the velocity and direction of motion of the mobile with respect to the direction of arrival of the received multipath wave. Thus, the Doppler shift can be written as,

v f = cos (α) d λ (3.3) vf = c cos (α) , c where v is the speed of the mobile, α is the direction of motion of the mobile with respect to the direction of arrival of the multipath, c is the speed of light and λ and fc are the wavelength and carrier frequency of the radio signal, respectively. In general, if we introduce any form of acceleration or change of direction between the transmitter and receiver (e.g., driving in a curve), the Doppler shift will become time dependent. 3.2 Modeling of Fading Channels 58

3.2.3 Power Delay Profile

The Power Delay Profile (PDP) is defined as the squared value of the CIR.

L 1 − P (t) = a2(t)δ(τ τ (t)). (3.4) l − l Xl=0 The PDP represents the relative received power in excess delay with respect to the first path. PDPs are usually found by averaging instantaneous PDP measurements. With indoor Radio Frequency (RF) channels, the PDP of the channel usually has an exponentially decaying function.

3.2.4 Coherence Bandwidth

The coherence bandwidth is a statistical measure of the range of frequencies over which the chan- nel can be considered flat (i.e., a channel which passes all spectral components with approxi- mately equal gain and linear phase) [78]. The coherence bandwidth is an important parameter in the design of many wireless systems. In particular, in OFDM systems, if the subcarrier spacing is set to be less than coherence bandwidth of the channel, each subcarrier would be affected by a flat channel and thus a simple one-tap equalization can be used.

3.2.5 Coherence Time

The coherence time, Tc, is the time domain dual of the Doppler spread and is used to characterize the time-varying nature of the frequency dispersiveness of the channel in the time domain. Tc is the time for which the correlation of the channel responses decrease by 3 dB. For example, in OFDM systems, to avoid fast fading effect, the OFDM symbol length needs to be shorter than the coherence time of the channel.

3.2.6 Time Selective and Fast Fading Channels

If the CIR changes rapidly within the symbol duration, then the channel is said to be a fast fading channel. Under such conditions, the coherence time of the channel, Tc, is smaller than the symbol period of the transmitted signal, Ts. This causes frequency dispersion (also called time- selective fading) due to Doppler spreading, which leads to signal distortion. In the frequency domain, mobility results in frequency spread of the signal which dependents on the operating frequency and the relative speed between the transmitter and receiver, also know as Doppler spread [79]. Therefore, a signal undergoes fast fading or time selective fading if [78]

Ts > Tc, (3.5) 3.2 Modeling of Fading Channels 59 and

Bs < Bd, (3.6) where Bs is the badndwith of the transmitted signal and Bd is the Doppler spread of the channel.

3.2.7 Slow Fading Channels

In a slow fading channel, the CIR changes at a rate much slower than the transmitted baseband signal. In this case, the channel may be assumed to be static over one or several reciprocal bandwidth intervals. In the frequency domain, this implies that the Doppler spread of the channel is much less than the bandwidth of the baseband signal. Therefore, a signal undergoes slow fading if [78]

Ts << Tc, (3.7) and

Bs >> Bd. (3.8)

3.2.8 Frequency Selective Channels

If the channel possesses a constant-gain and linear phase response over a bandwidth that is smaller than the bandwidth of transmitted signal, then the channel creates frequency selective fading on the received signal. Under such conditions, the CIR has a multipath delay spread which is greater than the reciprocal bandwidth of the transmitted message waveform. When this occurs, the received signal includes multiple versions of the transmitted waveform which are attenuated (faded) and delayed in time, and hence the received signal is distorted. As a result of that, the channel induces Inter Symbol Interference (ISI) [78]. For frequency selective fading, the spectrum of the transmitted signal has a bandwidth which is greater than the coherence bandwidth Bc of the channel. Thus, a signal undergoes frequency selective fading if

Bs >> Bc, (3.9) and

Ts < στ , (3.10) where στ is the Root Mean Square (RMS) value of the delay spread. 3.3 Channel Models 60

3.2.9 Flat Fading Channels

If the mobile radio channel has a constant gain and linear phase response over a bandwidth which is greater than the bandwidth of the transmitted signal, then the received signal will undergo frequency flat fading or simply, flat fading. Flat fading is sometimes referred to as narrowband channels, since the bandwidth of the applied signal, BS, is narrow as compared to the channel

flat fading bandwidth or the coherence bandwidth, Bc. To summarize, a signal undergoes flat fading if [78]

Bs << Bc, (3.11) and

Ts >> στ . (3.12)

3.3 Channel Models

Many channel models have been suggested for wireless communication [80], [81], [82], just to cite a few. We now review two channel probabilistic models which are widely used, mainly due to their simplicity and their ability to form a good approximation of the real physical channels.

3.3.1 Rayleigh Fading Channels

The Rayleigh fading channel model is the simplest model for channel filter taps. It is based on the assumption that there are a large number of statistically independent reflected and scattered paths with random amplitudes in the delay window corresponding to a single tap. Each tap hl is composed of the sum of many independent random variables, so that by invoking the Central Limit Theorem it can be modeled as a zero-mean Complex Gaussian random variable:

h CN 0, σ2 (3.13) l ∼ l  The magnitude h of the l-th tap is a Rayleigh random variable with density | l| x x2 p (x) = exp − , x 0 (3.14) σ2 2σ2 ≥ l  l  and the squared magnitude h 2 is exponentially distributed with density | l| 1 x p (x) = exp − , x 0. (3.15) σ2 σ2 ≥ l  l  3.3 Channel Models 61

Rayleigh fading corresponds to a Non Line Of Sight (NLOS) situation. The other common fading model is the Rician fading model which corresponds to a Line Of Sight (LOS) situation. Further information about fading models can be found in [83].

3.3.2 Clarke’s / Jake’s Model

The Rayleigh fading channel model provides a statistical view of each channel component hl. This, however, is only part of the full behavior of the channel, as it does not provide information about the statistical behavior of the channel over time. A statistical quantity that models the relationship of the channel taps over time is the tap gain autocorrelation function Rl[m], which is defined as

R [m] , E h [n]h∗[n + m] . (3.16) l { l l }

In Jake’s model, the transmitter is fixed and the mobile receiver is moving at speed v. The transmitted signal is assumed to be scattered by stationary objects around the mobile. There are K paths, the k path arriving at angle θ , 2πk/K, k [0, ,K 1], with respect to the k ∈ ··· − direction of motion. The scattered path arriving at the mobile at angle θ has a delay τθ(t) and a time invariant gain aθ. The input/output relationship is given by

K 1 − y(t) = a x(t τ (t)). (3.17) θk − θk Xk=1 The most common scenario assumes uniform power distribution and isotropic antenna gain patter. This models the situation when the scatterers are located in a ring around the mobile. Making the assumption that the phase is uniformly distributed in [0, 2π] and i.i.d. across all angles of θ, the tap gain is a sum of many small independent contributions, one from each angle. By the Central Limit Theorem, the process can be approximated as Gaussian. It can be shown that the process is stationary with an autocorrelation function Rl[m] given by

2πmf R [m] = 2a2πJ d , (3.18) l 0 W   where W is the bandwidth and J ( ) is the zeroth-order Bessel function of the first kind: 0 · 1 π J (x) , exp jx cos θ dθ. (3.19) 0 π { } Z0 3.3 Channel Models 62

The power spectral density S (f) defined on [ 1/2, +1/2] is given by [84] l −

1 , f f T 2 d s f πfdTs 1 | | ≤ Sl (f) = − fdTs (3.20)  r   0 , else,

 1 where fd is the Doppler offset and Ts = W is the symbol duration. It can be verified that taking the inverse Fourier transform of (3.20) yields (3.18).

3.3.3 Approximation of Jake’s Model

Since (3.20) is a nonrational function, precise fitting of the theoretical statistics is impossible by an Auto Regressive Moving Average (ARMA) model of any order. However, our goal is to accurately capture the dynamics of the wireless channel yet remaining mathematically tractable for implementation.

To simulate the structured variations of time-selective wireless fading channel, three types of linear models are usually considered [85], [77], which are:

1. Auto Regressive (AR) or ‘All-pole’ model.

2. Moving Average (MA) or ‘All-zero’ model.

3. ARMA model.

Out of these three, the AR model is the most frequently used model because of its simplicity and ease of designing (i.e., the equations that determine its parameters are linear). In [85], it was demonstrated that it is possible to capture most of the channel tap dynamics by using a low-order AR model .

The Yule-Walker equations [77] can be used to determine the parameters of the model. The Levinson-Durbin recurrent algorithm, proposed by Norman Levinson in 1947 and improved by James Durbin in 1960, can be used to solve these equations. The key of the algorithm is the recursive computation of the filter coefficients, beginning with the first order and increasing the order recursively, using the lower order solutions to obtain the solution to the next higher order.

Selecting the model order is another difficult problem in developing a linear model. In [86] the information theoretic results show that the first order AR model provides a sufficiently accurate model for time-selective fading channels. As shown in [87], a simple Gauss-Markov model can capture most of the channel tap dynamics, and suitable for channel tracking and, therefore, we will adopt it henceforth. Thus, using discrete-time notations, hl varies according to

h [n] = αh [n 1] + v[n], (3.21) l l − 3.4 Overview of Multi Antenna Communication Systems 63 where α is the AR coefficient which accounts for the variations in the channel due to Doppler 2 shift, and v[n] is the zero-mean complex Gaussian noise with covariance σv and is statistically independent of h [n 1]. Using Yule-Walker equations [77], α and σ2 can be calculated as l − v 2πf v α = R[1] = 2a2πJ c (3.22a) l 0 W c   σ2 = 1 α2, (3.22b) v − where al is the variance if hl.

Now we have introduced the basic elements of the wireless channel, we shall discuss a few transmission techniques that take advantage of the properties of the channel.

3.4 Overview of Multi Antenna Communication Systems

A large suite of techniques, known collectively as Multiple Input Multiple Output (MIMO) communications, have been developed in the past several years to exploit effectively the spatial domain, when multiple antennas are used at both ends of the wireless communication link. Following Telatar’s paper [3], extraordinary capacity gains have been demonstrated by MIMO systems over the conventional Single Input Single Output (SISO). In addition to diversity gain and array gain, MIMO systems can offer the so called multiplexing gain by using parallel data streams (usually called spatial mode or eigen subchannels) within the same frequency band at no additional power expense. In the presence of rich scattering, MIMO links offer capacity gains that are proportional to the minimum of the number of channel inputs and outputs.

3.4.1 The Linear MIMO Channel

Although MIMO channels arise in many different communication scenarios such as wire line systems or frequency selective systems [88], [89], [90], in this thesis we concentrate on the flat- fading MIMO uncorrelated channels. The MIMO system model with nT transmit antennas and nR receive antennas is depicted in Figure 3.2

3.4.2 Channel Model

The MIMO channel is characterized by its transition probability density function, given by p(y x), which describes the probability of receiving the vector y conditioned on the fact that the | vector x was actually transmitted. Common to MIMO communication systems is that under some assumptions the input output relation can be described rather accurately by the following 3.4 Overview of Multi Antenna Communication Systems 64

1 1

1 1

2 2

1 H 1 x Receiver

nT

n T nR 1 1

Figure 3.2: MIMO system model linear model:

y = Hx + w, (3.23) where x AnT is the transmitted symbol, y is the received vector, H CnR nT represents the ∈ ∈ × linear response of the channel, such that its element [H]ij denotes the channel path gain between the j-th transmitter and i-th receiver. In the vast majority of cases (and in this dissertation in particular), the noise terms are modeled as i.i.d. circularly symmetric complex Gaussian random 2 variable with zero mean and covariance matrix given by Rw = Iσw. Or, formally:

E w = 0, (3.24) { } H E ww = Rw. (3.25)  3.4.3 Uncertainty Models for the Channel State Information

In many practical cases, we do not have full knowledge of the CSI. This may be attributed to a noisy channel estimate, quantized values etc.. In the most typical communication setup, channel estimation is performed at the receiver during the training period in which the transmitter sends a pilot or training sequence. Consequently, the quality of the CSI can be categorized into three different situations:

No CSI: the receiver does not have any knowledge about the values of the CSI, or • its statistics. Under these conditions, the detection of symbols is termed non-coherent detection.

Perfect CSI: the receiver has full knowledge of the instantaneous channel realization. • Under these conditions, the detection of symbols is termed coherent detection. 3.5 Detection Techniques in MIMO Systems 65

Imperfect CSI: the receiver has inaccurate knowledge about the parameters describing • the channel. For example, the receiver may be informed of the estimated channel matrix H = H, with corresponding covariance error matrix C. In that case the channel model 6 that can be used is one in which the channel is a random matrix H CN H, C . b ∼   b 3.5 Detection Techniques in MIMO Systems

MIMO systems transmit parallel data sequences simultaneously in the same frequency band at the same time. As a result, the receiver has a difficult task to distinguish multiple data sequences. To solve this problem, several MIMO detection algorithms were stated in the litera- ture. In general, these algorithms defer by their performance merits, such as BER and expected complexity, etc.. The Maximum A Posteriori (MAP) detector minimizes the error probability, and can be expressed in the case of perfect CSI as

x = arg max p (x H, y) x AnT | ∈ (3.26) b = arg max p (y H, x) P r (x) . x AnT | ∈ where A is the constellation used to transmit the symbols. The likelihood function in that case can be expressed as

1 y Hx 2 p(y H, x) = exp k − k . (3.27) | 2 nR/2 − σ2 (πσw) ( w )

If there is no a priori information at the input, or if the input symbols are equiprobable, the Maximum Likelihood (ML) detector is equivalent to the MAP detector and can be expressed as

x = arg max p (y H, x) x AnT | ∈ 1 y Hx 2 b = arg max exp (3.28) n /2 k − 2 k x AnT (πσ2 ) R − σw ∈ w ( ) = arg min y Hx 2 . x AnT k − k ∈ In both cases, the exponential growth of the search space, A nT prohibits the use of brute- 2 | | force ML detection - i.e., simply evaluating y Hx for all possible x AnT . Therefore k − ′k ′ ∈ more efficient but possibly suboptimum detectors need to be studied. We now provide a brief overview of some of these detection methods. 3.5 Detection Techniques in MIMO Systems 66

3.5.1 Linear Detectors

In order to reduce the complexity of the optimal ML detector, many suboptimal schemes have been devised. The two most basic ones are the Zero Forcing (ZF) and the Minimum Mean Squared Error (MMSE) detectors. The ZF approach solves the set of linear equations by forcing the noise component w to zero, or equivalently, solves the unconstrained ML estimate, which gives

x = H†y = x + H†w (3.29)

The solution in (3.29) needs to be quantizedb to the closest lattice point in the constellation A

x = A [x] . (3.30) ZF Q

Since the signal components are fully decoupled,b thisb quantization can be performed separately for each symbol. The main disadvantage of the ZF detector is that in case of some column vectors of H are close to parallel, the corresponding components of w will be significantly amplified by the multiplication with H†. This noise enhancement can become infinity for singular matrices. To alleviate this problem, the MMSE detector has been suggested. This approach balances residual interference and noise enhancement by finding the matrix GMMSE such that

2 GMMSE = arg min E Gy x . (3.31) G k − k n o Defining the estimation error e = Gy x we obtain from the principle of orthogonality −

E eyT = 0, (3.32)  and finally

1 σ2 − G = HH H + w I HH . (3.33) MMSE σ2  x  In the limit for high Signal to Noise Ratio (SNR) the MMSE based detector approaches the ZF one.

3.5.2 VBLAST Detector

The Vertical Bell Labs Layered Space Time (VBLAST) is an improved variant of the decision- feedback equalization strategy and was introduced by Foschini in [91]. In VBLAST algorithm, rather than jointly decoding all the transmit signals, we first decode the “strongest” signal, then subtract this strongest signal from the received signal, proceed to decode the strongest signal 3.5 Detection Techniques in MIMO Systems 67 of the remaining transmit signals, and so on. The optimum detection order in such nulling and canceling strategy is from the strongest to the weakest signal. Assuming that the channel is known at the receiver, the main steps of the VBLAST algorithm can be summarized as follows:

1. Nulling: an estimate of the strongest transmit signal is obtained by nulling out all the weaker transmit signals (say using zero forcing criterion).

2. Slicing: the estimated signal is detected to obtain the data bit.

3. Cancellation: These data bits are remodulated and the channel is applied to estimate its vector signal contribution at the receiver. The resulting vector is then subtracted from the received signal vector and the algorithm returns to the nulling step until all transmit signals are decoded.

3.5.3 Sphere Decoder

As mentioned before, ML detection involves an exhaustive search, and so the computational complexity is exponential in the length of the codeword. The sphere decoding algorithm [92] was proposed to lower the computational complexity. Considerable research has gone into sphere decoding in the last decade [93], [94], [95]. This has resulted in the emergence of quite a few sphere decoders with various variants to facilitate the decoding process. The conventional sphere decoders have been replaced by sphere decoders where the search proceeds independent of the initial radius [96]. There are also list sphere decoders where more than one solution can be found [97]. The size of the list can be as large as the constellation size, in which case the complexity is the same as the ML decoding technique or it can be of much smaller size, in which case the number of points scanned would reduce. The principle of the sphere decoding algorithm is to search the closest lattice point to the received signal within a sphere radius, where each codeword is represented by a lattice point in a lattice field [98]. In a two-dimension problem illustrated in Figure 3.3, one can easily restrict the search by drawing a circle around the received signal just enough to enclose one lattice point and eliminate the search of all the points outside the circle.

In order to derive the sphere decoder, we rewrite the ML detection rule in (3.28) as

x = arg min y Hx 2 x AnT k − k ∈ (3.34) H H b = arg min (x x) H H (x x) , x AnT − − ∈ where x is the unconstrained ML estimate of x,b defined in (3.29).b Based on Fincke Pohst method [99], a lattice point which lies inside the sphere with radius d has to fulfill the condition b d2 y Hx 2 = (x x)H HH H (x x) + y 2 Hx 2 . (3.35) ≥ k − k − − k k − k k b b 3.5 Detection Techniques in MIMO Systems 68

search area

constellation point soft decision

Figure 3.3: Idea behind the sphere decoder

By defining d 2 = d2 y 2 + Hx 2, (3.35) can be rewritten as ′ − k k k k

2 H H d′ (x x) H H (x x) . (3.36) ≥ − −

The matrix HH H can be decomposed to triangularb matricesb with Cholesky decomposition, so that HH H = UH U, where U is an upper triangular matrix.

Further simplifications of (3.35) yield

2 H H d′ (x x) H H (x x) ≥ − − = (x x)H UH U (x x) − b − b nR nR b2 b Ui,j = Ui,i (xi xi) + (xj xj)  − Ui,i −  (3.37) Xi=1 j=Xi+1   = U 2 (x xb )2 b nR,nR nR − nR 2 2 UnR 1,nR − + UnR 1,nR 1 xnbR 1 xnR 1 + (xnR xnR ) + . ... − − − − − Un 1,n 1 −  R− R−  Because of the upper triangular nature ofb U, one can begin evaluationb of the last element in x as

2 2 2 U (x x ) d′ , (3.38) nR,nR nR − nR ≤ which leads to b

d d x ′ x x + ′ . (3.39) nR − U ≤ nR ≤ nR U  nR,nR   nR,nR  b b 3.6 Overview of OFDM Systems 69

2 2 2 The method employs an iterative search, for every xnR satisfying (3.39), dn′ 1 = d′ UnR,nR (xnR xnR ) R− − − can be defined, and a new condition can be written as b 2

2  UnR 1,nR  2 Un 1,n 1 xnR 1 xnR 1 + − (xnR xnR ) dn′ 1, (3.40) R− R− − − − U − ≤ R−  nR 1,nR 1   − −   xn −1|n   b R R b   | {z } which is equivalent to b

dn′ 1 dn′ 1 R− R− xnR 1 nR xnR 1 xnR 1 + . (3.41) − | − Un 1,n 1 ≤ − ≤ − Un 1,n 1  R− R−   R− R−  b b In a similar fashion, one proceeds for xn 2, and so on, stating nested necessary conditions for R− all elements of x.

To ensure that the lattice point is inside the sphere, the initial radius must be big enough to enclose at least one lattice point.

A comprehensive overview of different detection schemes can be found in [100], [101].

3.6 Overview of OFDM Systems

Orthogonal Frequency Division Multiplexing (OFDM) is nowadays ubiquitous and used for achieving high data rates as well as combating multipath fading in wireless communications. In this multi-carrier modulation scheme data is transmitted by dividing a single wideband stream into several smaller or narrowband parallel bit streams. Each narrowband stream is modulated onto an individual carrier. The narrowband channels are orthogonal to each other, and transmit- ted simultaneously. In doing so, the symbol duration is increased proportionately, which reduces the effects of ISI induced by multipath Rayleigh-faded environments. The spectra of the sub- carriers overlaps each other, making OFDM more spectral efficient as opposed to conventional multicarrier communication schemes.

3.6.1 OFDM Signals and Orthogonality

In OFDM systems, subchannels (subcarriers) are obtained via an orthogonal transformation on each block of data (OFDM symbol) comprising N subcarriers. Orthogonal transformations are used so that at the receiver, the inverse transformation can be used to demodulate the data with- out error in the absence of noise. Weinstein [6] proposed that Discrete Fourier Transform (DFT) for multicarrier modulation. The DFT exhibits the desired orthogonality and can be imple- mented efficiently through the Fast Fourier Transform (FFT) algorithm. OFDM schemes use 3.6 Overview of OFDM Systems 70

Processing in the frequency domain Processing in the time domain

serial parallel Add In D/A RF Up- MOD to IFFT to Cyclic Lowpass Converter parallel serial Extension Transmitter

CHANNEL

parallel serial Remove out Sampling RF Down- DEMOD to FFT to Cyclic A/D Converter serial parallel Extension Receiver

Baseband signal HF signal

Figure 3.4: Block diagram of OFDM transceiver rectangular pulses for data modulation, hence a given subchannel has significant spectral overlap with a large number of adjacent subchannels (See Figure 3.5). When the channel distortion is mild relative to the channel bandwidth, data can be demodulated with a very small amount of in- terference from the other subchannels, due to the orthogonality of the transformation. However, in order to completely remove the ISI a Cyclic Prefix (CP) is inserted in front of every OFDM symbol. The CP is a copy of the OFDM symbol tail [102]. For complete ISI removal the length of the CP G must be longer than the essential support of the CIR L. The length of the OFDM symbol after insertion of the CP is denoted by P = N + G. In the following Section we provide a detailed overview of data transmission in OFDM systems.

3.6.2 OFDM Symbols Transmission

OFDM maps a symbol vector containing N symbols (corresponding to N subcarriers) at time n, d[n] = [d [n], , d [n]]T CN , where the subscript i 1,...,N is for the carrier index, 1 ··· N ∈ ∈ { } according to

H s[n] = TCP WN d[n], (3.42) 3.6 Overview of OFDM Systems 71

f 1 f 2 f 3 Amplitude

Normalised frequency

Figure 3.5: Frequency domain representation of three OFDM subcarriers where the CP insertion is described via matrix

ICP P N TCP , R × , (3.43) " IN # ∈ where the matrix I RG N denotes the last G rows of the identity matrix I RN N . The CP ∈ × ∈ × unitary DFT matrix W CN N has elements N ∈ × 1 j2πik [W ] , exp − , i, k 0,...,N 1 . (3.44) N i,k √ N { } ∈ { − } N   After parallel to serial conversion s[n] is transmitted over the multipath channel. We express the CIR in vector notation as

h0 . h =  .  , (3.45)

 hL 1   −    3.6 Overview of OFDM Systems 72 where we assume that L < G. Let

h 0 0 0 ········· . . . .  ......   .. .. .   hL 1 . . .  P P H ,  −  C × , (3.46) ISI  . . . .  ∈  0 ......     . . . .   ...... 0       0 ... 0 hL 1 . . . h0   −    be the lower triangular Toeplitz channel matrix and let

0 ... 0 hL 1 . . . h1 . . . − . .  ......   . .. ..   . . . hL 1  P P H ,  −  C × , (3.47) IBI  . .  ∈  . .. 0     . . .   . .. .       0 0   ············    be the upper triangular Toeplitz channel matrix. We can express the received signal as

r[m] = H s[m] + H s[m 1] + w[m], (3.48) ISI IBI − where the first term represents the ISI between two consecutive OFDM symbols, the second term corresponds to Inter Block Interference (IBI) between two consecutive OFDM block transmis- sions at time m and time (m 1) and w[m] CP is the AWGN, assumed to be i.i.d. complex − ∈ 2 Gaussian with zero-mean and variance σw.

At the receiver the CP of length G is removed, and a DFT is performed on the remaining N samples. The CP removal can be represented by the matrix

N P RCP , [0N GIN ] R × , (3.49) × ∈ which removes the first G entries from the vector r[m] CP if the product R x[m] is formed. ∈ CP As long as G L, ≥

RCP HIBI = 0N P , (3.50) × 3.6 Overview of OFDM Systems 73 which indicates that the ISI between two consecutive OFDM symbols is completely eliminated. Finally, the received signal can be written as:

y[m] = WN RCP r[m]

= WN RCP HISI s[m] + WN RCP w[m] (3.51) H = WN RCP HISI TCP WN d[m] + WN RCP w[m] H = WN HWN d[m] + WN RCP w[m], where

H , RCP HISI TCP (3.52) H = WN diag(g)WN , where the CFR g CN is defined as the DFT of the CIR ∈

g , WN Lh, (3.53) × where WN L is the partial DFT matrix containing the first L columns of WN . Using (3.52) we × rewrite (3.51) as

y[m] = diag(g)d[m] + z[m], (3.54)

, 2 where the elements of z[m] WN RCP w[m], are white with variance σw. Hence, the covariance matrix of z[m] has diagonal structure with identical elements

H Rz[m] = E WN RCP w[m](WN RCP w[m]) 2n H H o (3.55) = σwWN RCP IP RCP WN 2 = σwIN .

In an OFDM system, according to (3.54), every element of the symbol vector d[m] is transmitted over an individual frequency-flat subcarrier. Note that (3.54) can also be expressed as

y[m] = diag(d[m])g + z[m] (3.56) = diag(d[m])WN Lh + z[m]. ×

For the detection of symbols, using (3.54) would be more convenient, while for the purpose of channel estimation, the usage of (3.56) would be more convenient. These two equivalent system models are presented in Table 3.1. 3.6 Overview of OFDM Systems 74

OFDM system model for detection of symbols :

y[m] = diag(g)d[m] + z[m].

OFDM system model for channel estimation :

y[m] = diag(d[m])WN Lh + z[m]. ×

Table 3.1: OFDM system models summary

3.6.3 Multi Carrier versus Single Carrier Modulation Schemes

We now list a few pros and cons of using Multi Carrier (MC) techniques. The main advantages of MC over Single Carrier (SC) modulation schemes are:

1. Narrowband interference - MC systems are robust against narrowband interference, because such interference affects only a small number of sub-carriers.

2. Equalization - in MC systems, equalization is very simple. This is because the time dispersive channel is transformed into parallel channel gains on each subcarrier. Therefore, equalization can be implemented using a one-tap equalizer. In SC systems, in contrast, a careful design of the equalizer needs to be implemented in order to mitigate the ISI effect.

3. Adaptive schemes - in MC systems, different modulation, power allocation and coding schemes can be assigned to each subcarrier in a natural fashion [103], [104]. For example, a subcarrier in deep fade can be assigned with a Binary Phase Shift Keying (BPSK) modulation and low power allocation while a subcarrier on“good” channel can be assigned with high order modulation scheme (e.g. 64 Quadrature Amplitude Modulation (QAM)). Thus, the channel can be used more efficiently.

The main disadvantages of MC are:

1. Frequency offsets and phase noise - the sensitivity to frequency offsets and phase noise of MC systems is well known. When the receiver’s Voltage Controlled Oscillator (VCO) is not oscillating with exactly the same carrier frequency as the transmitter’s VCO, both Carrier Frequency Offset (CFO) and Phase Noise (PN) may occur. Both CFO and PN result in ISI, as the subcarriers are no longer orthogonal and interfere with each other. Because OFDM divides the spectral allotment into many narrow subcarriers, each with small carrier spacing, it may be very sensitive to CFO and PN errors [105], [106]. The characteristics of ISI are similar to additive white Gaussian noise and lead to a degradation of the overall SNR. 3.7 Channel Estimation in OFDM Systems 75

2. High Peak to Average Power Ratio (PAPR) - The main cause of large PAPRs is when symbol phases in the subcarriers line up in a fashion that results in constructively forming a peak in the time-domain signal. The signal transmitted by the OFDM system is the superposition of all signals transmitted in the narrowband subchannels. According to the Central Limit Theorem the transmitted signal follows a Gaussian distribution leading to high peak values compared to the average power. A system design not taking this into account will have a high clip rate: each signal sample that is beyond the saturation limit of the power amplifier suffers either clipping to this limit value or other non-linear distortion, both creating additional bit errors in the receiver [107], [108]. The PAPR is defined as

max s[n] 2 PAPR s[n] , | | . (3.57) { } E ns[n] 2 o | | n o Throughout this thesis we shall not deal with ICI, CFO, PN and PAPR phenomena, but instead assume that the aforementioned affects are compensated for.

3.7 Channel Estimation in OFDM Systems

In mobile wireless communications systems, the channel is time-varying because of the relative motion between the transmitter and the receiver. This results in variation in the propagation path. Most of the modern digital receivers rely on coherent detections. This requires knowledge of the fading amplitude and phase. Channel estimation is a vital task for the receiver, in order to obtain satisfactory performance. It can be carried out in both time and frequency domains. In the time domain, the CIR h in (3.56) is estimated, while in the frequency domain, the CFR g in (3.53) is estimated.

Channel estimation methods can be classified as blind, semi-blind and pilot-aided. Blind algo- rithms do not require any training data and exploit statistical or structural properties of com- munication signals. Semi-blind methods combine blind methods criteria with limited amount of pilot data. Pilot-aided methods on the other hand rely on a set of known symbols interleaved with data in order to acquire the channel estimate.

The advantage of estimating h instead of g lies in the fact that the number of elements in h (that is, the number of channel taps) is usually much smaller than the number of elements in g (that is, the number of subcarriers). In a typical OFDM system the number of subcarriers can reach hundreds or even thousands while the number of channel taps is usually less than ten elements. This means that for the same amount of pilot symbols, a smaller MSE can be achieved using time domain channel estimation techniques. We shall therefore concentrate in the sequel in the estimation of the CIR h. 3.7 Channel Estimation in OFDM Systems 76

We now provide a short review of these methods.

3.7.1 Pilot Aided Channel Estimation

Pilot Symbol Aided Modulation (PASM) based schemes obtain the estimate on the basis of known pilot symbols that are interleaved among the transmitted data symbols, see [109], [110], [111], [112], [113].

Least squares channel estimation • The simplest approach to PASM channel estimation in OFDM systems is the Least Squares (LS) approach. In that case, no a priori information is assumed to be known about the statistics of the channel taps. Based on (3.56), the LS estimate of h (which is also the ML solution for the case of additive Gaussian noise) can be expressed as

1 h = AH A − AH y. (3.58)  b where A , diag(d[m])WN L. The MSE of the LS estimator, ǫ, can be written as ×

2 H 1 ǫ = σz T r A A − . (3.59) n  o 2 2 It’s important to note that (3.58) minimises the quantity y Ah and not h h . − −     b b MMSE channel estimation • In case that the channel is known to be Rayleigh fading, then we can use the Bayesian framework to find quantities such as the MAP or MMSE estimates. The MMSE estimate can be written as

H 2 1 H h = A A + σwRh − A y, (3.60)  b where Rh is the covariance matrix of h. The MSE of the MMSE estimator can be written as

H 2 1 H ǫ = T r cov (h) A A + σ Rh − A cov (h, y) . (3.61) − w n  o 2 Unlike (3.58), the MMSE estimator does minimise the cost function h h . −   b 3.7 Channel Estimation in OFDM Systems 77

3.7.2 Blind Channel Estimation

The term blindness means that the receiver has no knowledge of the transmitted sequence and the CIR. Channel identification, equalization or demodulation is then performed using only some statistical or structural properties of communication signals, e.g. cyclostationarity, high- order statistics, or the finite-alphabet property. The need for higher data rates motivates the search for blind channel estimation methods. In OFDM systems, the CP typically occupies up to 25% of the transmitted data. Furthermore, if pilot symbols are used for channel estimation and synchronization purposes, those may require another 15% 20% of the remaining data. − Therefore, blind estimators are of interest, especially in the case of slow time-varying channels. It may be classified as follows: correlation-based methods, subspace methods, methods exploiting the finite alphabet property and maximum likelihood estimation.

Decision Directed (DD) methods: In this method, the detection of symbols at time • n, d[n], is carried out conditional on the channel estimation at n 1, h[n 1], see for − − example [114] and [115]: b b

d[n] = arg max p y[n] d, h[n 1] . (3.62) d | −   b b Next, the channel estimation at time n is updated conditional on the current detected symbols, d[n], as virtual pilots:

b h[n] = arg max p h[n] d[n], y[n] . (3.63) h[n] |   b b DD methods are effective at moderate to high SNRs and slow varying channels, when reliable symbol decisions are available to the receiver. However, this method is prone to error propagation. We will discuss the DD in detail in Chapter 6.

Precoding based approach: these methods employ non-redundant precoding at the • transmitter side in SISO [116], [117] or MIMO systems [118]. With non-redundant precod- ing, the block length remains unchanged, but a specific correlation structure is induced at the transmitter, e.g. by correlating each carrier with a reference carrier. A proper balance must be found between the level of transmitter-induced correlation, leading to ICI, and a small channel estimation variance which may in turn improve the system performance.

Correlation based approach: this approach takes advantage of the specific structure of • the CP. Since the CP is periodic, thus introduces redundancy as well as cyclostationarity. Cyclostationary signals have the property that statistics, such as mean or autocorrelation function, are periodical [119]. Linear time-invariant filtering does not affect cyclostationar- ity. Consequently, periodicity is expected in the time-varying correlation at the output of the channel. Cyclostationary statistics carry information on channel amplitude and phase 3.7 Channel Estimation in OFDM Systems 78

and allow blind channel estimation [120], [121], [122].

Joint channel estimation and detection methods: It is possible to perform joint • estimation of the channel and the symbols [123], [124] or alternatively, perform symbols detection by marginalisation over the channel parameters.

If we are interested only in the detection of the transmitted symbols, then the channel h can be viewed as a nuisance parameter and can be integrated out. Thus, the MAP detector can be expressed as

d = arg max P r (d y, h) p (h) dh. (3.64) | d Z b In most cases the solution of (3.64) would be hard to obtain due to the intractability of the integral. Another approach then would be to derive the joint channel estimation and symbols detection. This can be carried out by expressing the joint posterior

p (h, d y) p (y h, d) P r (d) p (h) . (3.65) | ∝ |

Based on (3.65), one can obtain various quantities, such as the ML, MMSE or MAP estimates. For example, the joint MAP estimate can be obtained by solving

h, d = arg max p (y h, d) P r (d) p (h) . (3.66) h,d |   b b The complexity of such approaches can be prohibited in practice, and other approaches have been suggested, such as Turbo receivers, which will be discussed in Section 3.9.

Other blind approaches include the finite-alphabet property based methods [125], [126] and sub- space based methods [127], [128], [129] which are not discussed in this thesis and will be therefore skipped.

3.7.3 Semi Blind Channel Estimation

Semi blind methods are based on limited training data in conjunction with blind algorithms [125], [130]. Semi blind methods possess three benefits over blind methods: the ambiguities inherent to blind methods may be resolved, convergence speeds are improved, and more effective and robust tracking of time-varying channels is achieved. 3.8 Channel Coding 79

3.7.4 MIMO-OFDM System Model

On the one hand, OFDM is an effective technique to combat multipath fading in wireless com- munications systems. On the other hand, the capacity of wireless communications systems can be improved by using MIMO techniques. By combining these two techniques, OFDM can trans- form a frequency selective MIMO channels into a set of parallel frequency-flat MIMO channels as long as the channel length is smaller than the CP length. As a result the receiver complexity decreases and the advantages of each techniques are retained.

3.8 Channel Coding

During data transmissions, the original signal is likely to be corrupted by the channel and the noise at the receiver. This causes the signal to be received with errors, which increases the unreliability of reconstructing the original information from the received data. In order to alleviate this problem, Error Control Coding (ECC) can be used. By adding redundant information to the transmitted data, it helps to correct the received errors and reconstruct the original data. Using ECC can help achieve the same BER at a lower SNR in a coded system than in a comparable uncoded system [131], known as coding gain.

3.8.1 Linear Codes

A code is linear if the sum c + c of any two length N code words, c, c C, is again a code word ′ ′ ∈ in C. It follows that the code C is a K dimensional subspace of the vector space of all 2N binary length N vectors. K linearly independent code words in C form the basis of the subspace C, i.e. any code word c C can be uniquely expressed as a linear combination of these K linearly ∈ independent vectors. These K base vectors entirely define the code and are commonly arranged as the rows of a K N generator matrix G. This offers a convenient linear encoding rule from × the set of information words to the set of code words:

c = uG (3.67)

The columns of G correspond to the code word positions, the rows to the information word positions. The encoding mapping is systematic if the K information bits, u, are contained in the code word c.

Alternatively, the code C may be defined as the null space of a (N K) N parity check matrix − × H:

T cH = 01 (N K) (3.68) × − 3.8 Channel Coding 80

+ + ci 1, u i D D

c + i 2,

Figure 3.6: Rate 1/2 convolutional encoder where 1 (N K) is the all-zero vector of length N K. The columns in H correspond to the × − − code word positions, the rows to the parity check equations fulfilled by a valid code word.

3.8.2 Convolutional Coding

Convolutional codes were introduced by Elias [132] and are now broadly utilized in different communication fields. These codes are highly structured, allowing a simple implementation and a good performance with short block length. The Viterbi algorithm [133] is an efficient implementation of the optimum ML word based decoding for convolutional codes. The basic concept is the sequential computation of the metric and the tracking of survivor paths in the trellis. The algorithm was extended in [134] for soft-outputs (SOVA algorithm). Convolutional codes are usually linear codes. An example of a rate 1/2 convolutional code is shown in Figure 3.6. The code words of a convolutional code are the output sequence of a linear encoder circuit fed by the information bits. This code construction sets additional constraints on the characteristics of the corresponding G and H matrices.

3.8.3 BICM Technique

Bit Interleaved Coded Modulation (BICM), first suggested by Zehavi [135] is the serial concate- nation of a code, interleaver and mapper, as depicted in Figure 3.7. The information bits are processed by a single encoder and random interleaver Π. The coded and interleaved bit sequence c is partitioned in Ns subsequences cn of length M:

c = (c1,..., cn,..., cNs ) , with (3.69) cn = (cn,1, . . . , cn,m, . . . , cn,M ) .

M The bits (cn,1, . . . , cn,m) are mapped at time index n to a symbol xn chosen from the 2 -ary signal constellation X according to the binary labeling map µ : 0, 1 M X. The optimum { } → 3.9 Iterative Processing Techniques 81

c ,mn

ui xn Encoder Π Demultiplex Mapping X,

Figure 3.7: Block diagram of a BICM encoder

BICM receiver, depicted in Figure 3.8, would perform the overall ML decoder. However, the complexity of a joint ML demapper and decoder is not manageable. Therefore, we separate the demapping and decoding task and consider a BICM receiver without and with iterative demapping and decoding.

 ∧  (cL ) uL i  yn ,mndem   Demapper Π−1 Decoder

Figure 3.8: Block diagram of a BICM decoder

3.9 Iterative Processing Techniques

3.9.1 The Turbo Principle

Practical coding structures that perform close to the capacity limit are of great interest. In- deed, although Shannon’s theory proved the existence of such codes, it did not give provide a mechanism to design them. In fact, Shannon’s theory did not prove the existence of capacity- approaching codes with tractable decoding complexity. In 1993, Claude Berrou’s research group proposed a coding structure, referred to as turbo code [136], operating close to Shannon’s bound while exhibiting reasonable decoding complexity. Perhaps more than the proposed coding struc- ture, the most groundbreaking contribution of Berrous team lays in the iterative processing used for decoding the received observations. The so-called turbo decoder consisted in two low- complexity decoders iteratively exchanging some soft information about the transmitted bits. Due to its outstanding performance when applied to the decoding of turbo codes, the so-named turbo principle has then been further applied to a variety of other receiver tasks: demodulation [137], equalization [138], multi-user joint reception [139]. Note that, the turbo principle was orig- inally developed in a rather ad-hoc way for turbo codes. Recently, a mathematical framework under the name of factor graphs [140] provides insightful ways to develop iterative algorithms, based on a graphical representation of a problem or system. 3.9 Iterative Processing Techniques 82

3.9.2 Iterative Detection, Decoding and Estimation

Iterative (“Turbo”) processing techniques have received more and more attention followed by the discovery of the powerful Turbo codes. The turbo principle can be applied not only to channel decoders, but also to a wide variety of combinations of detectors, decoders, equalizers, multiuser detectors, coded modulators, joint source/channel coders, etc.

Communications systems usually consist of a collection of cascaded system blocks. For example, let’s consider a receiver consisting of a symbol detector and a decoder. In a conventional system, the detector makes a hard decision about the symbol based on the received signal. Then, its decision is passed to the decoder that decides what the transmitted data bits were. This solution, though simple, comes with a price, since there is significant information loss when the information about a symbol is truncated to a hard decision. If the confidence level of the detector is passed along with its symbol decision, approximately 2 3 dB of performance can be gained at high − SNRs [77]. However this performance is still far from optimal, because earlier stages are not getting any of the information collected by the later stages in the chain. The optimal solution is a Maximum Likelihood Sequence Estimator (MLSE) technique, requiring the construction and evaluation of a super-trellis that includes the channel and code effects. This way, the estimation process considers the joint effects of both the channel and the coder. Although this approach is optimal, it is computationally prohibitive. Iterative processing methods provide an alternative to pass the information from later stages to earlier stages. For iterative processing to work, the individual sub-blocks must produce MAP, or soft output estimates of the quantities that they estimate. Namely, both the detector and the decoder must produce MAP or soft output estimates of the transmitted bits.

3.9.3 Iterative Detector and Decoder Components

An example of an iterative detector and decoder is shown in Figure 3.9. At each iteration of the loop the detector makes a decision about the coded bits c[n] by considering the received signal y[n], a priori information about the coded bits from previous iteration, and using knowledge of the system structure, which includes channel structure, modulation type, noise statistics, etc. The superscript j indicates the j-th iteration of the turbo processing algorithm. The 1 system blocks indicated by Π and Π− are called interleaver and deinterleaver respectively. Their purpose and function will be described later. By applying Bayes’ rule we get the following factorization:

pj (y[n] c[n] = 1]) P rj 1 (c[n] = 1]) P rj (c[n] = 1 y[n]) = | − (3.70a) | p (y[n]) pj (y[n] c[n] = 1]) P rj 1 (c[n] = 0]) P rj (c[n] = 0 y[n]) = | − (3.70b) | p (y[n]) 3.9 Iterative Processing Techniques 83

LLR from detector Extrinsic input to decoder

∧ + Π y Detector/ _ b _ Decoder Estimator Π−1 +

Extrinsic input to detector /estimator LLR from decoder

Figure 3.9: Iterative receiver for coded systems

Instead of posterior distributions, usually Log Likelihood Ratio (LLR) values are used.

P rj (c[n] = 1 y[n]) Λj (c[n]) = log | 1 P rj (c[n] = 0 y[n]) | j 1 p (y[n] c[n] = 1) P r − (c[n] = 1) (3.71) = log | + log . p (y[n] c[n] = 0) P rj 1 (c[n] = 0) | − j j−1 λ1(c[n]) λ2 (c[n]) | {z } | {z } j The quantity Λ1 (c[n]) is called the extrinsic information produced by the detector, which is the information about the coded bit extracted from the received signal, and the a priori information j 1 about the other coded bits, but not from the a priori probability of c[n]. The quantity λ2− (c[n]) is the a priori LLR of c[n], which can be generated from the decoder’s output of the previous j iteration. The extrinsic information λ1 (c[n]) is sent to the channel decoder, which uses it as a priori information. The channel decoder uses the information passed by the detector and information about the channel structure to calculate an a posteriori LLR.

j The likelihood here again is expressed as a sum of the extrinsic likelihood λ2 (c[n]) gleaned from j the input, the code structure and all coded bits except c[n], and the a priori likelihood λ1 (c[n]). j The extrinsic information λ2 (c[n]) is fed back to the first block as a priori information about the coded bits.

It is important to note that the above equations hold only if the inputs to the individual sub- blocks are independent. Obviously, the sequence of coded bits is not independent, because the parity bits are generated from the data bits and, hence, there is some correlation among them. To remedy this, an interleaver or deinterleaver device that shuffles the bits is inserted to make the bit sequence appear random. The block on which it operates must be large (approximately on the order of 1, 000 bits or more). For simplicity here we employ a random interleaver. Another advantage of the interleaver is that it disperses burst errors evenly throughout the frame. Also, bits that happen to be transmitted during a long lasting fade of the channel, get shuffled and more confident decisions about the new neighbors helps reconstruct the original. Remarkably, after a 3.10 Relay Based Communication Systems 84 few iterations, the decisions become refined, and the estimates become significantly confident. The overall performance of the receiver approaches the performance of the MLSE receiver.

3.10 Relay Based Communication Systems

3.10.1 Introduction

In a relay-based communication system, transmission between the source and the destination is achieved through an intermediate transceiver unit called a relay. The main property of relay channels is that certain terminals, called relays, receive, process, and re-transmit some information bearing signal(s) of interest in order to improve performance of the system. The relay channel, first introduced by van der Meulen in 1971 [141] has recently received con- siderable attention due to its potential in wireless applications. In [9], Cover and El-Gamal introduced two relaying strategies commonly referred to as Decode and Forward (DF) and Esti- mate and Forward (EF). The relaying techniques have the potential to provide spatial diversity, improve energy efficiency, and reduce the interference level of wireless channels. A number of relay strategies have been studied in the literature. These strategies include Amplify and Forward (AF) [142], where the relay sends a scaled version of its received signal to the desti- nation. The AF scheme is attractive because of its simple operation at the relay nodes. Other strategies include the demodulate-and-forward [142] in which the relay demodulates individual symbols and retransmits, DF [143] in which the relay decodes the entire message, re-encodes it and re-transmits it to the destination, and Compress and Forward (CF) [9] where the relay sends a quantized version of its received signal. With cooperative communications, the design of the encoder and decoder at the source and destination is accompanied with the design of the functionality of the relay nodes. The choice of the relay function affects different aspects of the system, such as potential capacity [144], [145], or SNR optimality [146]. Clearly, the most desirable schemes are those that achieve the optimality criteria with minimal processing complexity at the relays. Memoryless relay functions are highly relevant for this objective, due to their simplicity.

3.10.2 Relay System Model

We consider a relay network model as illustrated in Figure 3.10, which consists of a single source node, a destination node, and L relay nodes r L . In this model, the relays facilitate the { }r=1 ultimate transmission from the source to the destination by cooperating with the source. All the relays are working in half duplex mode, in which they cannot transmit and receive at the same time on the same frequency band. In the first time slot, the source broadcasts to all 3.10 Relay Based Communication Systems 85

g v h1 n1 1 1

X + Relay 1 X + g v h2 n2 2 2

source X + Relay 2 X + destination

X + Relay L X +

g v hL nL L L

Figure 3.10: Parallel Relay Channels with one source, L relay nodes and one destination the relays. In the second time slot, the relays transmit to the destination in the L orthogonal subchannels. The total L orthogonal subchannels can be realized in time division, frequency division, or code division. The source broadcasts the signal (symbol or codeword) s to all the relays. The received signal at the i-th relay, r(i), is

r(i) = h(i)s + w(i), i 1, ,L , (3.72) ∈ { ··· } where h(i) is the channel coefficient between the source and the i-th relay and w(i) is the additive noise at the i-th relay.

The memoryless relay processing function (with possibly different functions at each of the relays) of the i-th relay node is denoted by f (i), i 1, ,L , and can be both linear or non-linear. ∈ { ··· } Next, each relay transmits its signal on orthogonal subchannels. The corresponding received signals at the destination, y(i), can be expressed as

y(i) = f (i) r(i) g(i) + v(i) (3.73)   = f (i) h(i)s + w(i) g(i) + v(i), i 1, ,L , ∈ { ··· }   where g(i) is the channel coefficient between the i-th relay and destination and v(i) is the additive noise at the destination during the i-th slot. 3.10 Relay Based Communication Systems 86

3.10.3 MAP Detection in Memoryless Relay Functions

Detection of the transmitted symbols is a fundamental task of the receiver node. Since the chan- (i) L nel gains are mutually independent, the received signals y i=1 are conditionally independent given s. The MAP decision rule is given by 

L s = arg max P r s y(i) s S | ∈ Yi=1   (3.74) b L = arg max p y(i) s P r (s) . s S | ∈ Yi=1   In the simple case, which is of the form f (β) = αβ, where α is a constant, the likelihood can be obtained analytically as

2 p y(i) s = CN αsh(i)g(i), α2 g(i) σ2 + σ2 I . (3.75) | w v      

In general, the relay function may not be of a linear form. Conditional on s, we know the distribution at the relay r(i) s, |

p(i) r(i) s = p sh(i) + w(i) s | | (3.76)   CN (i) 2  = sh , σw .   However, finding the distribution of the random variable after the non-linear function is applied i.e. the distribution of f r(i) , f r(i) g(i) given s, involves the following change of variables formula   e 1 1 (i) − (i) (i) − (i) ∂f p f r s = pr(i) f r r s , (3.77) | | ∂r(i)           e e e which can not always be written down analytically for arbitrary f. The second more serious complication is that even in cases where the density for the transmitted signal is known, one e must then solve a K fold convolution to obtain the likelihood:

(i) (i) p y s = p f r s p (i) | | ∗ v (3.78)   ∞    (i) = e p f (z s) p (i) y z dz. | v − Z−∞     e Typically this will be intractable to evaluate pointwise. In Chapter 8 we shall develop algorithms to overcome these problems using the ABC theory. 3.11 Concluding Remarks 87

3.11 Concluding Remarks

In this chapter we have presented an overview of fundamental concept of wireless communica- tions. The main point presented in this chapter:

We gave an overview of different aspects and properties of wireless channels. • We presented a few common statistical fading channel models. • We provided an overview of multiple antennas systems and data detection techniques. • We discussed the transmission and reception of OFDM systems. • We presented different families of channel estimation techniques for OFDM systems. • Error correction codes and iterative receiver principles were presented. • We presented different families of channel estimation techniques for OFDM systems. • We presented an overview of wireless relay networks and discussed the problem of data • detection due to non-linear relay functions.

Chapter 4

Detection in MIMO Systems using Power Equality Constraints

“The Americans have need of the telephone, but we do not. We have plenty of messenger boys.”

Sir William Preece, chief engineer of the British Post Office, 1876

4.1 Introduction

In this chapter we present novel algorithms for detection of data symbols in MIMO systems with perfect Channel State Information (CSI) at the receiver. In the proposed approach, each transmitted symbol vector in the multi-dimensional constellation is classified into one group from a finite set of groups each containing equal-power. For each of these groups, we relax the non-convex discrete constraint, and replace it with a non-convex continuous Power Equality Constraint (PEC). This results in a series of non-convex optimization problems for all the power groups. Using the hidden convexity methodology [147], these optimization problems can be solved efficiently, and a list of “soft candidates” is produced. Once the list is assembled, the soft candidates are quantized (rounded) to yield hard candidates, out of which, the one that best fits the received signal vector is chosen as the final detection solution. Although the number of power groups under consideration can be considerable, we will show that only a small number of power groups is relevant in selecting the final detection solution, thus a significant complexity reduction can be attained. An appealing property of the proposed detection scheme is that the algorithm does not require the knowledge of the noise variance, which is in many cases unknown a priori.

In addition to the proposed detection approach, an improved detection algorithm with a heuristic search is also presented. Based on the list of soft candidates, a local search is performed in order 89 4.2 Background 90 to improve the performance of the detection. Numerical results show that the proposed detection algorithms significantly outperform the Minimum Mean Squared Error (MMSE) detector. For a MIMO system with four transmit antennas, four receiver antennas and 16 QAM, the performance improvement is between 2 to 8 dB over the MMSE detector, depending on whether the heuristic search is used. The main contributions presented in this chapter are as follows:

Algorithm 1 - Power Equality Constraint - Least Squares detector (PEC-LS): We present a • novel detection scheme for MIMO systems with high level modulation constellations. The proposed detection approach is based on a relaxation of the ML optimization problem in a multidimensional constellation. Each symbol vector in the multidimensional constellation can be classified into a finite set of equi-power groups, leading to a set of non-convex opti- mization problems that can be solved efficiently using the hidden convexity methodology.

Algorithm 2 - Ordered Power Equality Constraint (OPEC) detector: We present a novel • algorithm which significantly reduces the complexity of Algorithm 1 without any perfor- mance degradation. This will be achieved by sorting the set of soft solutions according to their MSE performance and producing only a subset of hard decision candidates.

Algorithm 3 - Improved Ordered Power Equality Constraint (IOPEC) detector: The pur- • pose of this algorithm is to enhance the BER performance of the previous two algorithms. This is achieved by incorporating a local search in the neighborhood of the soft solutions.

Performance analysis and complexity reduction: in order to reduce the complexity of the • proposed detectors, we present an efficient implement of the proposed algorithms. We shall separate the computational operations into two categories:

1. Pre-Processing Phase - contains operations that are common to all the power groups and depend on only the current observation vector. 2. Processing Phase - contains operations that are group specific.

We show that it is possible to design our algorithms in such a way that most of the com- plexity burden lies in the Pre-Processing Phase, leading to a low overall complexity. We present a complexity analysis of the proposed algorithms and show that they have the same order of complexity as the MMSE detector.

4.2 Background

MIMO systems arise in many modern communication channels, such as multiple access and multiple antennas channels. It is well known that MIMO systems can yield vast capacity and 4.3 System Description 91 error performance gains if a rich-scattering environment is properly exploited [2], as compared to traditional single antenna systems. In order to exploit these gains, the system must be able to efficiently detect the transmitted symbols at the receiver. The optimal method for detecting the transmitted symbols is a ML detector, which minimizes the error probability. Unfortunately, the ML detector requires solving a combinatorial optimization problem that has a complexity that is exponential to the number of transmit antennas and the rate of modulation, which makes it impractical for many applications [148]. Therefore, a few suboptimal detectors are proposed as alternatives to the ML detector. The most common suboptimal detectors are linear detectors, such as the Matched Filter (MF), the de-correlator or Zero Forcing (ZF) detector and the Minimum Mean Squared Error (MMSE) detector. However, the performance of these linear detectors is far from the ML performance. Moreover, their performance gap from the ML detector becomes larger as the constellation size increases.

Apart from the ML detector and the linear detectors, a different approach that can be taken is to implement a two-step detector:

In the first step, an approximated solution is found based on a relaxation of discrete ML • by a tractable continuous optimization problem. For example, a Semi-Definite Relaxation (SDR) detector has been proposed, where the discrete constellation constraint was replaced by a polynomial constraint [149], [150]. Other examples are given in [151], where the algorithm is based on a relaxation of the ML by using a quadratic non-convex constraint and in [152] by using a convex continuous constraint.

In the second step, heuristic search methods can be used to further improve the initial • detection solution [153], [154].

4.3 System Description

Consider a MIMO system consisting of M transmit antennas and N receive antennas. The relationship between the transmitted symbol vector s and the received vector y is determined by y = Hs + w, (4.1) where H CN M denotes the flat fading channel matrix consisting of independent complex ∈ × Gaussian values which are assumed to be known by the receiver [155]. In (4.1), s CM 1 ∈ × stands for the transmitted symbol vector, where the elements of s belong to some known complex constellation S with cardinality D. The additive noise vector w is of size N 1 and with i.i.d. × CN 2 complex random elements with every element wi 0, σw . For the clarity of presentation, ∼ . throughout this chapter, we use as an example a 16 QAM constellation and M = 2 so that S = 4.3 System Description 92

16 QAM and D = 16, although our approach is general and can be used with any constellation size and any number of antennas.

First, we provide a shot overview of some well-known detectors. If the input has a flat prior, the ML detector minimizes the error probability and can be written as

2 sML = arg min y Hs , s k − k (4.2)  s.t. s SM .  b ∈ To find the solution for (4.2), which is a combinatorial problem, a brute-force search over all of the DM candidates (or lattice points) must be used [156]. However, this is impractical as M and D become large.

Now, we review two linear detectors, which are the ZF and the MMSE detector. In both ZF and MMSE detectors, as a simple relaxation, it is assumed that the elements of s belong to an M -dimensional continuous complex plane, denoted by s CM . This relaxation results in the ∈ following optimization problem

s = arg min y Hs 2 , s k − k (4.3)  s.t. s CM .  b ∈ The solution of (4.3) leads to the well-known ZF Detector, given as [157]

1 s = HH H − HH y , (4.4) ZF Q h  i where [x] denotes component wiseb quantization of x according to the symbol alphabet used. Q 1 This is, however, suboptimal in general because the multiplication by HH H − HH introduces H 1 correlations of the noise components, and small eigenvalues of H H − Hwill lead to large errors due to noise amplification. The problem of noise enhancement  through ZF has been addressed by using the MMSE detection. By combining the same relaxation as in the ZF detector with minimization of the MSE, the MMSE detector provides a trade-off between noise amplification and interference suppression and is able to achieve an improved performance, relative to the ZF detector. The MMSE solution is given by [157]

1 σ2 − s = HH H + w I HH y , (4.5) MMSE Q E " s  #

b 2 where Es is the average power of the constellation and the noise variance σw needs to be known by the receiver. Although a significant improvement over the ZF detector is achieved, the performance of the MMSE detector is still far from that of the ML detector. Therefore, there has been considerable effort in nonlinear approximations of the ML detector. 4.4 Power Equality Constraint Least Square Detection 93

In the rest of the chapter, we will propose a detector which is based on the solution for an optimization problem of a relaxed continuous non-convex constraint set. Since the constraint is significantly tighter than the one used in the ZF and MMSE detectors, a performance gain in terms of BER is expected.

4.4 Power Equality Constraint Least Square Detection

In this section, we propose a PEC-LS detector for MIMO systems with high-level QAM modu- lations. In the proposed detector, the set of DM possible symbol combinations is replaced by a relaxed constrained set. This relaxation leads to a non-convex optimization problem that can be solved efficiently. First, we introduce the background for our approach and provide some useful definitions.

4.4.1 Basic Definitions and Problem Settings

Definition 4.1. Quantum Power Level (QPL): Let Φ (S) be a set of all possible power levels that are associated with a system of a single transmit antenna and a constellation S, that is

Φ(S) , P s S : s 2 = P . (4.6) ∈ R | ∃ ∈ k k n o . Then Φ (S) is called quantum power level set of constellation S. In the case of S = 16 QAM, as depicted in Figure 4.1, we have

Φ (16 QAM) = 2, 10, 18 , Φ (16 QAM) = 3. (4.7) { } | |

Definition 4.2. Symbol Vector Power Group: Let Ω (M,S) be a set of all possible powers of a transmitted signal vector that are associated with a system of M transmit antennas and constellation S. Then the symbol vector power group set is defined as

M Ω(M,S) , P s SM : s 2 = s[m] 2 = P , (4.8) ∈ R | ∃ ∈ k k | | ( m=1 ) X where s = [s[1], . . . , s[m], . . . , s[M]]T and s[m] is the symbol from the m-th transmit antenna. 4.4 Power Equality Constraint Least Square Detection 94

Φ 3 = 18

-3+3i -1+3i 1+3i 3+3i Φ 2=10

-3+i -1+i 1+i 3+i Φ 1=2

-3-i -1-i 1-i 3-i

-3-3i -1-3i 1-3i 3-3i

Figure 4.1: QPL for 16QAM modulation

The cardinality of Ω (M,S) depends on the constellation’s structure S as well as the number of transmit antennas M. Note that Ω (1,S) = Φ (S) , (4.9) and also that the elements in Ω (M,S) are composed of the sums of M elements from Φ (S). Gen- erally, the cardinality of the set Ω (M,S) can be upper bounded by Ψ(M,S) , where Ψ (M,S) | | is the set of permutations to choose M symbols out of possible Ω (1,S) different values with | | repetitions, that is Ω(M,S) Ω (1,S) M , Ψ(M,S) . (4.10) | | ≤ | | | | . For example, we consider a setting that M = 2 and S = 16 QAM. In this case, the possible power combinations are:

Ψ (2,S) = (2 + 2), (2 + 10), (2 + 18), (10 + 2), (10 + 10),   4 12 20 12 20 (4.11) | {z } | {z } | {z } | {z } | {z } (10 + 18), (18 + 2), (18 + 10), (18 + 18) ,  28 20 28 36  and the cardinality is equal to| nine.{z } The| {z number} | of{z unique} | {z elements} in the power group set Ω (2,S) is only five, that is

Ω (2,S) = 4, 12, 20, 28, 36 , Ω (2,S) = 5. (4.12) { } | |

The ω-th element in Ω (M,S) is denoted by Ω (M,S) , ω 1,..., Ω(M,S) , and it is referred ω ∈ { | |} to as a power group, as it represents a power level of a group of transmitted symbol vectors. 4.4 Power Equality Constraint Least Square Detection 95

. Groups representation of M = 2, S = 16 QAM

ω Ω (2,S) (2, S) (2,S) ω Gω |Gω | 1 4 (2 + 2) (2,S) = 1 |G1 | 2 12 (2 + 10), (10 + 2) (2,S) = 2 |G2 | 3 20 (2 + 18), (10 + 10), (18 + 2) (2,S) = 3 |G3 | 4 28 (10 + 18), (18 + 10) (2,S) = 2 |G4 | 5 36 (18 + 18) (2,S) = 1 |G5 |

Table 4.1: Relation between Ω (2,S), Φ (S) and (2,S) G

Definition 4.3. The combinations of QPL elements that compose a power group Ωω (M,S) are µ denoted by (M,S) and each possible element in (M,S) is denoted by ω (M,S), where Gω Gω G µ 1,..., (M,S) . ∈ { |Gω |}

The relation between Ω (2,S), Φ (S) and (2,S) in our example is depicted in Table 4.1, where G every row represents a power group. From Table 4.1, it is clear that the number of power groups in Ω (2,S) is five. For example, the element ω = 3 and the power group Ω3 (2,S) = 20 correspond to symbol vectors s such that s 2 = 20. For this power group, there are (2,S) = 3 k k |Gω | combinations which are: 3 (2,S) = (2 + 18) , (10 + 10) , (18 + 2) . It is clear that Ψ(M,S) = Ω(M,S) G { } | | | | (M,S) . ω=1 |Gω | P Now, using the above definitions, we introduce an approximated solution of the ML detection based on replacing the discrete constraint in (4.2) by a continuous non-convex constraint set. Unlike the constraints used in the detectors reviewed in Section 4.3, we use a power equality constraint on the transmitted symbol vector in our proposed detector. This relaxation of the ML problem here is significantly tighter than the one in (4.4) and (4.5), where the relaxation is s CM . However, since there are Ω(M,S) possible power groups, there will be Ω(M,S) ∈ | | | | constraint detectors, one for each power group. Therefore, we obtain the following Ω(M,S) | | independent optimization problems:

2 sω = arg min y Hs s k − k (4.13)  s.t. s 2 = Ω (M,S) ,  b k k ω  4.4 Power Equality Constraint Least Square Detection 96 where Ω (M,S) is the ω-th element in Ω (M,S) and ω 1,..., Ω(M,S) . By solving (4.13) ω ∈ { | |} once for each power group, a set of Ω(M,S) soft candidates (SC) is assembled as: | |

sSC , sSC , ω 1,..., Ω(M,S) , (4.14) ω ∈ { | |}  and their corresponding Euclideanb distancesb to the received signal y are defined as:

2 ∆SC , ǫ , y H sSC , ω 1,..., Ω(M,S) . (4.15) ω − ω ∈ { | |} n o The sets of sSC , ∆SC will serve as theb starting point for our proposed algorithms in the following Sections. b

4.4.2 Constraint LS Detection for a Specific Power Group

We now discuss the solution of (4.13) for a specific power group. Problem (4.13) constitutes the LS estimation with the additional constraint that the candidate vectors lie in the hypersphere s 2 = Ω (M,S). This problem is nonconvex, since the relaxed set s 2 = Ω (M,S) does not k k ω k k ω define a convex set. However, this seemingly nonconvex problem has been studied extensively [147], [158] ,[159] as part of the Trust Region Subproblem (TRS), and can be solved efficiently. In this work we will use the hidden convexity methodology [147] to solve problem (4.13) as it will cater for an efficient solution for the MIMO detection.

In general, a non-convex minimization problem is called a hidden convex minimization problem [147] if there exists an equivalent transformation such that the transformed minimization problem is convex. In [151], problem (4.13) was solved using the above mentioned methodology. We now present this solution.

Theorem 4.4. ([151]) The solution to

s = arg min y Hs 2 s k − k (4.16)  s.t. s 2 = P,  b k k is 

1 s = HH H + ηI − HH y, (4.17)  where η λ HH H is the uniqueb root of ≥ − min  s 2 = P. (4.18) k k b It is interesting to see that the solution (4.17) has a similar form as the linear MMSE receiver 2 in (4.5), where the constant positive regularization term σw in the MMSE receiver is replaced Es 4.4 Power Equality Constraint Least Square Detection 97 by the symbol-dependent term η. We note that η can be both negative (de-regularization) and positive (regularization). An appealing property of Theorem 4.4 is that this estimator does not 2 require the knowledge of the noise variance σw, which may not be known in practice.

We now discuss the implementation of Theorem 4.4. The only issue is the evaluation of the single parameter η. This can be done by trying different values of η in (4.17) until the one that 2 2 1 satisfies (4.18) is found. Due to the fact that s = HH H + ηI − HH y is monotonic k ηk decreasing in η λ HH H , we can find a value of η that satisfies (4.18) using a simple ≥ − min  line-search, such as bi-section search [160]. The search range is η η η , where η  left ≤ ≤ right left can be chosen such that

s 2 P and (4.19a) ηleft ≥ η λ HH H , (4.19b) left ≥ − min  and ηright can be chosen such that

s 2 P. (4.20) ηright ≤

In Section 4.6, we derive ηleft and ηright that enable an efficient line search on a reduced range.

Since the constraint in (4.16) is related to power levels of hyper-spheres, rather than the signal constellation points, we call the solution in (4.17) a soft solution (or a soft candidate). After the soft solution is obtained, it should be properly rounded to the M-dimensional signal constella- tions of the corresponding power level, yielding a hard candidate. The details will be given in the next Section.

4.4.3 PEC-LS Detection with QAM Modulation

As stated previously, the soft solution for a specific power group can be solved efficiently. In a MIMO system with QAM modulation, there are a number of power groups, each yields a soft solution. Using the set of soft candidates sSC found in (4.14), a set of Ω(M,S) hard-decision | | candidates can be produced by: b sHD = sSC , ω 1,..., Ω(M,S) , (4.21) ω Qω ω ∈ { | |}   where [ ] refers to the hard-decisionb b operation of the ω-th element in Ω (M,S). Qω · We now elaborate on the [ ] operation. Unlike the MMSE and ZF detectors, for which Qω · the quantization operation can be carried out element-wisely, the quantization in the proposed detector needs to be carried out vector-wisely. This is due to the fact that we have the additional 4.4 Power Equality Constraint Least Square Detection 98 constraint on the power of the transmitted vectors, hence the hard-decision solution based on a soft solution must satisfy the power constraint of the group.

A direct implementation of (4.21) is cumbersome, since the quantization operation for a given power group involves choosing the combination of M (hard-decision) symbols that has the min- imum Euclidean distance to the soft solution, among all the possible combinations of symbols that compose the power group. This can be formulated as:

HD SC 2 sω = arg min sω s 2 − s: s =Ωω(M,S) k k M (4.22) b b SC 2 = arg min sω [m] s [m] , 2 − s: s =Ωω(M,S) (m=1 ) k k X  b where s[m] refers to the m-th element of s. . For example, for M = 2, S = 16 QAM, ω = 3, the constraint Ω3 (M,S) = 20 can be satisfied by 16 + 64 + 16 = 96 combinations. A direct implementation of (4.22) involves computing the squared Euclidean distances between 96 candidate vectors to the soft-decision vector of the power group, then comparing their distances and choosing a candidate vector which is closest to the soft candidate, yielding a hard-decision vector for that power group.

In order to reduce the complexity of the above hard-decision operation, we now present a method which can perform the quantization operation on each group efficiently. In the proposed method, instead of generating a single hard-decision candidate for each power group, we produce a set of (M,S) hard-decision candidates for the ω-th power group. For each combination of |Gω | QPL elements given by (M,S), the transmitted signal power from every antenna is specified, Gω therefore, the quantization operation in (4.21) can be implemented element-wisely and we have

M 2 2 min sSC s = min sSC [m] s [m] ω − ω − (m=1 ) X (4.23) M  b b 2 = min sSC [m] s [m] . ω − m=1 X  b Given a QPL, the hard-decisions for the signal from each transmit antenna lie on a circle. As a result, the element-wise quantization can be carried out via a simple phase rounding, which is denoted by µ [ ]. QGω · For each power group indexed by ω, (M,S) hard-decision candidates are generated by using |Gω | the phase rounding and they form a hard-decision candidate set for this power group. This set of hard-decision candidate is denoted by sHD, sHD,..., sHD . Then, we compute the ω,1 ω,2 ω, ω(M,S) |G | squared Euclidean distances of the hard-decisionn candidates to the receivedo signal y, which is b b b 4.4 Power Equality Constraint Least Square Detection 99 denoted by

2 ǫHD = y HsHD , i 1,..., (M,S) . (4.24) ω,i − ω,i ∈ { |Gω |}

Then, a hard-decision with the smallest b squared Euclidean distance to the receive signal is chosen as the hard-decision of the power group. That is,

2 sHD = arg min y HsHD , i 1,..., (M,S) , (4.25) ω − ω,i ∈ { |Gω |} n o and the correspondingb hard distance is b

2 ǫHD = min y HsHD , i 1,..., (M,S) . (4.26) ω − ω,i ∈ { |Gω |} n o b . For example, in a case of M = 2, S = 16 QAM, ω = 3, Ω3 (M,S) = 20, there are three possible combinations of QPL symbols ( (2,S) = (2 + 18) , (10 + 10) , (18 + 2) ). Accordingly, a phase G3 { } SC SC SC T rounding is implemented on the soft candidate s3 =[s3 [1], s3 [2]] for three cases. In the first instance, sSC [1] is rounded to a symbol with s 2 = 2 and sSC [2] is rounded to a symbol 3 | | 3 2 b b b 2 with s = 18. In the second instance sSC [1] is rounded to a symbol with s = 10 and sSC [2] | | 3 | | 3 b 2 b is rounded to a symbol with s = 10. In the third instance sSC [1] is rounded to a symbol with | | 3 2 SC b 2 b s = 18 and s3 [2] is rounded to a symbol with s = 2. For this power group, the proposed | | | | b method needs to compute three squared Euclidean distance and three phase rounding operations, b as compared to 96 squared Euclidean distance calculations in the direct implementation of (4.22). Therefore, the proposed method is less complex.

Repeating the phase rounding and hard-decision operation for all power groups, we get a set of HD HD HD Ω(M,S) hard-decision candidates s1 , s2 ,..., s Ω(M,S) and their corresponding Euclidean | | { | |} HD HD HD distances to the received vector y, which are denoted by ǫ1 , ǫ1 , . . . , ǫ Ω(M,S) . Finally, the b b b { | |} candidate with the minimal squared Euclidean distance from all power groups is chosen as the detection output, that is

2 s = arg min y HsHD , ω 1,..., Ω(M,S) . (4.27) HD − ω ∈ { | |} n o We refer to the detectorb proposed above asb PEC-LS detector, and the algorithm is depicted in Algorithm 12.

4.4.4 Ordered PEC-LS Detection with Reduced Number of Power Groups

In the previous Section we have shown that the solution to the PEC-LS detection is found based on solving a series of optimization problems. For a system with a large number of transmit antennas and a high modulation level, the number of phase rounding and Euclidean distance 4.4 Power Equality Constraint Least Square Detection 100

Algorithm 12 PEC-LS detector Input: y, H, S, M Output: sHD 1: Compose the Quantum Power Levels set QP L (M,S) according to (4.6). 2: Composeb the Symbol Vector Power Groups Ω (M,S) according to (4.8). 3: for ω = 1 to Ω(M,S) do | | 2 2 4: sSC = arg min y Hs , s.t. s = Ω (M,S), ω s k − k k k ω 5: for i = 1 to w (M,S) do HD |G SC | 6: b sω,i = i sω QGω 7: i = i + 1   8: endb for b 2 9: sHD = arg min y HsHD , i 1,..., (M,S) . ω − ω,i ∈ { |Gw |}   10: end for 2 11: s b = arg min y HsHDb , ω 1,..., Ω(M,S) . HD − ω ∈ { | |} n o b b computation grows exponentially with the number of antennas. In this subsection, we present an inherent property of the PEC-LS algorithm. Using this property, we can significantly reduce the detection complexity by deleting a number of irrelevant power groups in finding the detection output. Now, we use the following Lemma to explain this inherent property of the PEC-LS algorithm. After that, we propose an Ordered PEC-LS (OPEC) algorithm.

2 Lemma 4.5. Let ǫHD , y HsHD be the squared distance of a hard-decision solution j − j of the j-th power group, j 1,..., Ω( M,S) . Given another power group indexed by k,

∈ { b | |} k 1,..., Ω(M,S) , if the soft distance ∈ { | |}

2 y HsSC ǫHD, (4.28) − k ≥ j then, the k-th group can not produce a hard-decisionb candidate such that ǫHD ǫHD. k ≤ j

2 Proof. By virtue of Theorem 4.4, the soft value sSC minimizes the expression y H sSC . k − k HD SC Therefore, any hard solution sk , which is obtained from sk , has a greater (or the same) b b squared distance. Therefore, we have b b 2 2 2 ǫHD = y HsHD y HsSC y H sHD = ǫHD (4.29) k − k ≥ − k ≥ − j j and the Lemma is proved. b b b

Based on the result of Lemma 4.5, we can now develop the OPEC algorithm which yields the same solution of the PEC-LS algorithm but with a reduced complexity. Now, we recall the set of Ω(M,S) soft detections in (4.14): | |

sSC = sSC , ω 1,..., Ω(M,S) , (4.30) ω ∈ { | |}  b b 4.5 Improved Ordered Power Equality Constraint Detection 101 associated with Ω(M,S) soft Euclidean distances | |

2 ∆SC = ǫ = y HsSC , ω 1,..., Ω(M,S) . (4.31) ω − ω ∈ { | |} n o In the detection, we first sort the Ω( M,Sb) power groups according to their corresponding soft | | Euclidean distances in an ascending order. Then we find the hard decision solution according to the sorted groups. Suppose that quantization and selection of the best hard-decision candidate as discussed in Section 4.4.3 has been performed on the i-th power group, i 1,..., Ω(M,S) . ∈ { | |} Its hard squared distance ǫi is compared with the soft squared distance of the remaining groups (ǫ , i < j Ω(M,S) ). If any of the remaining power groups’ soft error is greater than the j ≤ | | current hard-decision squared distance, all the power groups from this power group onwards can be omitted from the candidacy list, since they can not generate a hard decision which is closer to the received vector than the current hard decision. Note that this algorithm is deterministically equivalent to the PEC-LS detector in terms of detected symbols and bares no degradation in terms of BER. We refer to this algorithm as OPEC detector and it is depicted in Algorithm 13. In Section 4.7, we will assess the complexity reduction achieved by using this algorithm, relative to the previously presented PEC-LS algorithm.

Algorithm 13 OPEC based detector for MIMO systems Input: y, H, S, M Output: sHD 1: Initialize: σ = 1, ǫ = . HD ∞ 2: Composeb the Quantum Power Levels set QP L (M,S) according to (4.6). 3: Compose the Symbol Vector Power Groups Ω (M,S) according to (4.8). 4: for ω = 1 to Ω(M,S) do SC | | 2 2 5: sω = arg mins y Hs , s.t. s = Ωω (M,S) k 2− k k k 6: ǫ = y H sSC ω − ω 7: endb for sorted SC 8: ∆ , index b= sort ∆ in an ascending order. 9: ssorted = sSC , ω 1,..., Ω(M,S) ω index(ω) ∈  | | 10: for ω = 1 to Ω(M,S) do | | 11: b if ǫ b ǫω then HD ≤ 12: return 13: else 14: for µ = 1 to index(ω) (M,S) do HD G sorted 15: sµ = µ sω QGindex (ω) HD 2 16: if y Hsµ  ǫHD then b − ≤b 17: sHD = sHD µ 2 18: ǫ = yb H sHD HD − µ 19: endb if b

20: end for b 21: end if 22: end for 4.5 Improved Ordered Power Equality Constraint Detection 102

4.5 Improved Ordered Power Equality Constraint Detection

In this Section we propose an improved detection scheme, based on the OPEC detector. From the previous Section, we see that there are primarily two steps: in the first step we find the soft solution. In the second step, based on the soft solution we find the hard decision candidate 2 for each power group, according to (4.22), such that sSC sHD is minimized. However, this ω − ω HD 2 hard decision solution may not minimize y Hsω in that power group, which is the hard − b squared distance to the received signal.

Motivated by this observation, we present an approach to improve our proposed OPEC algo- SC rithm, where a local search around each set of soft candidates sω is incorporated. The basic idea of the method is that in performing the hard decision by doing the phase rounding for each b element, we choose two constellation points that have the smallest phase differences to the soft solution. This is different from the previously proposed algorithm where we only choose one point that has the smallest phase difference. This can be seen as a local search algorithm [161] SC around the soft-decision sω .

The starting point of theb improved detector is the same as that in the OPEC detector, which obtains Ω(M,S) soft solutions as in (4.14). To find the two points with the smallest phase | | differences from the soft solution, we define the neighborhood from which two points will be chosen. Since we are interested in a simple phase rounding operation, the neighborhood will be naturally defined under phase criterion. We define the neighborhood N (d) of d, where d is an element of a soft candidate, as

′ N (d) = (si, sk) S , i = k ∈ 6 | (4.32) n ′ max ∠d ∠s , ∠d ∠s < ∠d ∠s j 1,...,D , j = i, k , {| − i| | − k|} | − j| ∀ ∈ 6 n o o ′ ′ where S is a subgroup of S with size D which corresponds to a QPL group.

In summary, the procedure for the improved detection is similar to the OPEC algorithm. The distinction is that instead of creating one hard-decision candidate for each power group in the OPEC algorithm, 2M hard-decision candidates are created in the improved OPEC algorithm. . For example, in the case of M = 2,S = 16 QAM we now have up to 4 Ψ (2,S) = 36 candidates ×| | instead of Ψ (2,S) = 9, as in the PEC-LS algorithm. In the ordered algorithm, the number of | | hard-decision candidates is much smaller than the one without ordering.

In order to create the neighborhood N (d) for the Improved Ordered PEC-LS, modified decision boundaries are required. By investigating the structure of 16 QAM signal, we observe that two different types of decision boundaries. The first type of decision boundaries will be serving the elements in Ω (M,S) that are composed of power levels Φ1 (S) and Φ3 (S), as they contain only four constellation points with the same arguments. The second type of decision boundary will 4.6 Efficient Implementation and Complexity Analysis 103 be used for the elements in Ω (M,S) that are composed of power level Φ2 (S), as there are eight constellation points with the same arguments (see Figure 4.1).

Now, we describe a general method for generating the decision boundaries for the neighborhood N (d) in (4.32). The decision rule for the decision region is based on phase distance. Each decision region contains two adjacent constellation points and is fully described by a start phase α and an end phase β. In order to construct the decision regions boundaries the following optimization problem needs to be solved:

1 α = ∠ (s2) + ∠ arg min ∠ (s1) ∠ (x) Mod (2π) , (4.33a) 2 x S=(s1) − ( " ∈ 6 # !) 1 β = ∠ (s1) + ∠ arg min ∠ (x) ∠ (s2) Mod (2π) , (4.33b) 2 x S=(s2) − ( " ∈ 6 # !) where (x) Mod (2π) is the modulus x with 2π, and the two points are chosen such that (∠s1 < ∠s2). The resulting values α and β represent the right and left region boundary for the combination of s1 and s2, respectively. The two types of decision boundaries are explained in Figures 4.2 and 4.3, and are given in Tables 4.2 and 4.3.

Once the hard candidates for each power combination, as specified in (4.11), have been chosen, the one that has the smallest squared distance is selected as the hard candidate for this power combination. Next, by using (4.27), a hard candidate is selected as the final solution. We refer to this detector as an IOPEC detector.

text text

Region 1 Region 4 Region 2 Region

Region 3

text text

Figure 4.2: Decision Boundaries for Φ1 and Φ3 4.6 Efficient Implementation and Complexity Analysis 104

Region 3 text text Re 4 n g o io i n 2 Reg Region 1 Region text text Region 5 Region

text text

Re 8 g on io i n g 6 e text text R Region 7

Figure 4.3: Decision Boundaries for Φ2

Decision boundaries for Φ1 and Φ3

Regions Phase of s Candidates

Region 1 1 ∠ (s) 3 (1 + i) , ( 1 + i) 4π ≤ ≤ 4π − Region 2 3 ∠ (s) 5 ( 1 + i) , ( 1 i) 4π ≤ ≤ 4π − − − Region 3 5 ∠ (s) 7 ( 1 i) , (1 i) 4π ≤ ≤ 4π − − − Region 4 7 ∠ (s) 1 (1 i) , (1 + i) 4π ≤ ≤ 4π −

Table 4.2: Decision Boundaries for Φ1 and Φ3 as depicted in Figure 4.2

4.6 Efficient Implementation and Complexity Analysis

In this Section we discuss efficient implementation of the above mentioned algorithms and analyze their complexity. In the implementation of the proposed detectors we separate the necessary computations into two groups:

Pre-Processing phase : contains operations that are common to all power groups Ω (M,S) • and depend on only the current observation vector y.

Processing phase : contains operations that are specific to each power group Ω (M,S). • ω 4.6 Efficient Implementation and Complexity Analysis 105

Decision boundaries for Φ2 Regions Phase of s Candidates

Region 1 1.8524π ∠ (s) 0.1476π (3 i) , (3 + i) ≤ ≤ − Region 2 0.1476π ∠ (s) 0.3524π (3 + i) , (1 + 3i) ≤ ≤ Region 3 0.3524π ∠ (s) 0.6476π (1 + 3i) , ( 1 + 3i) ≤ ≤ − Region 4 0.6476π ∠ (s) 0.8524π ( 1 + 3i) , ( 3 + i) ≤ ≤ − − Region 5 0.8524π ∠ (s) 1.1476π ( 3 + i) , ( 3 i) ≤ ≤ − − − Region 6 1.1476π ∠ (s) 1.3524π ( 3 i) , (3 3i) ≤ ≤ − − − Region 7 1.3524π ∠ (s) 1.6476π (3 3i) , (1 3i) ≤ ≤ − − Region 8 1.6476π ∠ (s) 1.8524π (1 3i) , (3 i) ≤ ≤ − −

Table 4.3: Decision Boundaries for Φ2 as depicted in Figure 4.3

The overall complexity can be expressed as:

= + , (4.34) C CPP CP where and are the complexity of the Pre-Processing and Processing phases, respectively. CPP CP

4.6.1 Efficient Implementation of Constrained LS Detector

In order to implement the proposed detectors, the set of soft candidates (equation (4.13)) needs to be assembled first. We begin by presenting an efficient way to obtain the solutions of the set of optimization problems in (4.13).

As mentioned in Section 4.4, the constrained least squares in (4.13) of every power group can be implemented using a simple line search, such as bi-section search. At each iteration of the line search, a new value

1 2 s 2 = HH H + ηI − HH y , (4.35) k ηk

 2 which is a function of η needs to be evaluated, until the one that satisfies the equality sηopt = Ωω (M,S) is found. The operation in (4.35) can be evaluated efficiently using the following

4.6 Efficient Implementation and Complexity Analysis 106 equality

H 1 H T H H + ηI − H = VΛηU , (4.36)  where [U, Q, V] = SVD (H) is the Singular Value Decomposition (SVD) of H, U and V are unitary matrices and Q is a diagonal matrix containing the singular values λm, m = 1,...,M. The matrix Λ of dimensions M N (assume M N) is defined by η × ≤

λ1 2 0 ... 0 . .. 0 (λ1) +η λ2  0 2 ... 0 . .. 0  Λ = (λ2) +η , (4.37) η  . .   . . ... 0     λM   0 0 ... 2 ... 0   (λM ) +η    Then (4.35) can be expressed as

2 2 s 2 = VΛ UT y = Λ UT y = Λ r 2 , (4.38) k ηk η η k η k where the second equality holds because V is a unitary matrix, and r , UT y is a common operation among all groups and can be implemented in the Pre-Processing phase. The SVD operation is common to all power groups and it only depends on the channel realization H and is part of the Pre-Processing phase. The implementation of (4.13), (4.14) and (4.15) for the ω-th power group is presented in Algorithm 14.

Having obtained an efficient implementation of each iteration in the line search, now we assess the number of iterations needed for the line search to converge to a desired ending tolerance δ. The bi-section line-search has a linear convergence rate. Defining the size of the initially bracketing interval of the ω-th power group Rω as

R = η η , (4.39) ω right − left

H where ηleft λmin H H , according to Theorem 4.4. In order to reduce the number of ≥ − 2 iterations of line search, we derive tight ηleft and ηright, such that sηleft Ωω (M,S) 2 ≥ ≥ sηright .

2 2 1 To this end we recall that s = HH H + ηI − HH y is monotonic decreasing in η, where k ηk η λ HH H , and that min  ≥ −  s 2 = Λ r 2 k ηk k η k M λ 2 (4.40) = m r [m] 2 . λ2 + η | | m=1 m X   4.6 Efficient Implementation and Complexity Analysis 107

1. Deriving ηleft: Relying on (4.17) and (4.40), we define η , λ HH H + θ , θ 0, then we have left − min ≥  M λ 2 s 2 = m r [m] 2 , (4.41) ηleft λ2 λ (HH H) + θ | | m=1  m min  X −

where r [m] is the m-th element of r. In order to derive ηleft, we require that

M λ 2 m r [m] 2 Ω (M,S) . (4.42) λ2 λ (HH H) + θ | | ≥ ω m=1 m min X  −  Also, we note that

λ2 (H) = λ HH H , m 1,...,M , (4.43) m m ∈ { }  and defining λ , min λ , m 1,...,M , we have k { m} ∈ { } λ2 s 2 k r [k] 2 . (4.44) ηleft ≥ θ2 | |

λ2 2 Therefore, as long as k r [k] 2 Ω (M,S), we have s Ω (M,S), which calls θ2 | | ≥ ω ηleft ≥ ω for

λ θ k r [k] . (4.45) ≤ Ωω (M,S) | | p Finally we obtain

, H 2 λk ηleft λmin H H + θ = λk + r [k] . (4.46) − − Ωω (M,S) | |  p 2. Deriving ηright: We now derive the value of η , where we assume η 0. Using (4.40), we have right right ≥

M λ 2 s 2 m r [m] 2 , (4.47) k ηk ≤ η | | m=1 X   and in particular we require that

M 1 λ2 r [m] 2 Ω (M,S) . (4.48) η2 m | | ≤ ω right m=1 X 4.6 Efficient Implementation and Complexity Analysis 108

Then we obtain that

M 1 η = λ2 r [m] 2. (4.49) right vΩ (M,S) m | | u ω m=1 u X t

3. Line Search Interval Rω:

The size of the initially bracketing interval of the ω-th power group Rω can be expressed as

M 1 λ R = η η = λ2 r [m] 2 + λ2 k r [k] . (4.50) ω right − left vΩ (M,S) m | | k − | | u ω m=1 Ωω (M,S) u X t p R The number of iterations required, assuming that R δ 0, is ITR = log ω [160]. ω ≥ ≥ ω 2 δ This value is a random number since it depends on H and y and is different for each power group. However, this value does not depend on SNR which means that the complexity of the line search is constant with regard to the SNR. The average number of iterations

ITRω will be evaluated using simulations in the next Section.

4.6.2 Overall Complexity

We now discuss the overall complexity of the OPEC detector and compare it with the MMSE detector. The complexity of the algorithms is evaluated in terms of the number of complex multiplications (CM) and the number of complex additions (CA) required to carry it out. The complexity of the different elements of the OPEC detector is given in Table 4.4. Recalling (4.34) we have that the approximated overall complexity is

= + C CPP CP CM 4NM 2 + 8M 3 + N 2 + CA 2NM 2 + 4M 3 + N 2 ≈ CPP  Ω(M,S) | | | {z } + CM 2M ITR + Ω(M,S) M 2 + NM + 2N + f (SNR)(NM + 2N)  ω | |  (4.51) ω=1 X   Ω(M,S)  | | + CA (M 1) ITR + Ω(M,S) (M (M 1) + NM + N 1)  − ω | | − − ω=1 X  +f (SNR)(NM + 2N)) , where f (SNR) denotes the number of power groups visited as a function of SNR. 4.7 Simulation Results 109

In comparison, the overall complexity of the MMSE detector in (4.5) can be approximated as

CM M 3 + 2M 2N + MN + CA M 3 M 2 + M 2N + MN + 2M . (4.52) CMMSE ≈ −   The complexity introduced by the proposed OPEC algorithm lies primarily in two parts: the first part is the computation of the bi-section line searches for obtaining the set of soft candidates sSC , ω 1,..., Ω(M,S) . In particular, each line search requires a number of iterations to ω ∈ { | |} converge, denoted by ITRω. With the properly defined searching range given by ηleft and ηright, b the number of iterations in the line search is usually small (less than 10 for M = N = 2 and less than 13 for M = N = 4). Therefore, the complexity of this part is not significant.

The second part lies in computing the hard-decision Euclidean distances for the QPL combina- tions under consideration. By employing the OPEC algorithm, the number of relevant power groups, denoted by f (SNR), decreases rapidly with the SNR. As shown in the the next Sec- tion, at a high SNR f (SNR) is around 1, therefore, the complexity of this part is small. From the above, we see that the MMSE and the proposed detection schemes have the same order of complexity in terms of the number of antennas.

Algorithm 14 Constrained LS soft detection of the ω-th power group T Input: H, y,Ωω (M,S), ηleft, ηright, Λ, r , U y Output: s, ǫ 1: ηL = ηleft 2: ηR = ηright 3: repeat ηL+ηR 4: η = 2 2 2 M λm 2 5: s = m=1 2 r [m] k k λm+η | | 2 6: δ = s Ω (M,S) k kP− ω 7: if δ > 0 then 8: e ηL = η 9: elsee 10: ηR = η 11: end if 12: until δ δ ≤ SC 13: s = V Ληr e 2 14: ǫSC , y H sSC −

4.7 Simulation Results

In this Section, we evaluate the performance of the proposed detectors via Monte Carlo simula- tions. First, we describe the simulation setup. 4.7 Simulation Results 110

Complexity of Constrained PEC-LS detector components Operation Implementation Number Complex Complex category of repetitions multiplications additions

SVD(H) Pre-Processing 1 4NM 2 + 8M 3, 4NM 2 + 8M 3, see [162] see [162]

r , UT y Pre-Processing 1 N 2 N 2

, Ω(M,S) a Ληr Processing ω| =1 | ITRω M 0

2 Ω(M,S) s = aT a Processing P| | ITR M M 1 k k ω=1 ω − sSC = Va Processing PΩ(M,S) M 2 M (M 1) | | − 2 ǫSC , y H sSC Processing Ω(M,S) NM + 2NNM + N 1 − | | − 2 ǫHD , y H sHD Processing f (SNR) NM + 2NNM + N 1 − −

Table 4.4: Complexity of PEC-LS detector components divided into Pre-Processing and Pro- cessing categories

4.7.1 System Configuration

In the simulations, we assume that the elements of the channel matrix H are independent and identically distributed complex Gaussian random variables with zero-mean and unit-variance. We consider ergodic channel and the fading coefficients for different symbol durations are un- correlated. The additive noise is spatially and temporally white complex Gaussian with zero- 2 mean, and variance σw. In the simulations, we consider MIMO systems with 16 QAM and 64 QAM. The signal bit energy to noise power spectral density ratio is defined by

Eb Es N = 2 , (4.53) No 2σw M log2 D where D is the total number of points in the QAM constellation. The ending tolerance for 2 the line search δ was set to δ = 10− . We found that smaller values of δ did not yield any performance improvement in terms of BER performance.

4.7.2 Discussion of Simulation Results

In all the simulations, we use the MMSE detector as our benchmark detector. In Figure 4.4, we show the BER for a MIMO system with 16 QAM and M = N = 2, where the MMSE 4 detector, the OPEC detector and the IOPEC detector are considered. At a BER of 10− , we 4.7 Simulation Results 111

−1 10 MMSE OPEC

−2 10 IOPEC

Bit error rate −3 10

−4 10 10 15 20 E 25 30 35 b [dB] No

Figure 4.4: BER performance of a MIMO system with 16QAM M = N = 2 observe that the proposed OPEC detector performs about 3 dB better than the benchmark MMSE detector. Under the same condition, the IOPEC detector is more than 8 dB better than the MMSE detector. Figure 4.5 shows the BER results for MIMO systems with 16 QAM 4 and M = N = 4. At a BER of 10− , we see that the proposed OPEC detector outperforms the benchmark MMSE detector by 2.5 dB, while the IOPEC detector outperforms the MMSE detector by 8 dB. Now, we extend our simulation to scenarios with 64QAM, where the spectral efficiency is 6M bits/sec/Hz. In Figure 4.6, we show the BER for a MIMO system with 64QAM and M = N = 2. Again, we consider the MMSE, OPEC and IOPEC detectors. At a BER 4 of 10− , we observe that the proposed OPEC detector performs about 4.5 dB better than the benchmark MMSE detector. Under the same condition, the IOPEC detector is about 10 dB better than the MMSE detector. Compared to the scenario with 16 QAM, a larger performance improvement by using the proposed detectors is achieved. In Figure 4.7, we show the BER for 4 a MIMO system with 64 QAM and M = 2,N = 4. At a BER of 10− , it is observed that the proposed OPEC and IOPEC detectors outperform the MMSE detector by about 4 dB and 9 dB, respectively. 4.7 Simulation Results 112

−1 10 MMSE OPEC

−2 IOPEC 10 Bit error rate −3 10

−4 10

0 5 10 15 E 20 25 30 35 b [dB] No

Figure 4.5: BER performance of a MIMO system with 16QAM M = N = 4

−1 10 MMSE OPEC IOPEC −2 10 Bit error rate −3 10

−4 10

E 5 10 15 20 b [dB] 25 30 35 40 No Figure 4.6: BER performance of a MIMO system with 64QAM M = N = 2 4.7 Simulation Results 113

−1 10 MMSE OPEC

−2 IOPEC 10 Bit error rate −3 10

−4 10

5 10 15 20 E 25 30 35 40 b [dB] No

Figure 4.7: BER performance of a MIMO system with 64QAM M = 2,N = 4

4.7.3 Complexity Assessment

In this Section we present simulation results regarding the overall complexity of the proposed algorithms. First, we present the complexity reduction that can be achieved using the OPEC de- tector in comparison to the PEC-LS detector. Next we present the number of iterations needed for the line search to converge.

Figure 4.8 presents the number of power groups visited as a function of the SNR in order to generate a hard decision solution. Here, we consider a MIMO system with 16 QAM The upper figure depicts the number of groups visited for the case of M = N = 2 out of all five power groups, and in the lower figure for M = N = 4 out of all nine power groups. As the figures depict, the reduction is significant and the number of groups visited decreases as the SNR increases. At a high SNR, just over one group on average had to be visited for all cases. Next, we examine the number of iterations required by the line search to converge to a desired ending tolerance 2 δ = 10− with 16 QAM. Table 4.5 depicts the average number of iterations for a MIMO system with M = N = 2 and M = N = 4. As expected, the number of iterations does not depend on the value of the SNR. 4.7 Simulation Results 114

3

Full boxes − ICLS 2.5 Empty boxes − CLS

2

1.5

1 visited (out of 5) 0.5 Average number of groups

0 0 5 10 15 20 25 30 SNR [sB]

7

6 Full boxes − ICLS Empty boxes − CLS

5

4

3

visited (out of 9) 2

Average number of groups 1

0 0 5 10 15 20 25 30 SMR[ dB]

Figure 4.8: Average number of groups visited, upper figure - M = N = 2, lower figure - M = N = 4

2 2 MIMO system 4 4 MIMO system × × Power Average Number of Iterations Power Average Number of Iterations 4 4.7 8 5.7 12 7.3 16 7.2 20 8.8 24 8.4 28 9..8 32 9.4 36 10.5 40 10.3 48 11.1 56 11.8 64 12.3 72 13

Table 4.5: Number of iterations in the line search as a function of power groups for a 2 2 (left) and 4 4 (right) MIMO systems × × 4.8 Chapter Summary and Conclusions 115

4.8 Chapter Summary and Conclusions

In this chapter, novel algorithms for symbols detection in MIMO systems with high level modula- tion constellations were presented. First, we introduced the concept of the proposed method and derived the basic PEC-LS algorithm. Next we presented the OPEC algorithm which significantly reduces the complexity of PEC-LS algorithm without any performance degradation. Based on that, we presented the IOPEC detection algorithm, which incorporates a local search around the constraint least square solutions. It was shown that the proposed detectors significantly outperform the conventional MMSE detector for MIMO systems with high-level modulations in terms of BER. In contrast to the MMSE detector, the proposed schemes do not assume any knowledge of the noise variance, which may not be known in practice. We also presented efficient ways to reduce the overall complexity of the proposed schemes. It was shown that the overall complexity of the proposed detectors is of the same order of complexity as the MMSE detector, which makes them suitable for real-time applications.

Chapter 5

Detection of Gaussian Constellations in MIMO Systems under Imperfect CSI

“The world is full of magical things patiently waiting for our wits to grow sharper.”

Bertrand Russell

5.1 Introduction

In this chapter we present algorithms for detection of symbols in MIMO systems, where the symbols are taken from Gaussian-like constellations and in the presence of channel estimation errors (imperfect Channel State Information (CSI)).

Under this framework we develop computationally efficient approximations of the MAP detector. The detectors are based on a relaxation of the discrete nature of the digital constellation and on the channel estimation error statistics. This results in a non-convex optimisation problem.

First, we solve this optimisation problem using the hidden convexity optimisation approach. The solution involves two nested line search that can be efficiently evaluated.

Next, we develop an alternative approach for solving the optimisation problem using the Bayesian Expectation Maximisation (BEM) approach. Although the BEM method is suboptimal, we demonstrate via simulations that comparable BER performance is achieved.

In addition, we extend the proposed detection scheme to the case where the noise variance is unknown a priori. We present a modified BEM approach with an annealed Gibbs sampling

117 5.2 Background 118 optimisation technique to perform joint noise variance estimation and symbols detection. Sim- ulation results in a random MIMO system show that the proposed algorithms outperform the LMMSE receiver in terms of BER.

The main contributions presented in this chapter are as follows:

We present two novel MAP detection schemes of (discrete) Gaussian constellations in • MIMO systems under imperfect CSI.

We extend the detection scheme for the case where the noise variance is unknown a priori. • This is achieved by using a stochastic optimisation solution that is based on the concept of annealed Gibbs sampling technique.

We present low complexity implementations of the proposed algorithms. • We analyse different aspects of the proposed algorithms, such as complexity and conver- • gence rate.

5.2 Background

MIMO systems can yield vast capacity increases when a rich scattering environment is properly exploited [2], [3]. A MIMO system employs multiple antennas at both the transmitter and the receiver, and its capacity increases linearly as the minimum of the number of transmit and receive antennas. However, in practice the channel conditions must be estimated since perfect channel knowledge is never known a priori. Typically, this is performed in two distinct stages. The transmitter first sends a known sequence of symbols, which allow channel estimation to be performed at the receiver. Next, the information symbol sequence is transmitted, and sequence detection is performed at the receiver using the estimated channel. In practice, the channel estimation procedure can be aided by transmitting pilot symbols that are known at the receiver. System performance depends on the quality of the channel estimate, and the number of pilot symbols as shown in [163]. It is desirable to minimize the number of transmitted pilot symbols in order to maximize spectral efficiency.

We consider the case in which one wishes to transmit symbols designed to achieve maximum throughput of a channel. In the linear Gaussian channel model that we consider, it is well known that in order to achieve capacity, powerful coding schemes must be combined with shaping methods which result in a near-Gaussian distribution of the symbol’s amplitude [164], [165]. Two practical schemes that obtain shaping gain are “trellis shaping” [166] and “shell mapping” [167]. Though theory states that the signal points should be chosen from a continuous Gaussian distribution, in practice, since the constellations are finite this means that optimal gain can not be achieved. In performing constellation shaping, we note that approximating the optimal 5.3 System Description 119

Gaussian with a discrete distribution can be achieved in many ways. We focus here on the Maxwell Boltzman (MB) distribution [168].

The optimal detector for digital constellations can be implemented by using a brute-force ap- proach which searches over all symbol possibilities. Typically this is impractical due to the massive computational burden it presents. Several alternative suboptimal, but computationally more tractable, receivers have been considered in the literature e.g. the sphere detector [169], semidefinite relaxation [149] and VBLAST [170]. The most common class of suboptimal detec- tors is the class of linear detectors, i.e. the MF, the decorrelator or ZF, and the MMSE detectors [155].

In this work we focus on the MMSE and the MAP detectors. A common practice in detection schemes, is to consider a relaxation of the discrete problem. The relaxation leads to a continuous optimization problem, that although suboptimal, in many cases provides a tractable optimization problem, that is simpler to solve. The majority of the literature concentrates on the basic case, in which it is assumed that the channel matrix is completely specified [155]. In this setting, the MMSE, LMMSE and MAP estimators coincide and have a simple closed form solution with identical detection performance. In contrast, when a noisy channel estimate is considered, the MMSE, LMMSE and MAP approaches lead to different estimators. In fact, we will show that the solution of the MMSE leads to an intractable integration, whereas the MAP estimator can be efficiently found.

5.3 System Description

Consider a flat fading MIMO communication system of M transmit and N receive antennas. The data stream is multiplexed to M data substreams and transmitted by M transmit antennas simultaneously. The baseband equivalent model of the received signal vector at the instant of sampling can be represented by

y = Hx + w, (5.1) where y CN 1 is the received vector, x CM 1 is the transmitted signal vector, and ∈ × ∈ × w CN 1 represents the additive noise vector, modeled as a circularly symmetric Gaussian ∈ × E H 2 distributed random vector with zero mean and covariance matrix ww = σwI. The (i, j)- th component of the channel matrix H CN M represents the channel gain between the j-th ∈ ×  transmit and the i-th receive antenna. The MIMO system is depicted in Figure 5.1 The in- put symbol vector x is taken from a (discrete) finite Gaussian distributed signal set D with E x 2 = σ2, i [1,...,M]. This choice of signal distribution has the property that it re- | i| x ∈ ducesn theo average transmitted power. The notion of (discrete) Gaussian symbols is explained 5.3 System Description 120

1 1

1 1

2 2

1 H 1 x Receiver M M N 1 1

Figure 5.1: MIMO system model as follows: a continuous Gaussian distribution is quantized into M states of width ∆x and the associated discrete probability of the i-th state is given by,

1 i∆x x2 P r(xi) = exp − dx, (5.2) 2 2 πσ (i 1)∆x 2σx x Z −   p and depicted in Figure 5.2. To achieve this in practice, one resorts to an approximation. The choice we consider in this paper is the MB distribution, motivated by [171], [172] and [173].

0.14

Continuous Gaussian distribution (Discrete) finite Gaussian 0.12

0.1

0.08 P(x)

0.06

0.04

0.02

0 −17 −15 −13 −11 −9 −7 −5 −3 1 1 3 5 7 9 11 13 15 17 X values

Figure 5.2: (Discrete) finite Gaussian constellation of 16-PAM 5.3 System Description 121

5.3.1 Pilot Aided Maximum Likelihood Channel Estimation

In most practical cases, the receiver does not have a priori knowledge of the realization of the channel matrix H. In this case, the channel needs to be estimated using a training sequence, i.e. using a set of data symbols, X CM T , which is known to the receiver, where T is the p ∈ × length of the the training sequence. Different channel estimation techniques can be employed depending on the prior knowledge about the statistics of H. In case where no a priori knowledge about H exists, the ML channel estimation yields:

H H 1 H = YpXp XpXp − , (5.3)  where Y CN T is the observationb matrix. The minimal estimation error is achieved when p ∈ × using orthogonal pilot sequence [163]

P X XH = I, (5.4) p p M where P is the total transmitted power for training vectors. Thus the channel matrix can be expressed as

H = H + ∆, (5.5) [∆] CN 0, σ2 , i,j ∼ b h

2  2 , σwM where σh P .

In general, the overall system model for the MIMO system with imperfect CSI can be expressed as:

y = Hx + w, (5.6a) H = H + ∆, (5.6b) [∆] CN 0, σ2 , (5.6c) i,j ∼ b h σ2 M σ2 = w .  (5.6d) h P

5.3.2 Detection in a Mismatched Receiver

In this Section we review two detection schemes for MIMO symbols ignoring channel uncertainty. In this regard, we perform the following:

1. The receiver estimates the channel matrix H from Xp and Yp and produces H.

2. The receiver performs detection of the transmitted symbols using the channelb estimate H instead of the true channel matrix H, assuming that the estimate is exact. b 5.4 Bayesian Detection under Channel Uncertainty 122

First, we consider optimal detection of the transmitted signal vector x based on the received signal y using the MAP criterion.

xMAP (y) = arg max P r (x y) x DM | ∈ (5.7) b = arg max p (y x) P r (x) , x DM | ∈ where D is the modulation alphabet. We now relax the discrete constraint over x and instead assume it is distributed according to a continuous Gaussian distribution. Due to the Gaussian assumption of the signal x and the additive noise, we have

p (y x) = CN Hx, σ2 I , (5.8a) | w CN  2  p (x) = 0b, σxI . (5.8b)  Therefore, the MAP detector in (5.7) can be expressed as

2 y Hx 2 1 − 1 x xMAP (y) = arg max N exp  2  M exp k 2k M 2 − σ 2 − σ x D (πσw) w (πσx) ( x ) ∈  b  (5.9) b 2   y Hx  x 2  = arg min − + k k  2 2  x DM σw σx ∈  b    To find the solution for (5.9), which is a combinatorial problem, a brute-force searching over all of the D M possibilities can be employed. However, it is impractical as M and D increase. | | | | Instead, a low complexity suboptimal detector based on the MMSE criterion can be derived. This detector minimizes the MSE between multiple inputs and filtered multiple outputs. Considering that the variables y and H are jointly Gaussian, the Bayesian MMSE detector is

bx (y) = E x y MMSE { | } 1 (5.10) = HH H + η I − HH y. b MMSE   2 b b b σw where ηMMSE = 2 . The detected symbols are given by σx

xD MMSE (y) = [xMMSE (y)] , (5.11) − Q where [a] is defined as the operationb of roundingba to the nearest lattice point in the signal Q constellation. 5.4 Bayesian Detection under Channel Uncertainty 123

5.4 Bayesian Detection under Channel Uncertainty

We now discuss the optimal and suboptimal detection schemes for the transmitted symbols, based on both the estimated channel matrix H and its estimation error matrix ∆. In this Section we consider the knowledge of the uncertainty of the channel matrix as part of our b system model, and we incorporate this knowledge into the detection formulation.

First we present the optimal detection scheme and suboptimal LMMSE receiver for this model. Then we present two suboptimal detection schemes with low complexity. The first one finds the global minimum point of the optimization problem using a simple one-dimensional line search. The second method uses the BEM to find a maximum of the posterior.

5.4.1 Optimal MAP Detection

As in Section 5.3.2, the MAP detector is given by (5.7), but after considering the channel estimation error, we now have

p (y x) = CN Hx, σ2 x 2 + σ2 I , (5.12a) | h k k w CN  2   p (x) = 0b, σxI , (5.12b)  2 where σh is the variance of the channel estimation error, given in (5.5). Note, in (5.12b) we relax the discrete constraint over x and instead assume it is distributed according to a continuous Gaussian distribution. Therefore, the MAP detector can be written as

2 1 y Hx x (y) = arg max exp − MAP N  2 2  M 2 −σ x + σ2 x D πN σ2 x + σ2  h b w  ∈ h k k w  k k  b   2 1 x   exp k k   (5.13) M 2 M 2 × π (σx) (− σx ) 2 y Hx 2 2 2 2 − x = arg min N log σh x + σw + 2 + k 2k  . M k k 2 2 σ x D σ h x + σw x ∈    k kb    The solution of (5.13) is computationally intensive. Hence, we investigate two low complexity suboptimal detectors. 5.4 Bayesian Detection under Channel Uncertainty 124

5.4.2 Linear MMSE Detection

First we present the MMSE detector of x for the model in (5.6a)-(5.6d). This can be written as

x (y) = E x y MMSE { | } = E E x y, H y b { { | }| } (5.14) 2 1 E H σw − H = H H + 2 I H y y . ( σx )  

Unfortunately, it is clear that the computational complexity involved in solving (5.14) is too great for practical applications. Instead, a common approach is to consider the LMMSE estimator. The LMMSE minimizes the expected value of the squared error, using a linear transformation A of y, so that

xLMMSE (y) = Ay (5.15)

The LMMSE estimate is given by b

1 xLMMSE (y) = cov(x, y)cov(y, y)− y 1 (5.16) = HH H + η I − HH y, b LMMSE   2 b b b 2 σw where ηLMMSE = Mσ + 2 . For the discrete signal constellation, we obtain h σx

xD LMMSE (y) = [xLMMSE (y)] (5.17) − Q

In the next two Sections we presentb two novel algorithmsb for finding the MAP detector under the system in (5.6a)-(5.6d). The first is based on a solution of a non-convex optimisation problem and the second is based on the BEM approach.

5.4.3 Hidden Convexity Based MAP Detector

Here we develop a novel near-optimal approach for the MAP detector. We suggest to relax the discrete constraint over x and instead assume it stems from a continuous Gaussian distribution. Therefore, the detector can be written as

xD MAP (y) = [xC MAP (y)] , (5.18) − Q − b b 5.4 Bayesian Detection under Channel Uncertainty 125 where xC MAP (y) is the solution to the system with a continuous random vector x with mul- − tivariate Gaussian distribution, resulting in b 2 y Hx 2 2 2 2 − x xC MAP (y) = arg min N log σh x + σw + 2 + k 2k  (5.19) − M k k 2 2 σ x C σ h x + σw x ∈    k kb  b   Problem (5.19) is an M -dimensional, nonlinear and nonconvex optimization program. In [174] and [175] the authors presented a method to transform a similar problem into a tractable form which can be solved efficiently. Under this setting, the vector x was treated as an unknown deterministic vector. In our setting, the vector x is treated as a random Gaussian vector. This difference results in an additional quadratic term in the MAP objective function, namely x 2/σ2, which incorporates the a priori information about the random vector x. The following k k x theorem shows that this method can also be applied in the MAP problem.

Theorem 5.1. For any t 0, let ≥ 2 f (t) = min y Hx (5.20) x: x 2=t − k k

b and denote the optimal argument by x(t). Then, the MAP estimator of x in the model (5.19) is x(t∗), where t∗ is the solution to the following unimodal optimization problem

f(t) 2 2 t arg min 2 2 + N log(σht + σw) + 2 (5.21) t 0 σ t + σw σx ≥  h 

Proof. By introducing a slack variable t = x 2, we can rewrite (5.19) as (5.21) using f(t) defined k k in (5.20). In Appendix 5.7, we prove that the line search in (5.21) is unimodal in t 0. ≥

The change of variables in Theorem 5.1 allows for an efficient solution of the MAP problem. This is due to:

1. There are standard methods for evaluating f(t) in (5.20) for any t 0. ≥ 2. The unimodality of (5.21) ensures that an efficient one dimensional search can find the global optimum.

First we provide a simple method for evaluating f(t) in (5.20). This is a quadratically constrained LS problem.

Lemma 5.2. ([151], [176]): The solution to

2 f (t) = min y Hx (5.22) x: x 2=t − k k

b 5.4 Bayesian Detection under Channel Uncertainty 126 is

x(t) = HH H + ηI † HH y, (5.23)   b b b where η λ HH H is the unique root of the equation ≥ − min   b b x(t) 2 = t. (5.24) k k

The derivation of Lemma 5.2 can be found in [151], [176].

The only parameter that needs be evaluated is η. This can be done by trying different values of η in (5.23) until one that satisfies (5.24) is found. This is made simpler due to the fact that 2 HH H + ηI † HH y is monotonically decreasing in η which enables us to find a value of η

that satisfies (4.18) using a simple line-search, such as bi-section [177]. The search range is b b b λ HH H η η . In order to reduce the search range we develop the following upper − min ≤ ≤ max and lower bounds on the search range ηmin η ηmax as (see Appendix 5.8): b b ≤ ≤

M 1 η λ2 r [m] 2, (5.25a) max ≤ v t m | | u m=1 u X t 2 λk ηmin λ + r [k] . (5.25b) ≥ − k √t | | where λ , m [1,...,M] are the eigen values of HH H , λ , min λ , m [1,...,M], and m ∈ k { m} ∈ , T r U y, where [U, Q, V] = SVD H is the SVD of H.   Next, f(t) can be evaluated by pluggingb the appropriateb x(t) into y Hx(t) 2. Now that k − k we have an efficient method for evaluating f(t), it remains to solve (5.21). The unimodality property ensures that this line search can be efficiently implemented using the Golden section search [177]. Theoretically, the search region is defined to be over 0 t . However, in ≤ ≤ ∞ practice, the search can be confined to 0 t t where t is a sufficiently large upper ≤ ≤ max max bound.

In the limit of an infinite number of amplitudes in the signal constellation, xD MAP in (5.19) − is effectively equal to xMAP , and it is optimal. In that case, the detection problem, generally b considered to be exponentially complex, can be solved using our approach with linear complexity. b

5.4.4 Bayesian EM based MAP Detector

In this Section we provide an alternative solution to the one suggested in 5.4.3 to provide a comparison. Here we develop a detection scheme which is based on the BEM methodology. The 5.4 Bayesian Detection under Channel Uncertainty 127

BEM methodology is a general technique for finding MAP estimates where the model depends on unobserved latent variables [175], [178], [179], [180], [181]. This method is based on the classical EM algorithm [34], [35], and is known to converge to a stationary point of the posterior density corresponding to a mode, though convergence to the global mode is not guaranteed, as is well known for the standard EM methodology.

This missing data problem is overcome by considering the hidden variables as being random variables and averaging over their distribution. The BEM method is best summarized with the following three steps:

1. Initial guess of the parameters

2. Replace the missing values by their expectations given the guessed parameters (E)

3. Estimate parameters by maximising the expectation (M)

Steps (2) and (3) are repeated until convergence. We now provide a solution for finding the MAP estimator of x using the BEM method. For this purpose we rewrite (5.6a) as

y = Gx + w, (5.26)

2 where G is a Gaussian random matrix with mean H and variance of each element σh, as defined in (5.5). At each BEM iteration, the algorithm maximizes the expected log likelihood (with b respect to G, given y and parameterized by xn). Assume that xn is the estimated x at iteration n. Then, at iteration n + 1, we update the estimate as

E xn+1 = arg max G y;xn log p (y, G, x) x | { } E = arg max G y;xn log (p (y G, x) p (G) p (x)) x | { | } E = arg max G y;xn log (p (y G, x) p (x)) x | { | } E = arg max G y;xn log p (y G, x) + log p (x) x | { | } (5.27) E 1 2 1 2 = arg min G y;xn y Gx + x x | σ2 k − k σ2 k k  w x  1 H E H 1 H E H 1 H = arg min 2yx G y;xn G + x G y;xn G G x + x x x σ2 − | σ2 | σ2  w w x     1 H H 1 H = arg min 2yx Φ1 (y, xn) + x Φ2 (y, xn) x + x x x σ2 − σ2  w x   where

E Φ1 (y, xn) = G y;xn G (5.28a) | { } E H Φ2 (y, xn) = G y;xn G G (5.28b) |  5.4 Bayesian Detection under Channel Uncertainty 128

Next, we take the derivative of (5.27) with respect to x, setting it to 0, to obtain the following recursion:

1 σ2 − x = Φ (y, x ) + w I Φ (y, x )T y (5.29) n+1 2 n σ2 1 n  x 

The expectations in (5.28a) can be evaluated based on the jointly Gaussian optimal MMSE estimation theory [182]. To demonstrate this, first using the Kronecker product, we rewrite (5.26) as

y = xT I g + w, (5.30) ⊗  where g = vec(G). The Bayesian MMSE of (5.30) becomes

xT I (y Hx ) E n g y;xn g = vec (H) + ⊗ − 2 , (5.31) | { } 2 σw xn + σ2 k k h 4 T 2 σg xnxn I covg y;xn g = σhI ⊗ . (5.32) | { } − σ2 x 2 + σ2 h k nk w

Now using simple algebraic manipulations, we can find Φ1 and Φ2 explicitly as

1 T Φ1 (y, xn) = H + 2 (y Hxn) xn (5.33a) 2 σw − xn + σ2 k k h 2 T T 2 σhxnxn Φ2 (y, xn) = Φ1 (y, xn) Φ1 (y, xn) + σhN I 2 (5.33b)  − 2 σw  xn + σ2 k k h  

5.4.4.1 Initial guess of x0

Since our problem is non-convex, convergence to the optimal solution is not guaranteed. It is well known that the initial starting point x0 in such settings will influence whether the solution converges to the optimal solution. As a starting point we suggest to take xLMMSE (equation

(5.16)) as the initial guess, and therefore x0 = xLMMSE. b b 5.4.5 MAP Detection with Unknown Noise Variance

In many practical scenarios, the noise variance σw is unknown. Here we derive a novel MAP 2 detection algorithm for model (5.6a)-(5.6d) where inference is performed for both x and σw. To achieve this we need to solve the following

2 2 x, σw = arg max p x, σw y . (5.34) 2 | (x,σw)    b c 5.4 Bayesian Detection under Channel Uncertainty 129

Before presenting the solution, we note that the BEM approach presented before would be difficult to perform efficiently in the new problem setting, since now the expectation step involves a double integral. We propose an alternative approach which separates the problem into an iterative procedure consisting of the BEM algorithm as before, and a conditional optimization. Here, we introduce a new variable ζ, given by

ζ = σ2 x 2 + σ2 . (5.35) h k k w

Our solution to this problem will be achieved utilizing the concept of annealed Gibbs sampling [183]. The idea is that since we are only after the maximum of the posterior, instead of working with the actual distribution p (x, ζ y), we work with a ‘heated’ version of it. The concept comes | from statistical physics, which has become popular in probabilistic optimization theory. When this concept is applied to find the mode of a distribution, one scheme is to consider the target posterior raised to a power, which is analogous to a temperature. As the temperature increases, the mass concentrates on the mode. Asymptotically, as the temperature goes to infinity, the result is a Dirac mass at the mode.

Here we adopt a Gibbs sampling approach. This allows us to separate the problem into iterations which involve the original BEM to optimize one of the full conditionals p (x y, ζ) as before, and | the optimization of p (ζ y, x). We can make an additional computation saving by introducing | conjugacy into the model which gives us a parametric full conditional for ζ, with an analytic expression for the mode. We iterate the following procedure:

ζn = arg max p ζ y, x xn−1 , (5.36a) ζ | |   b b b xn = arg max p (x y, ζ) ζ . (5.36b) x | | n b We define the prior density for (ζbx) to be an Inverse Gamma (IG) distribution over the support | ζ > 0, given by

p (ζ x; α, β) = (ζ; α, β) | IG α (5.37) β α 1 β = ζ− − exp − , Γ(α) ζ   with shape parameter α and scale parameter β. This is a popular choice of prior, since it includes the option of an uninformative prior, and also caters for conjugacy under the Normal-IG model [20]. We are free to specify the parameters of the prior as we wish. One way to do this when no a priori information is available is to use a diffuse prior. We specify the parameters of the prior as follows: we have chosen the shape parameter α = 2. Next we center the prior in such a way that it includes some knowledge of the model, whilst remaining diffuse. We set the mean of the E β 2 2 IG distribution, given by ζ = α 1 to be x , leading to β = x . { } − k k k k 5.4 Bayesian Detection under Channel Uncertainty 130

Having described the Bayesian model, we first write the conditional distributions

p(ζ x, y) p(y x, ζ)p(ζ x) (5.38a) | ∝ | | = CN Hx, ζI (α, β) IG   p(x ζ, y) p(y ζ, x)p(ζ x)p(x) (5.38b) | ∝ | b | = CN Hx, ζI (α, β)CN 0, σ2I . IG x    Due to the conjugacy of the Normal-IG modelb in (5.38a)-(5.38b) [20], the full posterior for ζ follows an IG distribution, and is given by

p(ζ x, y) = α, β (5.39) | IG  with hyperparameters

N α = α + , (5.40a) 2 2 N i=1 y [i] Hx [i] β = β + − , (5.40b)   P 2 b where N is the number of elements in y. The method described in (5.36a)-(5.36b) is performed as follows:

BEM-Annealed Gibbs Detector

Step 1: evaluate

ζn = arg max p (ζ y, x) xn−1 ζ | | b b βn 1 = − (5.41) αn 1 + 1 − 2 N 2β + i=1 y [i] Hxn 1 [i] = − − .   P 2 α + N + 2 b b Step 2: simulate using BEM

xn = arg max p (x y, ζ) ζ . (5.42) x | | n b b

The procedures in (5.41)-(5.42) are repeated until a certain precision is reached. We will denote this algorithm as BEM-Gibbs. 5.4 Bayesian Detection under Channel Uncertainty 131

5.4.6 Efficient Implementation and Complexity Analysis

We now propose low complexity implementation of the algorithms we developed in Sections 5.4.3 and 5.4.4 and discuss their overall complexity.

First we assess the complexity of the detector presented in Section 5.4.3. The algorithm is composed of two nested line searches. At each iteration of the outer loop (over t) one needs to 2 evaluate x (t) 2 = HH H + ηI † HH y for different values of η. This complexity intensive k η k operation can be implemented  with low complexity using the following equality b b b

1 H − H T H H + ηI H = VΛηU , (5.43)   b b b where [U, Q, V] = SVD H is the SVD of H, U and V are unitary matrices and Q is a diagonal matrix containing the singular values λm, m [1, ,M]. Matrix Λη of dimensions b b ∈ ··· λm M N (assume M N) is a diagonal matrix with elements 2 , m [1, ,M], on its main × ≤ λm+η ∈ ··· diagonal. Then, we can derive a low complexity implementation of (5.23), as

2 x 2 = VΛ UT y k ηk η 2 = Λ UT y η = Λ r 2 (5.44) k η k M λ 2 = m r [m] 2 , λ2 + η | | m=1 m X   where the second equality holds because V is a unitary matrix, r , UT y, and r [m] is the m-th element of r.

The most intensive part is the SVD operation which needs to be performed once in each block (since that channel is assumed block fading and remains constant over a block length). Its 4N 2M+8NM 2+9M 3 complexity per MIMO symbol is L [flops] [184], where L is the block length. Thus, if N and M are of the same order of magnitude, and ignoring all terms except the leading terms, the complexity is O N 3 . Since the outer line search consists of only scalar operations and doesn’t depend on either xor y, it is negligible. Next we asses the complexity of the inner line search. The vector operations have quadratic complexity and can be upper bounded by O N 2 . For the inner loop, implemented using bi-section, the convergence rate is linear. The

ǫ0 number  of iterations required to achieve a given tolerance in the solution is ITR1 = log2 ǫ [185], where ǫ , η η is the size of the initial bracketing interval, and ǫ is the desired 0 max − min  ending tolerance. For the outer line search, using the Golden section search, each iteration of the line search approximately reduces the original interval by a factor of 0.618 and therefore, after

ITR2 ITR2 iterations the updated search interval is 0.618 tmax [185], where tmax is the original interval size. 5.5 Simulation Results 132

To summarize, the overall complexity of the hidden convexity based algorithm developed in Section 5.4.3 is O N 3 . An important feature of this algorithm is that the overall complexity is 2 2 deterministic, and doesn’t depend on σh and σw which makes it appealing for real-time systems.

Next we evaluate the overall complexity of the BEM based detector, presented in Section 5.4.4. The most intensive part in the BEM detector is the matrix inversion in (5.29). However, the matrix inversion can be avoided by rewriting (5.29) as

σ2 Φ (y, x ) + w I x = Φ (y, x )T y. (5.45) 2 n σ2 n+1 1 n  x  Therefore, at each iteration a solution of M linear equations is required. Solving the set of 2 3 2 equations using Lower Upper (LU) factorization requires 3 M + 2M [flops] [184]. Next we evaluate the complexity of forming the matrices Φ1 and Φ2 in (5.33a)-(5.33b). The number of flops for evaluating Φ1 is approximately (4NM + 2M + N) [flops] and for Φ2 it is approxi- mately 2M 2N + 2M 2 [flops]. Therefore, the overall complexity of a single BEM iteration is approximately 

2 c ( M 3 + 4M 2 + 2M 2N + 4NM + 2M + N) [flops]. (5.46) BEM ≈ 3

As previously, if N and M are of the same order of magnitude, and ignoring all terms except the leading terms, we can simplify the overall complexity of the BEM to be

8 C c ITR = N 3ITR , (5.47) BEM ≈ BEM BEM 3 BEM where ITRBEM is the number of iterations required for convergence. ITRBEM is a random 2 2 variable which depends on the values of σw, σh and x0 and therefore, the complexity of the BEM is random and can be only evaluated via simulations. Simulations show that the algorithm 2 2 converges in less than 10 iterations, depending on σh and σw.

To summarise, both algorithms have a similar complexity of O N 3 [flops]. The advantage of the algorithm based on the hidden convexity methodology in Section  5.4.3 is that its complexity is constant and does not depend on system parameters. This makes it more suitable for real time applications.

5.5 Simulation Results

In this Section, we compare the performance of the proposed detectors via simulation. First, we describe the simulated MIMO system. 5.5 Simulation Results 133

5.5.1 System Configuration

In the simulations performed, the elements of the channel matrix H are assumed independent and identically distributed complex Gaussian random variables with zero-mean and unit-variance. Additionally, we assume that the channel coefficients remain constant within each frame and change independently from frame to frame. The additive channel noise is spatially and tempo- 2 rally white complex Gaussian with zero-mean, and the variance σw. The average power of the MEs constellation is Es, and the SNR is defined as 2 . Different numbers of transmit and receive σw antennas have been used in the simulations. We use frame-by-frame transmission, with each frame consisting of L = 128 MIMO symbols. At the beginning of each frame 4 equipower or- thogonal pilot symbols were dedicated for channel estimation phase, according to Section 5.3.1.

The energy of each pilot vector is 4Es.

5.5.2 Constellation Design

In this example we will illustrate the detection performance on a Gaussian like distribution. A popular choice in the literature [171] to approximate a Gaussian is the MB distribution, motivated in [172] and [173]. The MB distribution [168] gives symbol probabilities P r (xj) , j = 1,..., D , which are obtained using | |

P r (x ) = C (λ) exp λ x 2 , λ 0, (5.48) j − | j| ≥ n o with a normalizing constant,

1 − C (λ) = exp λ x 2 . (5.49)  − | j|  x Xj n o   The parameter λ governs the trade-off between average power of the signal points and the entropy rate. For λ = 0 a uniform distribution results, whereas for λ only the two minimum energy → ∞ signal points are used. In practice, creating a non-uniform input distribution is a non trivial task. One efficient approach involves a look up table, see [171] for practical implementation details.

In this example we illustrate the use of an MB distribution in a 16-ary Pulse Amplitude Modulation (PAM) constellation, with the signal set D = 15, 13, , 13, 15 and λ = 1/40. {− − ··· } The 16-PAM distribution is depicted in Figure 5.3. To analyze the goodness of fit between the MB discrete constellation with the true continuous Gaussian, we plot the cdf of the MB distri- bution in (5.48)-(5.49) versus the cdf of a continuous Gaussian with zero-mean and variance, σ2 = 16 P r (x ) x 2. The results are depicted in the bottom plot of Figure 5.3. k=1 k | k| P 5.5 Simulation Results 134

0.16 continuous Gaussian 0.14 M−B 16−PAM distribution 0.12

0.1

0.08 P(x) 0.06

0.04

0.02

0 −17 −15 −13 −11 −9 −7 −5 −3 −1 1 3 5 7 9 11 13 15 17 x values 1

0.8

0.6

0.4

0.2

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 CDF − CDF

Figure 5.3: Near-Gaussian constellation of 16-PAM with λ = 1/40. Upper figure depicts the discrete constellation. Lower figure shows the goodness of fit between the MB discrete constellation with the true continuous Gaussian

5.5.3 Comparison of Detection Techniques

We now compare the BER results of the proposed detectors. These are the following:

1. The MMSE detector with perfect CSI, referred to as MMSE CSI. This will serve as a performance reference.

2. The MMSE detector of the mismatched receiver, i.e. ignoring channel uncertainty (equa- tion (5.11)), referred to as MMSE NCU.

3. The LMMSE detector which takes channel uncertainty into account (equation (5.17)), referred to as LMMSE CU.

4. The proposed MAP detector (equation (5.18)), referred to as MAP CU.

5. The BEM based detector (equation (5.29)), referred to as BEM CU.

6. The BEM-Gibbs based detector (equations (5.41)-(5.42)), referred to as BEM-Gibbs.

We first compare the BER performance for different system configurations. 5.5 Simulation Results 135

BER performance of non-symmetric MIMO systems Here we compare the BER results for two non-symmetric configurations, that is (N = 2, M = 1), (N = 4, M = 2). The BER results are depicted in Figures 5.4 and 5.5, respectively. First, we compare the BER performance of the two extreme cases, where there is perfect CSI (MMSE CSI) and when the mismatched detector MMSE NCU is used. The BER gap is about 7 dB for both MIMO configurations. In both configurations the LMMSE CU detector is about 0.5 dB better than the MMSE NCU detector. The MAP detector, MAP CU produces almost 1 dB gain over the LMMSE CU detection. We also note that the BEM CU provides the same BER as the MAP CU detector.

−1 10

MMSE_CSI MMSE_NCU LMMSE_CU MAP_CU

−2 10 BEM_CU Bit error rate −3 10

−4 10

0 5 10 15 20 25 30 SNR [dB]

Figure 5.4: BER performance of MIMO systems with N = 2,M = 1

BER performance of symmetric MIMO systems Here we compare the BER results for two symmetric configurations, that is (N = 4, M = 4), (N = 8, M = 8). These results are depicted in Figures 5.6 and 5.7. The BER gap between the MMSE CSI and and the mismatched detector MMSE NCU is about 15 dB in both cases. In both configurations the LMMSE CU detector is about 1 dB better than the MMSE NCU detec- tor. The proposed MAP detector, MAP CU produces almost 2 dB gain over the LMMSE CU detection for (N = 4, M = 4) system and above 2 dB for (N = 8, M = 8). We also notice that here too the MAP CU and the BEM CU detectors provide comparable results.

In conclusion, the simulation results show the improvement made by using the proposed MAP detector over the LMMSE detector. We note that the performance gain is more noticeable as the 5.5 Simulation Results 136

MMSE_CSI

−1 10 MMSE_NCU LMMSE_CU MAP_CU BEM_CU

−2 10 Bit error rate

−3 10

−4 10 0 5 10 15 20 25 30 SNR [dB]

Figure 5.5: BER performance of MIMO systems with N = 4,M = 2 number of transmit and receive antennas increases. We also note that in all cases the MAP CU and the BEM CU detectors provide comparable BER performance. This is because the starting point of the BEM CU detector is in the region of the global maxima of the cost function in (5.19).

BER performance of BEM CU versus BEM-Gibbs detection schemes We now present the BER performance of the BEM-Gibbs detector presented in Section 5.4.5. 2 This detector extends the BEM CU detector to the case where the noise variance, σw, is unknown a priori. The BER results are presented in Figure 5.8 for the system configuration (N = 4, M = 4) and (N = 8, M = 8). The results show a minor degradation of less than 0.5 dB in performance in comparison to the BEM CU detector which has a priori knowledge of the noise variance. Based on these results we conclude that the annealed Gibbs sampling methodology is suitable for the task of symbols detection under noise uncertainty.

Sensitivity of the BEM CU detector to initialisation As discussed before, the BEM may be sensitive to the initial starting point of the algorithm x0. We now examine the performance of the BEM detector under two different starting points, x = 0, x and different values of σ2. We found out that in both cases the algorithm 0 { LMMSE} h almost always (99.4% of the times) converges to the same value x and as a result yield a very b similar BER. Only in a fraction of the times, the starting point x0 does not lie around the mode b of the global maximum and therefore does not converge to the true mode of the posterior. 5.5 Simulation Results 137

MMSE_CSI MMSE_NCU LMMSE_CU MAP_CU BEM_CU

−2 10 Bit error rate

−3 10

0 5 10 15 20 25 30 35 SNR [dB]

Figure 5.6: BER performance of MIMO systems with M = N = 4

However, the number of iterations required in both cases to converge is different. These results for the case N = 4, M = 4 are depicted in Figure 5.9. The number of iterations required by the initial guess x0 = 0 is much larger than for x0 = xLMMSE. The number of iterations in both 2 2 cases also depends on the values of σh and σw. b

MMSE_CSI MMSE_NCU LMMSE_CU MAP_CU −1 BEM_CU 10 Bit error rate

−2 10

0 5 10 15 20 25 30 35 SNR [dB]

Figure 5.7: BER performance of MIMO systems with M = N = 8 5.5 Simulation Results 138

N=M=8 −1 10 BEM BEM Gibbs

N=M=4

−2 10 Bit error rate

0 5 10 15 20 25 30 35 SNR [dB]

Figure 5.8: BER performance comparison between BEM and BEM-Gibbs detectors

26

24 σ2 = 0.05 h σ2 = 0.1 h 20 σ2 = 0.15 h σ2 = 0.2 h x = 0 16 0

12

8 Average number of iterations

4

x0 = xLMMSE

0 0 5 10 15 20 25 30 SNR [dB]

Figure 5.9: Average number of iterations of the BEM algorithm for different starting points 5.6 Chapter Summary and Conclusions 139

5.6 Chapter Summary and Conclusions

In this chapter we discussed the detection in MIMO systems of near-Gaussian constellations under imperfect channel estimation. We derived the MAP estimator and provided an efficient method for finding it by transforming the multi-dimensional, nonlinear and nonconvex problem into a simple tractable form. Next, we proposed a detection scheme for near-Gaussian-digitally modulated symbols with linear complexity, based on the new MAP estimator. We also presented a competing approach for finding the MAP estimator using the BEM methodology and showed that comparable BER performance can be obtained. Finally, we considered the case where the variance of the additive noise is unknown a priori. Based on the BEM methodology and using the concept of annealed Gibbs sampling we were able to develop an iterative solution for this problem. Simulation results showed the performance gain offered by our approach in comparison to the LMMSE detector in terms of BER. 5.7 Appendix I 140

5.7 Appendix I

In this Appendix, we show that (5.21) is unimodal in t 0. ≥

Proof. First, we will show that f(t) is convex in t 0. In [151] and [184], it was shown that ≥ strong duality holds in this special case and that is equal to the value of its dual program

max yH y yH H HH H + ηI † HH y ηt α − − H f(t) =  s.t. H H + ηI 0  (5.50)    HH y R HH H + ηI ∈   Thus, f(t) is the pointwise maximum of a family of affine functions and therefore is convex in t 0. Next, we will show that ≥

f(t) 2 2 t r(t) = 2 2 + N log(σht + σw) + 2 (5.51) σht + σw σx is unimodal in t 0. We use the following result from [184]: ′ ′′ ≥ If r (t) = 0, r (t) > 0 for any t 0, then r(t) is unimodal in t 0. ′ ≥ ≥ The condition r (t) = 0 states that

2 2 f ′(t) f(t)σh Nσh 1 r′(t) = 2 2 2 2 2 + 2 2 + 2 = 0. (5.52) σht + σw − (σht + σw) σht + σw σx

2 σh Multiplying (5.52) by 2 2 and rearranging yields σht+σw

4 4 2 2 f(t)σg Nσg σhf ′(t) σh 2 2 3 = 2 2 2 + 2 2 2 + 2 2 2 . (5.53) (σht + σw) (σht + σw) (σht + σw) σx(σht + σw)

The second derivative of r(t) is

f (t) f (t)σ2 f (t)σ2 2f(t)σ4 Nσ4 ′′ ′ h ′ h g g (5.54) r′′(t) = 2 2 2 2 2 2 2 2 + 2 2 3 2 2 2 . σht + σw − (σht + σw) − (σht + σw) (σht + σw) − (σht + σw)

4 f(t)σg Using (5.53), we replace the term 2 2 3 in (5.54), which results in (σht+σw)

2 4 f ′′(t) 2σh Nσg r′′(t) = 2 2 + 2 2 2 + 2 2 2 . (5.55) σht + σw σx(σht + σw) (σht + σw)

Now, f(t) is convex, which means that f (t) 0. Therefore, the first term of (5.55) is non- ′′ ≥ negative. The second and third terms are positive since σ2 0, σ2 0 and σ2 > 0. x ≥ h ≥ w 5.8 Appendix II 141

5.8 Appendix II

Here we derive upper and lower bounds on the search region of Lemma 5.2.

Deriving ηmin: We first define

η , λ HH H + θ , θ > 0, (5.56) min − min  then using (5.44), we have

M λ 2 x 2 = m r [m] 2 . (5.57) k ηmin k λ2 λ (HH H) + θ | | m=1 m min X  − 

In order to derive ηmin, we require that

M λ 2 m r [m] 2 > t. (5.58) λ2 λ (HH H) + θ | | m=1 m min X  −  Also, we note that

λ2 (H) = λ HH H , m 1,...,M , (5.59) m m ∈ { }  and defining λ , min λ , m 1,...,M , we have k { m} ∈ { } λ2 x 2 > k r [k] 2 . (5.60) k ηmin k θ2 | |

λ2 Therefore, as long as k r [k] 2 > t, we have x 2 t, which requires θ2 | | k ηmin k ≥ λ θ < k r [k] . (5.61) √t | |

Finally we obtain

2 λk ηmin = λ + r [k] . (5.62) − k √t | |

Deriving ηmax: We now derive the value of η , where we assume η 0. Using (5.44), we have max max ≥

M λ 2 x 2 m r [m] 2 , (5.63) k ηmax k ≤ η | | m=1 max X   5.8 Appendix II 142 and in particular we require that

M 1 λ2 r [m] 2 t. (5.64) η2 m | | ≤ max m=1 X Then we obtain that

M 1 η = λ2 r [m] 2. (5.65) max v t m | | u m=1 u X t Chapter 6

Iterative Receiver for Joint Channel Tracking and Decoding in BICM-OFDM Systems

“It’s a mistake to think you can solve any major problems just with potatoes.”

Douglas Adams

6.1 Introduction

In this chapter we present an iterative receiver that performs channel tracking and symbols decod- ing in Bit Interleaved Coded Modulation (BICM) - Orthogonal Frequency Division Multiplexing (OFDM) systems over time varying channels with high mobility of the mobile station.

The proposed scheme alternates between a MAP decoder and Kalman filter (KF) to track the time varying channel taps. As presented in Chapter 2, the KF attains the optimal solution in terms of MMSE. However, a key assumption made in developing the KF is that the state-space model is perfectly known and no model mismatch is present.

As discussed in Chapter 3, Decision Directed (DD) detection methods make use of the decoded symbols to form the state-space model. The state-space model may be misspecified due to misdecoded symbols, which causes a model mismatch. Because of the lack of robustness of the KF to outliers, filter divergence may occur and give rise to error propagation bringing about the well known “error floor” phenomenon.

We present an approach to address the problem of state-space mismatch. In particular, we introduce a new component in the iterative receiver, which evaluates the correctness of the filter’s performance. By performing a statistically based comparison of the innovation process against 143 6.2 Background 144 the theoretical predicted values, the new component adapts the number of decoded symbols used for channel tracking according to their reliability, thus reducing the mismatch effect. We present an analysis of the effect of misdecoded symbols on the KF and analyse how misdecoded symbols affect the Channel Frequency Response (CFR) estimation error. Through extensive simulations we compare the BER of the proposed algorithm to other methods presented in the literature and show that the presented method can significantly reduce the error propagation effect, leading to superior BER performance. The main contributions presented in this chapter are as follows:

We present a complete framework for the iterative joint detection, decoding and channel • tracking.

We introduce a new component to monitor the health parameters of the channel tracking • module, designed to reduce state-space model mismatch due to misdecoded symbols.

We present asymptotic analysis of the influence of misdecoded symbols on the CFR esti- • mation error.

We compare the proposed scheme to other algorithms to show its performance improve- • ment.

6.2 Background

Modern wireless communication systems require high data rate transmission over wireless chan- nels. OFDM [186] has received a considerable amount of attention over the last few years due to its ability to transform a frequency selective channel into a group of low-rate parallel flat chan- nels, thereby increasing the symbol duration, canceling the IBI and leading to a simple one-tap equalization. BICM scheme [135] was shown to have good performance for fading channels. It consists of a concatenation of an encoder, a bit-wise random interleaver and a symbol mapper. The combination of a proper BICM and OFDM was shown to be able to exploit frequency di- versity that is inherited within the frequency selective fading channels and can achieve the full diversity order of L for L-tap frequency selective channels [187].

Coherent OFDM transmission requires the estimation of the gain of each subcarrier (OFDM tone), and the performance of the system relies on the accuracy of this estimate. This problem is further complicated by the fast time-varying nature of channels, pertinent for mobile envi- ronments. The optimal receiver for the BICM-OFDM is joint channel tracking and symbols decoding, in which channel estimation is performed for each possible data sequence [77]. It usu- ally suffers from high computational burden. Some suboptimal methods, such as Per Survivor Processing (PSP), have been presented in order to reduce complexity [188]. Iterative (turbo) 6.2 Background 145 processing has gained a lot of attention since its introduction [136]. Referred to as turbo equal- ization, this method iterates decoding and equalization for the same set of received data, by exchanging probabilistic information about data symbols iteratively. Turbo equalization can achieve close-to-optimal performance but with much lower complexity then a joint ML receiver [189].

For OFDM systems, upon modeling the channel dynamics by an AR process, the KF [32] can be utilized for tracking channel parameters. There are two primary methods for channel tracking algorithms for OFDM systems: Time Domain (TD) methods to track the CIR [190] and Frequency Domain (FD) methods which estimate the CFR of the OFDM tones [191]. TD methods have a few advantages over FD methods. Firstly, they require the estimation of a relatively small number of parameters, i.e. the channel taps, whereas in the FD methods, the number of parameters to be estimated is equal to the number of subcarriers, which can be several hundreds in a typical OFDM system. Secondly, TD methods are less sensitive to misdecoded symbols. Using FD methods, a single error can cause a phase rotation of the estimated channel, leading to a complete loss of track in that subcarrier. This problem can be addressed using a frequency domain MMSE combiner [191]. This hurdle does not occur using the TD approach, since as we show, misdecoded symbols are spread over the whole spectrum and are not concentrated in a single subcarrier. Regardless of what method is being used, the use of known symbols (pilots), at the beginning of the transmission for initial estimation of the channel, and periodically, is required in order to ensure that the algorithm does not diverge.

DD methods for channel tracking perform well as long as the symbol are correctly decoded and do not contain many misdecoded symbols. However, in case of time selective channels, the channel may fluctuate rapidly and changes significantly between consecutive OFDM symbols. Therefore, DD based channel tracking may contain a significant amount of errors. The KF is optimal in the minimum variance sense [192]. However, it assumes that the underlined model is completely specified and correct. The performance of the KF based upon nominal parameter values may be severely degraded in the presence of parameter deviations, i.e. incorrect data matrix. This can diverge the KF from the real channel trajectory. This effect is referred to as error propagation and is responsible for the “error floor” phenomenon, as reported in [87] and references within. Misdetections can be seen as addition of contaminating noise or equivalently outliers. In the presented system, misdecoded symbols are the sole reason for model mismatch, and the system can be considered as Errors In Variables (EIV) model [193].

One simple approach to overcome the problem is a frequent usage of pilot symbols in order to keep track of the channel estimate [194]. However, this would result in a lower spectral efficiency of the system.

Another type of approaches for handling this problem is to use the soft outputs of the decoder for channel tracking. In [195], the authors designed a soft input KF based on the soft information 6.2 Background 146 provided by the decoder. Other methods, based on the EM use the soft values of the decoded symbols as input to the channel estimation algorithm [196], [197]. The EM-Kalman treats the data symbols as the hidden data and the CIR as the parameter to be estimated. The maximization step utilizes the KF while the expectation step is based on the log-likelihood function conditioned on the incomplete data. However, as reported in [198], the channel-tap estimates are biased, which can severely degrade the receiver performance, especially at the first iteration of the turbo equalizer.

The sensitivity of the KF to outliers is well known and it is said that the KF lacks robustness. There are various definitions for robustness [199–201]. One common measure of robustness is the breakdown point [202]. The breakdown point of an estimator measures the maximal percentage of the data points that may be contaminated before the estimate becomes completely corrupted. Robust estimation can be formulated under the M-estimator framework [203]. Many robust estimators have been constructed and they differ by the cost (or loss) function used. At the core of the M-estimators lies the idea of trying to downweight the influence of outliers by replacing the well known residual squared error (a.k.a L2 norm) cost function with another function of the residuals [204], [205]. The M-estimators are solely based on the values of the residuals, since no other indication regarding the presence of outliers exists. According to [206], when regression in an EIV model is considered, any M-estimator with unbounded score function (such as the Huber score function) has asymptotically a breakdown point of zero. As an alternative to overcome this hurdle, the Redescending M-estimators approach [207] has been suggested, and can be more efficient than the Huber M-estimator under some conditions. A robust version of the KF would have to satisfy two objectives: be as nearly optimal as possible when there are no outliers ( no misdecoded symbols); and be resistant to outliers (misdecoded symbols) when they do occur. An attempt to robustify the KF against both system uncertainties and outliers was suggested in [208], which is based on robust least squares method [209]. A scheme to reject outliers in MIMO based receivers was suggested in [210].

In contrast to the above mentioned methods, we take a different approach for addressing the error propagation problem. At the core of our approach lies the fact that we have access to extra information regarding the statistics of the EIV model given by the MAP decoder. We can therefore use a cost function which is based on both the residuals, and the decoder’s output LLR values. By analyzing the decoder’s output LLR values, we can estimate the credibility of the decoded symbols, and choose whether to include or trim them from the system model. This can robustify the KF against the presence of outliers (misdecoded symbols) thus deliver smaller estimation error, and ultimately lower BER. Our approach is similar to the Redescending M- estimators, in the sense that our cost function is set to zero for detected outliers, with the prominent distinction that our criterion for trimming does not solely rely on the residual values but also on the information given by the decoder. 6.3 System Description 147

More specifically, the proposed scheme performs iterative symbols decoding and TD channel tracking using three components: a soft demodulator, a symbols decoder and a channel estima- tor. The soft demodulator computes the soft values for the coded bits in terms of LLR values. These LLRs are used by the decoder as bit metrics. The output of the decoder is fed to both a channel estimator to track the channel taps and soft demodulator as a priori information for the purpose of iterative process. We use the KF for channel tracking. The data matrix required by the KF is provided by the decoder output at the previous iteration in a hard-decision form.

In order to reject outliers, we introduce a new component, namely Adaptive Detection Selection (ADS). This component contains a consistency test of the KF based on the innovation (residual) process [211]. As long as the empirical statistics agree with the expected theoretical results, the KF is assumed to be working under nominal conditions and therefore all decoded symbols are used for forming the data matrix. In this case, all available data are used for channel tracking. If, however, they are in disagreement, it means that the KF is becoming inconsistent, leading to its divergence. When this is detected, an adaptive threshold will be used to select the most robust symbols, based on their LLR values to be included in the state-space model. By doing so we reduce the number of misdecoded symbols used for channel tracking. The adaptive threshold varies according to the test statistics of the innovation processes. As long as the innovation test statistics produces good match, the threshold will be relatively low, and when disagreement occurs, the threshold will increase. Using the proposed approach, we reduce state- space mismatch, preventing its divergence. We show that the proposed scheme can significantly reduce the error propagation effect, leading to a reduced BER and approaching the performance of a system with perfect CSI. Simulation results show that the proposed algorithm is more robust and achieves better error performance than the soft-value based methods, such as the EM-Kalman, and it also outperforms M-estimators based receivers [212].

6.3 System Description

6.3.1 BICM-OFDM Transmitter

We consider a BICM-OFDM transmitter with N subcarriers as shown in Figure 6.1. A sequence of information bits is encoded with a channel encoder, interleaved and mapped to complex- valued symbols of an M-ary modulation alphabet set A = a1, .., a A . Throughout this chap- | | ter, all derivations are suitable for Quadrature Phase Shift Keying (QPSK) modulation, but can be easily extended to other constellations. The data symbols are multiplexed to OFDM subcarriers. After modulation of the OFDM symbols via an N point Inverse Discrete Fourier Transform (IDFT), the signal is transmitted over a frequency selective, time varying channel. We assume that the channel state is fixed (static) over the duration of a single OFDM, but can change significantly between consecutive OFDM symbols. The use of a proper CP eliminates 6.3 System Description 148

Rate ½ Information QPSK OFDM convolutional interleaver bits mapping modulator encoder

Figure 6.1: Block diagram of the BICM-OFDM transmitter

IBI between consecutive OFDM symbols. The length of the CP is assumed to be larger than the CIR length and it is also assumed that ICI caused by Doppler offset is fully compensated for. At the receiver, after CP removal and DFT, the signal at time index n is given by [186]

y = D W h + z , (6.1) n n L n n where y is an N 1 observation vector in the frequency domain, D is an N N diagonal n × n × matrix and its diagonal entries contain the encoded data symbols, and z is an N 1 vector of n × i.i.d. complex zero-mean Gaussian noise with covariance matrix R = Iσ2 and it is assumed to be uncorrelated with the channel. The vector h = [h [0], h [1], . . . , h [L 1]]T is the discrete n n n n − CIR at time instant n of length L, and its components are h [l], l 0,...,L 1 . W is n ∈ { − } L , j2πnl/N an N L partial DFT matrix, defined as WL e− n 0,...,N 1 ;l 0,...,L 1 . The CFR, × ∈{ − } ∈{ − } denoted by h with elements, h [k], k 1,...,N , can be written as n n ∈ { }

L 1 1 − j2πkl/N hn [k] = hn [l] e− . (6.2) √N Xl=0

6.3.2 Channel Model

We consider a multipath fading channel according to Jakes model [84], consisting of L indepen- dent impulses. The channel impulse response is given by

L 1 − h (t, τ) = γ (t) δ (τ τ ) , (6.3) l − l Xl=0 where τl is the delay of the l-th path and γl (t) is the corresponding complex amplitude. In (6.3),

γl (t)’s are wide-sense stationary complex Gaussian processes, independent of each other. The 6.3 System Description 149 correlation of the l-th path is [191]

E Cγl (∆t) = γl (t + ∆t) γl∗ (t) { } (6.4) 2 = σl J0 (2πfD∆t) ,

2 where σl is the power of the l-th path, J0 (x) is the zeroth-order Bessel function of the first kind v fc and fD is the maximum Doppler frequency in Hertz, given by fD = c , where v is the mobile speed, fc is the carrier frequency and c is the speed of light.

Applying this model to the discrete-time OFDM system we obtain

C (j) = σ2J (2πf T j ) , (6.5) γl l 0 D s | | where Ts is the OFDM symbol duration and j is an integer. In this case, the theoretical power spectral density of the received fading signal has the well- known U-shaped bandlimited form [84]

1 , f f T 2 D s f πfDTs 1 | | ≤ Sl (f) = − fDTs (6.6)  r   0 , else

 6.3.3 Autoregressive Modeling

Since (6.6) is a nonrational function, precise matching of the theoretical statistics is impossible by an ARMA model. However, our goal is to accurately capture the dynamics of the wireless channel yet remaining mathematically tractable for implementation. It is possible to capture most of the channel tap dynamics by using a low-order AR model [85]. Model inaccuracy can be made arbitrary small by increasing the model order p. With an AR model, the approximation of the CIR hn is

p hn = A(i)hn i + vn, (6.7) − Xi=1 where A(i), i 1, . . . , p , are matrices of size L L, v is an L 1 zero-mean i.i.d. circular ∈ { } × n × complex Gaussian vector with correlation matrix Qvv(j) given by [213]

E H 2 Qvv(j) = vnvn+j = σvIδ(j). (6.8)  Due to the independency assumption of the channel taps, matrices A(i), i 1, . . . , p , and ∈ { } Qvv(j) are diagonal. The solution for the diagonal entries can be found by using Yule-Walker equations [192]. 6.4 Receiver Structure 150

Information-theoretic results from [86] have demonstrated that implementing a first-order Marko- vian model offers sufficient accuracy to model the Rayleigh time-varying channel. As shown in [87], a simple Gauss-Markov model can capture most of the channel tap dynamics, and suitable for channel tracking. Therefore, we utilise a first order Markov model to describe the channel’s dynamics. The extension to a higher order AR model is straightforward. For the first-order AR model, we can rewrite the channel impulse response hn in (6.7) as

hn = Ahn 1 + vn. (6.9) −

Solving the linear system of the Yule-Walker equations for p = 1 (AR(1) model) yields the entries of A and Q matrices. Matrix A is a diagonal matrix with values α = J0 (2πfDTs) on its diagonal, and Q is a diagonal matrix with values σ2 1 α 2 , l 0,...,L 1 , on its l − | | ∈ { − } diagonal.  

6.4 Receiver Structure

In this Section we present an iterative DD channel tracking and decoding algorithm for the BICM-OFDM system. Since a joint estimation and detection scheme is too complex to be realized, a suboptimal solution is considered. In particular, we present an iterative receiver which consists of three submodules: a soft demodulator, a MAP decoder and a channel tracker. Throughout the iterative process, one submodule tends to provide more reliable information to the other submodules so as to improve the overall performance in a progressive manner. The receiver is depicted in Figure 6.2. The details of the components are now explained.

6.4.1 Soft Demodulator

Based on the received signal y , the estimated channel coefficients h , and the a priori in- n n formation fed back from the decoder, the soft demodulator yields the LLR of the coded bits. b During the first iteration we use the channel estimate from the previous block hn 1 as our initial − estimate (using DD approach) and for the remaining, we use the estimate given by the KF from b the previous iteration. For clarity, we omit the time dependency n when it is not needed.

The OFDM equation for the k-th subcarrier (k 1,...,N ) is ∈ { }

y [k] = D [k, k] h [k] + z [k] , (6.10) where y [k] is the scalar observation, h [k] is the CFR, D [k, k] is the QPSK symbol and z [k] is the complex Gaussian noise with variance σ2. The coded bits are assumed to be random and statistically uncorrelated due to the random interleaver in the BICM system. Since the actual 6.4 Receiver Structure 151

k k dL i dL ie cL ja cL j OFDM Soft ( ) ( ) ( ) ( ) De-interleaver demodulator demodulator + ∑ SISO decoder Hard decision -

∧ Y h - interleaver ∑ + dL k cL ( ia ) ( je ) ∧ D Subcarriers QPSK Kalman filter interleaver selection mapping

ε Ψ

Consistency test

Figure 6.2: Block diagram of the BICM-OFDM iterative receiver

CFR h [k] is unknown, we use the estimated CFR h [k] instead in the demodulator. The symbol

D [k, k] is compounded of 2 bits, d0 [k] and d1 [k], as D [k, k] = [d0 [k] , d1 [k]]. The LLRs of the b A Posteriori Probabilities (APP) for the first and second bits, d0 [k] and d1 [k], respectively, are defined as

P r(d0 [k] = 1 y [k] , h [k]) L(d0 [k] y [k] , h [k]) , ln | , (6.11) | P r(d0 [k] = 0 y [k] , h [k]) | b b P r(d1 [k] = 1 y [k] , h [k]) L(d1 [k] y [k] , h [k]) , ln | b . (6.12) | P r(d1 [k] = 0 y [k] , h [k]) | b b Using the max-log approximation and the fact that the additive noiseb is zero mean white cir- cularly symmetric complex Gaussian, after some straightforward manipulations (see Appendix 6.8), we obtain

1 2 L(d0 [k] y [k] , h [k]) La(d0 [k]) + 2 min[δ00 [k] , δ01 [k] 2σ La(d1 [k])] | ≈ 2σ − (6.13) 1 b min[δ [k] , δ [k] 2σ2L (d [k])], − 2σ2 10 11 − a 1

1 2 L(d1 [k] y [k] , h [k]) La(d1 [k])+ 2 min[δ00 [k] , δ10 [k] 2σ La(d0 [k])] | ≈ 2σ − (6.14) 1 b min[δ [k] , δ [k] 2σ2L (d [k])], −2σ2 01 11 − a 0 6.4 Receiver Structure 152 where La(di [k]), i = 0, 1, represents the a priori LLR of bit di [k], and δmn [k] is given by

2 δ [k] = y [k] D [k] h [k] , (6.15) mn − mn

and D [k, k] stands for the QPSK symbol d [k] = m,b d [k ] = n . The LLRs, L (d [k]) and mn { 0 1 } a 0 La(d1 [k]), are initialized to zero for the first iteration, since there is no a priori information on the coded bits, and for the remaining iterations, those values are provided by the soft decoder outputs. The LLRs of the APP from the demodulator’s output L(d [k] y [k] , h [k]), subtracted i | by the a priori LLRs of the demodulator’s input La(di [k]), generate the extrinsic LLRs of the b coded digits, denoted by Le(di [k]). This is given by

L (d [k]) = L(d [k] y [k] , h [k]) L (d [k]). (6.16) e i i | − a i b After de-interleaving, the extrinsic information Le(di [k]) will be fed into the soft decoder as its a priori information, denoted by La(cj).

6.4.2 MAP Decoder

The MAP decoder is a soft input soft output decoder. It receives the LLR values of the input bits from the soft demodulator La(cj), and produces the LLRs of the APP of the coded bits,

L(cj). The log-likelihood ratio of the coded bits cj is given by

P r(cj = 1 L(D y, h) L(c ) = ln | | , (6.17) j P r(c = 0 L(D y, h)  j | | 

T where L(D y, h) = ...,L(d [k] y [k] , h [k]),L(d [k] y [k] , h [ k]),... , and its elements | 0 | 1 | L(d [k] y [k] , h [k]),hk 1,...,N , i 0, 1 , are the LLR valuesi of the i-th bit of the k-th i | ∈ { } b∈ { } b subcarrier, as shown in (6.13) and (6.14). b The MAP decoder outputs will be fed back to both the soft demodulator and the channel tracker for the next iteration. The LLRs of the coded bits can be directly fed back to the KF for channel tracking for the next iteration. In the meantime, by subtracting the a priori information at the input of the MAP decoder, we obtain the extrinsic information of the coded bits, denoted by

Le(cj) and given by

L (c ) = L(c ) L (c ). (6.18) e j j − a j

k The extrinsic information will be interleaved and used as the a priori information La(di ) for the soft demodulator for the next iteration, as shown in (6.13) and (6.14). At the last iteration, the

LLR of the coded bits L(cj) is used to detect the original data stream. 6.4 Receiver Structure 153

6.4.3 Channel Tracking with Known Symbols

Consider the AR model for the channel dynamics (6.9), and the OFDM model (6.1). A state- space model for the OFDM system can be written as

hn = Ahn 1 + vn, (6.19) − y = D W h + z . (6.20) n n L n n

If the data symbols are known at the receiver (training phase), the CIR hn can be tracked using the KF as follows [32]. Step 1: a priori estimation (prediction)

hn− = Ahn 1, (6.21) − H Pn− = APn 1A + Q. (6.22) b b −

Step 2: a posteriori estimation (update)

1 H H − Kn = Pn− (DnWL) DnWLPn− (DnWL) + R , (6.23)   h = h− + K y D W h− , (6.24) n n n n − n L n   P = P− K D W P−. (6.25) bn b n − n n L n b

First, the filter calculates the a priori estimates of the channel hn− and its error covariance matrix Pn−, based on the history prior to the current observation y , where [214] bn

H P− = E h− h h− h . (6.26) n n − n n − n      b b Next, the a posteriori estimates of the channel hn and its error covariance matrix Pn after the observation y is available at the receiver are evaluated, where n b

H P = E h h h h . (6.27) n n − n n − n      b b The Kalman gain Kn balances the weight between the predicted estimates and the innovation process (6.21). Matrices Q and A are defined in Subsection 6.3.3, and R = Iσ2 is the covariance matrix of the complex zero-mean Gaussian noise.

6.4.4 Decision-Directed Based Channel Tracking

In the previous Section we laid the foundation for channel tracking using the KF. During data transmission, pilot symbols are not available, and DD channel tracking can be used. In 6.4 Receiver Structure 154 order to form the state-space equations (6.19)-(6.20), the symbols matrix Dn is obtained from the outputs of the MAP decoder by interleaving the hard decision of the coded bits and then b mapping them to QPSK signal constellation (gray area in Figure 6.2). However, Dn and Dn may be different due to misdecoded symbols. Therefore, using Dn instead of Dn in the state-space b model (6.19)-(6.20) is an approximation of the actual model. b

As long as Dn = Dn (meaning that there are no misdecoded symbols), the state-space model in (6.19)-(6.20) is accurate and the KF operates in its nominal values and therefore is optimal in the b MMSE sense. However, if D = D , the filter may diverge, depending on model inaccuracies. n 6 n This can lead to serious performance degradation. Misdetections can be seen as addition of b contaminating noise or equivalently, an outlier. This is due to the fact that the OFDM equation for the k-th subcarrier (where we omit the time dependency n for clarity of presentation) can be expressed as

y [k] = d [k] h [k] + z [k] . (6.28)

When a misdetection occurs, the misdetected symbol d [k] can be expressed as

d [k] = d [k] + d [kb] , (6.29) where d [k] is the error term, and (6.28)b can be writtene as

e y [k] = d [k] h [k] d [k] h [k] + z [k] . (6.30) −

Consequently, the use of d [k] insteadb of d [k] maye contribute to the overall noise component which now accounts for both the AWGN and the misdetection, and is defined as b

z [k] , z [k] d [k] h [k] (6.31) − e The distribution of z [k] can be writtene as

pe(z [k]) = p z [k] d [k] h [k] d [k] P r d [k] . (6.32) − | Xd[k]     e e e e e Since each element in the summation of (6.32) p z [k] d [k] h [k] d [k] follows CN z [k]; d [k] h [k] , σ2 − | − distribution, the overall distribution of p (z) is a mixture of Gaussians. This illustrates why the  e e e KF is no longer optimal, since the KF assumes a single mode posterior distribution [214]. The e sensitivity of the KF to outliers is well known and it is said that the KF lacks robustness. 6.4 Receiver Structure 155

6.4.5 Robust Estimation

A few methods to tackle the problem of robust estimation have been devised. These methods provide tools for statistics problems in which the underlying assumptions are inexact. The most common approach is based on the M-estimators methodology, for which different cost functions have been analyzed. We now briefly review a few of them.

Consider the quadratic (L2 norm) estimator

ρ (x) = x 2 , (6.33a) k k Ψ(x) = 2x, (6.33b) where ρ (x) is the cost function and Ψ (x) defined as the derivative of ρ (x) is proportional to the influence function. The influence function measures the impact of a single contaminated observation and can be seen as a local resistance measure whereas the breakdown point is a global resistance measure. The idea behind the M-estimators methodology is that in order to increase robustness, an estimator must be more forgiving about outlying measurements. One simple approach is to replace the quadratic (or L2 norm) cost function with the absolute value (or L1 norm):

ρ (x) = x , (6.34a) | | Ψ(x) = sign (x) . (6.34b)

Unfortunately the L1 norm based estimator is not robust in that it still has a breakdown point of 1 N , where N is the dimension of x. Because of that Huber [203] proposed the minimax estimator

2 x ǫ k k + , x ǫ, ρ (x) = 2ǫ 2 | | ≤ (6.35a)  x , x > ǫ, | | | |  x , x ǫ, Ψ(x) = ǫ | | ≤ (6.35b) sign (x) , x > ǫ,  | | where ǫ is a user specific predefined threshold. Huber’s minimax estimator combines true behav- ior of the L2 norm for small residuals while maintaining the L1 for large residuals, thus reducing 1 its sensitivity to outliers. Still, Huber’s M-estimator has a breakdown of N in case of EIV model [206].

To further increase robustness, redescending estimators such as Tukey’s biweight, Andrews sine and the Lorentzian for which the influence of outliers tends to zero were derived [207]. A key property of all M-estimators methods is that they are based on examining the residual elements 6.4 Receiver Structure 156 in order to detect the existence of outliers, and therefore:

y hd x , − , (6.36) σ b b where σ is the standard deviation of the AWGN.

However, in the system under consideration, we can take advantage of the fact the knowledge of the LLR values of the data symbols given by the MAP decoder. These values can indicate the probability of the existence of outliers in the system model and provide a valuable source of information.

In the next Section we present a novel method to robustify the KF. This is based on a modified cost function which takes into account both residuals and the LLR values.

6.4.6 Channel Tracking using Adaptive Detection Selection

In this Section we present a novel algorithm to prevent the KF’s divergence due to model mismatch as was specified before. Divergence is said to occur when the actual errors in the estimates of state become inconsistent with the error covariance predicted. This may cause a breakdown of the channel tracking module. From robust estimation point of view, our approach has the same goal as the outliers rejection version of the Redescending M-estimators approach [207], where we aim at completely downweighting outliers (that is, giving them 0 influence). However, instead of only using residual analysis, we use a novel approach to decide on the existence of outliers and a rejection mechanism. In contrast to other approaches such as [195– 197] that use the soft outputs values of the MAP decoder in order to perform channel estimation, we aim at minimizing the state-space model mismatch in (6.19)-(6.20). This enables us to use the KF without modifications (and approximations such as in [195]). In order to reject outliers, we need use only correctly decoded symbols in constructing the state-space model.

To this end we notice that the dimension of y is much bigger than of h ,(N >> L) and the n n system is overdetermined. Therefor, instead of using all N elements of y (and N decoded n symbols), we can choose a subset Ωn of the total N symbols to be used for tracking the CIR. Conceptually, it would be like assigning binary weights 0, 1 to each of the elements in the { } observation vector, so that only those with a weight of 1 will be taken into account and the rest will be pruned. In this case the state-space would have the form of

hn = Ahn 1 + vn, (6.37) − M y = M (D W h + z ) . (6.38) n n n n L n n 6.4 Receiver Structure 157 where M is an N N masking diagonal matrix and its diagonal entries contain binary values n × 0, 1 according to: { }

1, k Ωn [M ] = ∈ , (6.39) n k,k  0, else where k 1,...,N is the subcarrier index.  ∈ { }

In determining Ωn, two parameters need to be evaluated:

1. µ - the number of decoded symbols to be used to form Ω (µ 1,...,N ). n n n ∈ { }

2. Determining which µn symbols out of the N possible ones to be included in Ωn.

We first address the second question: Given µn,Ωn contains µn decoded symbols with the highest reliability measure CL [k], k 1,...,N , where CL [k] is the reliability measure of the ∈ { } decoded symbols of the k-th subcarrier, defined as:

CL [k] , L(c2k) + L(c2k+1) , k 1,...,N , (6.40) ∈ { }

where L( ) is defined in (6.17). The reason for favoring this criterion over a criterion that uses • the sum of the LLRs of the two QPSK bits ( L(c2k) + L(c2k+1) ), is that the latter does not give a clear picture regarding the reliability of the detection of the QPSK symbol. Using the former criterion requires that both decode bits forming the symbol are reliable. In the latter, however, if one of the bits is highly reliable and the other is not, the value of the reliable LLR bit may overshadow the poor LLR value of the other bit, giving the wrong impression that the symbol is reliable.

Next we address the first question. The value µn should maintain a balance between two con- tradictory requirements: on the one hand, the more observations (and correct decoded symbols) we take into account, the smaller the variance of the estimation error of hn will be. On the other hand, misdecoded symbols that were taken into account may diverge the filter. Moreover, the KF maintains a balance between the state model and the measurements. In the low SNR region the KF “relies” more (gives more weight) on the channel model and less on the observations

(relatively small values of Kn). In the high SNR region the KF “relies” more on the observations and less on the channel model (relatively large values of Kn) [215]. This implies that the penalty for misdecoded symbols in the high SNR region would be more severe than in low SNRs.

We therefore suggest the following approach for choosing µn: by evaluating the consistency status of the KF we can adapt the value of µn accordingly. When the consistency status of the KF is poor, µn will be gradually decreased by a fixed amount η, thus allowing for only very reliable decoded symbols to be used, while when the consistency status is good, µn can be 6.4 Receiver Structure 158 increased, thus incorporating more decoded symbols. We now present a consistency test for the KF.

6.4.7 Tracking Quality Indicator using Consistency Test

In this Section we discuss a method to evaluate the performance robustness of the KF. In [211], several consistency test were suggested. These tests should not be confused with parameter estimation consistency which is an asymptotic measure. Here we rely on finite sample size. We now present two common measures for testing the consistency of the KF.

To this end we define the innovation process at time n as

ǫ = y D W h−, (6.41) n n − n L n then the first measure is the Normalized Innovationb Squaredb (NIS) at time n and it is defined as [211]

, T 1 NISn ǫn Sn− ǫn, (6.42)

H , where Sn DnWLPn− DnWL + R . The distribution of NISn under nominal conditions    2 (in our case, whenb Dn = Db n) follows a χK distribution with K degrees of freedom, where K is the dimension of the observation vector y . Based on NISn, the second measure is termed b n modified log-likelihood. At time instant n it can be written in a recursive manner as (see [211] equation 2 246) −

λn = λn 1 + NISn, (6.43) −

2 The distribution of λn is χnK with nK degrees of freedom. Notice that the distribution of λn varies over time as it depends on n. Using confidence intervals, we can perform Hypothesis Test- ing on either NISn or λn to detect the consistency status of the KF, where the null hypothesis is

H0 : Consistent filter (6.44) and the alternate hypothesis is

H1 : Inconsistent filter (6.45) 6.4 Receiver Structure 159

The null hypothesis is rejected when λn or NISn lie outside the expected confidence interval.

For example, if NISn values are tested, then the null hypothesis H0 is excepted if

NIS [r , r ] , (6.46) n ∈ 1 2 where the confidence interval [r1, r2] is determined such that

p (NIS [r , r ]) = 1 α, (6.47) n ∈ 1 2 − where 1 α is the confidence level. The smaller α is, the larger the range [r , r ] would be. That − 1 2 means that as α decreases, the more forgiving the test is, since a larger range of NISn would be tolerable. In Section 6.6, we will asses these two criteria and the influence of different confidence interval values.

6.4.8 Adaptive Detection Selection Algorithm

We now present the ADS algorithm based on the tracking quality indicator and the adaptive subset selection. At each time n a consistency test using either NISn or the modified log- likelihood is performed. If the null hypothesis H0 is declared, the KF is assumed to work under nominal values and therefore all decoded symbols are used for channel tracking. In case that hypothesis H1 is declared, the number of decoded symbols used for channel tracking is adaptively reduced, while making sure that at least L decoded symbols are still used. The algorithm is described in Algorithm 15.

Algorithm 15 Consistency Test and Adaptive Subcarriers Selection Input: L, η, y , D , h , r , r n n n− 1 2 Output: µn 1: Evaluate the innovation:b b ǫ = y D W h n n − n L n− 2: Evaluate NISn according to (6.42) or λn according to (6.43) 3: Perform Hypothesis testing: for example,b theb NISn test is: if NIS / [r , r ] then H , else H . n ∈ 1 2 1 0 4: if H1 then 5: µn = max [L, µn 1 η] − − 6: else 7: µn = N 8: end if

6.4.9 Discussion - Soft versus Hard Kalman Filter

Our proposed approach suggests to use the conventional KF with hard decisions, using only a sub-set of the N decoded symbols. By doing so we aim at minimizing the amount of incorrect 6.5 Estimation Error Analysis 160 symbol decisions used by the channel tracking module. In contrast, other approaches such as the EM-Kalman [196], [197] and soft-KF [195] make use the decoded soft values and incorporate this information in the KF. Both EM-Kalman and soft-Kalman methods make use of the soft values to build a modified state-space model. The rationale behind these approaches is that for unreliable decoded symbols, the corresponding variance is significant and down weights its contribution in the KF. However, this means that state-space model mismatch constantly occurs (as E D = D ), regardless whether the decoded symbol is accurate or not. { n} 6 n The soft-Kalman does not yield the the MMSE estimate but only the LMMSE estimate and therefore suboptimal. This is due to the fact that the additive noise has a nonlinear compo- nent containing the multiplication of two random vectors. Consequently, y and h cannot be n n assumed to be jointly Gaussian.

The EM-Kalman algorithm based on the EM method [35] is known to converge to a stationary point corresponding to a local optimum of the posterior distribution, and its convergence to the global model is not guaranteed. Therefore, the EM-Kalman method relies on the initial estimate and is sensitive to poor initial estimates which can occur relatively often in a fast fading channel scenario.

For the proposed method, as long as the ADS component filters out all misdecoded symbols, our approach does not suffer from a model mismatch. In that case, the state-space model is accurate and the MMSE estimate is produced for the given number of decoded symbols being used for CIR tracking.

6.5 Estimation Error Analysis

In this Section we analyse the impact of misdecoded symbols on the channel estimation error of each subcarrier. For simplicity, we will limit the analysis to the high SNR region. Recalling the Kalman gain in (6.23)

1 H H − Kn = Pn− DnWL DnWLPn− DnWL + R , (6.48)       b b b and defining B , DnWL we obtain

b H H 1 Kn = Pn−B BPn−B + R − . H 1 H H H 1 Kn = B B − B B Pn−B BPn−B + R − . (6.49) H H 1 H H Kn BP n−B + R = B B − B B Pn−B .    6.5 Estimation Error Analysis 161

In the high SNR region we get

H H 1 H H Kn BPn−B + lim R = B B − B B Pn−B . R 0  →  H H 1 H  H  KnBPn−B = B B − B BPn−B . (6.50) H 1 H H 1 H  − Kn = B B − B = DnWL DnWL DnWL .        b b b H H Since Dn Dn = 2I for QPSK and WL WL = NI, we obtain that

c c 1 K = WH DH . (6.51) n 2N L n

Now (6.24) becomes b

h = h− + K y D W h− n n n n − n L n 1 h H H i b = hb− + W D b y bDnWLh− (6.52) n 2N L n n − n 1 h i = b WH DH y b. b b 2N L n n

This result shows that in the high SNRb region the KF does not use the channel model at all, but only relies on the observation vector to estimate hn. Actually, (6.52) is the LS estimation of hn. We define the Normalized Squared Error (NSE) for a specific subcarrier k to be

2 2 Hk Hk ωkh ωkh NSEk , − = − , (6.53) k 2 k 2 H H c| | b| | where Hk and Hk are the CFR and estimated CFR of the k-th subcarrier, respectively, and k j2πnk/N k ω = e− n 1,...,L 1 is a 1 L vector. The NSE indicates how much the estimation c ∈{ − } × of the channel response of the subcarrier k deteriorates due to the estimation error of the channel impulse response h. Combining (6.52) and (6.53), we obtain

2 1 k H 1 k H 2N ω WL Dy 2N ω WL Dy NSEk = − Hk 2 b | | H (6.54) k H k H ω WL D D y ω WL D D y = − − .   4N2 Hk 2    b | | b 6.5 Estimation Error Analysis 162

Defining the data error matrix as ∆ , D D, we have − H ωkWbH ∆y ωkWH ∆y NSEk = L L 4N 2 Hk 2  | |  H H k H k H y ∆ WL ω ω WL ∆ Y = (6.55)  4N 2Hk 2  | | yH ∆HΩk∆y = , 4N 2 Hk 2 | |

k k H k H where Ω , WL ω ω W , and it is an N N deterministic matrix which depends only on L × the index k. Noting that y = DH + z DH in the high SNR region, we get  ≈ hH DH ∆HΩk∆DH NSEk = . (6.56) 4N 2 Hk 2 | | Let Π , ∆D, (6.56) can be rewritten as

hH ΠHΩkΠh NSEk = . (6.57) 4N 2 Hk 2 | | Since Π is a multiplication of two diagonal matrices, Π is a diagonal matrix as well, with values only on the diagonal where ∆ holds values, that is to say, only on the entries where misdecoded symbols occur. Defining Θk , ΠHΩkΠ, we have that

hH Θkh NSEk = . (6.58) 4N 2 Hk 2 | | Taking a close look at Θk, we can see that Θk is an N N matrix with entries in all the × permutations of the erroneous subcarriers. For example, if there were 2 errors of misdecoded symbols in the data vector, in subcarriers 5 and 8, the matrix Θk would have non-zero values in the following indices: (5, 5), (5, 8), (8, 5), (8, 8) . Generally, in case of r errors, the number of { } entries which are nonzero would be r2. The numerator becomes a combination of the subcarriers that contain errors He, e J and the permutation matrix Θk, where J is the set of subcarriers ∈ that contain errors. The number of values that this combination holds grows exponentially as the number of errors. This indicates that in the high SNR region, the effect of misdecoded symbols is severe. Another important observation is that regardless of the location of the errors, all the subcarriers will suffer some distortion, since the estimation error smears all the subcarriers. 6.6 Simulation Results 163

6.6 Simulation Results

In this Section we present the simulation results. First, we describe the simulated system.

6.6.1 System Configuration

The OFDM system employing QPSK modulation and operating with a carrier frequency of fc = 5 GHz is considered. The available bandwidth is 2 MHz. The OFDM system has N = 64 data subcarriers and a CP of 16 samples. The CIR hn is assumed to be a multipath Rayleigh fading with independent exponential decay power profiles, where the length of the channel L is set to 14. The channel code is a memory 2, rate 1/2 convolutional code with generator polynomials (7, 5) in octal form. A random interleaver is used for bit interleaving and the interleaved coded bits are mapped onto QPSK symbols. We consider a wide range of terminal velocities: v [0, 100] km/hr. The value of η was chosen to be 4. Values lower than that ∈ caused the ADS algorithm not to react fast enough and many misdecoded symbols were used for channel tracking. When larger values of η were used, although almost always only correctly decoded symbols were used for channel tracking, some correctly decoded symbols that could have been used for channel tracking were also trimmed, thus not achieving the full potential of the KF. In order to avoid loosing the track, training of one OFDM symbol every 100 symbols was performed. During the OFDM training symbol, the value of µn was set to 64 since all decoded symbols are correct.

6.6.2 Kalman Filter Consistency Tests

The value of the confidence interval α is usually selected based on simulations. We have tested different values of confidence level (1 α). Figure 6.3 depicts the the histograms for the number − of subcarriers used on average for α = 0.01, 0.1, 0.2 , SNR = 5 dB and v = 90 km/hr. As { } expected, the larger the values of α used, the smaller the range [r1, r2], which means that the

NISn test produces less NULL hypothesis outcomes, and therefore less subcarriers are used for channel tracking. We found out that amongst the three α values, the value α = 0.1 gave the best results for a large range of velocities and SNR values and it was used in the BER simulations. We also found out that there was no significant difference, in terms of BER performance, between the NISn (equation (6.42)) and λn (eq. (6.43)) criteria. Since the implementation of the

NISn hypothesis test is simpler due to its constant statistical properties, we used NISn as our hypothesis test in all simulations. 6.6 Simulation Results 164

α = 0.01 1

0.8

) 0.6 n µ

p( 0.4

0.2

0 40 44 48 52 56 60 64 66 Number of subcarriers

α = 0.1 0.8

0.6 ) n

µ 0.4 p(

0.2

0 40 44 48 52 56 60 64 66 Number of subcarriers

α = 0.2 0.25

0.2

) 0.15 n µ

p( 0.1

0.05

0 40 44 48 52 56 60 64 66 Number of subcarriers

Figure 6.3: Number of subcarriers used for different values of confidence interval α = 0.01, 0.1, 0.2 , for SNR = 5 dB, v = 90 km/hr { }

6.6.3 BER and Channel Estimation Error Results

In this part, we present the simulation results for different SNR values. The following methods are compared:

1. PERFECT-CSI - the CIR is assumed to be perfectly known at the receiver. This method acts as a lower bound for the BER of the system.

2. ADS - the algorithm proposed in Section 6.4.6.

3. GENIE-AIDED - under this scenario the proposed algorithm uses only correctly decoded symbols. This method acts as a lower bound for the ADS algorithm.

4. USE-ALL - all decoded symbols are used for CIR tracking.

5. EM-KALMAN - the algorithm is based on EM-Kalman approach.

6. IRLS-HM - the algorithm is based on the algorithm presented in [212]. This is based on Huber’s cost function and is solved using Iteratively Reweighted Least Squares (IRLS). The iterative procedure is terminated once the sum of squared residuals changes less than 6.6 Simulation Results 165

1% from the previuous iteration, as recommended in [212]. The β parameter was set to 4 1.3 as it gave the lowest BER for the target BER of 10− .

First, we present a general comparison of these methods: for terminal velocities under v = 40 km/hr, there was no apparent difference between all methods in terms of BER performance. This is due to the fact that in a relative slow fading rates, the DD based detection is reliable. As the terminal velocity increases above v = 40 km/hr, the ADS BER performance showed improvement over the other methods. In very high velocities, which are over v = 200 km/hr, all methods gave poor results, mainly due to the high Doppler rate that caused the DD based initial detection of the symbols to be very poor. In order to illustrate the BER performance, we present simulation results for a terminal velocity of v = 90 km/hr.

Now we compare USE-ALL, ADS and PERFECT-CSI methods, in order to show the perfor- mance improvement of the ADS over conventional system. The BER results are depicted in Figure 6.4. It is clear that the proposed ADS algorithm is substantially better than the USE- ALL method. It is visible that the iterative process improves the performance of both methods.

−1 10 PERFECT−CSI USE−ALL 1st itr USE−ALL 2nd itr USE−ALL 3rd itr

−2 ADS 1st itr 10 ADS 2nd itr ADS 3rd itr

−3 10 Bit error rate

−4 10

−5 10

4 6 8 10 12 14 16 18 SNR [dB]

Figure 6.4: BER performance comparison of ADS versus USE-ALL for v = 90 km/hr BER performance comparison of ADS versus USE-ALL for v = 90 km/hr 6.6 Simulation Results 166

USE−ALL 1st itr USE−ALL 2nd itr −2 10 USE−ALL 3rd itr ADS 1st itr ADS 2nd itr ADS 3rd itr

−3 10

−4

Normalaized MSE 10

−5 10

4 6 8 10 12 14 16 18 SNR [dB]

Figure 6.5: Channel estimation error performance comparison of ADS versus USE-ALL for v = 90 km/hr

However, for v = 90 km/hr the ADS algorithm is about 3 dB better than USE-ALL at BER 4 of 10− . As depicted by those results, the USE-ALL method suffers severe performance degra- dation. This is due to the error propagation phenomenon occurring in the channel tracking component and can be seen from the channel estimation error results. Figure 6.5 depicts the average NSE as defined in (6.53), where the average is taken over all subcarriers. It is clear that our algorithm achieves a much lower NSE. While the USE-ALL cannot keep tracking the correct channel trajectory, our algorithm does not suffer from apparent error floor in the high SNR region. This results in an overall lower BER as shown before. Next we compare ADS and GENIE-AIDED methods. The GENIE-AIDED acts as a lower bound for our algorithm, since only correctly decoded symbols are used for channel tracking, and therefore no model mismatch occurs. The BER results are depicted in Figure 6.6. The BER results of ADS are only about 0.5 dB away from that of the GENIE-AIDED case. This result shows that the ADS algorithm is efficient in detecting inconsistencies of the KF and that the subcarriers trimming approach is effective. 6.6 Simulation Results 167

−1 10

GENIE−AIDED 1st itr GENIE−AIDED 2nd itr GENIE−AIDED 3rd itr ADS 1st itr −2 10 ADS 2nd itr ADS 3rd itr

−3 10 Bit error rate

−4 10

−5 10

4 6 8 10 12 14 16 18 SNR [dB]

Figure 6.6: BER performance comparison of ADS versus GENIE-AIDED for v = 90 km/hr

Lastly, we compare the BER performance of the ADS algorithm with EM-KALMAN and IRLS- HM methods. The results are depicted in Figures 6.7 and 6.8, respectively. As can be seen from the results, the ADS algorithm provides gains of about 1.4 dB and 0.5 dB over the EM-KALMAN 4 and IRLS-HM algorithms, respectively at BER of 10− . We have noticed that as the mobile velocity increases, the EM-KALMAN algorithm’s robustness degrades and channel tracking becomes less reliable, resulting in a higher channel estimation error. This leads to inferior BER results of the EM-KALMAN algorithm. In addition, the EM-KALMAN is very sensitive to the initial CIR estimation. In a fast fading environment, when the DD initial estimate is poor, the EM-KALMAN performance is degraded. Therefore, the EM-Kalman method is probably not well suited for DD detection in fast fading environment and for long data frames. Note that similar results have also been reported in [216]. We also note that while the IRLS-HM is based solely on the residuals values, it may be useful to consider the generalized M estimators [207], that makes use of the soft values of the symbols matrix Dn.

b 6.6 Simulation Results 168

−1 10 EM−KALMAN 1st itr EM−KALMAN 2nd itr EM−KALMAN 3rd itr ADS 1st itr −2 10 ADS 2nd itr ADS 3rd itr

−3 10 Bit error rate

−4 10

−5 10

4 6 8 10 12 14 16 18 SNR [dB]

Figure 6.7: BER performance comparison of ADS versus EM-KALMAN for v = 90 km/hr

IRLS−HM 1st itr −1 10 IRLS−HM 2nd itr IRLS−HM 3rd itr ADS 1st itr ADS 2nd itr ADS 3rd itr

−2 10

−3 10 Bit error rate

−4 10

−5 10

4 6 8 10 12 14 16 18 SNR [dB]

Figure 6.8: BER performance comparison of ADS versus IRLS-HM for v = 90 km/hr 6.7 Chapter Summary and Conclusions 169

6.7 Chapter Summary and Conclusions

In this chapter we presented an iterative DD based channel tracking and symbols decoding using Kalman filter and MAP decoder for BICM-OFDM systems. We addressed the problem of state-space mismatch due to misdecoded symbols, which causes the KF to diverge. We showed that in order to perform channel tracking, not all decoded symbols are required, but instead it is possible to take into account only those whose reliability is high. We introduced a new component that performs a consistency test on the innovation sequence, and adapts the number of subcarries used to form the state-space model. This method is able of reducing the effect of state-space mismatch, thereby considerably decreasing the error propagation effect of the KF. We then presented an analysis of the influence of misdecoded symbols on the KF and how it affects the estimation error of the subcarriers in the frequency domain. Finally, simulation results showed a considerable performance improvement over systems that use soft value based methods and robust estimation techniques. 6.8 Appendix - LLR Values of the A posteriori Probabilities 170

6.8 Appendix - LLR Values of the A posteriori Probabilities

We define the log-likelihood ratio of bit d , i 1,..., log (M) of the received symbol in a i ∈ { 2 } constellation M in the k-the subcarrier, L(d [k] y [k] , h [k]), as i | b P r(di [k] = 1 y [k] , h [k]) L di [k] y [k] , h [k] , ln | . (6.59) | P r(di [k] = 0 y [k] , h [k])   | b b Invoking Bayes’ Theorem, and omitting the subcarrier index for clarity,b (6.59) can be expressed as:

(1) p(y α, h)P r (α) α si | L di y, h = ln ∈ , (6.60) | P n o (0) p(y α, bh)P r (α)   α si | b ∈ P n o (1) (0) (1) b (0) where si and si are two set partitions such that si comprises symbols with di = 1 and si comprises symbols with di = 0 in the constellation. We also used the fact that the symbols and the channel are uncorrelated, and therefore p α h = p (α) and that the normalisation constant | p y h cancels out since it appears in both numerator  and denominator. Due to the interleaving | b operation  of the BICM encoder, the elements of d are assumed uncorrelated, that is b M P r (d) = P r (di) . (6.61) Yi=1 Consequently, (6.60) can be decomposed to extrinsic information and a priori information:

(1) p(y α, h) P r (dk) P r (di = 1) α s k=i L d y, h = ln + ln ∈ i | 6 , i| P r (d = 0) n o (6.62) i P (0) p(y α, h) Q P r (d ) α s b k=i k   ∈ i | 6 b La(di[k]) n o P extrinsicb Q | {z } To obtain a simple solution to (6.59), we use| the max-log approximation{z }

ln (exp β1 + ... + exp βn ) max (βi) , (6.63) { } { } ≈ i 1,...,n ∈{ } which in the case of QPSK modulation leads to

1 2 L(d0 y, h) La(d0) + 2 min[δ00, δ01 2σ La(d1)] | ≈ 2σ − (6.64) 1 b min[δ , δ 2σ2L (d )], − 2σ2 10 11 − a 1 6.8 Appendix - LLR Values of the A posteriori Probabilities 171

1 2 L(d1 y, h) La(d1)+ 2 min[δ00, δ10 2σ La(d0)] | ≈ 2σ − (6.65) 1 b min[δ , δ 2σ2L (d )], −2σ2 01 11 − a 0 where La(di), i = 0, 1, represents the a priori LLR of bit di, and δmn is given by

2 δ = y D h , (6.66) mn − mn

and D stands for the QPSK symbol d = m, d =bn . mn { 0 1 }

Chapter 7

Channel Estimation in OFDM Systems using Trans Dimensional MCMC

“Anyone who attempts to generate random numbers by deterministic means is, of course, living in a state of sin.”

John von Neumann

7.1 Introduction

In this chapter we consider the problem of channel estimation for OFDM systems, where the number of channel taps and Power Delay Profile (PDP) value are unknown.

First, we consider the problem of Channel Impulse Response (CIR) estimation with unknown number of channel taps and known PDP value. We show that in that case, it is possible to obtain an analytic expression for both the Bayesian Model Averaging (BMA) and Bayesian Model Order Selection (BMOS) estimators. Simulation results show that both BMA and BMOS estimators provide close to optimal BER performance.

Next, we tackle the more complicated problem of CIR estimation with unknown number of channel taps and unknown PDP value. Using a Bayesian approach, we construct a nested model in which we estimate jointly the coefficients of the channel taps, the channel order and decay rate of the PDP. In order to sample from the resulting posterior distribution we develop three novel Trans Dimensional Markov chain Monte Carlo (TDMCMC) algorithms and analyze and compare their performance. Using the TDMCMC algorithm which produces the best performance in terms of exploration of the model sub-spaces, we assess its performance in terms of BER. It is shown that the proposed algorithms can achieve results very close to the case where both the 173 7.2 Background 174 channel length and the PDP are known. The main contributions presented in this chapter are as follows:

1. CIR estimation with known PDP value and unknown channel length:

We develop two algorithms for CIR estimation, based on the BMOS and BMA • methodology. We show that closed form expressions can be analytically obtained. We compare the performance of the BMOS and BMA approaches. • 2. CIR estimation with unknown PDP value and unknown channel length:

Algorithm 1: Birth-Death (BD)-TDMCMC algorithm which uses a simple transition • kernel to obtain samples from the posterior distribution. Algorithm 2: Based on Algorithm 1 we improve the mixing rates of the Markov chain • for the between-model subspaces by developing an adaptively learning algorithm. The proposed scheme will be based on a Stochastic Approximation (SA) method for stochastic optimisation, first developed by Robbins and Monro [61]. Algorithm 3: The intention of the third algorithm is twofold: to automate moves • between different subspaces of the resulting posterior distribution; and to provide an efficient mixing rate for the between-model moves. This algorithm is based on Condi- tional Path Sampling (CPS) methodology introduced in [217]. Although each itera- tion of this algorithm is more computationally involved than the other two algorithms, this is justified by obtaining much higher acceptance probabilities for between-model moves, and therefore, less iterations are required. We assess several aspects of the model in terms of sensitivities to different prior choices • of various parameters. We perform a detailed analysis of the performance of each of the TDMCMC algo- • rithms. This allows us to contrast the resulting computational effort required under each approach versus the estimation performance. Several lower bounds on the channel estimation MSE are presented. • Detailed complexity analysis of the three algorithms is presented. •

7.2 Background

OFDM systems often employ coherent detection that requires accurate information about the CIR [186]. This can be achieved in slow fading channels by using pilot symbols at the beginning of each frame so that the channel can be estimated [218]. Based on this channel estimation, the data symbols can be detected during the rest of the frame. When no a priori information about 7.3 System Description 175 the statistics of the channel taps is known, an ML approach is optimal amongst all unbiased estimators [219]. A wireless channel is typically modeled as a FIR filter with every tap distributed as a complex Gaussian [220]. When the channel taps’ second order statistics are known, the MMSE estimate can achieve significant gains compared to the ML estimate [114]. In defining these channels, the number of taps and PDP need to be a priori known in order for the channel to be estimated using Bayesian inference [182]. These parameters, however, are usually unknown a priori and vary as the mobile station changes its location. Existing approaches for estimating the number of channel taps include the Most Significant Taps (MST) idea developed in [221] to estimate the channel order. In [222], an iterative algorithm that re-estimates the channel order using the generalized Akaike information criterion [223] and cancels its contribution is presented. In [224], after the initial channel estimation phase, an auxiliary function is used to distinguish between real taps and the noise contribution in the estimated channel. In [225], both BMA and BMOS algorithms were proposed for channel estimation based on a finite set of possible channel models. In this chapter we explore a Bayesian model which allows for joint estimation of the channel length, the decay rate of the PDP and the channel coefficients. To achieve this we develop novel TDMCMC algorithms to sample from the resulting posterior distribution in our Bayesian approach and perform numerical integration. TDMCMC algorithms have become popular in many areas including statistics, machine learning and signal processing since their introduction by [54], [226] , [227]. Since then they have been successfully developed further in the signal processing literature by [228], [229]. We extend aspects of the methodology in these papers and apply our novel algorithms to find solutions to the estimation and sampling problems posed.

7.3 System Description

We define a sequence of information bits mapped to complex-valued symbols of an M-ary mod- ulation alphabet set by A = a1, .., a A . The data symbols are multiplexed to K OFDM | | subcarriers. After modulation of the OFDM symbols via a K point IDFT, the signal is trans- mitted over a frequency selective, time varying channel. We assume that the channel state is fixed (static) over the duration of one frame, but can change significantly between consecutive frames. The use of a proper CP eliminates IBI between consecutive OFDM symbols. It is assumed that the length of the CP is longer than the maximum path delay and also that ICI caused by Doppler offset is fully compensated for.

At the receiver, after CP removal and DFT, the signal at time index n is given by [186]

y[n] = D[n]WLh[n] + z[n]. (7.1) 7.4 Channel Estimation with Unknown Channel Length 176

Here y[n] is a K 1 observation vector, D[n] is a K K diagonal matrix and its diagonal entries × × contain the data symbols, and z[n] is a K 1 vector of i.i.d. complex zero-mean Gaussian noise × 2 with covariance matrix R = IσZ and it is assumed to be uncorrelated with the channel. The T vector h[n] = [h1[n], h2[n], . . . , hL[n]] is the discrete CIR at time instant n of length L, and its components are hl [n], l = 1,...,L. It is assumed that the channel length L is upper bounded by a predefined maximal possible channel length L . The matrix W is a K L partial DFT max L × matrix, defined as W , 1 e j2πkl/K . L √K − k=0,...,K 1;l=0,...,L 1 − −  7.3.1 Channel Model

We consider a fading multipath channel model whose impulse response is represented as

L h[n] = h [n] δ (n l) , (7.2) l − Xl=1 where L is the number of paths and δ ( ) is the Kronecker delta function. It is assumed that • h [n]; l = 1,...,L are mutually independent,Wide Sense Stationary Uncorrelated Scattering { l } (WSSUS) circular complex zero-mean Gaussian random processes with covariance given by

2 exp βk σ = {− } , 1 k = j L H hk L E l=1 exp βl ≤ ≤ hk [n] hj [n] =  {− } (7.3) P n o 0 , otherwise   2 The channel’s covariance matrix is Σh = diag σ , 1 l L. The parameter β controls the hl ≤ ≤ decay rate of each tap’s power, and depends onn theo physical environment in which the system operates. When β = 0 the channel taps are uniformly distributed. As β increases, the decay rate increases, and as a result high order channel taps become less significant as their power decreases. The parameters L and β are assumed to be statistically independent. The channel model is depicted in Figure 7.1. Hereafter, for notational simplicity, the block time index, n, is omitted.

2 h

β ¡ ¡ 2 2 1 L t L

Figure 7.1: Channel model with L taps and PDP β 7.4 Channel Estimation with Unknown Channel Length 177

7.4 Channel Estimation with Unknown Channel Length

In this Section we derive the MMSE CIR estimator under this model using BMA and BMOS criteria for the case that:

1. the CIR length, L, is unknown a priori.

2. the PDP value, β, is known a priori.

7.4.1 Channel Estimation using Bayesian Model Averaging

Here we develop the MMSE estimator for the CIR using BMA approach. The quantity of interest is the Bayesian MMSE channel estimation, that can be written as [182]

h = E h y . (7.4) { | }

First, the posterior distribution of h givenb the data y is

Lmax p(h y) = p (h y, l) P r(l y). (7.5) | | | Xl=1 The posterior mean of h can be written as

h = E h y { | } = hp(h y) dh b | Z Lmax (7.6) = hp (h y, l) P r(l y) dh " { | | }# Z Xl=1 Lmax = P r(l y)E h y, l , | { | } Xl=1 where h is a weighted average of all possible posterior means under each of the models considered E h y, l , weighted by their posterior model probability P r(l y). Given the prior for l, P r(l), { | b } | the posterior model probability for l can be expressed as

p(y l)P r(l) P r(l y) = | , (7.7) | Lmax p (y l) P r(l) l=1 | P 7.4 Channel Estimation with Unknown Channel Length 178 where the marginal likelihood p(y l) can be written as |

p(y l) = p(y h , l)p(h l) dh | | l l| l Z (7.8) CN H 2 = 0, (DWl)Σh (DWl) + σwI . Σ  y|l    | {z } We now evaluate E h y, l . Conditional on l, y and h are zero-mean jointly Gaussian, and the { | } MMSE estimator for h is expressed as [182]

H 1 H E h y, l = E hy E− yy y { | } 1   − (7.9) H H 2 = Σh (DWl) (DWl)Σh (DWl) + σwI y. Σ  y|l    | {z } Substituting (7.7), (7.8) and (7.9) into (7.6), the BMA estimator of h can be expressed as

1 1 H 1 E b Lmax 1/2 exp 2 y Σy−l y h y, l Σy|l − | { | } h = | | n o . (7.10) Lmax 1 1 H 1 l=1 l=1 1/2 exp 2 y Σy−l y X Σy|l − | b | | P n o 7.4.2 Channel Estimation using Bayesian Model Order Selection

We now develop the CIR estimation without a priori knowledge of the CIR length L using a BMOS approach. Using this approach we first find the most probable model using the MAP estimate of the model order, lMAP , and then condition on this estimate to obtain the MMSE estimate of h. This procedure is composed of the following 2 steps: b

1. MAP estimate of the CIR length

lMAP = arg max P r(l y) l | (7.11) b = arg max p(y l)P r(l), l |

where p(y l) is obtained using (7.8). | 2. MMSE estimate of the CIR

H 1 H 1 h = E h y E− yy − y lMAP n o H 1 (7.12) = Σ b DW  Σ − y, b hl l y l MAP MAP | MAP     b b b 2 where Σh = diag σh , 1 l lMAP . lMAP l ≤ ≤ b n o b 7.4 Channel Estimation with Unknown Channel Length 179

7.4.3 Complexity Issues

The complexity of the BMA-MMSE estimator in (7.10) and of the BMOS-MMSE estimator in (7.12) is actually quite low, as most of the terms can be precalculated. For example, when calculating (7.10) a significant amount of precalculation can be performed. Excluding the out- ermost right matrix multiplication (y), for each model l, all the remaining multiplications can be performed in advance and stored in the system’s memory.

7.4.4 Simulation Results

The OFDM system setup is K = 64 subcarriers employing QPSK symbols, Lmax = K/4 (typical CP length value in OFDM systems). The channel is modeled as block Rayleigh fading and the λl exp λ channel length L follows a truncated Poisson distribution P r(l) = C{−!l } , where λ = 8, and l Lmax λ exp λ C = Σl=1 !l{− } .

2 1 The realization of the channel length L is 8 and σh = L . One OFDM pilot symbol was used for every frame of length 128, and 5000 frames were transmitted for each SNR. The following estimators have been compared:

An estimator which has the knowledge of l (labeled as Known L), and serves as the system’s • lower bound.

An estimator based on BMA and is given by equation (7.10) and an estimator based on • BMOS and is given by equation (7.12). BMA and BMOS show comparable results (labeled as Est L).

The last estimator assumes a saturated mode, i.e. l = L (labeled as L ), and does • max max not employ CIR length estimation. As a lower bound for the BER, a detector with known CIR (labeled as CSI) is provided.

The channel estimation MSE and BER results for different estimators are depicted in Figures 7.2 and 7.3, respectively. The results show that the proposed algorithms operate very close to the detector with known CIR length. In Figure 7.4, the histogram of the CIR length esti- mate (equation (7.11)) for SNR= 10 dB is depicted. This result shows that over 80% of the times the MAP estimation is correct and demonstrates the robustness of the MAP estimate in (7.11). We also studied the performance of this estimator for different number of subcarriers (K = 32, 64, 128). The average channel length estimate for different SNR values is depicted in Figure 7.5. These results are evident of the robustness of the MAP estimator, even in low SNR values. It is also evident that the larger K is the better the MAP estimate. For high SNR values, regardless of the value of K, the MAP estimator converges to the actual channel order. We also 7.4 Channel Estimation with Unknown Channel Length 180

0.25

known L

L max Est L

0.2

0.15

Mean squared error 0.1

0.05

0 0 5 10 15 20 25 30 SNR [dB]

Figure 7.2: Channel estimation MSE of BMA and BMOS estimators studied two scenarios of model misspecification, in which the prior distribution for model order was a truncated Poisson. In scenario I we specified the true model order significantly less than the prior mean, and vise versa in scenario II. The outcome was for small K (K < 16), small difference in performance was observed between BMOS and BMA at low SNR (SNR< 10 dB).

CSI

known L −1 10 Est L

L max

−2 10 Bit error rate

−3 10

0 5 10 15 20 25 30 Eb/N0 [dB]

Figure 7.3: BER performance of BMA and BMOS estimators compared to several lower bounds 7.4 Channel Estimation with Unknown Channel Length 181

90

80

70

60

50

40 Percentage

30

20

10

0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Channel order estimate

Figure 7.4: CIR length estimation, L = 8, SNR = 10 dB

For large K no difference was evident for any level of SNR. These two effects can be explained that at high SNR there is very weak prior influence. At low SNR with small N there is sufficient prior influence to induce a performance gain of BMA over BMOS of around 0.5 dB.

8.1

8

7.9 K=128

K=64

K=32 7.8

7.7 Average channel estimation length

7.6

7.5 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 SNR [dB]

Figure 7.5: MAP channel order estimation, K = 32, 64, 128 , L=8 { } 7.5 Channel Estimation with Unknown Channel Length and Unknown PDP 182

7.5 Channel Estimation with Unknown Channel Length and Unknown PDP

In the rest of this chapter we solve the problem of channel estimation for the case that:

1. the CIR length, L, is unknown a priori.

2. the PDP value, β, is unknown a priori.

As we will show, adding the PDP as an unknown parameter complicates the problem and the approach demonstrated in the previous Sections can no longer be used.

We shall concentrate on two Bayesian estimators, namely the MMSE and the MAP estimators. Here, for notational convenience we rewrite the OFDM system model in (7.1) as

T y = DWLh + z, (7.13) where we define h as a row vector, and explicitly label the elements of h under model L as h1:L = [h1, . . . , hL]. The matrix D is known at the receiver for pilot symbols. In the case where L and β are a priori known at the receiver, the MAP, MMSE and LMMSE estimates of the CIR coincide and can be written as

h = E h y, L, β { | } = E hT yH E 1 yyH y b − (7.14) 1  H  2 1 − H = (DWL) (DWL) + σz Σh− (DWL) y.  

Here we build on the model developed in Section 7.4. In this model we conditioned on assumed knowledge of the PDP decay rate β. We then modeled the channel length and channel coef- ficients as random variables in a Bayesian framework. The estimation of the channel length and channel coefficients was performed by considering BMOS estimators for the MMSE and MAP. The procedure in Section 7.4 allows an explicit solution to be found for the posterior model probabilities P (L y) (see equations (7.7) and (7.8)). It was achievable since the required | marginalising integrations were analytic. This in turn simplified the model selection and conse- quent channel coefficient estimation. We model the PDP, channel length and channel coefficients as random variables. In this setting, the addition of the estimation of the PDP decay rate β complicates the previous solutions. It is no longer possible to take the approach presented in Section 7.4. Instead we must perform joint inference from the posterior of the channel length, PDP and channel coefficients. That is, we have the additional complexity that we model the fol- lowing model parameters: channel model order L, channel coefficients [h1, . . . , hL], and channel power decay rate β as random variables. 7.5 Channel Estimation with Unknown Channel Length and Unknown PDP 183

In general, we can consider either performing BMOS or BMA as in Section 7.4 [225]. However, in this Section we will focus on BMOS to estimate desired properties of the channel. We are interested in a BMOS analysis to obtain the MMSE channel estimator conditional on the MAP estimate for the channel length, LMAP , which is given by

h = E h y,L MMSE { | MAP } (7.15) = h p(h, β y,L )dβ dh, | MAP ZZ and the MAP channel estimator is given by

hMAP = arg max p (h y,LMAP ) h { | } (7.16) = arg max p (h, β y,LMAP ) dβ , h | Z  where

LMAP = arg max P r (L y) . (7.17) L |

In order to estimate (7.15) and (7.16) we need an analytic expression for p (h y). Obtaining an | expression for h involves marginalizing the joint posterior,

Lmax p (h y) = p (h, L, β y) dβ. (7.18) | | Xl=1 Z In our system model (7.13), typically it will not be possible to obtain an analytical expression for (7.18). Hence, we shall sample from the joint posterior via TDMCMC, and estimate this integral numerically.

Given T samples h(t),L(t), β(t) from the distribution p (h, L, β y) we can then estimate t=1:T | quantities hMMSE, hMAP and L MAP as follows:

b b b E LMAP = arg max P r (L y) , (7.19) L | b where we define P rE (.) is the empirical histogram estimate of the density.

h E h y, L MMSE ≈ | MAP nT o 1 (7.20) b = h(bt)I L(t) = L , R MAP t=1 X   b where R is the total number of samples corresponding to model LMAP .

E b hMAP = arg max p h y, LMAP , (7.21) h | n  o b b 7.6 Trans Dimensional Markov chain Monte Carlo 184

E where p (.) in this case is formed from samples corresponding to model LMAP .

To proceed, we define the joint posterior as follows b

p (h, L, β y) p (y h, L, β) p (h L, β) P r (L) p (β) , (7.22) | ∝ | | where the priors for our model are specified as follows

P r (L) = P oi (L; λ), with L 1,...,L , • ∈ { max} p (β) = U [0, β ], • max

p (h L, β) CN (0, Σh), • | ∼ where P oi (L; λ) represents Poisson distribution with mean λ, and U [α, β] represents uniform distribution on the support [α, β].

Note that the approaches we develop are general and they allow for any desired prior structure. The choices of the priors made above are chosen to be uninformative.

In the following Sections we shall design novel TDMCMC algorithms to obtain samples from the posterior (7.22).

7.6 Trans Dimensional Markov chain Monte Carlo

We begin this Section by defining the following terminology which will be used throughout:

proposal: shall be used to refer to any generic Markov chain transition kernel, which moves • the chain from one state to another probabilistically.

within-model moves: shall be used to define generic moves which propose to change the • state of the Markov chain within a given model subspace, see [54]. We work with nested models and this corresponds to changes to parameters h L and β L with conditioning 1:L| | on the model L remaining fixed, i.e. the model subspace is unchanged.

between-model moves: shall be used to define generic moves which propose to change • the state of the Markov chain to move it between different model subspaces, see [54]. This will correspond to increasing or decreasing the channel order, L, and proposing the

corresponding model parameters in the new model subspace given by h1:L and β.

MH within Gibbs: shall be used to refer to the MH within Gibbs Markov chain Monte • Carlo algorithm, see [45]. 7.6 Trans Dimensional Markov chain Monte Carlo 185

birth and death: shall be used to refer to a particular set of between-model moves, see [54]. • Under a birth move the model subspace Markov chain state index (L) shall be incremented

by one, i.e. L + 1. This shall involve the first L parameters h1:L remaining unchanged and the (L + 1)-th parameter sampled from a proposal. The corresponding death move involves decrementing by one the model subspace Markov chain state index (L), ie. L 1. − This shall involve discarding the L-th parameter hL.

mixing: shall be used generically to refer to the rate at which the generated Markov chain • reaches ergodicity, see [45].

The intention is to estimate h = h ,..., h L and MMSE 1,MMSE LMAP ,MMSE| MAP h = h ,..., h L h where we sample realisations of thei model parameters MAP 1,MAP LMAPb,MAP | MAPb b (h1:L, β, Lh) jointly from the posterior distribution.i To estimate these quantities we must resort b b b to numerical integration and the approach we consider here is based on MCMC algorithms [45]. In particular we construct the problem such that it will require use of TDMCMC, see [54] [230], [231] and [217].

We start by specifying the posterior support of each of the parameters of interest, L, β and h. Here L 1, 2,...,L can be considered as a model index specifying how many channel taps ∈ { max} are present in the channel model vector h = h = [h , . . . , h ] CL. The parameter β R+ 1:L 1 L ∈ ∈ models the decay rate of the PDP. The joint posterior forms a nested model defined on a disjoint union of subspaces, Θ = L CL R+. ⊎ { } × × Having specified the model we now develop novel TDMCMC simulation algorithms to sample from the target posterior distribution (7.22). The first will be the basic BD MH within Gibbs sampler, called BD-TDMCMC. The second approach develops the ideas of Contour Monte Carlo or SA adjusted TDMCMC for model selection and estimation as presented in [232], denoted as SA-TDMCMC. The third is a novel TDMCMC algorithm which automates the construction and sampling of an approximation to the optimal between-model transition distributions, and this will be based on CPS MH within Gibbs approach of [217]) termed CPS-TDMCMC. The BD-TDMCMC algorithm forms a comparison benchmark for the other two approaches. These more sophisticated approaches aim to improve mixing properties of the Markov chain between- model subspaces by increasing the probability of acceptance for between-model transitions in our trans-dimensional Markov chain, when compared to the basic BD sampler.

The SA-TDMCMC utilises a simple between-model proposal mechanism based on the BD- TDMCMC proposal that we develop. The simplicity of the BD proposal mechanism is not optimised for the target posterior distribution from which we are aiming to obtain samples. As such it will produce proposed transitions to new models which in the majority of cases may be unlikely to be accepted under a standard TDMCMC acceptance probability. This will result in slow between-model mixing of the Markov chain. 7.6 Trans Dimensional Markov chain Monte Carlo 186

The SA-TDMCMC algorithm attempts to offset the simplistic choice of between-model pro- posal distribution. It achieves this by increasing the chance of acceptance of a between-model move with adaptively adjusting the acceptance probability of such a move through a SA adjust- ment. This SA adjustment is based on on-line estimation of the Bayes Factors which weight the acceptance probability, with the intention of improving the between-model mixing rate.

The second approach to improving the acceptance probability for between-model moves is to approximate the optimal between-model proposal distribution. This is an optimal proposal in the sense that it maximises the TDMCMC acceptance probability for a move from one model to another. We shall achieve this through use of the approach proposed in [217] which develops a CPS approximation of the optimal proposal distribution.

We note here that we could combine both these approaches, the SA stage and the approximate optimal between-model proposal under a CPS approximation. This would combine the best of both approaches and should be the preferred choice in terms of between-model mixing of the Markov chain. However, our intention is to compare which of these aspects of TDMCMC algorithms will lead to the largest improvement in mixing of the Markov chain between different models, relative to our basic BD-TDMCMC algorithm.

We introduce the generic notation for the within-model (conditional on L) proposal mechanism (t 1) to move from state (h − , β(t 1)) at iteration t 1 of the Markov chain to state (h , β ) 1:L(t−1) − 1:∗ L(t−1) ∗ (t 1) − at iteration t, denoted by T (h − , β(t 1)) (h , β ) . Such a move will be accepted 1:L(t−1) − → 1:∗ L(t−1) ∗ according to a standard MH acceptance probability. The generic notation for between-model (t 1) moves, going from state (h − , β(t 1),L(t 1)) at iteration t 1 of the Markov chain to state 1:L(t−1) − − − (t 1) (t 1) (t 1) (h ∗ , β ,L ) at iteration t will be denoted by Q (h − , β ,L ) (h ∗ , β ,L ) . 1:∗ L ∗ ∗ 1:L(t−1) − − → 1:∗ L ∗ ∗   In formulating our TDMCMC algorithms we shall utilise a common within-model move mech- anism for all three algorithms. That is, they only differ in how they propose to move between (t 1) (t 1) (t 1) models, i.e. in the specification of Q (h − , β ,L ) (h ∗ , β ,L ) . 1:L(t−1) − − → 1:∗ L ∗ ∗ Before providing details of how we design these Markov chain proposals, we present the general algorithm we will use in all three cases. This is presented in Algorithm 16.

Note that the acceptance probability in (7.23) and (7.24) is obtained by the fact that we wish to construct a reversible Markov chain satisfying detailed balance. The details of this can be found in [54], [230].

(t 1) We now present our choice of T (h − , β(t 1)) (h , β ) for the within-model moves, 1:L(t−1) − → 1:∗ L(t−1) ∗ given by a MH within Gibbs sampler. This is just one of many different approaches which could be utilised, see [45]. We choose a MH within Gibbs sampling framework since we are able to obtain expressions for the full conditional posterior distributions of each of our parameters. We then sample from each full conditional posterior distribution of each parameter via a MH 7.6 Trans Dimensional Markov chain Monte Carlo 187

Algorithm 16 Generic TDMCMC Algorithm to obtain samples from (7.22)

Initialize Markov chain state:

1: Initialise parameters randomly or deterministically: (0) (0) (0) e.g. L = 3, h1:3 = [0.1, 0.1, 0.1], β = 1 2: Repeat for t = 1 to T

(t 1) Within-Model Moves (conditional on model L − ):

3: Sample new proposed states of the Markov chain for (t 1) (h , β ) T (h − , β(t 1)) (h , β ) 1:∗ L(t−1) ∗ ∼ 1:L(t−1) − → 1:∗ L(t−1) ∗ 4: Calculate acceptance probability of new proposed state represented as,

(t 1) (t 1) α (h − , β ), (h , β ) 1:L(t−1) − 1:∗ L(t−1) ∗  (t 1)  (t 1) (t 1) (7.23) p(h∗ (t−1) , β∗,L − y) T ((h , β ) (h − , β )) = min 1, 1:L | 1:∗ L ∗ 1:L − (t 1) (t 1) → p(h − , β(t 1),L(t 1) y) T ((h − , β(t 1)) (h , β ))! 1:L(t−1) − − | 1:L(t−1) − → 1:∗ L(t−1) ∗ 5: Sample u U[0, 1] ∼ (t 1) 6: if u < α (h − , β(t 1)), (h , β ) then 1:L(t−1) − 1:∗ L(t−1) ∗ 7:   (h(t) , β(t),L(t)) = (h , β ,L(t 1)) 1:L(t) 1:∗ L(t−1) ∗ − 8: else 9: (t) (t) (t) (t 1) (t 1) (t 1) (h , β ,L ) = (h − , β ,L ). 1:L(t) 1:L(t−1) − − 10: end if

Between-Model Moves:

(t 1) (t 1) (t 1) 11: Sample (h ∗ , β ,L ) Q (h − , β ,L ) (h , β ,L ) 1:∗ L ∗ ∗ ∼ 1:L(t−1) − − → 1:∗ L(t−1) ∗ ∗ 12: Calculate acceptance probability of new proposed state according to 

(t 1) (t 1) (t 1) α (h − , β ,L ), (h ∗ , β ,L ) 1:L(t−1) − − 1:∗ L ∗ ∗   (t 1) (t 1) (t 1) p(h ∗ , β ,L y) Q((h∗ ∗ , β∗,L∗) (h −(t−1) , β − ,L − )) = min 1, 1:∗ L ∗ ∗| 1:L → 1:L (t 1) (t 1) (t 1) (t 1) (t 1) (t 1) p(h − , β ,L y) Q((h − , β ,L ) (h ∗ , β ,L ))! 1:L(t−1) − − | 1:L(t−1) − − → 1:∗ L ∗ ∗ (7.24)

13: Sample u U[0, 1] ∼ (t 1) (t 1) (t 1) 14: if u < α (h − , β ,L ), (h ∗ , β ,L ) then 1:L(t−1) − − 1:∗ L ∗ ∗ 15:   (t) (t) (t) (h , β ,L ) = (h ∗ , β ,L ) 1:L(t) 1:∗ L ∗ ∗ 16: else 17: (t) (t) (t) (t 1) (t 1) (t 1) (h , β ,L ) = (h − , β ,L ) 1:L(t) 1:L(t−1) − − 18: end if 7.6 Trans Dimensional Markov chain Monte Carlo 188 procedure. For further details of the convergence properties of this algorithm in terms of mixing rate and optimal acceptance probabilities, see for example [233] and [234].

7.6.1 Specification of Within-Model Moves: Metropolis-Hastings within Gibbs

In this Section we shall utilise a Gibbs sampling framework, and sample from each of the full con- (t 1) ditionals via a MH stage. This corresponds to decomposing T ((h − , β(t 1)) (h , β )) 1:L(t−1) − → 1:∗ L(t−1) ∗ as follows

L(t−1) (t 1) (t 1) (t 1) (t 1) T (h − , β − ) (h∗ , β∗) = T (h − ) (h∗) T (β − ) (β∗) . 1:L(t−1) → 1:L(t−1) 1:L(t−1) → i →   Yi=1     (7.25)

(t 1) 7.6.1.1 Specification of Transition Kernel T (h − ) (h ) 1:L(t−1) → i∗   t 1 1. Sample h CN h − , σ (i) , where σ (i) is set deterministically, though it could be i∗ ∼ i 1 1 obtained via a process of pretuning, such as in an Adaptive MCMC algorithm, see [235].

2. The proposed new state is then accepted according to an MH rejection stage. (t 1) Accept hi∗ with probability a hi − , hi∗ given by,   (t 1) (t 1) p h h − ,L(t 1), β(t 1), y CN h − ; h , σ (i) (t 1) i∗ /i − − i i∗ 1 a h − , h∗ = min 1, | ∈ i i (t 1) (t 1)   (t 1)   − − (t 1) (t 1) CN −  p hi h/i ,L − , β − , y hi∗; hi , σ1 (i)   | ∈     (7.26) with

p (hi h/i, L, β, y) p (y hi, h/i, L, β) p (hi β, L). (7.27) | ∈ ∝ | ∈ |

7.6.1.2 Specification of Transition Kernel T ((β(t 1)) (β )) − → ∗ 1. Sample β N β ; β(t 1), σ I (β > 0), where σ (i)is either set deterministically or ∗ ∼ ∗ − β ∗ β obtained via a process of pretuning, such as in Adaptive MCMC algorithms.

(t 1) 2. Accept β∗ with probability a β − , β∗ given by  (t 1) (t 1) (t 1) (t 1) p β∗ h − ,L − , y N β − ; β∗, σβ I β − > 0 (t 1) 1:L(t−1) a β − , β∗ = min 1, |  (t 1)   p β(t 1) h − ,L(t 1), y N β ; β(t 1), σ I (β > 0)   − | 1:L(t−1) − ∗ − β ∗     (7.28) 7.7 Design of Between-Model Birth and Death Proposal Moves 189

with

p (β h, L, y) p (y h, L, β) p (h L, β) p (β) (7.29) | ∝ | |

7.6.2 Specification of the Between-Model Moves Transition Kernel

In this Section we shall specify the choice of between-model transition kernel (t 1) (t 1) (t 1) Q (h − , β ,L ) (h ∗ , β ,L ) which distinguishes the three algorithms. 1:L(t−1) − − → 1:∗ L ∗ ∗   (t 1) Typically, Q (h − , β(t 1),L(t 1)) (h , β ,L ) would be decomposed as the probabil- 1:L(t−1) − − 1:∗ L ∗ ∗ → ∗ ity of choosing to perform a move from model L(t 1)to model L , denoted q L(t 1) L , − ∗ − → ∗ followed by a proposal distribution for sampling the new parameters in model L∗conditional on (t 1) (t 1) (t 1) current parameters in model L , denoted q h − , β (h ∗ , β ) . − 1:L(t−1) − → 1:∗ L ∗    (t 1) We make an assumption that, for L∗ > L − , in our nested model structure, we can associate (t 1) the first h − and β(t 1) parameters in model L(t 1) with those in model L . This gives 1:L(t−1) − − ∗ (t 1) (t 1) the new parameters as, (h ∗ , β ,L ) = h − , h , β ,L , and we can now 1:∗ L ∗ ∗ 1:L(t−1) L∗ (t−1)+1:L∗ − ∗ specify the optimal choice for this generic between-model move to sample the new parameters hL∗ (t−1)+1:L∗ as

(t 1) (t 1) (t 1) q h − , h ∗ = p(h ∗ h , h − , β ,L , y) 1:L(t−1) 1:∗ L L∗ L∗ (t−1)+1:L∗ 1 1:L(t−1) − ∗ | −   (t 1) (t 1) p(h ∗ h , h − , β ,L , y) (7.30) L∗ 1 L∗ (t−1)+1:L∗ 2 1:L(t−1) − ∗ × − | − ··· (t 1) (t 1) p(h∗ h − , β − ,L∗, y). × L(t−1)+1| 1:L(t−1) However, in practice it is not possible to obtain analytic expressions or to sample from any of the marginalised distributions p(hL∗ i h1:∗ L i 1, β∗,L∗, y). In the first instance this typically − | − − involves solving marginilising integrals which don’t admit an analytic expression. In the second instance, even when marginalising integrals admit an analytic expression, these marginals will typically not be of a standard parametric form and hence are complicated to sample from. We will restrict our proposed moves between models to increasing and decreasing L(t) by one, giving q L(t 1) L as L q L(t 1) L , with q L(t 1) L given by − → ∗ ∗ ∼ − → ∗ − → ∗    (t 1) (t 1) L − + 1 w.p. 0.5, if 1 < L − < Lmax  (t 1) (t 1) (t 1) L − 1 w.p. 0.5, if 1 < L − < Lmax q L − L∗ =  − (7.31) →  (t 1) (t 1) L − 1 w.p. 1, if L − = Lmax    − L(t 1) + 1 w.p. 1, if L(t 1) = 1.  − −    7.7 Design of Between-Model Birth and Death Proposal Moves 190

7.7 Design of Between-Model Birth and Death Proposal Moves

(t 1) In this Section we shall specify the transition kernel q h − h ∗ for the three different 1:L(t−1) → 1:∗ L algorithms.  

7.7.1 Algorithm 1: Basic Birth Death Moves TDMCMC (BD-TDMCMC)

(t 1) At time t, given that the state of the chain at time t 1 is h − , β(t 1),L(t 1) , proposing − 1:L(t−1) − − (t 1) to move from model L − to a new model L∗ proceeds by first determining if a birth or a death will be proposed according to (7.31).

(t 1) (t 1) If a birth move is proposed, L is incremented by one and h ∗ = h − , h ∗ . We obtain − 1:∗ L 1:L(t−1) L hL∗ by sampling a new value for hL∗ according to a complex Gaussianh proposal withi zero mean and standard deviation σ1 for real and imaginary components, N(0, σ1), chosen based on the prior model.

(t 1) (t 1) (t 1) In a death move case, L − is decremented by one and h1:∗ L∗ = h1 − , . . . , hL−1 . − h i These BD moves form a reversible pair so it is sufficient to specify the probability of acceptance for the birth move. The probability of acceptance for a birth move is the reciprocal of the death probability. The birth acceptance probability is given in equation (7.24) after making the appropriate substitutions for the choice of Q given next

(t 1) (t 1) (t 1) (t 1) Q (h −(t−1) , β − ,L − ) (h1:∗ L∗ , β∗,L∗) = q L − L∗ 1:L → → (7.32)     = CN(hL∗ ; 0, σ2), where σ2 is set deterministically.

7.7.2 Algorithm 2: Stochastic Approximation TDMCMC (SA-TDMCMC)

Here we extend the proposal mechanism of BD trans-dimensional sampling methodology detailed in Algorithm 1. We shall draw on the work of [62] and [232]. The extension of the TDMCMC involves SA and is based on adaptive Monte Carlo strategies and convergence results obtained in [236], [237] and [238].

The SA algorithm, also known as Contour Monte Carlo, is itself a generalization of the Wang- Landau algorithm. This algorithm is typically used to calculate the spectral density of a physical system. Here we will use this methodology in a similar manner as discussed in [237], however we will be working in a TDMCMC setting. Here we reiterate that we are working with our target posterior distribution, p (h , L, β y), from which we wish to obtain samples. As in [62] we define 1:L | a mapping S (h , β, l) which maps a vector (h , β, l) Θ to a model indicator l, distinguishing 1:l 1:l ∈ 7.7 Design of Between-Model Birth and Death Proposal Moves 191 each model subspace of Θ. The choice of this mapping was made specifically since we are aiming to develop a trans-dimensional sampler. Hence, the mapping S effectively partitions the support of the posterior distribution Θ into subspaces corresponding to each model subspace,

S (θ):E = θ = (h , β , 1) : S (θ) = 1 , 1 { 1 1 } E = θ = (h , β , 2) : S (θ) = 2 , 2 { 1:2 2 } . (7.33) . E = θ = (h , β ,L ): S (θ) = L . Lmax { 1:Lmax Lmax max max}

Next we define a weight which is attached to each model subspace Ei by

p (h , β y,L = i) dh dβ g = 1:i | 1:i , i = 1,...,L (7.34) i π max R i where πi > 0 are pre-specified weights which satisfy the constraint

Lmax πi = 1. (7.35) i=1 X These quantities relate to Bayes Factors between model i and model j via the ratio of gi . We gj now re-express the target posterior distribution as

L max p (θ y) p∗ (h1:L, L, β y) | I (θ Ei) , (7.36) | ∝ gi ∈ Xi=1 where I (.) is an indicator function. Hence, developing an algorithm which samples from p (h , L, β y) results in a random walk between the desired model spaces according to sampling ∗ 1:L | frequencies, π = (π1, . . . , πLmax ) . The SA algorithm effectively provides an automated way to learn the optimal weights g1, . . . , gLmax simultaneously for given initial π = (π1, . . . , πLmax ). In (t) the SA algorithm we denote gi as the estimate of gi at iteration t. ψ (θ) is used as generic notation for the unnormalised target posterior. The estimate of the posterior distribution at b iteration t is given by Lmax (t) ψ (θ) p (h1:L, L, β y) πiI (θ Ei) . (7.37) | ∝ g(t) ∈ Xi=1 i (t) b (t) (t) Denote by θk the samples drawn from p (h1:L, L, β y) at iteration t and denote υ = k=1:M b | υ(t), . . . , υn(t) o as the realized sampling frequency of each model with 1 Lmax b   M 1 υ(t) = I θ(t) E . (7.38) i M k ∈ i Xk=1   7.7 Design of Between-Model Birth and Death Proposal Moves 192

The generic SA algorithm of [62] is presented in the following:

Generic Stochastic Approximation Monte Carlo Algorithm

1. Sampling: Draw samples θ(t), k = 1, .., M, from working density p(t)(h , L, β y). This k 1:L | can be achieved via many methods, Importance Sampling, MCMC, Rejection Sampling, b TDMCMC etc.

(t) 2. Weight Update: Update the working estimate of gi , i = 1,...,Lmax, recursively by setting (t+1) (t) (tb) log g = log g + γ υ π , i i t i − i   where γt is prespecified gain factor.b b

The SA-TDMCMC algorithm is an extension of the BD-TDMCMC algorithm. In this regard the SA-TDMCMC algorithm we develop samples from the target posterior over time by using the BD-TDMCMC algorithm with the addition of a weighting according to the SA learning stage. Hence, the idea is to apply a generic TDMCMC algorithm such that the within-model moves are unchanged. The between-model moves are modified by the SA stage and it is demonstrated in [232] that under BD-TDMCMC frameworks this corresponds to modifying the between-model moves to incorporate the stages as shown in Algorithm 17.

Note, γt is the gain factor or learning rate, it is critical for the convergence of this algorithm that it satisfies the following two conditions [232]

∞ γ = , (7.39a) t ∞ t=0 X ∞ γ2 < . (7.39b) t ∞ t=0 X The first condition ensures convergence from any initial starting point and the second condition (t) is an asymptotic damping of errors introduced from the use of υi . We shall use the suggested choice of [232] ρ κ γ = , κ > 0. (7.40) t max (κ, t) The recommendation of [232] suggests that an appropriate choice for our problem would be to use ρ = 1 and κ = 2 Lmax.

In the model selection context, [232] produces a generalised TDMCMC algorithm. The resulting algorithm adaptively weights the probability of transitioning between different model spaces 7.7 Design of Between-Model Birth and Death Proposal Moves 193 according to their posterior model probabilities which are learnt on-line. Additionally, the guaranteed convergence results of this algorithm ensure the validity of the approach in eventually generating correlated Markov chain samples from the correct stationary distribution given by our target posterior. Hence, the addition of the SA adjustment should help to offset a poorly designed between-model transition kernel, such as a simple BD kernel. That is our Markov chain will still be able to move between different model spaces due to the adjustment to the acceptance probability based on the on-line learning of adjustments based on Bayes Factor estimates.

Algorithm 17 SA-TDMCMC: Between-Model Moves Transition Kernel

(t 1) (t 1) (t 1) 1: Sample (h ∗ , β ,L ) Q (h − , β ,L ) (h ∗ , β ,L ) given in (7.32). 1:∗ L ∗ ∗ ∼ 1:L(t−1) − − → 1:∗ L ∗ ∗ 2: Calculate the acceptance probability of new proposed state according to

(t 1) (t 1) (t 1) α (h − , β ,L ), (h ∗ , β ,L ) = 1:L(t−1) − − 1:∗ L ∗ ∗  (t)  (t 1) (t 1) (t 1) g (t−1) p (h ∗ , β ,L y) Q((h∗ ∗ , β∗,L∗) (h −(t−1) , β − ,L − )) min 1, L 1:∗ L ∗ ∗| 1:L → 1:L . (t) (t 1) (t 1)  (t 1) (t 1) − (t 1) (t 1)  gL∗ p h −(t−1) , β − ,L − y Q((h (t−1) , β − ,L − ) (h1:∗ L∗ , β∗,L∗)) b 1:L | 1:L →    (7.41) b

3: Sample u U[0, 1] ∼ (t 1) (t 1) (t 1) 4: if u < α (h − , β ,L ), (h ∗ , β ,L ) then 1:L(t−1) − − 1:∗ L ∗ ∗ 5:   (t) (t) (t) (h , β ,L ) = (h ∗ , β ,L ) 1:L(t) 1:∗ L ∗ ∗ 6: else 7: (t) (t) (t) (t 1) (t 1) (t 1) (h , β ,L ) = (h − , β ,L ). 1:L(t) 1:L(t−1) − − 8: end if (t) 9: Weight Update: Update the working estimate of gi , i = 1,...,Lmax, recursively by setting

(t+1) (t) (t) log g = log g + bγt υ πi . (7.42) i i i −   b b 7.7 Design of Between-Model Birth and Death Proposal Moves 194

7.7.3 Algorithm 3: Conditional Path Sampling TDMCMC (CPS-TDMCMC)

We now develop an algorithm to approximate the optimal birth proposal based on the CPS methodology. This requires sampling from the conditional distributions for the real and imagi- nary components of the new value of hL∗ . Henceforth we will work with the complex parameters h1:L on the real domain and by a slight abuse of notation we denote them by h1:2L. There are twice as many components to reflect the fact we have L real components and L imaginary components. The proposal for a birth move is then given by

(t 1) (t 1) (t 1) (t 1) Q (h − , β ,L ) (h ∗ , β ,L ) =q L L 1:2L(t−1) − − 1:2∗ L ∗ ∗ − ∗ → → (7.43)    (t 1)  q h − h∗ ∗ × 1:2L(t−1) → 1:2L   where

(t 1) (t 1) (t 1) (t 1) q h − h ∗ = p h (t−1) L + 1, h (t−1) , h − , β , y 1:2L(t−1) 1:2∗ L 2L +2 − 2L +1 1:2L(t−1) − → | (7.44)    (t 1) (t 1) (t 1)  p h (t−1) L − + 1, h − , β − , y . × 2L +1| 1:2L(t−1)   We have used here the fact that we can decompose the optimal proposal into two distributions. Then by sampling from the first and conditioning on the sample we can sample the second com- ponent. Clearly, sampling h2L+2 and h2L+1 is not trivial since it involves sampling from both the conditional posterior and marginal conditional posterior of new parameters. Additionally sam- pling of h2L+1 is further complicated by the fact that its distribution is only known analytically as the integral p (h h , y) = p (h , h h , y) dh . In general this integration 2L+1| 1:2L 2L+2 2L+1| 1:2L 2L+2 cannot be done analytically. InsteadR we approximate this optimal sampling distribution via a path sampling estimator as described in Section 2.2 of [217] and [239].

Next we develop the automated TDMCMC algorithm of [217] denoted as CPS-TDMCMC. The intention of using this algorithm is twofold. Firstly, we would like to automate the between- model move mechanism of the MCMC sampler. Secondly, we require that the resulting algorithm has suitably efficient mixing between models. The details of the construction of this proposal mechanism can be found in Section 3.1 of [217].

7.7.3.1 Generic Construction of the CPS proposal

We utilise the density estimator of [239] as a simple marginal estimator based on path sampling and gradients of the log-posterior. It is straightforward to implement, and flexible for the purpose of estimating densities of the form p (h h , y). 2L+1| 1:2L Generically, suppose we have samples θ1,..., θt from a distribution π(θ) known up to some normalisation constant. We are interested in estimating the k-th marginal distribution, πk(θk). 7.7 Design of Between-Model Birth and Death Proposal Moves 195

If α(θk) = log πk(θk) denotes the log of the (unnormalised) marginal posterior distribution, [239] shows that

∂ E θ α(θk) = θ−k Uk( ) (7.45a) ∂θk { } ∂ Uk(θ) = log π(θ). (7.45b) ∂θk

An estimator for Eθ U (θ) can be obtained by first ordering the samples according to the −k { k } k-th marginal sample values θ(1) < θ(2) < . . . < θ(m), m n, ignoring any duplicates which may k k k ≤ arise in the context of the Metropolis algorithm for example. If n(i) is the number of replicates (i) ¯ for each θk , then the estimator Uk(i) is obtained as

t(i) 1 U¯k(i) = n(i)− Uk(θj). (7.46) Xj=1

The U¯k(i) approximates the gradient of log πk(θk) at each of the points in the sample, and may be utilised to derive a density estimate. Of the density estimation methods presented in [239] we implement the simplest approach, since our interest is in a computationally efficient approximate distribution rather than a perfect density estimate. A stepwise linear approximation to πk(θk) (1) is obtained by arbitrarily settingα ˆ(θk ) = 0 and defining

(i) (i 1) (i) (i 1) αˆ(θ ) =α ˆ(θ − ) + θ θ − U¯ (i) + U¯ (i 1) /2 for i = 2, . . . , m. (7.47) k k k − k × k k −    Hence for all points θ [θ(1), θ(m)], the unnormalised density estimate is given by ∈ k k

πˆ (θ ) = exp αˆ(θ(i)) for θ [θ(i), θ(i+1)]. (7.48) k k k ∈ k k n o The corresponding distribution function is obtained by integrating overπ ˆk(θ), with the normal- (m) izing constant given by the value of the distribution function at the point θk . Simulation from πˆk(θ) may proceed via inversion methods.

Hence, our CPS birth proposal automates between-model moves by using the estimator of [239] to construct a marginal density estimate which is conditioned upon some subset of the remaining parameters θ k, e.g. πk(θk θ1, . . . , θk 1). This is achieved by fixing the sample margins − | − at their conditioned values and estimating the density as before. Therefore it is feasible to approximate the optimal between-model proposal decomposition in (7.44). This may proceed by first estimating and sampling from the density p (h h , β, y), then estimating and 2L+1| 1:2L sampling from p (h L + 1, h , h , β, y) conditional upon the previously sampled point, 2L+2| 2L+1 1:2L and so on.

Details of the construction follow recommendations made in [217]. For example when estimating p (h h , β, y) we use exact gradient calculations found in Appendix 7.12 to obtain our 2L+1| 1:2L 7.7 Design of Between-Model Birth and Death Proposal Moves 196 estimate of U¯. Additionally, we utilise a deterministic grid for the values of h2L+1 at which we will construct our piecewise constant density estimator. This grid is specially centered on the approximate mode that we obtain by solving for the roots of the gradient expression in Appendix 7.12. Finally, we estimate the average gradient of the log posterior by sampling from the prior for parameters not of interest and not conditioned upon, in our case h2L+2.

The proposed CPS-TDMCMC between-model moves is presented in Algorithm 18.

Algorithm 18 CPS-TDMCMC: Between-Model Moves Transition Kernel

(t 1) (t 1) If performing a move from model L − to L∗ = L − + 1:

(t 1) (t 1) (t 1) 1: Sample (h ∗ , β ,L ) Q (h − , β ,L ) (h ∗ , β ,L ) . This procedure is 1:∗ L ∗ ∗ ∼ 1:L(t−1) − − → 1:∗ L ∗ ∗ depicted in Algorithm 19.   2: Calculate acceptance probability of new proposed state according to

(t 1) (t 1) (t 1) α (h − , β ,L ), (h ∗ , β ,L ) = 1:L(t−1) − − 1:∗ L ∗ ∗   (t 1) (t 1) (t 1) p (h ∗ , β ,L y) Q((h∗ ∗ , β∗,L∗) (h −(t−1) , β − ,L − )) min 1, 1:∗ L ∗ ∗| 1:L → 1:L .  (t 1) (t 1) (t 1) (t 1) (t 1) (t 1)  p h − , β ,L y Q((h − , β ,L ) (h∗ ∗ , β ,L )) 1:L(t−1) − − | 1:L(t−1) − − → 1:L ∗ ∗     3: Sample u U[0, 1] ∼ (t 1) (t 1) (t 1) 4: if u < α (h − , β ,L ), (h ∗ , β ,L ) then 1:L(t−1) − − 1:∗ L ∗ ∗ 5:   (t) (t) (t) (h , β ,L ) = (h ∗ , β ,L ) 1:L(t) 1:∗ L ∗ ∗ 6: else 7: (t) (t) (t) (t 1) (t 1) (t 1) (h , β ,L ) = (h − , β ,L ). 1:L(t) 1:L(t−1) − − 8: end if

Clearly, the CPS proposal requires more computational effort to construct compared to the simple BD proposal. Generally, this is justified by obtaining much higher acceptance probabilities for between-model moves as a result of sampling from an approximation of the optimal between- model proposal. In the next Sections we shall compare different aspects of the three proposed algorithms in terms of computational complexity, mixing rate of the Markov chain and BER. 7.7 Design of Between-Model Birth and Death Proposal Moves 197

Algorithm 19 CPS method to obtain a sample from (t 1) (t 1) (t 1) (h ∗ , β ,L ) Q (h − , β ,L ) (h ∗ , β ,L ) 1:∗ L ∗ ∗ ∼ 1:L(t−1) − − → 1:∗ L ∗ ∗   1: Construct approximate optimal proposal for parameter h2L+1 using CPS approach, as fol- lows:

i. Sample m points from the prior for parameters h2L+2, h2L+1, denoted by h2L+2 (i) and h2L+1 (i), which are used to create an array

(t 1) (t 1) h − , , h − , h (1) , h (1) 1 ··· 2L 2L+1 2L+2 . . . . .  . . . . .  . (7.49) (t 1) (t 1)  h1 − , , h2L− , h2L+1 (m) , h2L+2 (m)   ···  ii. Use each row of this array to evaluate the expression for the mode of the posterior (Appendix 7.12) for parameter h2L+1 and average these results to obtain the centre point for the grid around which we construct the proposal, denoted by \ Mode (h2L+1) . \ iii. Construct a grid of n points with n/2 points on either side of Mode (h2L+1) (linearly spaced or non-linearly spaced) and for each grid point sample m points from the prior for h2L+2 and construct the array in (7.49). We used a linear spaced grid with width between adjacent points of length w.

(t 1) (t 1) \ n h1 − , , h2L− , Mode (h2L+1) 2 w, h2L+2 (1) . ···. . . − .  . . . . .  ...... (7.50) (t 1) (t 1) \ n  h1 − , , h2L− , Mode (h2L+1) + 2 w, h2L+2 (n (m 1))   (t 1) ··· (t 1) \ n × −   h − , , h − , Mode (h ) + w, h (n m)   1 ··· 2L 2L+1 2 2L+2 ×    iv. Use the array in (7.50) of n grid points with m samples per grid point to evaluate the gradient (in Appendix 7.12). Next, for the i-th grid point average the m gradient evaluations to obtain an estimate of U¯k(i). This is repeated for each of the n grid points. v. Construct stepwise constant approximation p(h h , β, y) using equations 2L+1| 1:2L (7.47) and (7.48). Then add Gaussian tails to this approximate distribution on either side of the left and right end points ofb the grid, see [217] [p. 5, equation 9] for details. Then it is trivial to normalise this stepwise constant approximation and construct an empirical cdf from which one can easily sample to obtain a new proposed state. 2: Sample proposal h from normalised approximation p (h h , β, y). 2∗L+1 2L+1| 1:2L 3: Construct approximate optimal proposal for parameter h2L+2 using CPS approach, i.e. es- (t 1) timate p h L + 1, h , h − , β(t 1), y as follows:b 2L+2| 2∗L+1 1:2L − i. Using the array constructed in (7.49), replace the elements corresponding to h2L+1 with the sampled point h2∗L+1. ii. As above, use each row of this array to evaluate the expression for the mode of the posterior (Appendix 7.12) for parameter h2L+2. iii. Average these results to obtain the centre point for the grid around which we \ construct the proposal, denoted by Mode (h2L+2) . iv. Repeat steps 1.iii. and 1.v. above using the newly constructed grid. (t 1) 4: Sample proposal h from normalised approximation p h h , h − , β(t 1), y . 2∗L+2 2L+2| 2∗L+1 1:2L −   b 7.9 Estimator Efficiency via Bayesian Cram´erRao Type Bounds 198

7.8 Complexity Analysis

The complexity analysis of the class of algorithms presented in the previous Sections can be studied from two perspectives; the most technical of these involves theoretical study of the mixing rate of the TDMCMC algorithm under consideration, the other focus would be on the computational complexity. A a theoretical study of the rate at which the generated Markov chain forgets its initial conditions is well beyond the scope of this thesis. Instead we focus on a computational complexity comparison between each of the algorithms. The computational cost of each of these algorithms can be split into three parts: the first cost involves constructing and sampling from the proposal; the second significant computational cost comes from the evaluation of the acceptance probability for the proposed new Markov chain state; and the third is related to the mixing rate of the overall MCMC algorithm as affected by the length of the Markov chain required to obtain estimators of a desired accuracy. We define the following building blocks and associated complexity:

1. Sampling a random variable using exact sampling via inversion of cdf has a complexity of O (1).

2. Evaluation of likelihood density p (y h, L, β) has a complexity of KL (C + C ) + O (1). | m a 3. Evaluation of prior density p (h β, L) has a complexity of O (1), i| where Cm and Ca represent the operations of complex multiplication and addition, respectively. The complexity of the BD-TDMCMC (Algorithm 1) is presented in Tables 7.1 and 7.2. The additional complexity of the SA-TDMCMC algorithm results from modifying the acceptance probability as shown in (7.41) and updating the weights according to (7.42). The modification of the acceptance probability is an operation of O (1). The updating of the weights adds an extra stage after acceptance probability calculation, which has complexity of LmaxO (1).

The complexity of the CPS-TDMCMC algorithm is depicted in Table 7.3. We have now explicitly presented the computational complexity of one iteration of the Markov chain for each of the algorithms. It is clear that although the CPS algorithm is computationally complex than the other algorithms, it is still linear in the dimension of the problem.

7.9 Estimator Efficiency via Bayesian Cram´erRao Type Bounds

In Section 7.5 we have formulated a Bayesian model which resulted in a posterior distribution (7.22). We discussed that we are interested in approximating point estimates (7.20) and (7.21) from the posterior model. We then established that in order to obtain estimates of these quanti- ties we would need to be able to sample from the posterior distribution over all models, requiring 7.9 Estimator Efficiency via Bayesian Cram´erRao Type Bounds 199

Complexity of within-model moves in (7.25) for deterministic scan Metropolis-Hastings within Gibbs sampler

Operation Number of operations

L(t−1) (t 1) Sampling T ((h − ) (h )) L(t 1)O (1) i=1 1:L(t−1) → i∗ − Q (t 1) (t 1) Evaluating acceptance probability a(hi − , hi∗) KL − (Cm + Ca) T (β(t 1)) (β ) O (1) − → ∗  (t 1) (t 1) Evaluating acceptance probability a(β − , β∗) KL − (Cm + Ca)

Table 7.1: Computational complexity of within-model moves. This part is common to all three TDMCMC algorithms

Complexity of between-model moves

Operation Number of operations

(t 1) (t 1) (t 1) sampling Q((h − , β ,L ) (h ∗ , β ,L )) O (1) 1:L(t−1) − − → 1:∗ L ∗ ∗

Evaluating acceptance probability (t 1) (L + L )(K (Cm + Ca)) (t 1) (t 1) (t 1) ∗ − α (h − , β ,L ), (h ∗ , β ,L ) 1:L(t−1) − − 1:∗ L ∗ ∗  

Table 7.2: Computational complexity of the between-model moves of BD-TDMCMC algorithm the TDMCMC methodology. In this Section we provide some definitions and comparative per- formance bounds based on MSE of the point estimator of hMMSE obtained from the posterior h y. These bounds will be defined based on the concept of the BCRLB adapted for the model | b selection setting.

To clarify this further, we note that obtaining an analytical expression for (7.18) is not feasible, therefore the MMSE estimate for h given y and its MSE cannot be calculated analytically. To asses the performance of the estimator of h we can calculate the MSE for h via simulation. We can then compare the MSE to the BCRLB.

For deterministic parameters, a commonly used lower bound for the MSE of the parameters is the Cram´erRao Lower Bound (CRLB), given by the inverse of the Fisher information matrix [36]. Since we are assuming the model parameters are random variables and therefore we are 7.9 Estimator Efficiency via Bayesian Cram´erRao Type Bounds 200

(t 1) Constructing forward proposal q(h − h ∗ ) 1:2L(t−1) → 1:2∗ L

Operation Number of operations

step (1.i) in Algorithm 19 mO (1)

step (1.ii) in Algorithm 19 , equation (7.69) m (Cm + Ca) (5K + 4KL)

step (1.iii) in Algorithm 19 (nm + n) O (1)

step (1.iv) in Algorithm 19, equation (7.66) nmK (Cm + Ca) (3 + 2L)

step (1.v) in Algorithm 19 nO (1)

step (3.i) in Algorithm 19 O (1)

steps (3.ii), (3.iii) and (3.iv) in Algorithm 19 nmK (Cm + Ca) (3 + 2L)

(t 1) Sampling from forward proposal q(h − h ∗ ) 1:2L(t−1) → 1:2∗ L steps (2) and (4) in Algorithm 19 nO (1)

(t 1) (t 1) (t 1) Evaluating the acceptance probability α((h − , β ,L ), (h ∗ , β ,L )) 1:L(t−1) − − 1:∗ L ∗ ∗

(t 1) (t 1) step (2) in Algorithm 18 K (Cm + Ca)(L∗ + L − + m(6n + 8L − ))

Table 7.3: Computational complexity of the between-model moves of the CPS-TDMCMC algorithm working with the posterior distribution under a Bayesian framework we can instead use the BCRLB or posterior CRLB [240].

The BCRLB provides a lower bound on the MSE matrix for random parameters. Let h denote an estimate of h which is a function of the observations y. The estimation error is h h and −b the MSE matrix is   b H Σ = Ey h h h h h , (7.51) , − −     b b where Ey,h denotes expectation with respect to p (y, h). The BCRLB C provides a lower bound on the MSE matrix Σ. It is the inverse of the BIM J and therefore Σ C , J 1, where the ≥ − matrix inequality indicates that Σ C is a positive semi-definite matrix. Defining ∆α to be the − β L K matrix of second-order partial derivatives with respect to the L 1 parameter vector β × × 7.9 Estimator Efficiency via Bayesian Cram´erRao Type Bounds 201 and K 1 parameter vector α, the BIM for h is defined as ×

h J = Ey h ∆ ln p (y, h) , − h n Lmax o (7.52) E h = y,h ∆h ln p (y, h, L, β) dβ . (− !) Xl=1 Z It is clear that the BCRLB for h can not be obtained analytically since it involves the summation over L and integration over β. This analysis is nonstandard since it involves an unknown model order, as such we will present several settings for which we can define performance bounds, which are based around the BCRLB.

Here we define the following model settings and associated bounds:

1. BCRLB: this bound is calculated conditional on knowledge of the true L and true β used to generate the data, as such this would represent the lowest bound that we consider for our comparison.

T H 1 H T r J = T r Σh E h y E− yy E yh∗ { } − { } 1 (7.53)   H  H − = T r Σh Σh (DW ) (DW )Σh (DW ) + R (DW )Σh . − L L L L    

2. B Lmax,β: to define this bound we condition on knowledge of β and on a misspecification | of L. In particular we consider a saturated model where L = Lmax and the true L is less

than Lmax. In this case we consider based around the BCRLB for the saturated model, given by

H T r J Lmax,β = T r ΣLmax ΣLmax (DWLmax ) | − n 1  (DW )Σ (DW )H + R − (7.54) × Lmax Lmax Lmax   (DW )Σ , × Lmax Lmax }

exp βl where [ΣL ] = L {− } , l = 1,...,Lmax. max l,l max exp βk k=1 {− } We note that, underP the assumption that we have a nested model structure, the following interesting identity holds

T r J Lmax,β > T r J . (7.55) | { } 

3. B β: in this bound we condition on knowledge of β and we consider two settings, namely | BMOS and BMA. We define the lower bound for the BMA case as

Lmax BMA T r J β = P r (l y, β) T r Σl , (7.56) | | { } n o Xl=1 7.10 Simulation Results 202

where

p(y l, β)P r(l) P r(l y, β) = | , (7.57) | Lmax p (y l, β) P r(l) l=1 | and the marginal likelihood p(y l, β) canP be written as |

p(y l, β) = p(y h , l, β)p(h l, β) dh | | l l| l Z (7.58) CN H 2 = 0, (DWl)Σl (DWl) + σwI .   In the case of BMOS we would use the expression given in (7.53) after replacing Lmax with

LMAP .

4. B L: this bound corresponds to the case where we condition on knowledge of the model | order. However, we do not condition on knowledge of β. Since we base this bound on the BCRLB, we are interested in

h J = Ey h ∆ ln p (y, h L) , − h | (7.59) n h o = Ey h ∆ ln p (y, h, β L) dβ . , − h |   Z  This can not be written down analytically, hence, we will approximate it as follows

T 1 p (h y,L) = p h y, L, β(t) . (7.60) | T | t=1 X   b We then substitute this Monte Carlo estimate in (7.60) into (7.59) to obtain an estimate of the bound

E h T r J L = y,h ∆h ln p (y, h, β L) dβ | − |   Z  n o S T (7.61) b 1 T r Σ β(t), y(s) ≈ ST min s=1 t=1 X X n  o (t) (s) where S is the number of realisations used and T r Σmin β , y is defined as in (7.53) with   exp β(t)l [Σh] = − , l = 1, . . . , L. (7.62) l,l Lmax exp β(t)k k=1  {− } P In the next Section we shall compare the MSE of the proposed algorithms with the aforemen- tioned bounds. 7.10 Simulation Results 203

7.10 Simulation Results

In this Section we present the simulation results for the proposed model and sampling algorithms. First we describe the simulated OFDM system.

7.10.1 System Configuration and Algorithms Initialization

Unless stated otherwise, the following specifications were used in the simulations. The OFDM system setup is K = 64 subcarriers employing QPSK symbols and Lmax = K/4. The channel is modeled as block Rayleigh fading and the a priori channel length distribution follows a truncated Poisson distribution

λl exp λ P r(l) = {− }, (7.63) C l!

l Lmax λ exp λ where λ = 8, and C = Σl=1 l!{− } .

The prior distribution for the decay parameter p (β) = U [0, βmax], with βmax = 1. In all simulations, the realized channel length was L = 8 and the decay rate was β = 0.1.

(1) (1) The initial state of the Markov chain for each of the algorithms was L = 3, h1:3 = [0.1 0.1 0.1] (1) and β = 1. The standard deviation for the within-model proposal distribution was σ1 =

0.25, the standard deviation for the between-model proposal distribution was σ2 = 0.1 and the standard deviation for the proposal for β was σβ = 0.25. In Algorithm 2 the pre-specified 1 1 (1) weights were set as π1:L = , , , g = [1, , 1] and the gain factor was set max Lmax ·· · Lmax 1:Lmax ··· as γ = 2 Lmax . h i t max(2 Lmax,t) b In constructing the CPS proposals in Algorithm 3 the grids specifications were m = 50, n = 50 and w = 0.05. Keeping w constant meant that we used linear spacing and we also attached Gaussian tails at the left and right end points of the grid as in Algorithm 18.

In the sensitivity studies, data were generated from the true model parameters and Algorithm 1 was run to produce a Markov chain of length 20k after discarding 5k samples as burn-in. These samples were then used to perform analysis of the posterior quantities of interest. In the convergence analysis, 100 independent data sets were generated from the true model using a seeded random generator. For each realization of the observation we ran each algorithm with Markov chains of length 100k and discarded the first 20k samples as burn-in, ensuring each algorithm was compared on the same realised data sets. Estimates were obtained by averaging over the posterior estimated quantities of interest for each realised data set. 7.10 Simulation Results 204

7.10.2 Model Sensitivity Analysis

Here we study the sensitivity of the posterior distribution (7.22) to various quantities. The intention is to provide insight into the performance of the proposed model. This will allow us to separate the effects of model sensitivity from the effects of convergence rates of each of our proposed trans-dimensional sampling algorithms. In this sensitivity analysis Section we set SNR = 15 dB and the prior mean of the truncated Poisson model was λ = 8, the true model order.

We begin by analysing the sensitivity of the MAP estimation of the model order to the prior choice for the model order. Then we provide recommendations for specification of the prior on the model order. Secondly, we analyse the sensitivity of the estimation of the MAP estimate for the model order L, to the value of β used to generate the data set. Finally, we analyse the sensitivity of the marginal posterior distributions for p(β y) and P r(L y) as a function of the | | SNR level.

7.10.2.1 Sensitivity of Model Order to Prior Choice P r (L)

In our analysis we selected a truncated Poisson as the prior for L. Therefore, we study the sensitivity of the MAP estimate for the model order as a function of the user specified prior mean on the model order. The results of the MAP estimate L versus λ are depicted in Figure 7.6. Here we choose λ = 2, 4, 6, 8, 10, 12, 14, 16 . Figure 7.6 demonstrates the estimated marginal { } posterior distribution of the model order P r(L y) as a function of the prior mean for the model | order. We see that the marginal posterior distribution is only mildly sensitive to the prior mean λ parameterising the truncated Poisson distribution. However, in general we recommend that one should use an uninformative prior choice when performing model selection under our framework. Generally this allows the data to drive the model selection estimation.

7.10.2.2 Sensitivity of Model Order to the True Decay Rate β

In this analysis we perform model order selection for various values of β. To illustrate this we study the MAP estimate of L as a function of varying the true β used to generate the data, β = 0, 0.1, 0.2, 0.6, 0.7, 0.8 . The results of this study are depicted in Figure 7.7. These { } results demonstrate the estimated posterior model probabilities P r(L y), for a given realization | of data. In each study we alter the true value of β used to generate the data. We also present the corresponding PDP for each value of β.

They confirm that as β increases, the coefficients taps become less uniform. In particular, the coefficients of the last few taps become statistically insignificant from 0. Hence, as β increases, for our fixed SNR, the difference between the power of each tap increases and in particular the 7.10 Simulation Results 205

λ =2 λ =4 1 1

0.5 0.5

0 0 4 6 8 10 12 14 16 3 4 5 6 7 8 9 10 11 12 13 14 15 16

λ =6 λ =8 1 1

0.5 0.5

0 0 3 4 5 6 7 8 9 10 11 12 13 14 15 16 3 4 5 6 7 8 9 10 11 12 13 14 15 16

λ =10 λ =12 1 1

0.5 0.5

0 0 3 4 5 6 7 8 9 10 11 12 13 14 15 16 3 4 5 6 7 8 9 10 11 12 13 14 15 16

λ =14 λ =16 1 1

0.5 0.5 p(L| Y)

0 0 3 4 5 6 7 8 9 10 11 12 13 14 15 16 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Model order L

Figure 7.6: Sensitivity of MAP estimate from P r(L y) to prior mean λ |

first few taps h1, h2 ... will dominate compared to . . . , hL 1, hL. This results in many of the − channel coefficients at higher orders, i.e. . . . , hL 1, hL, being indistinguishable from noise and − ultimately as demonstrated results in the MAP estimate for the model order L being less than the true model order which we set as L = 8.

This behavior is expected in this model. At low SNR levels this implies that generally the true MAP estimate for the posterior model probability, for a given data realisation will be signifi- cantly lower than the L used to generate the data. This mismatch in model order disappears asymptotically with data size conditioned upon. Hence, when we use our trans-dimensional sampling algorithms to estimate the posterior model probabilities and take the MAP estimate these will also be lower than the L used to generate the data.

The impact of this on symbols detection will also be demonstrated in Section 7.10.4. We expect only minor impact on the BER for a give SNR when not detecting taps with very low power relative to the noise level.

7.10.2.3 Analysis of Posterior Precision for Marginals, p (β y) and P r (L y) | |

In Figure 7.8 we assess how the SNR level effects the marginal posterior distribution for β. In particular we demonstrate that at low SNR levels, the distribution of β that results from our 7.10 Simulation Results 206

1 1 1 β=0 β=0.1 β=0.2 0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4 p(L|Y) 0.2 0.2 0.2

0 0 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 L

0.2 0.2 0.4

0.15 0.15 0.3

2 h 0.1 0.1 0.2 σ

0.05 0.05 0.1

0 0 0 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8

1 1 1 β=0.6 β=0.7 β=0.8 0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8

Figure 7.7: Sensitivity of MAP estimate from P r(L y) to β | model is left skewed and heavy tailed. However, the resulting MAP and MMSE estimates are still reasonably accurate. Then as the SNR increases, the distribution of β becomes more symmetric and centered on the true beta value used to generate the data. In Figure 7.9 we also present the marginal distributions of L as a function of SNR. Clearly, we also see that as the SNR increases, the precision of the marginal posterior for P r (L y) increases. The mode of this distribution | shifts towards the true value of L used to generate the data, demonstrating the convergence rate as a function of SNR and that no bias is present. In summary, these results show that as the SNR increases, the ability to distinguish taps with low power from noise increase. Hence, this corresponds to a more precise distribution for the marginal posterior, p (β y) and also results in | the posterior model probabilities P r (L y) producing a MAP estimate for L which corresponds | to the true L used to generate the data. Hence these results provide an indication around the rate at which the posterior distributions precision changes as a function of SNR.

7.10.2.4 Estimated Pairwise Marginal Posterior Distributions of p (h , h L , y) i j| MAP and p (h , β L , y) i | MAP

In Figure 7.10 we see the joint pairwise marginal estimated distributions of p (h , h L , y) and i j| MAP p (h , β L , y). These demonstrate that no correlation is present between pairs of variables i | MAP within the posterior distribution. This shows that rate of convergence of our Markov chain 7.10 Simulation Results 207

14

12 β MAP β MMSE

10 β

8

6 Estimated distribution of

4

2

0

0 5 10 15 20 SNR [dB]

Figure 7.8: Sensitivity of MAP estimate from P r(L y) to β | sampler, in our case a Gibbs sampler, will be unaffected by correlation between parameters in the posterior.

7.10.3 Comparative Performance of Algorithms

The focus of this Section will be to compare numerically the performance of each of the pro- posed algorithms. We generated synthetic data in which we know the true model order L = 8. Additionally we also selected SNR= 20 and the true β = 0.1 as we found that this resulted in a very high posterior model probability for P r (L = 8 y), which simplifies the convergence rate | analysis. The advantage of this simulation set up is that we can now assess the convergence of the different samplers with respect to time. This was achieved by plotting the convergence of the MSE in the estimated marginal posterior probability of P r L(t) = 8 y as a function of | simulation time of each of the algorithms. Averaging these results over the 100 independent realisations allows us to compare the performance of each algorithm in terms of mixing between different models. In obtaining the MSE estimates, we took true posterior model probabilities for P r (L = 8 y) as the posterior model probability after running the BD-TDMCMC algorithm | for 106 iterations.

In this study, for the BD Gaussian proposal when sampling two new components (hL, h2L) we simulated using two different values for the variance, σBD = 0.05 and σBD = 0.2. These were just selected arbitrarily as the simple BD-TDMCMC approach does not provide a method to 7.10 Simulation Results 208

SNR = 0 [dB] 1

0.5

0 1 2 3 4 5 6 7 8 9 10

SNR = 5 [dB] 1

0.5

0 1 2 3 4 5 6 7 8 9 10

SNR = 10 [dB] 1

0.5

0 1 2 3 4 5 6 7 8 9 10

SNR = 15 [dB] 1

0.5

0 1 2 3 4 5 6 7 8 9 10

SNR = 20 [dB] 1

0.5 p(L|Y)

0 1 2 3 4 5 6 7 8 9 10 L

Figure 7.9: Marginal distribution of L, P r(L y) versus SNR | determine an optimal parameter for the proposal other than via off-line tuning. We did not use this approach for the sake of comparison purposes in this example. In the SA-TDMCMC sampler we used the same proposal as the BD-TDMCMC algorithm and included the stochastic approximation stage. Finally, for the CPS-TDMCMC algorithm we selected the adaptive grid centered on the average estimate of the posterior mode.

In Figure 7.11 we present a comparison between the algorithms (BD-TDMCMC, SA-TDMCMC, CPS-TDMCMC) for the average MSE of the marginal posterior model probability, P r(L = 8 y) | averaged over each realised observation set. The plots present the average MSE between the estimated posterior model probability of L = 8 at simulation time t and the true posterior model probability. Note that the CPS proposal performs best in terms of the MSE criterion, however the SA-TDMCMC algorithm also performs well. We also present the distributions of the posterior model probabilities for these simulations. Clearly these results agree with the average MSE results, the CPS proposal provides the best performance in terms of convergence of the posterior model probabilities in this analysis. However, the computational cost of this approach is significantly higher than the simple BD-TDMCMC or the SA-TDMCMC. When deciding which algorithm to recommend, we point out that the advantage of the CPS-TDMCMC algorithm is that it is largely automated. One only needs to select the number of grid points 7.10 Simulation Results 209

Figure 7.10: Pairwise marginal posterior distributions for p(hi, hj y) | to include in the approximation of the optimal proposal. In contrast the SA-TDMCMC and BD-TDMCMC require specific choices to be made with respect to properties of the proposal distribution. These will impact the performance and should really be tuned off-line. Depending on the amount of tuning required, the simulation using these approaches could cost as much as the CPS-TDMCMC proposal constructions. In our simulations the most computationally efficient algorithm was BD-TDMCMC and was 20 % more efficient in terms of computer time compared to SA-TDMCMC algorithm and 44 % more efficient than CPS-TDMCMC algorithm.

Our analysis suggests that we proceed with use of the CPS algorithm. It had marginally better convergence rate in this study than the SA-TDMCMC algorithm and is an automated TDMCMC algorithm, requiring only simple interpretable user choices in terms of the number of grid points to use in constructing the proposal. Having selected the CPS-TDMCMC algorithm we perform detailed studies of BER versus SNR performance.

7.10.4 Algorithm Performance

In this Section we evaluate both channel estimation MSE and BER versus SNR performance of the CPS-TDMCMC algorithm. The frame length was 128 OFDM symbols. At the beginning 7.10 Simulation Results 210

Ave.MSEforPr(L=8|Y)vssimulationtime BD−TDMCMC σ=0.05 1 1 BD−TDMCMC σ=0.05 BD−TDMCMC σ=0.2 0.5 0.9 SA−TDMCMC σ=0.05 0 SA−TDMCMC σ=0.2 6 7 8 9 10 CPS−TDMCMC σ 0.8 BD−TDMCMC =0.2 1 0.5 0.7 0 6 7 8 9 10 0.6 SA−TDMCMC σ=0.05 1

0.5 0.5 0 6 7 8 9 10 0.4 SA−TDMCMC σ=0.2 1 0.3

AverageMSEp(L=8|Y)(100realisations) 0.5

0 0.2 6 7 8 9 10 CPS−TDMCMC 1

0.1 DistributionofPosteriorModelProbabilitiesover(100realisations) 0.5

0 0 0 5 6 7 8 9 10 10 10 ModelIndex SimulationTime(numberofiterationsofMarkovChain)

Figure 7.11: Average MSE of the marginal posterior model probability, P r(L = 8 y) | of each frame one OFDM symbol was composed of known symbols for the purpose of channel estimation using Algorithm 3.

We first evaluate the channel estimation MSE performance of the proposed algorithm in com- parison to the bounds presented in Section 7.9. These results are depicted in Figure 7.12. The results demonstrate the following key points:

As SNR increases the numerical estimate of the MSE of h obtained using Algorithm • MMSE 3 converges to the BCRLB. This confirms our previous findings in the sensitivity studies, b in that for low SNR the estimated L does not correspond to the true L. As a result we see that the MSE of the estimator of h is above the BCRLB. However, it is important to point out the following two things:

1. For low SNR values the MSE for the channel estimate is still close to the BCRLB, and converges as SNR increases. 2. In the region of SNR values for which the MSE is above the BCRLB this does not adversely affect the BER performance as we demonstrate below.

For all SNRs the numerical estimate of the MSE of h obtained using Algorithm 3 • MMSE is always significantly below the saturated model bound, B L . This is not surprising, b | max,β 7.10 Simulation Results 211

1.2

1

0.8 BCRLB

BLmax ,β

B|β B|L

MSE 0.6 CPS TDMCMC −

0.4

0.2

0 0 5 10 15 20 25 SNR [dB}

Figure 7.12: MSE performance for the OFDM system using CPS-TDMCMC algorithm with K = 64, L = 8, β = 0.1 in comparison to several bounds

since as long as the estimate of β and L is accurate, then we know that the saturated

model MSE, given by B L , should upper bound the MSE of hMMSE βMAP ,LMAP . | max,β |

The bound B β involves BMA. It is interesting to note that for low SNR the posterior • | model probabilities which we can obtain in this setting exactly, for each realisation of the data, favour underestimation of L as we demonstrated. As a result, when calculating this bound for low SNR values we obtain a non tight bound relative to the BCRLB, since most of the posterior model probability is given to lower model orders. Hence we find that for low SNR values this bound is not tight.

When considering B L, we use the estimator T r J L given by (7.61). When comparing • | | the estimate of this bound to the BCRLB, we seen thato as expected, it lies between the b BCRLB and the MSE of hMMSE obtained using Algorithm 3.

Next we evaluate the BER performance of the CPS-TDMCMC algorithm. As a reference, we compared the BER results of the proposed algorithm with two lower bounds:

MMSE perfect : Here the channel length L and decay parameter β are perfectly known • at the receiver, and can be estimated using (7.14). This serves as a lower bound for the proposed algorithm. 7.10 Simulation Results 212

CSI MMSE perfect MAP MCMC −1 10 MMSE MCMC

MAP MCMC & MMSE MCMC

−2 Bit error rate 10

CSI

MMSE perfect

−3 10

0 5 10 15 20 25 SNR [dB]

Figure 7.13: BER performance of CPS-TDMCMC algorithm with N = 64, L = 8, β = 0.1

CSI : Here the CIR h is perfectly known at the receiver. This will serve as a loose lower • bound of the BER of the system.

The detection method for all schemes is based on transforming the estimated channel h to the frequency domain and performing one-tap MMSE equalization. The simulation results are de- b picted in Figure 7.13. These results show that both MAP and MMSE estimates of the proposed algorithm operate close to MMSE perfect. These results demonstrate close to optimal perfor- mance of our chosen algorithm in estimating the unknown model parameters which are then used in the detection scheme, resulting in BER performance close to the lower bound. 7.11 Chapter Summary and Conclusions 213

7.11 Chapter Summary and Conclusions

In this chapter we presented novel algorithms for channel estimation in OFDM systems. First, we considered the problem of channel estimation with unknown number of taps and known PDP value. Using a Bayesian approach we derived the BMA and BMOS estimators of the CIR. Next we considered the problem of CIR estimation where both the length of the channel and the PDP are unknown. We constructed a Bayesian model and then developed three novel TDM- CMC algorithms to estimate quantities of the posterior distribution such as MMSE and MAP estimates of parameters. We then performed various analysis and assessed the performance of the algorithms we devel- oped. Finally, we performed analysis of the channel estimation MSE and of the BER. 7.12 Appendix 214

7.12 Appendix

In this Appendix, we derive the expression for the gradient of the full log posterior with respect to a generic element h h . Since we wish to work with only real components, we decompose i ∈ 1:2L the model (7.13) as follows: y = AhT + z, (7.64) where e e e e

Re A Im A A , { } − { } , (7.65a) " Im A Re A # { } { } e Re y y , { } ; (7.65b) " Im y # { } e Re z z , { } , (7.65c) " Im z # { } e where Re ( ) and Im ( ) are the real and imaginary parts of ( ), respectively, and A , DW . · · · L The gradient of the full log posterior with respect to a generic element h h can be written i ∈ 1:2L as:

∂ 1 T T T 2 log p (h1:2L, β, L y) = 2 2y A:,i + Ai,:Ah + h AA:,i 2 hi, (7.66) ∂hi { | } −σZ − − σh   i e e e e e e e where A:,i corresponds to the i-th column of A.

Next wee equate (7.66) to zero and solve for hie

2hi 2 T 1 T 1 T T − + y A:,i Ai,:Ah h A A:,i = 0. (7.67) σ2 σ2 − σ2 − σ2 hi z z z e e e e e e By defining hi=0 as h with the element in the i-th location set to zero and ei as a column indicator vector, meaning that its elements are all set to zero except for the i-th element which is set to one. We can now rewrite (7.67) as

2hi 2 T 1 T 1 T T − + y A:,i Ai,:A h + hiei h + hiei A A:,i = 0. (7.68) σ2 σ2 − σ2 i=0 − σ2 i=0 hi z z z     e e e e e e Rearranging (7.68) we obtain the expression

T T T T 2y A:,i Ai,:Ahi=0 hi=0A A:,i hi = 2 − − . (7.69) 2σz T σ2 + Ai,:Aei + eiA A:,i e e hi e e e e e e e e 7.12 Appendix 215

Adaptive Grid Placement Centred on Estimated Posterior Mode Constructing the CPS proposal computationally efficiently involves concentrating grid points in support regions in which the posterior has most mass, i.e. posterior mode. Estimating the mode of p (hi y, h/iβ, L) follows from above. The grid can be concentrated around this location. | ∈

Chapter 8

Bayesian Symbol Detection in Wireless Relay Networks Using “Likelihood Free” Inference

“You do not really understand something unless you can explain it to your grandmother.”

Albert Einstein

8.1 Introduction

This chapter deals with detection of the transmitted sequence of symbols in cooperative wireless relay networks. We consider a system with multiple relay nodes operating under arbitrary relay processing functions. In addition, we consider a general stochastic model in which imperfect knowledge of the Channel State Information (CSI) at the destination node is assumed. In gen- eral for such a system both the ML decision rule and the MAP decision rule do not admit closed form expressions. This is due to the intractability of the likelihood function, which can not be obtained analytically or evaluated point-wise for non-linear relay functions. Using a likelihood-free Bayesian methodology, we derive an approximate MAP sequence detec- tion scheme for arbitrary relays.

We also develop two novel alternative solutions: an Auxiliary Variable (AV)-Markov Chain Monte Carlo (MCMC) approach; and a suboptimal explicit search zero forcing approach. In the first instance, the addition of auxiliary variables results in closed form expressions for the full conditional posterior distributions. We use this fact to develop an efficient sampler and demonstrate that this works well when small numbers of relay nodes are considered. In the second instance, the detection scheme involves an approximation based on known summary statistics

217 8.1 Introduction 218 and an explicit exhaustive search over the parameter space of code words. This performs well for relatively small numbers of relays and a high SNR.

We asses some aspects of the proposed methodology using several simulation studies and compare the proposed algorithms in terms of Symbol Error Rate (SER).

The main contributions presented in this chapter are as follows:

1. We develop a novel MCMC-ABC based detection algorithm. We derive a novel sampling methodology to demonstrate how a statistical concept can solve a problem which would otherwise have been intractable.

2. We introduce a novel extension to the MCMC-ABC sampler by utilising a Soft Decision (SD) weighting function to improve the mixing rate of the Markov chain.

3. We develop a novel deterministic annealing schedule for the tolerance level ǫ.

4. We derive an alternative novel algorithm based on an auxiliary variable model for the joint problem of coherent detection and channel estimation for arbitrary relay functions. We contrast the performance of this approach with the performance of the MCMC-ABC based algorithm.

5. We present a novel Sub-Optimal Exhaustive Search - Zero Forcing (SES-ZF)-MAP detec- tion scheme. This is based on known summary statistics of the channel model, on the mean of the noise at the relay nodes, and using an explicit exhaustive search over the pa- rameter space of code words. We shall demonstrate its performance, whilst acknowledging its flaws.

6. We perform convergence diagnostic for the MCMC-ABC sampler.

8.1.1 Relay Communications

Cooperative communications systems have become a key focus for communications engineers as a result of the pioneering works of [143] and [241]. In a wireless communication system in which co-operation is present, users are able to harness spatial resources to gain diversity, enhance connection capability and throughput.

The relay channel, first introduced by van der Meulen [141] has recently received considerable attention due to its potential in wireless applications. The relaying techniques have the potential to provide spatial diversity, improve energy efficiency, and reduce the interference level of wireless channels, see [242].

There are a number of issues to be considered when designing a relay network, the more impor- tant of these include: the topology of the relay network; the number of hops in the relay; the 8.1 Introduction 219 number of relays to include in the network; and the type of relaying function to incorporate. We will focus on single hop relay design in which the number of relays present can be general and the type of relaying function can be general. However, our methodology extends to more sophisticated relay topologies and multiple hop networks.

A number of relay strategies have been studied in the literature. Initially Cover and El-Gamal [9] introduced two relaying strategies commonly referred to as decode-and-forward and estimate- and-forward which have subsequently been developed and extended. The amplify-and-forward (AF) strategy found in [142] includes an approach in which the relay sends a scaled version of its received signal to the destination. The AF scheme is attractive because of its simple operation at the relay nodes.

Other strategies include: demodulate-and-forward [142], in which the relay demodulates indi- vidual symbols and retransmits; decode-and-forward [143], in which the relay decodes the entire message, re-encodes it and re-transmits it to the destination; and compress-and-forward [9] in which the relay sends a quantized version of its received signal. With cooperative communica- tions, the design of the encoder and decoder at the source and destination is accompanied with the design of the functionality of the relay nodes. The encoder receives the transmitted data and introduces carefully designed structure to a data sequence. The intention is to protect it from transmission errors [243]. This added redundancy enables the decoder at the destination node to detect and possibly correct errors caused by corruption from the channel. There are several properties of the relay function that can be designed in order to optimise for particular criteria of the communication systems. Some important criteria and the associated relay function design can be found in studies in [144], [145] who considered maximising the capacity of the system, or the work in [146] that considers SNR criterion.

From a practical perspective when engineering an actual relay communications network the most desirable schemes are those that achieve the optimality criteria with minimal processing complexity at the relays. There are many reasons for minimising complexity in the design of the system, these are related to considerations such as; location of relays when maintenance is required and the cost of construction of such relay networks. Simpler relay networks in theory should require less maintenance and may be more robust. In addition there are questions related to quality of service, where more complicated relay functionality may result in longer delays in processing of transmitted signals. Hence, we focus on what are known as memoryless relay functions. These are highly relevant to meet the design objectives briefly mentioned due to their simplicity.

The system model under consideration is presented in Figure 8.1. We consider the setting in which the channels present in the relay network are treated as stochastic models. We do not know a priori the actual realised channel coefficient values. Instead, we consider partial CSI where we assume we know the distributional statistics of the channel coefficient model. In addition, 8.1 Introduction 220

g v h1 n1 1 1

X + Relay 1 X + g v h2 n2 2 2

source X + Relay 2 X + destination

X + Relay L X +

g v hL nL L L

Figure 8.1: Two hop relay system with L relay nodes no knowledge in the form of pilot symbols or training sequences are transmitted as part of the transmitted signals. In such a setting we would like to consider the joint problem of channel estimation and sequence detection of the unknown sequence of transmitted symbols. However, we will treat the channel coefficients as nuisance parameters and will numerically integrate them out of our problem formulation in order to concentrate on the critical question of detection. We will focus on M-ary Pulse Amplitude Modulated (PAM) sequences of symbols. In this setting transmitted symbols in the sequence are taken from an alphabet of M possible code words which are points non-uniformly distributed on the complex plane, each point being labeled with a binary sequence of bits. We note that our methodology is general and extends naturally to other constellation and encoding frameworks.

Detection of the transmitted symbols is a fundamental task of the receiver node. This poses a challenge since as we demonstrate in Section 8.1.2, in general, the likelihood model for a general non-linear relay system can not be evaluated analytically, and hence is intractable from this perspective. As such, the ML and in a Bayesian setting MAP receivers can not be obtained analytically. Only in special cases such as linear relay functions, e.g. [244] can the coherent AF relay detector be designed analytically. Under this special case the ML receiver is a linear operation on the received signals, and thus trivial to evaluate. In general for non-linear relay functions this will not be possible.

In particular our main objective is to focus on MAP sequence detection scheme with imperfect CSI at the receiver node, for arbitrary memoryless relay functions. Particularly relevant will be 8.1 Introduction 221 non-linear relay functions. We present, compare and make recommendations about three novel alternative approaches to solve this problem.

In the setting of detection and channel estimation one can define a suboptimal solution to the MAP detector. This is possible even in the setting in which the likelihood is intractable. This naive and highly computationally complex algorithm is a SES-ZF solution. A ZF solution is popular in simple system models as it is often efficient to calculate and performs well. In this system we will explain and then demonstrate in simulations how such a simple approach fails, hence justifying the need for the more sophisticated statistical approaches we develop.

Under a SES-ZF approach one must condition on the partial channel state information and then perform an explicit search over the set of all possible symbol sequences. We define the SES-ZF solution for MAP sequence detection as the solution which conditions on the mean of the noise at the relay nodes, and also uses the noisy channel estimates.

Under these conditional assumptions, the SES-ZF solution becomes trivial since the likelihood of the detected symbols takes a closed form solution and hence the MAP detector takes a closed form analytic solution. However, we note the following drawbacks of this solution, the first serious drawback is that for low SNR values this can be highly sub-optimal, though for high SNR values, this becomes close to optimal.

The second major flaw with the SES-ZF solution in this context of MAP sequence detection is that this approach involves an exponentially increasing computational cost. In particular the required computational cost of the algorithm grows as a function of relays, the cardinality of the constellation size and the length of the data sequence.

Hence, even though the SES-ZF-MAP detector we have defined provides an explicit solution, typically the computational cost is not feasible to perform. In order to obviate this problem, we present a novel new approach that is based on ABC methodology, or likelihood-free inference [67] to perform the MAP sequence detection.

Notation Throughout this chapter we shall adapt the following notations in order to explicitly distinguish between a random variable (r.v.) and its realisation: boldface upper case letters denote random vectors and boldface lower case letters denote realisations of random vectors. Standard upper case letters denote random variables and standard lower case letters denote realisations of a random variable or a constant. We define the following notation that shall be used throughout, g = g = g(1), , g(L) ; 1:L ··· [t] [t] [t] g refers to the state of the Markov chain at iteration t; gi refers to the i-th element of g . 8.1 Introduction 222

8.1.2 Model and Assumptions

Here we present the system model and associated assumptions.

1. Assume a wireless relay network with one source node, transmitting sequences of K symbols

denoted s = s1:K .

2. The sequence of K symbols s are from an M-ary pulse amplitude modulation (PAM) scheme.

3. The relays cannot transmit and receive on the same time slot on the same frequency band. We thus consider a half duplex system model in which the data transmission is divided into two steps. In the first step, the source node broadcasts a code word s Ω from the codebook ∈ to all the relay nodes. In the second step, the relay nodes then transmit the relay signals to the destination node in orthogonal non-interfering channels. We assume that all channels are independent with a coherence interval larger than the codeword length K.

4. Assume imperfect CSI in which noisy estimates of the channel model coefficients for each relay link are known. This is a standard assumption based on the fact that a training phase has been performed a priori. This involves an assumption regarding the channel coefficients as follows:

L (l) (l) 2 Source to relay there are L i.i.d. channels parameterized by H F h , σh , • ∼ l=1 where F ( ) is the distribution of the channel coefficients, h(l)nis the estimated channelso · b coefficient and σ2 is the associated estimation error variance. h b L Relay to destination there are L i.i.d. channels parameterized by G(l) F g(l), σ2 , • ∼ g l=1 where F ( ) is the distribution of the channel coefficients, g(l) is the estimated channels ·   2 b coefficient and σg is the associated estimation error variance. b 5. The received signal at the l-th relay is a random vector given by

R(l) = SH(l) + W(l), l 1, ,L , (8.1) ∈ { ··· }

where H(l) is the channel coefficient between the transmitter and the l-th relay, S Ω ∈ M is the transmitted code-word and W(l) is the noise realization associated with the relay receiver.

6. The received signal at the destination is a random vector given by

Y(l) = f (l) R(l) G(l) + V(l), l 1, ,L , (8.2) ∈ { ··· }   where G(l) is the channel coefficient between the l-th relay and the receiver, (l) (l) , (l) (l) (l) (l) ⊤ f r f r1 , . . . , f rK is the memoryless relay processing function (with  h    i 8.1 Introduction 223

possibly different functions at each of the relays) and V(l) is the noise realization associated with the relay receiver.

(l) (l) L 7. Conditional on h , g l=1, we have that all received signals are corrupted by zero-mean additive white complex Gaussian noise. At the l-th relay the noise corresponding to the l-th transmitted symbol is denoted by random variable W (l) CN 0, σ2 . At the receiver i ∼ w this is denoted by random variable V (l) CN 0, σ2 . i ∼ v  Additionally, we assume the following properties: 

E (l) (m) E (l) (m) E (l) (m) Wi W j = Vi V j = Wi V j = 0, n o n o n o i, j 1,...,K , l, m 1,...,L , i = j, l = m, where W denotes the complex conju- ∀ ∈ { } ∀ ∈ { } 6 6 j gate of Wj.

8.1.2.1 Prior Specification and Posterior

Here we present the relevant aspects of the Bayesian model and associated assumptions. We begin by specifying the prior models for the sequence of symbols and the unknown channel coefficients. At this stage we note that in terms of system capacity it is only beneficial to transmit a sequence of symbols if it aids detection. This is achieved by having correlation in the transmitted symbol sequence s1:K . We assume that since this is part of the system design, the prior structure for P r (s1:K ) will be known and reflect this information.

1. Under the Bayesian model, the symbol sequence is treated as a random vector S = S1:K .

The prior for the random symbols sequence (code word) S1:K is defined on a discrete support denoted Ω with Ω = M K elements and probability mass function (pmf) denoted by M | M | P r (s1:K ).

2. The assumption of imperfect CSI is treated under a Bayesian paradigm by formulating priors for the channel coefficients as follows:

L (l) CN (l) 2 Source to relay there are L i.i.d. channels parameterized by H h , σh , • ∼ l=1 where h(l) is the estimated channels coefficient and σ2 the associatedn estimation o error h b variance. b L Relay to destination there are L i.i.d. channels parameterized by G(l) CN g(l), σ2 , • ∼ g l=1 (l) 2 where g is the restituted channels coefficient and σg the associated estimation error b variance. b 8.1 Introduction 224

8.1.2.2 Evaluation of the Likelihood Function

The likelihood model p y(l) s, h(l), g(l) for this relay system is in general computationally in- | tractable. There are two potential difficulties that arise when dealing with non-linear relay functions. The first relates to finding the distribution of the signal transmitted from each relay to the destination. This involves finding the density of the random vector f (l) SH(l) + W(l) G(l) conditional on realizations S = s, H = h, G = g. This is not always possible for a general non- linear multivariate function f (l). Conditional on S = s, H = h, G = g, we know the distribution of R(l) s, g, h, |

(l) (l) (l) (l) pR (r s, g, h) = p sh + w s, h , g | | (8.3) CN (l) 2  = sh , σwI .   However, finding the distribution of the random vector after the non-linear function is applied i.e. the distribution of f R(l) , f R(l) G(l) given s, h(l), g(l), involves the following change of variable formula   e 1 (l) − (l) (l) (l) (l) 1 (l) (l) (l) ∂f p f r s, h , g = pR (f )− r s, h , g , (8.4) | | ∂r(l)         e e e which can not always be written down analytically for arbitrary f. The second more serious complication is that even in cases where the density for the transmitted signal is known, one e must then solve a K-fold convolution to obtain the likelihood:

(l) (l) p y s, g, h = p f r s, g, h p (l) | | ∗ V (8.5)   ∞  ∞  (l) = e p f (z s, g, h) p (l) y z dz . . . dz . ··· | V − 1 K Z−∞ Z−∞     e Typically this will be intractable to evaluate pointwise. However, in the most simplistic case of a relay function, that is, f (l) (α) = α, the likelihood can be obtained analytically as

2 p y(l) s, h(l), g(l) = CN sh(l)g(l), g(l) σ2 + σ2 I , (8.6) | w v       where I is the identity matrix.

Hence, the resulting posterior distribution involves combining the likelihood given in 8.5 with the priors for the sequence of symbols and the channel coefficients.

8.1.3 Inference and MAP Sequence Detection

Since our primary concern is on SER versus SNR, our goal is oriented towards detection of transmitted symbols and not the associated problem of channel estimation. We will focus on an 8.2 Likelihood-Free Methodology 225 approach which samples S1:K , G = G1:L, H = H1:L jointly from the target posterior distribution.

In particular we consider the MAP sequence detector at the destination node. Therefore the goal is to design a MAP detection scheme for s at the destination, based on the received signals L L L y(l) , the noisy channels estimates as given by the partial CSI h(l) , g(l) and the l=1 l=1 l=1 2 2 noise variances, σw and σv. n o  b b (l) L (l) L Since the channels are mutually independent, the received signals r l=1 and y l=1 are conditionally independent given s, g, h. Thus, the MAP decision rule, after marginalizing out the unknown channel coefficients, is given by

L s = arg max p s, g, h y(l) dhdg s Ω | ∈ Yl=1 ZZ   (8.7) b L = arg max p y(l) s, h, g P r (s) p (g) p (h) dhdg. s Ω | ∈ Yl=1 ZZ   The intractability of the likelihood model in (8.5) results in our development of likelihood-free based methodology and associated MCMC-ABC sampler. Two alternative approaches for MAP detection based on auxiliary variables and zero forcing methods will also be considered.

8.2 Likelihood-Free Methodology

Likelihood-free inference describes a suite of methods developed specifically for working with models in which the likelihood is computationally intractable. Here we work with a Bayesian model and consider the likelihood intractability to arise in the sense that we may not evaluate the likelihood pointwise (Section 8.1.3). Additionally, we can only obtain a general expression for the likelihood in terms of a multivariate convolution integral and hence we do not have an explicit closed form expression for the likelihood.

It is shown in [71] that the ABC method we consider here embeds an intractable target posterior distribution, in our case denoted by p (s , h , g y), into a general augmented model 1:K 1:L 1:L|

p (s1:K , h1:L, g1:L, x, y) = (8.8) p (y x, s , h , g ) p (x s , h , g ) P r (s ) p (h ) p (g ) , | 1:K 1:L 1:L | 1:K 1:L 1:L 1:K 1:L 1:L where x is an auxiliary vector on the same space as y. In this augmented Bayesian model, the weighting function p (y x, s , , h , g ) weights the intractable posterior. In this work we | 1:K 1:L 1:L consider the hierarchical model assumption, where we work with p (y x, s , h , g ) = p (y x) | 1:K 1:L 1:L | [245]. 8.2 Likelihood-Free Methodology 226

The mechanism which allows one to avoid the evaluation of the intractable likelihood involves data simulation from the likelihood. That is, given a realisation of the parameters of the model, a synthetic data set, x, is generated. Then summary statistics, D(x), derived from this data are compared to summary statistics of the observed data, D(y), and a distance is calculated, ρ (D(y),D(x)). Finally, a weight is given to these parameters according to the weighting function p (y x), which may give greater weight when D(x) and D(y) are close, (i.e. where ρ (D(y),D(x)) | is small).

In this work we examine the MCMC-ABC sampler under two different weighting functions: the popular Hard Decision (HD) rule, and a novel weighting function utilizes a SD rule. The HD rule is given by

1, if ρ (D (y) ,D (x)) ǫ, p (y x, s , h , g ) ≤ (8.9) | 1:K 1:L 1:L ∝  0, otherwise.

Hence, the HD weighting function rewards summary statistics of the augmented auxiliary vari- ables, D (x), within an ǫ-tolerance of the summary statistic of the actual observed data, D (y), as measured by distance metric ρ. The second weighting function utilizes a SD given by

ρ (D (y) ,D (x)) p (y x, s , h , g ) exp . (8.10) | 1:K 1:L 1:L ∝ − ǫ2   The SD weighting function clearly penalizes summary statistics as a non-linear function of the distance between summary statistics. That is, the key difference to the HD rule is that, even though the weighing may be small, it will remain non-zero, unlike the HD rule. Hence, in the ABC context, we obtain a general approximation to the intractable full posterior, denoted by p (s , h , g y, ǫ). We are interested in the marginal target posterior, P r (s y), ABC 1:K 1:L 1:L| 1:K | which, in the ABC framework is approximated by

P r (s y, ǫ) ABC 1:K | (8.11) p (y x, s , h , g ) p (x s , h , g ) P r (s ) p (h ) p (g ) dhdgdx. ∝ | 1:K 1:L 1:L | 1:K 1:L 1:L 1:K 1:L 1:L ZZZ Here we are interested in the marginal posterior ABC approximation P r (s y, ǫ), as we ABC 1:K | wish to formulate the MAP detector for the symbols. As discussed in [71], the MCMC class of likelihood-free algorithm is justified on a joint space formulation, in which the stationary distri- bution of the Markov chain is given by p (s , x y, ǫ). The corresponding target distribution ABC 1:K | for the marginal distribution P r (s y, ǫ) is then obtained via numerical integration. Note ABC 1:K | that the marginal posterior distribution P r (s y, ǫ) P r (s y) as ǫ 0, recover- ABC 1:K | → 1:K | → ing the ”true” (intractable) posterior, assuming that D(y) are sufficient statistics and that the weighting function converges to a point mass on D(y) as ǫ 0. → 8.3 Algorithm 1 - MAP Sequence Detection via MCMC-ABC 227

Accordingly, the tolerance ǫ is typically set as low as possible for a given computational budget. Typically, this will depend on the choice of algorithm used to sample from the ABC posterior distribution. In this work we focus on the class of MCMC-based sampling algorithms. In the next section we present details of choices that must be made when constructing a likelihood-free inference model.

8.3 Algorithm 1 - MAP Sequence Detection via MCMC-ABC

Here we present a novel algorithm to perform MAP detection of a sequence of transmitted symbols. To achieve this we utilize a MCMC-ABC sampler based on [73] and [69]. In particular we utilize a random scan Metropolis-Hastings within Gibbs sampler. The algorithm is depicted in Algorithm 20, where we use the notations Θ , (S, G, H) and Metropolis-Hastings proposals are given by

Draw proposal S from distribution • 1:∗ K

[t 1] [t 1] 1 1 − − q(s1:K s1:∗ K ) = q(i)q(si si ) = δs[t−1] δs[t−1] . (8.12) → | K log2 M 1:i−1 i+1:K

Draw proposal G from distribution • i∗

[t 1] (l) [t 1] 1 [t 1] 2 q(g − g∗) = q(l)q(g g − ) = CN g − , σ . (8.13) i → i | i L i g rw   Draw proposal H from distribution • i∗

[t 1] (l) [t 1] 1 [t 1] 2 q(h − h∗) = q(l)q(h h − ) = CN h − , σ , (8.14) i → i | i L i h rw   where δφ denotes a dirac mass on location φ, and q(i) and q(l) respectively denote the uniform probabilities of choosing indices i 1,...,K and l 1,...,L . ∈ { } ∈ { } We now present details about the choices made for the ABC components of the likelihood-free methodology.

8.3.1 Observations and Synthetic Data

The data y = y1:K correspond to the observed sequence of symbols at the receiver. The generation of the synthetic data in the likelihood-free approach involves generating auxiliary variables x(l), . . . , x(l) from the model, p x(l) s, g, h , for l = 1,...,L, to obtain a realization 1 K | (1) (L) x = x ,..., x ⊤.    8.3 Algorithm 1 - MAP Sequence Detection via MCMC-ABC 228

This is achieved under the following steps:

1. Sample W(l) CN 0, σ2 I , l 1, ,L . ∗ ∼ w ∈ { ··· }  2. Sample V(l) CN 0, σ2I , l 1, ,L . ∗ ∼ v ∈ { ··· }  3. Evaluate X(l) = f (l) S h(l) + W(l) g(l) + V(l) , l 1, ,L . ∗ ∗ ∗ ∈ { ··· }  8.3.2 Summary Statistics

As discussed at the beginning of Section 8.2, summary statistics D( ) are used in the compar- · ison between the synthetic data and the observed data via the weighting function. Summary statistics are utilised since they provide a more efficient (i.e. lower dimensional) mechanism of comparison of the data than comparing the full data vectors, x and y, directly. In particular, under the Neymann-Fisher factorization theorem, when sufficient statistics are utilised no loss of information is incurred, and the ABC posterior will recover the true posterior when ǫ = 0. When non-sufficient summary statistics are used, the target ABC posterior (8.11) becomes an approximation of the true posterior, even for ǫ = 0.

There are many possible choices of summary statistic. Most critically, we want these as close to sufficient as possible whilst low in dimension. The simplest choice is to use D (y) = y i.e. the complete dataset. This is optimal in the sense that it does not result in a loss of information from the observations. The reason this choice is rarely used is that typically it will result in poor performance of the MCMC-ABC and other algorithms. To understand this, consider the HD rule weighting function in (8.9). In this case, even if the true MAP estimate model parameters were utilized to generate the synthetic data x, it will still become improbable to realise a non-zero weight as ǫ 0. This is made worse as the number of observations increases, through the curse → of dimensionality. As a result, the acceptance probability in the MCMC-ABC algorithm would be zero for long periods, resulting in poor sampler efficiency. See [74] for a related discussion on chain mixing. We note that the exception to this rule is when a moderate tolerance ǫ and small number of observations are used.

The next best choice is to utilize a set of summary statistics which are sufficient for the model parameters. In this special case, there is no loss of data through use of the summary statistics versus the whole data set. Unfortunately, the sufficient statistics for an arbitrary model are in general unknown. A practical alternative to utilizing the entire data set, is to use empirical quantile estimates of the data distribution. Here we adopt the vector of quantiles D (y) =

[q0.1 (y) ,..., q0.9 (y)], where qα (y) denotes the α-level empirical quantile of y.

b b b 8.3 Algorithm 1 - MAP Sequence Detection via MCMC-ABC 229

8.3.3 Distance Metric

For the distance metric ρ (D(y),D(x)), as a component of the weighting function, we use the Mahalanobis distance

1 ρ (D (y) ,D (x)) = (D (y) D (x))⊤ Σ− (D (y) D (x)) , (8.15) − x − where Σx is the covariance matrix of D(x). In Section 8.6, we contrast this distance metric with scaled Euclidean distance, obtained by substituting diag (Σx) for Σx in the above.

We estimate Σx as the sample covariance matrix of D(x), based on 2, 000 likelihood draws from p y s˜ , h , g wheres ˜ is a mode of the prior for the symbols and the channel | 1:K 1:L 1:L 1:K coefficients are replaced with the partial CSI estimates. b b We note that in principal, the choice of matrix Σx is immaterial, in the sense that in the limit as ǫ 0, then P r (s y, ǫ) P r (s y) assuming sufficient statistics D(x). However, in → ABC 1:K | → 1:K | practice, algorithm efficiency is directly affected by the choice of Σx.We demonstrate this in 8.6.

8.3.4 Weighting Function

For weighting function p (y x, s , h , g ) = p (y x) we consider the HD weighting function | 1:K 1:L 1:L | as in (8.9) and the SD weighting function as in (8.10).

8.3.5 Tolerance Schedule

With a HD weighting function, an MCMC-ABC algorithm can experience low acceptance prob- abilities for extended periods, particularly when the chain explores the tails of the posterior distribution (this is known as “sticking” c.f. [74]). In order to achieve improved chain conver- gence, we implement an annealed tolerance schedule during Markov chain burn-in, through

ǫ = max T 10t, ǫmin , (8.16) t −  where ǫt is the tolerance at time t in the Markov chain, T is the overall number of Markov chain samples, and ǫmin denotes the target tolerance of the sampler. This scheduler replaces previous approaches of sampling the tolerance jointly with the parameters of the model and imposing a prior to ensure posterior preference for small ǫ. In doing so we are able to avoid the additional Monte Carlo variance in estimates, due to numerically integrating out the tolerance random variable from the posterior. The tolerance is decreased according to a deterministic schedule during the burn-in period and stopped at a final value of ǫmin after a given mixing criterion based on the acceptance probability of the MCMC-ABC sampler is satisfied.

There is a trade-off between computational overheads (i.e. Markov chain acceptance rate) and the accuracy of the ABC posterior distribution relative to the true posterior. In this work 8.3 Algorithm 1 - MAP Sequence Detection via MCMC-ABC 230 we determine ǫmin via preliminary analysis of the Markov chain sampler mixing rates for a transition kernel with coefficient of variation set to one. In general practitioners will have a required precision in posterior estimates. This precision can be directly used to determine, for a given computational budget, a suitable tolerance ǫmin.

Algorithm 20 MAP sequence detection algorithm using MCMC-ABC Initialize Markov chain state: [0] [0] 1: Initialize t=0, S(0) p (s), g = g , h = h ∼ 1:L 1:L 1:L 1:L 2: for t = 1,...,T do b b [t 1] Propose new Markov chain state: Θ∗ given Θ − . 3: Draw an index i U [1,...,K + 2L] ∼ 4: Draw proposal [t 1] [t 1] [t 1] Θ∗ = θ1:−i 1, θ∗, θi+1:− K+2L from proposal distribution q(θi − θ∗). − → (Noteh the proposal will dependi on which element of the θ vector is being sampled.) Evaluation of ABC posterior (8.11): (l) (l) 5: Generate auxiliary variables x , . . . , x from the model, p x(l) θ , for l = 1,...,L, to 1 K | ∗ (1) (L) obtain a realization of X = x = x ,..., x ⊤ by:  (5.a) Sample W(l) CN 0, σ2 I , l 1, ,L . ∗ ∼ w  ∈ { ··· } (5.b) Sample V(l) CN 0, σ2I , l 1, ,L . ∗ ∼ v  ∈ { ··· } (5.c) Evaluate X(l) = f (l) S h(l) + W(l) g(l) + V(l) , l 1, ,L . ∗ ∗  ∗ ∗ ∗ ∗ ∈ { ··· } 6: Calculate a measure of distance ρ (D(y),D(x∗)) 7: Evaluate the acceptance probability

[t 1] p (θ y, ǫ ) q θ∗ θ − α Θ[t 1], Θ = min 1, ABC ∗| t → , − ∗ θ[t 1] θ[t 1] θ ( pABC − y, ǫt 1 × q − ∗ )   | − →  where p (θ y, ǫ ), depending whether HD or SD is used, is given by:  ABC ∗| t

P r (s1:∗ K ) p (h1:∗ L) p (g1:∗ L) , if ρ (D (y) ,D (x∗)) ǫt, HD : pABC (θ∗ y, ǫt) ≤ | ∝ (0, otherwise;

ρ (D (y) ,D (x∗)) SD : p (θ∗ y, ǫ ) exp P r (s∗ ) p (h∗ ) p (g∗ ) . ABC | t ∝ − ǫ2 1:K 1:L 1:L  t 

8: Sample random variate u, where U U [0, 1]. ∼ 9: if u α Θ[t 1], Θ then ≤ − ∗ [t] 10: Θ = Θ ∗  11: else [t] [t 1] 12: Θ = Θ − . 13: end if 14: end for 8.4 Algorithm 2 - MAP Sequence Detection via Auxiliary MCMC 231

8.3.6 Performance Diagnostic

Given the previously mentioned mixing issues, one should carefully monitor convergence diag- nostics of the resulting Markov chain for a given tolerance schedule. For a Markov chain of length T , the performance diagnostic we consider is the autocorrelation evaluated on T˜ = T T − b post-convergence samples after an initial burn-in period T . Denoting by θ[t] the Markov b { i }t=1:T˜ chain of the i-th parameter after burn-in, we define the autocorrelation estimate at lag τ by

T˜ τ 1 − ACF[ (θ , τ) = [θ[t] µ (θ )][θ[t+τ] µ (θ )], (8.17) i ˜ i − i i − i (T τ)ˆσ (θi) t=1 − X b b where µ (θi) andσ ˆ (θi) are the estimated mean and standard deviation of θi.

b 8.4 Algorithm 2 - MAP Sequence Detection via Auxiliary MCMC

In this Section we present an alternative to the likelihood-free Bayesian model. In this section we demonstrate that at the expense of increasing the dimension of the parameter vector, one can develop a standard MCMC algorithm without the requirement of the ABC methodology.

We augment the parameter vector with the unknown noise realizations at each relay, w1:K , to obtain a new parameter vector (s1:K , g1:L, h1:L, w1:K ). The resulting posterior distribution p (s , g , h , w y) may then be decomposed into the full conditional distributions: 1:K 1:L 1:L 1:K |

P r (s g , h , w , y) p (y s , g , h , w ) P r (s ) , (8.18a) 1:K | 1:L 1:L 1:K ∝ | 1:K 1:L 1:L 1:K 1:K p (g s , h , w , y) p (y s , g , h , w ) p (g ) , (8.18b) 1:L| 1:K 1:L 1:K ∝ | 1:K 1:L 1:L 1:K 1:L p (h s , g , w , y) p (y s , g , h , w ) p (h ) , (8.18c) 1:L| 1:K 1:L 1:K ∝ | 1:K 1:L 1:L 1:K 1:L p (w s , g , h , y) p (y s , g , h , w ) p (w ) , (8.18d) 1:K | 1:K 1:L 1:L ∝ | 1:K 1:L 1:L 1:K 1:K which form a block Gibbs sampling framework. Conditioning on the unknown noise random variables at the relays permits a simple closed form solution for the likelihood and results in tractable full conditional posterior distributions for (8.18a)-(8.18d). In this case the likelihood is given by L p (y s , g , h , w ) = p y(l) s , g(l), h(l), w(l) , (8.19) | 1:K 1:L 1:L 1:K | 1:K Yl=1   where p y(l) s , g(l), h(l), w(l) = CN f (l) Sh(l) + W(l) g(l), σ2I . (8.20) | 1:K v       The resulting Metropolis-within Gibbs sampler for this block Gibbs framework is outlined in Algorithm 21, where we define the joint posterior parameter vector Θ , (S, G, H, W). The 8.5 Alternative MAP Detectors and Lower Bound Performance 232

Metropolis-Hastings proposals used to sample from each full conditional distribution were given by (8.12)-(8.14), and the additional proposal for the auxiliary variables, given by

Draw proposal W from distribution • i∗

[t 1] (l) [t 1] 1 [t 1] 2 q(w − w∗) = q(l)q(i)q(w w − ) = CN w − , σ . (8.21) i → i | i KL i w rw   Algorithm 21 MAP sequence detection algorithm using AV-MCMC Initialize Markov chain state: [0] [0] 1: Initialize t=0, S(0) p (s), g = g , h = h , W(0) p (w) ∼ 1:L 1:L 1:L 1:L ∼ 2: for t = 1,...,T do b b [t 1] Propose new Markov chain state: Θ∗ given Θ − . 3: Draw an index i U [1,...,K + 2L + KL] ∼ [t 1] [t 1] 4: Draw proposal Θ∗ = θ1:−i 1, θ∗, θi+1:− K+2L+KL from proposal distribution − [t 1] q(θ − θ ). h i i → ∗ (Note, the proposal will depend on which element of the Θ vector is being sampled.) 5: Evaluate the acceptance probability

[t 1] p (θ y) q θ∗ θ − α Θ[t 1], Θ = min 1, ∗| → . − ∗ θ[t 1] θ[t 1] θ ( p − y × q − ∗ )   | →    6: Sample random variate u, where U U [0, 1]. ∼ 7: if u α Θ[t 1], Θ then ≤ − ∗ [t] 8: Θ = Θ ∗  9: else [t] [t 1] 10: Θ = Θ − . 11: end if 12: end for

The MCMC-AV approach presents an alternative to the likelihood-free Bayesian model sampler, and produces exact samples from the true posterior following chain convergence. While the MCMC-AV sampler still performs joint channel estimation and detection, the trade-off is that sampling the large number of extra parameters will typically result the need for longer Markov chains, to achieve the same performance as the ABC algorithm (in terms of joint estimation and detection performance). This is especially true in high dimensional problems, such as when the sequence of transmitted symbols K is long and the number of relays L present in the system is large, or when the posterior distribution of the additional auxiliary variables exhibits strong dependence. 8.5 Alternative MAP Detectors and Lower Bound Performance 233

8.5 Alternative MAP Detectors and Lower Bound Performance

One can define a suboptimal solution to the MAP detector, even with an intractable likelihood, involving a naive, highly computational algorithm based on a ZF solution. The ZF solution is popular in simple system models where it can be efficient and performs well.

Under a ZF solution one conditions on some knowledge of the partial channel state information, and then perform an explicit search over the set of all possible symbol sequences. To our knowledge a ZF solution for MAP sequence detection in arbitrary non-linear relay systems has not been defined. Accordingly we define the ZF solution for MAP sequence detection as the solution which conditions on the mean of the noise at the relay nodes, and also uses the noisy channel estimates given by the partial CSI information, to reduce the detection search space.

8.5.1 Sub-optimal Exhaustive Search Zero Forcing Approach

In this approach we condition on the mean of the noise W(l) = 0, and use the partial CSI esti- L mates of channels coefficients, h(l), g(l) , to reduce the dimensionality of the MAP detector l=1 search space to just the symboln space Ω.o The SES-ZF-MAP sequence detector can be expressed b b as

L s = arg max P r s y(l),G(l) = g(l),H(l) = h(l), W(l) = 0 s Ω | ∈ Yl=1   (8.22) b L b b = arg max p y(l) s,G(l) = g(l),H(l) = h(l), W(l) = 0 P r (s) . s Ω | ∈ Yl=1   b b Thus, the likelihood model results in a complex Gaussian distribution for each relay channel, as follows

p y(l) s,G(l) = g(l),H(l) = h(l), W(l) = 0 = CN f (l) sh(l) g(l), σ2I . (8.23) | v       As a result, the MAP detectionb can beb solved exactly using an explicitb search.b

Note however that this approach to symbol detection also involves a very high computational cost, as one must evaluate the posterior distribution for all M K code words in Ω. It is usual for communications systems to utilise M as either 64-ary PAM or 128-ary PAM and the number of symbols can be anything from K = 1 to K = 20 depending on the channel capacity budget for the designed network and the typical operating SNR level. Typically this explicit search is not feasible to perform. However, the sub-optimal ZF-MAP detector provides a comparison for the MCMC-ABC approach, which at low SNR should be a reasonable upper bound for the SER and for high SNR an approximate optimal solution. 8.6 Simulation Results 234

The SES-ZF-MAP sequence detector can be highly sub-optimal for low SNR values. This is trivial to see, since we are explicitly setting the noise realisations to zero when the variance the noise distribution is large. For the same reasoning, in high SNR values, the ZF approach becomes close to optimal.

8.5.2 Lower Bound MAP Detector Performance

We denote the theoretical lower bound for the MAP detector performance as the oracle MAP detector (OMAP). The OMAP detector involves conditioning on perfect oracular knowledge of (l) (l) L (l) the channels coefficients h , g l=1 and of the realized noise sequence at each relay W . This results in the likelihood model for each relay channel being complex Gaussian, resulting in an explicit solution for the MAP detector. Accordingly, the OMAP detector provides the lower bound for the SER performance. The OMAP detector can be expressed as

L s = arg max P r s y(l),G(l) = g(l),H(l) = h(l), W(l) = W(l) s Ω | ∈ Yl=1   (8.24) b L = arg max p y(l) s,G(l) = g(l),H(l) = h(l), W(l) = W(l) P r (s) . s Ω | ∈ Yl=1   In this case, the likelihood model results in a complex Gaussian distribution for each relay channel, as follows

p y(l) s,G(l) = g(l),H(l) = h(l), W(l) = W(l) = CN f (l) sh(l) + W(l) g(l), σ2I . (8.25) | v       However, clearly this is impossible to evaluate in a real system, since oracular knowledge is not available. Next we compare the performance of joint channel estimation and detection under the ABC relay methodology versus the auxiliary MCMC approach for different model configurations. Additionally, we compare the detection performance of the ABC relay methodology, the auxiliary MCMC approach, the Optimal Oracle MAP sequence detector and the SES-ZF-MAP sequence detector.

8.6 Simulation Results

In this Section we evaluate and compare different aspects of the ABC methodology versus the auxiliary MCMC approach for different model configurations. Additionally, we compare the detection performance of the ABC relay methodology, the auxiliary MCMC approach, the optimal oracle MAP sequence detector and the SES-ZF-MAP sequence detector. The following specifications are used for all the MCMC algorithms presented: 8.6 Simulation Results 235

The length of the Markov chain is T = 20000. • The Burn In period is 5000. • For each MCMC algorithm, the proposals for each of the parameters were tuned offline to • ensure the average acceptance probability post burn-in was in the range of 0.3 to 0.5.

The following specifications for the relay system model are used for all the simulations performed:

The symbols are taken from a constellation which is 4-PAM. • Each sequence contains two symbols, K = 2. • The prior for the sequence of symbols is P r ((s , s ) = [1, 1]) = P r ((s , s ) = [ 1, 1]) = 0.3 • 1 2 1 2 − otherwise, all other symbols are equiprobable.

The partial CSI error variance is given by σ2 = σ2 = 0.1. • g h The nonlinear relay function is given by f ( ) = tanh ( ). • · ·

These system parameters were utilised as they allow us to perform the zero forcing solution, without a prohibitive computational burden.

8.6.1 Analysis of Mixing and Convergence of MCMC-ABC Methodology

We analyze the impact that the ABC tolerance level ǫmin has on estimation performance of channel coefficients and the mixing properties of the Markov chain for the MCMC-ABC algo- rithm. The study involves joint estimation of channel coefficients and transmitted symbols at an SNR level of 15 dB, with L = 5 relays present in the system. We adopt a scaled Euclidean distance metric with the HD weighting function (8.9) and empirically monitor the mixing of the Markov chain through the Autocorrelation Function (ACF).

In Figure 8.2 we present a study of the ACF of the Markov chains for the channel estimations of G1 and H1 as a function of the tolerance ǫ, and the associated estimated marginal posterior distributions p(g Y) and p(h Y). 1| 1| For large ǫ the Markov chain mixes over the posterior support efficiently, since when the tolerance is large, the HD weighting function and therefore the acceptance probability, will regularly be non-zero. In addition, with a large tolerance the posterior almost exactly recovers the prior given by the partial CSI. This is expected since a large tolerance results in a weak contribution from the likelihood. As the tolerance ǫ decreases, the posterior distribution precision increases and there is a translation from the prior partial CSI channel estimates to the posterior distribution over the true generated channel coefficients for the given frame of data communications. 8.6 Simulation Results 236

0.7

0.6 ε 5) = 0.005 100 )

1 4) ε = 0.1 0.5 3) ε = 0.5 ε 0.4 2) = 1 50 1) ε = 10 0.3

0.2 0

Estimated ACF (G 5 0.1 4 2 3 1 0 0 ε 2 −1 0 100 200 300 400 500 1 −2 p(g |y) Lag 1

0.7

) 0.6

1 80 0.5 60 0.4 40

0.3 20

Estimated ACF (H 0.2 0 5 0.1 4 2 3 1 2 0 0 ε 1 −1 0 100 200 300 400 500 0 −2 p(h |y) 1

Figure 8.2: Comparison of performance for MCMC-ABC with HD weighting and Scaled Eu- clidean distance metric. Subplots on the left of the image display how the estimated ACF changes as a function of tolerance level ǫ for the estimated channels of the relay system. Sub- plots on the right of the image display a sequence of smoothed marginal posterior distribution estimates for the first channel in both links of the relay, as the tolerance decreases. Note, the indexing of each marginal distribution with labels 1, 2,..., 5 corresponds to the tolerance given in the legend on the left hand plots.

It is evident that the mixing properties of the MCMC-ABC algorithm are impacted by the choice of tolerance level. A decreasing tolerance results in more accurate posterior distribution, albeit at the price of slower chain mixing. Clearly, the ACF tail decay rate is significantly slower as the tolerance reduces.

Note, that although the results are not presented here, we also performed analysis for all aspects of Algorithm 20 under the setting in which the relay function is linear. We confirmed the MMSE estimates of the channel coefficients and the MAP sequence detector results were accurate for a range of SNR values. The results presented here are for a much more challenging setting in which the relay is highly non-linear, given by a hyperbolic tangent function. 8.6 Simulation Results 237

8.6.2 Analysis of ABC Model Specifications

We now examine the effect of the distance metric and weighting function on the performance of the MCMC-ABC algorithm as a function of the tolerance. We consider HD and SD weighting functions with both Mahalanobis and scaled Euclidean distance metrics. We consider the esti- mated ACF of the Markov chains of each of the channel coefficients G and H. The SNR level was set to 15 dB and L = 5 relays were present. The results are presented for one channel; since all channels are i.i.d. this will be indicative of the performance of all channels.

For comparison, an equivalent ABC posterior precision should be obtained under each algorithm. Since, the weighting and distance functions are different, this will result in different ǫmin values for each choice in the MCMC-ABC algorithm. As a result, analysis proceeded by first taking a minimum base epsilon value, ǫb = 0.2 and running the MCMC-ABC with SD Gaussian weighting and Mahalanobis distance for 100, 000 simulations. We ensured that the average acceptance rate was between [0.1, 0.3]. This produced a “true” empirical distribution function (edf) which we used as our baseline comparison estimate of the true cdf. Now, for a range of tolerance values

ǫi = [0.25, 0.5, 0.75, 1] we ran the MCMC-ABC algorithm with SD Gaussian weighting and Mahalanobis distance for 20, 000 iterations, ensuring the average acceptance rate was in the interval [0.1, 0.3]. This produced a set of random walk standard deviation σb(i), one for each tolerance, that we used for the analysis in the remaining choices of the MCMC-ABC algorithms. For comparison purposes we recorded the estimated maximum error between the estimated edf for each algorithm and the baseline “true” edf. We repeated the simulations for each tolerance on 20 independent generated data sets.

In Figure 8.3 we present the results of this analysis, which demonstrate that the algorithm producing the most accurate results utilised the SD and Mahalanobis distance. The worst performance involved the HD and scaled Euclidean distance. In this case, at low tolerances the average distance between the edf and the baseline “true” edf was a maximum, since the algorithm was not mixing. This demonstrates that such low tolerances under this setting of the MCMC-ABC algorithm will produce poor performance, relative to the SD with Mahalanobis distance.

8.6.3 Comparisons of Detector Performance

Finally, we present an analysis of the SER under the MCMC-ABC algorithm (with a SD weight- ing function and Mahalanobis distance), the MCMC-AV detector algorithm, the SES-ZF and Oracle detectors. Specifically, we systematically study the SER as a function of the number of relays, L 1, 2, 5, 10 and the SNR 0, 5, 10, 15, 20, 25, 30 . The results of this analysis ∈ { } ∈ { } are presented in Figure 8.4. In summary these results demonstrate that under our proposed system model and detection algorithms, spatial diversity stemming from an increasing number 8.6 Simulation Results 238

1

SD Mahalanobis 0.9 HD Mahalanobis HD scaled Euclidean SD scaled Euclidean 0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 Absolute estimated maximum distance between edf and ‘‘true" 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ε min

Figure 8.3: Maximum distance between the edf and the baseline “true” edf for the first channel, estimated cdf for G1, averaged over 20 independent data realisations. of relays results in measurable improvements in the SER performance. For example Figure 8.4 demonstrates that for L = 1, there is an insignificant difference between the results obtained for algorithms MCMC-ABC, MCMC-AV and SES-ZF. However, as L increases, SES-ZF has the worst performance and degrades relative to the MCMC-based approaches, demonstrating the utility obtained by developing a more sophisticated detector algorithm. It is clear that the SES-ZF suffers from an error-floor effect: as the SNR increases the SER is almost constant for SNR values above 15 dB.

Also in Figure 8.4 the two MCMC-based approaches demonstrate comparable performance for small L. However as L increases, in the high SNR region, the difference in performance between the MCMC-AV and MCMC-ABC algorithms increases. This could be due to the greater numbers of auxiliary parameters to be estimated in the auxiliary-based approach as L increases. In particular we note that adding an additional relay introduces K additional nuisance parameters into the auxiliary model posterior. 8.6 Simulation Results 239

L=1 L=2

−1 10

−1 10

−2 OMAP 10 SES−ZF Symbol error rate MCMC−AV Symbol error rate

−2 MCMC−ABC 10 −3 10

0 5 10 15 20 25 30 0 5 10 15 20 25 30 SNR [dB] SNR [dB]

L=5 L=10

−1 −1 10 10

−2 −2 10 10

−3

Symbol error rate −3 10 Symbol error rate 10

−4 −4 10 10 0 5 10 15 20 25 30 0 5 10 15 20 25 30 SNR [dB] SNR [dB]

Figure 8.4: SER performance of each of the proposed detector schemes as a function of the number of relay hops, L. For each relay set up the SER is reported as a function of the SNR. 8.7 Chapter Summary and Conclusions 240

8.7 Chapter Summary and Conclusions

In this chapter, we proposed a novel cooperative relay system model and then obtained novel detection algorithms. In particular, this involved an approximated-MAP sequence detector for a coherent multiple relay system, where the relay processing function is non-linear. Using the ABC methodology we were able to perform ”likelihood free” inference on the parameters of interest. Simulation results validate the effectiveness of the proposed scheme. In addition to the ABC approach, we developed an alternative exact novel algorithm, MCMC-AV, based on auxiliary variables. Finally, we developed a sub-optimal ZF solution.

We then studied the performance of each algorithm under different settings of our relay system model, including the size of the network and the noise level present. As a result of our findings, we recommend the use of the MCMC-ABC detector especially when there are many relays present at the network, or the number of symbols transmitted in each frame is large. In settings where the number of relays is moderate, one could consider using the MCMC-AV algorithm, as its performance was on par with the MCMC-ABC results and does not involve an ABC approximation. Chapter 9

Conclusions and Future Work

“Science never solves a problem without creating ten more.”

George Bernard Shaw

9.1 Conclusions

In general terms, this dissertation has dealt with the design of OFDM, MIMO and relay network systems. These fall into two categories: the design of robust techniques for OFDM receiver design and the design of data detection in various MIMO and relay systems systems with different levels of knowledge of CSI.

In particular the following problems have been considered:

Detection of transmitted symbols in MIMO systems with perfect CSI at the receiver. • Detection of transmitted symbols stemming from non-uniform constellations in MIMO • systems with imperfect CSI at the receiver.

Joint channel tracking and decoding for OFDM systems under high mobility. • Channel estimation for OFDM systems with unknown channel model order and PDP decay • rate.

Detection of transmitted symbols in relay systems with imperfect CSI at the receiver, and • non-linear relay functions.

In Chapter 1 we have presented the motivation, the outline, and the contributions of this dis- sertation. 241 9.2 Future Work 242

In Chapters 2 and 3 we have reviewed some basic concepts concerning state of the art commu- nication systems, and in Bayesian inference.

Chapter 4 has dealt with the design of detection of data in MIMO systems with full CSI at the receiver. By introducing an extra constraint on the power of the transmitted symbols a set of non-convex optimisation problems had to be solved. We have presented a few algorithms with varying degrees of performance and computational complexity to achieve this goal.

Chapter 5 has been devoted to the design of algorithms for detection of Gaussian like constel- lations in MIMO systems when only partial CSI is given at the receiver. The problem was first formulated as a non convex and non-linear optimisation problem. Using the hidden convexity methodology, we were able to transform this problem into a tractable convex problem that could be solved efficiently. We have presented a competing approach based on the BEM methodology that achieved comparable performances. We then extended the problem to the case where the noise variance is also unknown. Using the concept of Gibbs annealing stochastic optimisation methodology, we were able to find a solution with close to optimal performance.

Chapter 6 has presented a full scheme for receiver design for coded OFDM systems. Emphasis was put on reducing the error propagation effect due to misdecoded symbols. We have designed a novel algorithm to deal with this effect and achieves close to optimal performance. A novel method has been considered. This was based on adaptive number of symbols used to form the state-space model for the purpose of channel tracking. Using hypothesis testing we were able to construct a statistical test for the health parameters of the tracking device. This allowed for deciding whether the tracking device was diverging due to model misspecification. By allowing only robust decoded symbols to be included in the state-space model a significant reduction in model misspecification was achieved. This led to BER performance close to the optimal case with no model misspecification.

Chapter 7 has dealt with the problem of channel estimation in OFDM systems with unknown model order and PDP decay rate. We formulated the problem under the Bayesian framework and using the TDMCMC methodology we were able to construct three different algorithms to sample from the intractable posterior distribution. We analysed the sensitivity of these algo- rithms to different choices of prior distributions and various SNR values. Simulation results demonstrate that these algorithms can perform close to the case where the model order and PDP decay rate are known.

Chapter 8 has dealt with the design of several algorithms for data detection in relay systems, where only partial CSI is available at the destination node and non-linear relay functions. We have shown that under Bayesian modeling, the MAP detector cannot be obtained in a closed form expression. In order to evade this problem we have adopted a “Likelihood Free” inference methodology. Using the ABC theory we were able to design an MCMC-ABC algorithm to obtain samples from the posterior distribution and eventually, the MAP sequence detector. 9.2 Future Work 243

9.2 Future Work

There exist many possibilities for future work that may extend the results obtained in this dissertation.

Concerning the MIMO setup presented in Chapter 4:

It would be interesting to extend the solution for the case where only partial CSI is given • at the receiver.

Analytic performance analysis of the expected BER could be derived. • It would be beneficial to find a reduced complexity algorithm for very high modulation • schemes, e.g. 64 QAM. That would include finding ways to reduce the number of groups visited in a smart way.

Concerning the MIMO setup presented in Chapter 5:

It would be interesting to find the solution for the case where the MIMO channels are • correlated. In that case, a different approach would be needed to solving the underlined optimisation problem.

The goodness of fit of the discrete Gaussian constellation could be evaluated under different • criteria, such as the Kolmogorov - Smirnov test or under Kullback - Leibler divergence.

Analytic performance analysis of the expected BER could be derived. • It would be interesting to derive the BCRLB in order to evaluate the MSE performance • of the MAP estimator derived.

Concerning the OFDM receiver setup presented in Chapter 6:

As stated, one could use Generalized M-estimators theory to form the state-space model. • It would be interesting to utilise different families of robust estimators to further improve the performance.

Instead of using the Kalman filter, SMC filters, such as particle filters could be used to track • the channel variations under non-Gaussian interference due to model misspecification.

The velocity of the mobile device is assumed known. It would be interesting to jointly • estimate this parameters with the other quantities. One could build a stochastic model for the variations in the velocity. 9.2 Future Work 244

Concerning the OFDM receiver setup presented in Chapter 7:

The major drawback of the algorithms presented is the computational complexity. It • would be beneficial to add an initial stage in which a coarse estimate of the model order and unknown PDP decay rate value is obtained. Based on these estimates a smaller joint space would need to be explored, thus reducing the length of the chain needed to be run, thus reducing the overall complexity.

It could be interesting to build a dynamic model to include variations in model order and • PDP decay rate. That would correspond to a time varying physical environment.

Concerning the relay network setup presented in Chapter 8:

Some of the parameters (such as ǫ) need to be pre-calibrated, and change between system • models and configurations. Using an SMC approach, a sequential solution for obtaining samples from these distributions can be designed. These samplers may be more efficient and need not be calibrated, thus offer an advantage of the MCMC approach.

The system model could be extended to cases where model uncertainty is introduced. For • example, one interesting direction would be the case where the number of relay nodes is unknown a priori and needs to be estimated.

Another interesting direction would be the case of possible malicious relays. Wireless • communications use a shared medium and is susceptible to malicious denial of service attacks, such as signal jamming, where an adversary tries to corrupt communication by actively transmitting a radio signal and interfering with the receiver. Under that scenario, the destination node would decide if the received signal stems from a friendly or malicious relay. Chapter 10

Appendix

“If the facts don’t fit the theory, change the facts.”

Albert Einstein

10.1 Properties of Gaussian Distribution

Definition 10.1. (Gaussian distribution). Random variable x Rn has Gaussian distribution ∈ with mean m Rn and covariance Σ Rn n if it has the probability density of the form ∈ ∈ ×

1 1 T 1 p (x m, Σ) = exp (x m) Σ− (x m) . | (2π)n/2 Σ 1/2 −2 − − | |   Lemma 10.2. (Joint density of Gaussian variables). If random variables x Rn and y Rm ∈ ∈ have the Gaussian probability densities

x N(m, Σ), (10.1a) ∼ y x N(Hx + u, R), (10.1b) | ∼ then the joint density of x,y and the marginal distribution of y are given as

m Σ, ΣHT p (x, y) = N , , (10.2a) " Hm + u # " HΣ, HΣHT + R #! p (y) = N(Hm + u, HΣHT + R). (10.2b)

245 10.3 Bayesian Derivation of the Kalman Filter 246

Lemma 10.3. (Conditional density of Gaussian variables). If the random variables x and y have the joint Gaussian probability density

a A, C p (x, y) = N , , (10.3) " b # " CT , B #! then the marginal and conditional densities of x and y are given as follows:

p (x) = N (a, A) , (10.4a) p (y) = N (b, B) , (10.4b) 1 1 T p (x y) = N a + CB− , A CB− C , (10.4c) | − T 1 T 1 p (y x) = N b + C A− (x a) , B C A− C . (10.4d) | − −  10.2 Definition of circularity

Let us consider the case of a random variable z = za + jzb, where za and zb denote the real and imaginary parts of z, respectively; for simplicity, we assume z having zero-mean.

Definition 10.4. The complex-valued random variable z is circularly symmetric if for any α, z and z , z exp jα have the same pdf. α { }

The invariance of the pdf to the rotation α explains the definition of circular symmetry. By defining A , z and φ , ∠z, the circularity of z is characterized by the following factorization | | of the pdf:

1 f (ζ) = f (ζ , ζ ) = f (a, φ) = f (a) , (10.5) z za,zb a b A,Φ 2π A i.e. the amplitude A with arbitrary fA (a) is independent of the phase φ, which is uniformly distributed in [0, 2π].

10.3 Bayesian Derivation of the Kalman Filter

In this Section we derive the Kalman filter from two different point of views. First we derive the Kalman filter using the MMSE criterion.

10.3.1 Introduction

Our goal is to find the sequential (recursive) probabilistic inference problem within discrete-time dynamic systems that can be described by a LDSSM. The hidden system state xn, with initial 10.3 Bayesian Derivation of the Kalman Filter 247 probability density p (x0), evolves over time as a first order Markov process according to the conditional probability density p (xn xn 1). The observations yn are conditionally independent | − given the state and are generated according to the conditional probability density p (ynxn). The LDSSM can be written as the following state-space model

xn = Anxn 1 + vn (10.6a) − yn = Bnxn + wn, (10.6b)

N N E T where vn (0, Q), wn (0, R), and vnwn = 0. The state transition density p (xn xn 1) ∼ ∼ | − is fully specified by f and the process noise distribution p (vn), whereas h and the observation noise distribution p (w ) fully specify the observation likelihood p (y x ). The dynamic state- n n| n space model, together with the known statistics of the noise random variables as well as the prior distributions of the system states, define a probabilistic generative model of how the system evolves over time and of how we observe this hidden state evolution.

10.3.2 MMSE Derivation of Kalman Filter

In his seminal paper [32], Kalman’s derivation was only based on the assumptions the system state could be consistently estimated by sequentially updating there first and second order moments (mean and covariance) and that the specific form of the estimator be linear, i.e. of the form

xn = xn− + Knyn, (10.7) where x represents the state estimateb at timeb n givene the observations up to n 1, K is a n− − n linear gain term (the Kalman gain) and yn is the innovation signal at time n defined as b

ye , y y−. (10.8) n n − n e b The predictions of the state , xn− , and the observation, yn−, are given by,

b E b xn− = f (xn 1, vn) , (10.9a) { − } y− = E h x−, wn . (10.9b) bn n   b b where the expectation are taken over the joint distributions of xn 1 and vn, and xn− and wn, − respectively. b b 10.3 Bayesian Derivation of the Kalman Filter 248

We define the estimation errors as

x = x x n n − n = xn x− + Knyn (10.10) e − b n = x− Knyn,  n − b e where x , x x . e e n− n − n−

We definee the errorb covariance Pxn , and the cross covariance between the state and observation errors P − as xnyn

b T T Px , E x x = E (x x )(x x ) , (10.11a) n n n n − n n − n T n oT P − , E x−y = E xn x− yn y− . (10.11b) xnyn ene n −b n −b n n o b    Taking outer products and expectations,e e the covarianceb of (10.7)b can be expressed as

, E T Pn xnxn (10.12) = E (x x )(x x )T . e ne− n n − n n o The Kalman gain Kn is obtained by minimisingb the traceb of the error covariance Pn

P , cov (x x ) n n − n = E (x x )(x x )T n − bn n − n n o = cov x x− + K y n −b n nb n (10.13) = cov xn x− + Kn yn Bnx− − bn e − n = cov xn x− + Kn Bnxn + wnBnx− − bn b − n = cov (I K nBn) xn x− Knwn .  − b − n − b   Since the measurement noise wn is uncorrelated withb the other terms

Pn = cov (I KnBn) xn xn− cov (Knwn) − − (10.14) T T = (I KnBn) cov xn x− (I KnBn) + Kncov (wn) K . − − bn − n  We seek to find the MMSE estimate of xn,b hence

2 xn = arg min E xn xn x k − k n n o (10.15) b = argb min (T r Pn )b. Kn { } 10.3 Bayesian Derivation of the Kalman Filter 249

The error covariance Pn can be expressed as

T T T T Pn = Pn− KnBnPn− Pn−Bn Kn + Kn BnPn−Bn + R Kn − − (10.16) T T T = P− K B P− P−B K + K S K ,  n − n n n − n n n n n n

, T where Sn BnPn−Bn + R Setting the matrix derivative to zero yields

∂ (T r Pn ) T { } = 2 BnPn− + 2KnSn. (10.17) ∂Kn −  Solving this yields the Kalman gain

T 1 Kn = Pn−Bn Sn− . (10.18)

10.3.3 MAP Derivation of Kalman Filter

Here we derive the Kalman filter under the MAP criterion. The following derivation is based on [246] and [247].

p (x , y ) p (x y ) = n 1:n n| 1:n p (y ) 1:n (10.19) p (xn, yn, y1:n 1) = − . p (yn, y1:n 1) − where

p (xn, yn, y1:n 1) = p (yn xn, y1:n 1) p (xn, y1:n 1) − | − − = p (yn xn, y1:n 1) p (xn y1:n 1) p (y1:n 1) (10.20) | − | − − = p (yn xn) p (xn y1:n 1) p (y1:n 1) . | | − − Substituting (10.19) into (10.19), we obtain

p (yn xn) p (xn y1:n 1) p (y1:n 1) p (xn y1:n) = | | − − | p (yn, y1:n 1) − p (yn xn) p (xn y1:n 1) p (y1:n 1) = | | − − (10.21) p (yn y1:n 1) p (y1:n 1) | − − p (yn xn) p (xn y1:n 1) = | | − . p (yn y1:n 1) | −

Conditional on xn, yn is Gaussian distributed as

p (y x ) = (B x , R) n| n n n (10.22) 1 T 1 = C exp (y B x ) R− (y B x ) , 1 −2 n − n n n − n n   10.3 Bayesian Derivation of the Kalman Filter 250

Ny/2 1/2 where C1 , (2π)− R − . Consider the conditional pdf p (xn y1:n 1), its mean and covari- | | | − ance are calculated by

E xn y1:n 1 = E Anxn + vn y1:n 1 { | − } { | − } (10.23) = Anxn 1 = xn−, −b and b b

cov (xn y1:n 1) = cov xn xn− | − − (10.24) = cov e− ,  n b  where x represents the state estimate at time n given the observations up to n 1, e is the n− − n− state error vector. Denoting the covariance of en− by Pn−, by the Gaussian assumption, we obtain b 1 T 1 p (xn y1:n 1) = C2 exp xn xn− Pn− − xn xn− , (10.25) | − −2 − −      Nx/2 1/2 b b where C , (2π)− P − . 2 | n−| By substituting (10.22) and (10.25) to (10.21), it follows

1 T 1 p (x y ) C exp (y B x ) R− (y B x ) (10.26) n| 1:n ∝ 3 −2 n − n n n − n n  1 T 1 x x− P− − x x− , (10.27) −2 n − n n n − n     b b where C3 , C1C2.

In order to find the MAP estimate we need to solve

xn = arg max p(xn y1:n), (10.28) xn | and therefore b

∂ log p (x y ) n 1:n = 0. (10.29) | xn=xn ∂xn | b Solving (10.29) yields

1 T 1 1 − 1 T 1 xn = Bn R− Bn + Pn− − Pn− − xn− + Bn R− yn . (10.30)       By using the lemmab of matrix inversion, xn can be writtenb as

x = x−b+ K y B x− , (10.31) n n n n − n n  b b b 10.3 Bayesian Derivation of the Kalman Filter 251 where Kn is the Kalman gain, defined by

T T 1 Kn = Pn−Bn BnPn−Bn + R − . (10.32)  Observing

e− = x x− n n − n = Anxn 1 + vn Anxn 1 (10.33) b− − − = Anen 1 + vn 1, − − b and by virtue of Pn 1 = cov (en 1), we have − −

Pn− = cov en− 1 − (10.34) T = AnP n 1An + Q. − Since

en = xn xn − (10.35) = xn xn− 1 Kn yn Bnxn− 1 , − b − − − −  noting that e = x x and y = B x + w , we furtherb have n− n − n− n n n n

b en = en− Kn Bnen− + wn − (10.36) = (I K B ) e− K w , − n n n − n n it follows

Pn = cov (en) (10.37) T T = (I K B ) P− (I K B ) + K RK . − n n n − n n n n Rearranging the above equation, we obtain

P = P− K B P−. (10.38) n n − n n n 10.4 The EM algorithm 252

10.4 The EM algorithm

10.4.1 Introduction

The EM algorithm is one of several general techniques for finding the MAP estimates where the model depends on unobserved latent variables. We have focused on this technique since it utilizes the classical EM algorithm [35], which exploits the property that the EM algorithm is known to converge to a stationary point corresponding to a local optimum of the posterior distribution, though convergence to the global model is not guaranteed. The strategy underlying the EM algorithm is to separate a difficult problem into two linked problems, each of which is easier to solve than the original problem. The problems are separated using marginalization. The EM algorithm consists of two major steps: an expectation step, followed by a maximization step. The expectation is with respect to the unknown underlying variables, conditioned on the current estimate of the parameters and the observations. During the maximization step one maximizes the complete data likelihood conditioning on the expectation estimates of the previous step. The algorithm is numerically stable and convergence is typically fast, though one should be careful with likelihood functions which are multi-modal. In such cases ad hoc methods such as multiple starting points have been proposed to try to obtain global maxima in this framework.

10.4.2 Derivation of EM

To gain more insight into the EM method, let us express the posterior function as follows:

p (θ x, y) p (θ, y x) p (y x) = p (y x) | = | . (10.39) | | p (θ x, y) p (θ x, y) | | Taking the logarithm of both sides yields:

ln p (y x) = ln p (θ, y x) ln p (θ x, y) . (10.40) | | − |

If we take the expectation of both sides of (10.40) w.r.t p (θ y, x ), where x is the current | k k estimate of x, we obtain

ln p (y x) = E ln p (θ, y x) E ln p (θ x, y) , (10.41) | { | } − { | } where the expectation is taken as

E ln p (θ y, x) = ln p (θ) p (θ y, x ) , (10.42) { | } | k Z and noting that p (y x) does not depend on p (θ y, x ). | | k 10.4 The EM algorithm 253

Next, we derive the following inequality for the second term in (10.41) based on [34] :

E ln p (θ y, x ) E ln p (θ y, x) . (10.43) { | k } ≥ { | }

Using the inequality ln a a 1, it follows that ≤ −

E ln p (θ y, x) E ln p (θ y, x ) = (ln p (θ y, x) ln p (θ y, x )) p (θ y, x ) dθ { | } − { | k } | − | k | k Z p (θ y, x) = ln | p (θ y, x ) dθ p (θ y, x ) | k Z | k (10.44) p (θ y, x) | 1 p (θ y, x ) dθ ≤ p (θ y, x ) − | k Z  | k  = 0, and therefore

E ln p (θ y, x ) E ln p (θ y, x) , (10.45) { | k } ≥ { | } for any x.

The goal of the EM algorithm is to maximise the first term on the right hand side of (10.41) at each iteration. For the time being, let us assume that we can maximise it, that is:

E ln p (θ y, x ) E ln p (θ y, x ) . (10.46) { | n+1 } ≥ { | k }

Then it follows that the likelihood function also increases at every iteration. To demonstrate that, consider the change in likelihood for a single iteration

ln p (θ y, xk+1) ln p (θ y, xk) = E ln p (θ, y xk+1) E ln p (θ, y xn) | − | { | } − { | } (10.47) (E ln p (θ y, x ) E ln p (θ y, x )) . − { | k+1 } − { | k } Since we are taking the expectation w.r.t p (θ y, x ) the right hand side of the above equation | k is positive. As a result, the likelihood function is guaranteed to increase at each iteration. The EM can be summarised in the following steps:

Initialisation: Start with a guess for x0. E-step: Evaluate the expected log-likelihood density function of the complete data given the current estimate xk, i.e. E Q (x, xk) = θ y;x log p (θ, y xk) . | k { | } M-step : Compute a new value of x that maximises the expected log-likelihood of the complete data.

xk+1 = arg max Q (x, xk) . x 10.4 The EM algorithm 254

10.4.3 BEM algorithm

The EM was originally derived for the ML case. However, it can be easily extended to the Bayesian framework in order to derive the MAP estimate. The BEM algorithm evolves through the same iterative procedure as the EM, but with a different auxiliary function QBEM (x, xk)

E QBEM (x, xk) = θ y;x log p (θ, y, x) . (10.48) | k { }

The relationship between the BEM and the EM algorithms can be established by factoring p (θ, y, x) as p (θ, y, x) = p (θ, y x) p (x) . (10.49) | It follows that

QBEM (x, xk) = QEM (x, xk) + log p (x) . (10.50)

Therefore, the difference between the EM and the BEM algorithms is the bias term log p (x) which incorporates our prior belief regarding x. Bibliography

[1] C. Shannon, “The mathematical theory of communication (parts 1 and 2),” Bell Syst. Tech. J, vol. 27, pp. 379–423, 1948.

[2] G. J. Foschini and M. J. Gans, “On limits of wireless communications in a fading envi- ronment when using multiple antennas,” Wireless Personal Communications, vol. 6, pp. 311–335, 1998.

[3] E. Telatar, “Capacity of multi-antenna gaussian channels,” European transactions on telecommunications, vol. 10, no. 6, pp. 585–595, 1999.

[4] R. Chang and R. Gibby, “A theoretical study of performance of an orthogonal multiplexing data transmission scheme,” Communications, IEEE Transactions on [legacy, pre-1988], vol. 16, no. 4, pp. 529–540, 1968.

[5] J. Holsinger, “Digital communication over fixed time-continuous channels with memory- with special application to telephone channels.” MIT Research Laboratory of Electronics, 1964.

[6] S. Weinstein and P. Ebert, “Data Transmission by Frequency-Division Multiplexing Using the Discrete Fourier Transform,” Communications, IEEE Transactions on [legacy, pre- 1988], vol. 19, no. 5 Part 1, pp. 628–634, 1971.

[7] A. Ruiz, J. Cioffi, S. Kasturia, I. Center, and N. Hawthorne, “Discrete multiple tone modulation with coset coding for thespectrally shaped channel,” Communications, IEEE Transactions on, vol. 40, no. 6, pp. 1012–1029, 1992.

[8] E. Van der Meulen, “Transmission of Information in a T-terminal Discrete Memoryless Channel.” Dept. of Statistics, University of California, Berkeley, CA., 1969.

[9] T. Cover and A. Gamal, “Capacity theorems for the relay channel,” Information Theory, IEEE Transactions on, vol. 25, no. 5, pp. 572–584, 1979.

255 [10] T. Bayes, “An essay towards solving the problem of chances,” Philosophical Transactions, vol. 53, pp. 376–418, 1763.

[11] C. Robert, The Bayesian choice: a decision-theoretic motivation. Springer-Verlag, 1994.

[12] G. Box, G. Tiao, and C. George, Bayesian inference in statistical analysis. Addison- Wesley Reading, Mass, 1973.

[13] A. Smith and J. Bernardo, Bayesian theory. Wiley New York, 2000.

[14] A. Gelman, J. Carlin, H. Stern, and D. Rubin, Bayesian Data Analysis. Texts in Statistical Science. Chapman & Hall/CRC, 2004, vol. 25.

[15] H. Jeffreys, Theory of probability. Oxford, 1961.

[16] B. Clarke and A. Barron, “Jeffreysprior is asymptotically least favorable under entropy risk,” Journal of Statistical Planning and Inference, vol. 41, no. 1, pp. 37–61, 1994.

[17] P. Lee, Bayesian statistics. Arnold London, UK:, 2004.

[18] H. Raiffa and R. Schlaifer, Applied statistical decision theory. MIT Press, 1968.

[19] C. Morris, “Parametric empirical Bayes inference: theory and applications,” Journal of the American Statistical Association, vol. 78, no. 381, pp. 47–65, 1983.

[20] A. Stuart, J. Ord, and S. Arnold, “Kendalls Advanced Theory of Statistics, vol. 2A,” London: Edward Arnold, 1999.

[21] J. Hoeting, D. Madigan, A. Raftery, and C. Volinsky, “Bayesian model averaging: A tutorial,” Statistical Scienve, vol. 14, no. 4, pp. 382–417, 1999.

[22] A. Raftery, D. Madigan, and J. Hoeting, “Bayesian Model Averaging for Linear Regres- sion,” Journal of the American Statistical Association, vol. 92, pp. 179–191, 1997.

[23] L. Wasserman, “Bayesian Model Selection and Model Averaging,” Journal of Mathematical Psychology, vol. 44, no. 1, pp. 92–107, 2000.

[24] N. Ma, M. Bouchard, and R. Goubran, “Speech enhancement using a masking thresh- old constrained Kalman filter and its heuristic implementations,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 1, pp. 19–32, 2006.

[25] B. Han, D. Comaniciu, Y. Zhu, and L. Davis, “Sequential kernel density approximation and its application to real-time visual tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 7, pp. 1186–1197, 2008.

[26] M. Hurtado, T. Zhao, and A. Nehorai, “Adaptive polarized waveform design for target tracking based on sequential Bayesian inference,” IEEE Transactions on Signal Processing, vol. 56, no. 3, pp. 1120–1133, 2008. [27] A. B. Trolle and E. S. Schwartz, “A general stochastic volatility model for the pricing and forecasting of interest rate derivatives,” no. 12337, Jun. 2006. [Online]. Available: http://ideas.repec.org/p/nbr/nberwo/12337.html

[28] K. Zhou, J. Doyle, and K. Glover, “Robust and optimal control,” 1996.

[29] J. Doyle, B. Francis, and A. Tannenbaum, “Feedback control theory,” 1992.

[30] A. Sage, C. White, and G. Siouris, “Optimum Systems Control,” Systems, Man and Cy- bernetics, IEEE Transactions on, vol. 9, no. 2, pp. 102–103, 1979.

[31] H. Sorenson, “Recursive estimation for nonlinear dynamic systems,” Bayesian Analysis of Time Series and Dynamic Models, pp. 127–166, 1988.

[32] R. Kalman, “A new approach to linear filtering and prediction problems,” Journal of Basic Engineering, vol. 82, no. 1, pp. 35–45, 1960.

[33] A. Doucet and N. De Freitas, Sequential Monte Carlo Methods in Practice. Springer, 2001.

[34] L. E. Baum, T. Petrie, G. Soules, and N. Weiss, “A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains,” The Annals of Mathematical Statistics, vol. 41, no. 1, pp. 164–171, 1970. [Online]. Available: http://dx.doi.org/10.2307/2239727

[35] A. Dempster, N. Laird, and D. Rubin, “Maximum likelihood from incomplete data via the EM algorithm (with discussion),” JR Statist. Soc, vol. 39, pp. 1–38, 1977.

[36] H. Cramer, “A contribution to the theory of statistical estimation,” Skand. Aktuarietidskr, vol. 29, pp. 85–94, 1946.

[37] C. Rao, “Information and the Accuracy Attainable in the Estimation of Statistical Pa- rameters,” Bull. Calcutta Math. Soc.,, vol. 37, pp. 81–91.

[38] Y. Wu, D. Hu, M. Wu, and X. Hu, “A Numerical-Integration Perspective on Gaussian Filters,” IEEE Transactions on Signal Processing, vol. 54, no. 8, pp. 2910–2921, 2006.

[39] A. Honkela and H. Valpola, “Unsupervised variational Bayesian learning of nonlinear mod- els,” Advances in Neural Information Processing Systems, vol. 17, pp. 593–600, 2005.

[40] C. Robert and G. Casella, Monte Carlo Statistical Methods. Springer, 2004.

[41] K. Borovkov, Elements of stochastic modeling. World Scientific Pub Co Inc.

[42] W. Hastings, “Monte Carlo sampling methods using Markov chains and their applications,” Biometrika, vol. 57, no. 1, pp. 97–109, 1970. [43] N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller, “Equation of State Calculations by Fast Computing Machines,” The Journal of Chemical Physics, vol. 21, no. 6, pp. 1087–1092, 1953.

[44] S. Chib and E. Greenberg, “Understanding the Metropolis-Hastings Algorithm,” Americn , vol. 49, pp. 327–335, 1995.

[45] W. Gilks, S. Richardson, and D. Spiegelhalter, Markov Chain Monte Carlo in Practice. Chapman & Hall/CRC, 1996.

[46] J. Siepmann and D. Frenkel, “Configurational bias Monte Carlo: a new sampling scheme for flexible chains,” Molecular Physics, vol. 75, no. 1, pp. 59–70, 1992.

[47] J. Liu, F. Liang, and W. Wong, “The Multiple-Try Method and Local Optimization in Metropolis Sampling,” Journal of the American Statistical Association, vol. 95, no. 449, pp. 121–134, 2000.

[48] S. Geman and D. Geman, “Stochastic relaxation, Gibbs distributions, and the Bayesian restoration ofimages,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 6, no. 6, pp. 721–741, 1984.

[49] P. Laarhoven and E. Aarts, Simulated annealing: theory and applications. Springer, 1987.

[50] M. Cowles and B. Carlin, “Markov Chain Monte Carlo Convergence Diagnostics: A Com- parative Review,” Journal of the American Statistical Association, vol. 91, no. 434, pp. 883–904, 1996.

[51] S. Brooks and G. Roberts, “Convergence assessment techniques for Markov chain Monte Carlo,” Statistics and Computing, vol. 8, no. 4, pp. 319–335, 1998.

[52] A. Gelman, “Inference and monitoring convergence,” Markov Chain Monte Carlo in Prac- tice, pp. 131–143, 1996.

[53] N. Madras and A. Sokal, “The pivot algorithm: A highly efficient Monte Carlo method for the self-avoiding walk,” Journal of Statistical Physics, vol. 50, no. 1, pp. 109–186, 1988.

[54] P. Green, “Reversible jump Markov chain Monte Carlo computation and Bayesian model determination,” Biometrika, vol. 82, no. 4, pp. 711–732, 1995.

[55] S. Richardson and P. Green, “On Bayesian Analysis of Mixtures with an Unknown Num- ber of Components (with discussion),” Journal of the Royal Statistical Society: Series B (Methodological), vol. 59, no. 4, pp. 731–792, 1997.

[56] F. Burk, Lebesgue measure and integration: an introduction. Wiley-Interscience, 1998.

[57] C. J. Geyer, “Markov chain monte carlo maximum likelihood.” pp. 156–163, 1991. [58] E. Marinari and G. Parisi, “Simulated tempering: a new Monte Carlo scheme,” Europhys. Lett, vol. 19, no. 6, pp. 451–458, 1992.

[59] F. Liang and W. Wong, “Real-parameter evolutionary Monte Carlo with applications to Bayesian mixture models,” Journal of the American Statistical Association, vol. 96, no. 454, pp. 653–666, 2001.

[60] W. Wong and F. Liang, “Dynamic weighting in Monte Carlo and optimization,” Proceed- ings of the National Academy of Sciences, vol. 94, no. 26, pp. 14 220–14 224, 1997.

[61] H. Robbins and S. Monro, “A stochastic approximation method,” The Annals of Mathe- matical Statistics, pp. 400–407, 1951.

[62] F. Liang, “A Generalized WangLandau Algorithm for Monte Carlo Computation,” Journal of the American Statistical Association, vol. 100, no. 472, pp. 1311–1327, 2005.

[63] S. Tavare, D. Balding, R. Griffiths, and P. Donnelly, “Inferring Coalescence Times From DNA Sequence Data,” Genetics, vol. 145, no. 2, pp. 505–518, 1997.

[64] J. Pritchard, “Population growth of human Y chromosomes: a study of Y chromosome microsatellites,” Molecular Biology and Evolution, vol. 16, no. 12, pp. 1791–1798, 1999.

[65] G. Peters and S. Sisson, “Bayesian inference, Monte Carlo sampling and operational risk,” Journal of Operational Risk, vol. 1, no. 3, pp. 27–50, 2006.

[66] G. Peters, M. W¨uthrich, and P. Shevchenko, “Chain Ladder Method: Bayesian Bootstrap versus Classical Bootstrap,” Preprint - UNSW statistics.

[67] R. Wilkinson and S. Tavare, “Approximate Bayesian Computation: a simulation based approach to inference,” Preprint.

[68] A. Butler, C. Glasbey, D. Allcroft, and S. Wanless, “A latent Gaussian model for com- positional data with structural zeroes,” JR Stat. Soc. Ser. C (Appl. Stat.), vol. 57, pp. 505–520, 2008.

[69] P. Bortot, S. Coles, and S. Sisson, “Inference for Stereological Extremes,” Journal- American Statistical Association, vol. 102, no. 477, pp. 84–92, 2007.

[70] O. Ratmann, O. Jørgensen, T. Hinkley, M. Stumpf, S. Richardson, and C. Wiuf, “Using likelihood-free inference to compare evolutionary dynamics of the protein networks of H. pylori and P. falciparum,” PLoS Comput Biol, vol. 3, no. 11, pp. 2266–2276, 2007.

[71] S. Sisson, G. Peters, Y. Fan, and M. Briers, “Likelihood-free samplers ,” Preprint - UNSW statistics.

[72] G. Peters, S. Sisson, and Y. Fan, “On Sequential Monte Carlo, Partial Rejection Control and Approximate Bayesian Computation ,” Preprint - UNSW statistics. [73] P. Marjoram, J. Molitor, V. Plagnol, and S. Tavare, “Markov chain Monte Carlo without likelihoods,” Proceedings of the National Academy of Sciences, vol. 100, no. 26, pp. 15 324– 15 328, 2003.

[74] S. Sisson, Y. Fan, and M. Tanaka, “Sequential Monte Carlo without likelihoods,” Proceed- ings of the National Academy of Sciences, vol. 104, no. 6, pp. 1760–1765, 2007.

[75] M. Beaumont, W. Zhang, and D. Balding, “Approximate Bayesian Computation in Pop- ulation Genetics,” Genetics, vol. 162, no. 4, pp. 2025–2035, 2002.

[76] J. von Neumann, “Various techniques used in connection with random digits,” Applied Math Series, vol. 12, pp. 36–38, 1951.

[77] J. Proakis and M. Salehi, Digital communications. McGraw-Hill New York, 1995.

[78] T. Rappaport, Wireless Communications: Principles and Practice. IEEE Press Piscat- away, NJ, USA, 1996.

[79] P. Robertson and S. Kaiser, “The effects of Doppler spreads in OFDM (A) mobile radio systems,” vol. 1, 1999.

[80] K. Yu and B. Ottersten, “Models for MIMO propagation channels: a review,” Wireless Communications and Mobile Computing, vol. 2, no. 7, pp. 653–666, 2002.

[81] J. Parsons and J. David, The mobile radio propagation channel. Wiley Chichester, 2000.

[82] H. Hashemi, N. Ltd, and A. Calgary, “The indoor radio propagation channel,” Proceedings of the IEEE, vol. 81, no. 7, pp. 943–968, 1993.

[83] M. P¨atzold, Mobile Fading Channels. Wiley, 2002.

[84] W. Jakes and D. Cox, Microwave Mobile Communications. Wiley-IEEE Press, 1994.

[85] K. Baddour and N. Beaulieu, “Autoregressive modeling for fading channel simulation,” Wireless Communications, IEEE Transactions on, vol. 4, no. 4, pp. 1650–1662, 2005.

[86] H. Wang and P. Chang, “On verifying the first-order Markovian assumption for a Rayleigh- fading channel model,” Vehicular Technology, IEEE Transactions on, vol. 45, no. 2, pp. 353–357, 1996.

[87] C. Komninakis, C. Fragouli, A. Sayed, and R. Wesel, “Multi-input multi-output fading channel tracking and equalizationusing Kalman estimation,” IEEE Transactions on Signal Processing, vol. 50, no. 5, pp. 1065–1076, 2002.

[88] H. El Gamal, A. Hammons Jr, Y. Liu, M. Fitz, and O. Takeshita, “On the Design of Space– Time and Space–Frequency Codes for MIMO Frequency-Selective Fading Channels,” IEEE Transactions on Information Theory, vol. 49, no. 9, pp. 2277–2292, 2003. [89] A. Maleki-Tehrani, B. Hassibi, and J. Cioffi, “Adaptive equalization of multiple-input multiple-output (MIMO) frequency selective channels,” vol. 1, 1999.

[90] S. Liu and Z. Tian, “Near-optimum soft decision equalization for frequency selective MIMO channels,” Signal Processing, IEEE Transactions on [see also Acoustics, Speech, and Signal Processing, IEEE Transactions on], vol. 52, no. 3, pp. 721–733, 2004.

[91] G. Foschini, “Wireless Communication in a Fading Environment When Using Multi- Element Antennas,” Bell Labs Tech. J, vol. 41, 1996.

[92] E. Viterbo and J. Boutros, “A universal lattice code decoder for fading channels,” Infor- mation Theory, IEEE Transactions on, vol. 45, no. 5, pp. 1639–1642, 1999.

[93] E. Agrell, T. Eriksson, A. Vardy, and K. Zeger, “Closest point search in lattices,” Infor- mation Theory, IEEE Transactions on, vol. 48, no. 8, pp. 2201–2214, 2002.

[94] B. Hochwald and S. ten Brink, “Achieving near-capacity on a multiple-antenna channel,” Communications, IEEE Transactions on, vol. 51, no. 3, pp. 389–399, 2003.

[95] A. Chan and I. Lee, “A new reduced-complexity sphere decoder for multiple antenna systems,” Communications, 2002. ICC 2002. IEEE International Conference on, vol. 1, pp. 460–464, 2002.

[96] C. Schnorr and M. Euchner, “Lattice basis reduction: Improved practical algorithms and solving subset sum problems,” Mathematical Programming, vol. 66, no. 1, pp. 181–199, 1994.

[97] Z. Guo and P. Nilsson, “Algorithm and implementation of the K-best sphere decoding for MIMO detection,” Selected Areas in Communications, IEEE Journal on, vol. 24, no. 3, pp. 491–503, 2006.

[98] J. Conway and N. Sloane, Sphere Packings, Lattices and Groups. Springer, 1999.

[99] U. Fincke and M. Pohst, “Improved methods for calculating vectors of short length in a lattice, including a complexity analysis,” Mathematics of Computation, vol. 44, pp. 463– 471, 1985.

[100] C. Windpassinger, L. Lampe, R. Fischer, and T. Hehn, “A Performance Study of MIMO Detectors,” IEEE Transactions on Wireless Communications, vol. 5, no. 8, pp. 2004–2008, 2006.

[101] T. Kailath, H. Vikalo, and B. Hassibi, “MIMO receive algorithms,” Space-Time Wireless Systems: From Array Processing to MIMO Communications, p. 302, 2006.

[102] A. Peled and A. Ruiz, “Frequency domain data transmission using reduced computational complexity algorithms,” vol. 5, 1980. [103] C. Wong, R. Cheng, K. Lataief, and R. Murch, “Multiuser OFDM with adaptive subcarrier, bit, and power allocation,” Selected Areas in Communications, IEEE Journal on, vol. 17, no. 10, pp. 1747–1758, 1999.

[104] C. Yih and E. Geraniotis, “Adaptive modulation, power allocation and control for OFDM wireless networks,” Personal, Indoor and Mobile Radio Communications, 2000. PIMRC 2000. The 11th IEEE International Symposium on, vol. 2, 2000.

[105] T. Pollet, M. Van Bladel, and M. Moeneclaey, “BER sensitivity of OFDM systems to carrier frequency offset andWiener phase noise,” Communications, IEEE Transactions on, vol. 43, no. 234, pp. 191–193, 1995.

[106] J. Armstrong, “Analysis of new and existing methods of reducing intercarrierinterfer- ence due to carrier frequency offset in OFDM,” Communications, IEEE Transactions on, vol. 47, no. 3, pp. 365–369, 1999.

[107] B. Krongold and D. Jones, “PAR reduction in OFDM via active constellation extension,” Broadcasting, IEEE Transactions on, vol. 49, no. 3, pp. 258–268, 2003.

[108] R. Bauml, R. Fischer, and J. Huber, “Reducing the peak-to-average power ratio of multi- carrier modulationby selected mapping,” Electronics Letters, vol. 32, no. 22, pp. 2056–2057, 1996.

[109] L. Tong, B. Sadler, and M. Dong, “Pilot-assisted wireless transmissions: general model, design criteria, and signal processing,” Signal Processing Magazine, IEEE, vol. 21, no. 6, pp. 12–25, 2004.

[110] S. Coleri, M. Ergen, A. Puri, and A. Bahai, “Channel estimation techniques based on pilot arrangement in OFDM systems,” Broadcasting, IEEE Transactions on, vol. 48, no. 3, pp. 223–229, 2002.

[111] Y. Li, “Pilot-symbol-aided channel estimation for OFDM in wireless systems,” Vehicular Technology, IEEE Transactions on, vol. 49, no. 4, pp. 1207–1215, 2000.

[112] F. Tufvesson and T. Maseng, “Pilot assisted channel estimation for OFDM in mobile cellularsystems,” vol. 3, pp. 1639–1643, 1997.

[113] M. Morelli and U. Mengali, “A comparison of pilot-aided channel estimation methods for OFDMsystems,” Signal Processing, IEEE Transactions on [see also Acoustics, Speech, and Signal Processing, IEEE Transactions on], vol. 49, no. 12, pp. 3065–3073, 2001.

[114] J. van de Beek, O. Edfors, M. Sandell, S. Wilson, and P. Borjesson, “On channel estimation in OFDM systems,” Vehicular Technology Conference, 1995 IEEE 45th, vol. 2, pp. 815– 819, 1995. [115] J. Ran, R. Grunheid, H. Rohling, E. Bolinth, and R. Kern, “Decision-directed channel estimation method for OFDM systems with high velocities,” vol. 4, pp. 2358–2361, 2003.

[116] A. Petropulu, R. Zhang, and R. Lin, “Blind OFDM channel estimation through simple linear precoding,” Wireless Communications, IEEE Transactions on, vol. 3, no. 2, pp. 647–655, 2004.

[117] R. Lin and A. Petropulu, “Linear precoding assisted blind channel estimation for OFDM systems,” Vehicular Technology, IEEE Transactions on, vol. 54, no. 3, pp. 983–995, 2005.

[118] W. Bai, C. He, L. Jiang, and H. Zhu, “Blind channel estimation in MIMO-OFDM systems,” Global Telecommunications Conference, 2002. GLOBECOM’02. IEEE, vol. 1, pp. 317–321, 2002.

[119] W. Gardner and D. Cochran, Cyclostationarity in communications and signal processing. IEEE press Piscataway, NJ, 1994.

[120] R. Heath Jr and G. Giannakis, “Exploiting input cyclostationarity for blind channel iden- tificationin OFDM systems,” Signal Processing, IEEE Transactions on [see also Acoustics, Speech, and Signal Processing, IEEE Transactions on], vol. 47, no. 3, pp. 848–856, 1999.

[121] G. Tong and T. Kailath, “A least-squares approach to blind channel identification,” Signal Processing, IEEE Transactions on [see also Acoustics, Speech, and Signal Processing, IEEE Transactions on], vol. 43, no. 12, pp. 2982–2993, 1995.

[122] E. Serpedin and G. Giannakis, “Blind channel identification and equalization withmodulation-induced cyclostationarity,” Signal Processing, IEEE Transactions on [see also Acoustics, Speech, and Signal Processing, IEEE Transactions on], vol. 46, no. 7, pp. 1930–1944, 1998.

[123] T. Cui and C. Tellambura, “Joint data detection and channel estimation for OFDM sys- tems,” Communications, IEEE Transactions on, vol. 54, no. 4, pp. 670–679, 2006.

[124] R. Chen, H. Zhang, Y. Xu, and X. Liu, “Blind receiver for OFDM systems via sequential Monte Carlo in factor graphs,” Journal of Zhejiang University-Science A, vol. 8, no. 1, pp. 1–9, 2007.

[125] S. Zhou and G. Giannakis, “Finite-alphabet based channel estimation for OFDM and relatedmulticarrier systems,” Communications, IEEE Transactions on, vol. 49, no. 8, pp. 1402–1414, 2001.

[126] Y. Song, S. Roy, L. Akers, and N. MathWorks, “Joint blind estimation of channel and data symbols in OFDM,” Vehicular Technology Conference Proceedings, 2000. VTC 2000- Spring Tokyo. 2000 IEEE 51st, vol. 1, 2000. [127] C. Li and S. Roy, “Subspace-based blind channel estimation for OFDM by exploiting virtual carriers,” Wireless Communications, IEEE Transactions on, vol. 2, no. 1, pp. 141– 150, 2003.

[128] S. Roy and C. Li, “A subspace blind channel estimation method for OFDM systems without cyclic prefix,” Wireless Communications, IEEE Transactions on, vol. 1, no. 4, pp. 572–579, 2002.

[129] X. Cai and A. Akansu, “A subspace method for blind channel identification in OFDM systems,” vol. 2, pp. 929–933, 2000.

[130] N. Chen and G. Zhou, “A superimposed periodic pilot scheme for semi-blind channel estimation of OFDM systems,” pp. 362–365, 2002.

[131] A. Burr, Modulation and Coding for Wireless Communications. Addison-Wesley Longman Publishing Co., Inc. Boston, MA, USA, 2001.

[132] P. Elias, “Coding for Noisy Channels, IRE Conv,” Rec, vol. 4, pp. 37–46, 1955.

[133] A. Viterbi, “Error bounds for convolutional codes and an asymptotically optimum decoding algorithm,” Information Theory, IEEE Transactions on, vol. 13, no. 2, pp. 260–269, 1967.

[134] J. Hagenauer and P. Hoeher, “A Viterbi algorithm with soft-decision outputs and its applications,” vol. 89, pp. 1680–1686, 1989.

[135] E. Zehavi, “8-PSK trellis codes for a Rayleigh channel,” Communications, IEEE Transac- tions on, vol. 40, no. 5, pp. 873–884, 1992.

[136] C. Berrou, A. Glavieux, and P. Thitimajshima, “Near Shannon limit error-correcting cod- ing and decoding: Turbo-codes.” Communications, 1993. ICC 93. Geneva. Technical Pro- gram, Conference Record, IEEE International Conference on, vol. 2, pp. 1064–1070, 1993.

[137] S. ten Brink and J. Yan, “Iterative demapping and decoding for multilevel modulation,” vol. 1, pp. 579–584, 1998.

[138] A. Glavieux, C. Laot, and J. Labat, “Turbo equalization over a frequency selective chan- nel,” Proc. Int. Symp. Turbo Codes, pp. 96–102, 1997.

[139] X. Wang and H. Poor, “Iterative (turbo) soft interference cancellation and decoding for- coded CDMA,” Communications, IEEE Transactions on, vol. 47, no. 7, pp. 1046–1061, 1999.

[140] F. Kschischang, B. Frey, and H. Loeliger, “Factor graphs and the sum-product algorithm,” IEEE Transactions on information theory, vol. 47, no. 2, pp. 498–519, 2001.

[141] E. van der Meulen, “Three-terminal communication channels,” Adv. Appl. Prob, vol. 3, no. 1, pp. 120–154, 1971. [142] D. Chen and J. Laneman, “Modulation and Demodulation for Cooperative Diversity in Wireless Systems,” IEEE Transactions on Wireless Communications, vol. 5, no. 7, pp. 1785–1794, 2006.

[143] J. Laneman, D. Tse, and G. Wornell, “Cooperative diversity in wireless networks: Effi- cient protocols and outage behavior,” Information Theory, IEEE Transactions on, vol. 50, no. 12, pp. 3062–3080, 2004.

[144] G. Kramer, M. Gastpar, and P. Gupta, “Cooperative Strategies and Capacity Theorems for Relay Channels,” IEEE Transactions on Information Theory, vol. 51, no. 9, pp. 3037– 3063, 2005.

[145] M. Khojastepour, A. Sabharwal, and B. Aazhang, “On the capacity of ‘cheap’ relay net- works,” pp. 12–14, 2003.

[146] K. Gomadam and S. Jafar, “Optimal relay functionality for SNR maximization in memo- ryless relay networks,” IEEE Journal on Selected Areas in Communications, vol. 25, no. 2, pp. 390–401, 2007.

[147] A. Ben-Tal and M. Teboulle, “Hidden convexity in some nonconvex quadratically con- strained quadratic programming,” Math. Program., vol. 72, no. 1, pp. 51–63, 1996.

[148] D. Micciancio, “The shortest vector in a lattice is hard to approximate to withinsome constant,” Foundations of Computer Science, 1998. Proceedings. 39th Annual Symposium on, pp. 92–98, 1998.

[149] A. Wiesel, Y. Eldar, and S. Shitz, “Semidefinite Relaxation for Detection of 16-QAM Signaling in MIMO Channels,” Signal Processing Letters, IEEE, vol. 12, no. 9, pp. 653– 656, 2005.

[150] N. Sidiropoulos and Z. Luo, “A Semidefinite Relaxation Approach to MIMO Detection for High-Order QAM Constellations,” IEEE Signal Processing Letters, vol. 13, no. 9, pp. 525–528, 2006.

[151] Y. Eldar and A. Beck, “Hidden convexity based near maximum-likelihood CDMA detec- tion,” in 2005 IEEE 6th Workshop on Signal Processing Advances in Wireless Communi- cations, 2005, pp. 61–65.

[152] P. Tan, L. Rasmussen, and T. Lim, “Constrained maximum-likelihood detection in CDMA,” Communications, IEEE Transactions on, vol. 49, no. 1, pp. 142–153, 2001.

[153] H. Zhao, H. Long, and W. Wang, “Tabu Search Detection for MIMO Systems,” Personal, Indoor and Mobile Radio Communications, 2007. PIMRC 2007. IEEE 18th International Symposium on, pp. 1–5, 2007. [154] P. Tan and L. Rasmussen, “Multiuser detection in CDMA-a comparison of relaxations, exact, and heuristic search methods,” Wireless Communications, IEEE Transactions on, vol. 3, no. 5, pp. 1802–1809, 2004.

[155] A. Paulraj, D. Gore, R. Nabar, and H. Bolcskei, “An Overview of MIMO Communications - A Key to Gigabit Wireless,” Proceedings of the IEEE, vol. 92, no. 2, pp. 198–218, 2004.

[156] M. O. Damen, H. E. Gamal, S. Member, and S. Member, “On maximum-likelihood de- tection and the search for the closest lattice point,” IEEE Transactions on Information Theory, vol. 49, pp. 2389–2402, 2003.

[157] S. Verdu, Multiuser Detection. New York, NY, USA: Cambridge University Press, 1998.

[158] C. Fortin and H. Wolkowicz, “The trust region subproblem and semidefinite program- ming,” Optimization Methods and Software, vol. 19, no. 1, pp. 41–67, 2004.

[159] R. Stern and H. Wolkowicz, “Indefinite Trust Region Subproblems and Nonsymmetric Eigenvalue Perturbations,” SIAM Journal on Optimization, vol. 5, pp. 286–313, 1995.

[160] W. Press and W. Vetterling, Numerical recipes in C: the art of scientific computing.

[161] V. Rayward-Smith, I. Osman, C. Reeves, and G. Smith, Modern heuristic search methods. Wiley New York, 1996.

[162] J. Benesty, Y. Huang, and J. Chen, “A fast recursive algorithm for optimum sequential signal detection in a BLAST system,” Signal Processing, IEEE Transactions on [see also Acoustics, Speech, and Signal Processing, IEEE Transactions on], vol. 51, no. 7, pp. 1722– 1730, 2003.

[163] M. Biguesh and A. Gershman, “Training-Based MIMO Channel Estimation: A Study of Estimator Tradeoffs and Optimal Training Signals,” Signal Processing, IEEE Transactions on [see also Acoustics, Speech, and Signal Processing, IEEE Transactions on], vol. 54, no. 3, pp. 884–893, 2006.

[164] G. Forney Jr and G. Ungerboeck, “Modulation and coding for linear Gaussian channels,” Information Theory, IEEE Transactions on, vol. 44, no. 6, pp. 2384–2415, 1998.

[165] U. Wachsmann, R. Fischer, and J. Huber, “Multilevel codes: theoretical concepts and practical design rules,” Information Theory, IEEE Transactions on, vol. 45, no. 5, pp. 1361–1391, 1999.

[166] G. Forney Jr, M. Codex, and M. Mansfield, “Trellis shaping,” Information Theory, IEEE Transactions on, vol. 38, no. 2 Part 2, pp. 281–300, 1992.

[167] P. Fortier, A. Ruiz, and J. Cioffi, “Multidimensional signal sets through the shell construc- tion for parallel channels,” Communications, IEEE Transactions on, vol. 40, no. 3, pp. 500–512, 1992. [168] F. Kschischang and S. Pasupathy, “Optimal nonuniform signaling for Gaussian channels,” Information Theory, IEEE Transactions on, vol. 39, no. 3, pp. 913–929, 1993.

[169] E. Agrell, T. Eriksson, A. Vardy, and K. Zeger, “Closest point search in lattices,” Infor- mation Theory, IEEE Transactions on, vol. 48, no. 8, pp. 2201–2214, 2002.

[170] G. Foschini, “Wireless Communication in a Fading Environment When Using Multi- Element Antennas,” Bell Labs Tech. J, vol. 41, 1996.

[171] D. Raphaeli and A. Gurevitz, “Constellation shaping for pragmatic turbo-coded modula- tion with high spectral efficiency,” Communications, IEEE Transactions on, vol. 52, no. 3, pp. 341–345, 2004.

[172] F. Sun and H. van Tilborg, “Approaching capacity by equiprobable signaling on the Gaus- sian channel,” Information Theory, IEEE Transactions on, vol. 39, no. 5, pp. 1714–1716, 1993.

[173] A. Calderbank and L. Ozarow, “Nonequiprobable signaling on the Gaussian channel,” Information Theory, IEEE Transactions on, vol. 36, no. 4, pp. 726–740, 1990.

[174] A. Wiesel, Y. Eldar, and A. Beck, “Maximum Likelihood Estimation in Linear Models With a Gaussian Model Matrix,” IEEE Signal Processing Letters, vol. 13, no. 5, pp. 292– 295, 2006.

[175] A. Wiesel, Y. Eldar, and A. Yeredor, “Linear Regression With Gaussian Model Uncer- tainty: Algorithms and Bounds,” Signal Processing, IEEE Transactions on [see also Acous- tics, Speech, and Signal Processing, IEEE Transactions on], vol. 56, no. 6, pp. 2194–2205, 2008.

[176] G. Forsythe and G. Golub, “On the Stationary Values of a Second-Degree Polynomial on the Unit Sphere,” SIAM Journal on Applied Mathematics, vol. 13, pp. 1050–1068, 1965.

[177] B. Flannery, W. Press, S. Teukolsky, and W. Vetterling, “Numerical Recipes in C,” Press Syndicate of the , New York, 1992.

[178] A. Logothetis, V. Krishnamurthy, and J. Holst, “A Bayesian EM algorithm for optimal tracking of a maneuvering target in clutter,” Signal Processing, vol. 82, no. 3, pp. 473–490, 2002.

[179] M. Beal, Z. Ghahramani, J. Bernardo, M. Bayarri, J. Berger, A. Dawid, D. Heckerman, A. Smith, and M. West, “The Variational Bayesian EM Algorithm for Incomplete Data: with Application to Scoring Graphical Model Structures,” Bayesian Statistics 7, 2003.

[180] N. Friedman, “The Bayesian structural EM algorithm,” vol. 98, 1998. [181] A. Gallo, G. Vitetta, and E. Chiavaccini, “A BEM-based algorithm for soft-in soft-output detection of co-channel signals,” Wireless Communications, IEEE Transactions on, vol. 3, no. 5, pp. 1533–1542, 2004.

[182] S. Kay, Fundamentals of Statistical Signal Processing, Volume 2: Detection Theory. Pren- tice Hall PTR, 1998.

[183] S. Kirkpatrick, C. Gelatt, and M. Vecchi, “Optimization by Simulated Annealing,” Science, vol. 220, no. 4598, pp. 671–680, 1983.

[184] S. Boyd and L. Vandenberghe, Introduction to Convex Optimization with Engineering Applications, 2000.

[185] A. Kharab and R. Guenther, An Introduction to Numerical Methods: A MATLAB Ap- proach. Chapman & Hall/CRC, 2006.

[186] H. Liu and G. Li, OFDM-Based Broadband Wireless Networks: Design and Optimization. Wiley-Interscience, 2005.

[187] E. Akay and E. Ayanoglu, “Full frequency diversity codes for single input single output systems,” Vehicular Technology Conference, 2004. VTC2004-Fall. 2004 IEEE 60th, vol. 3, pp. 1870–1874, 2004.

[188] N. Seshadri, “Joint data and channel estimation using blind trellis searchtechniques,” Communications, IEEE Transactions on, vol. 42, no. 234, pp. 1000–1011, 1994.

[189] A. Anastasopoulos and K. Chugg, “Iterative equalization/decoding of TCM for frequency- selective fading channels,” Signals, Systems & Computers, 1997. Conference Record of the Thirty-First Asilomar Conference on, vol. 1, pp. 178–181, 1997.

[190] T. Roman, M. Enescu, and V. Koivunen, “Time-domain method for tracking dispersive channels in MIMO OFDM systems,” vol. 4, pp. 393–396, 2003.

[191] W. Chen and R. Zhang, “Kalman-filter channel estimator for OFDM systems in time and frequency-selective fading environment,” Acoustics, Speech, and Signal Processing, 2004. Proceedings.(ICASSP’04). IEEE International Conference on, vol. 4, pp. 377–380, 2004.

[192] S. Haykin, Adaptive filter theory. Prentice-Hall, Inc. Upper Saddle River, NJ, USA, 1996.

[193] W. Fuller, P. Miller, D. Schnell, A. Meeting, and A. S. Association, Measurement error models. Wiley New York, 1987.

[194] M. Dong, L. Tong, and B. Sadler, “Optimal Pilot Placement for Channel Tracking in OFDM,” MILCOM, vol. 1, pp. 602–606, 2002. [195] S. Song, A. Singer, and K. Sung, “Soft input channel estimation for turbo equalization,” Signal Processing, IEEE Transactions on [see also Acoustics, Speech, and Signal Process- ing, IEEE Transactions on], vol. 52, no. 10, pp. 2885–2894, 2004.

[196] H. Wymeersch, F. Simoens, and M. Moeneclaey, “Code-aided channel tracking for OFDM,” IEEE International Symposium on Turbo Codes & Related Topics, Munich, April, 2006.

[197] G. Al-Rawi, T. Al-Naffouri, A. Bahai, and J. Cioffi, “An iterative receiver for coded OFDM systems over time-varying wireless channels,” Communications, 2003. ICC’03. IEEE International Conference on, vol. 5, pp. 3371–3376, 2003.

[198] M. Kopbayashi, J. Boutros, and G. Caire, “Successive interference cancellation with SISO decoding and EMchannel estimation,” Selected Areas in Communications, IEEE Journal on, vol. 19, no. 8, pp. 1450–1460, 2001.

[199] F. Hampel, “Contributions to the theory of robust estimation,” 1968.

[200] ——, “A general qualitative definition of robustness,” The Annals of Mathematical Statis- tics, vol. 42, no. 6, pp. 1887–1896, 1971.

[201] P. Papantoni-Kazakos and R. Gray, “Robustness of estimators on stationary observations,” Ann. Probab, vol. 7, no. 6, pp. 989–1002, 1979.

[202] D. Donoho and P. Huber, “The notion of breakdown point,” A Festschrift for Erich L. Lehmann in Honor of His Sixty-fifth Birthday, pp. 157–184, 1982.

[203] P. Huber, J. Wiley, and W. InterScience, Robust statistics. Wiley New York, 1981.

[204] P. J. Huber, Robust estimation of a location parameter, 1964, vol. 35, no. 1.

[205] A. Beaton and J. Tukey, “The fitting of power series, meaning polynomials, illustrated on band-spectroscopic data,” Technometrics, vol. 16, no. 2, pp. 147–185, 1974.

[206] R. ZAMAR, “Robust estimation in the errors-in-variables model,” Biometrika, vol. 76, no. 1, pp. 149–160, 1989.

[207] F. Hampel, E. Ronchetti, P. Rousseeuw, and W. Stahel, “Robust Statistics: The Approach Based on Influence Functions,” New York, 1986.

[208] S. Chan, Z. Zhang, and K. Tse, “A new robust Kalman filter algorithm under outliers and system uncertainties,” Circuits and Systems, 2005. ISCAS 2005. IEEE International Symposium on, pp. 4317–4320, 2005.

[209] A. Ben-Tal and A. Nemirovskii, Lectures on modern convex optimization: analysis, algo- rithms, and engineering applications. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 2001. [210] T. Abe and T. Matsumoto, “Space-time turbo equalization in frequency-selective MIMO channels,” Vehicular Technology, IEEE Transactions on, vol. 52, no. 3, pp. 469–475, 2003.

[211] Y. Bar-Shalom, X. Li, T. Kirubarajan, and J. Wiley, Estimation with applications to tracking and navigation. Wiley New York, 2001.

[212] S. Kalyani and K. Giridhar, “Mitigation of Error Propagation in Decision Directed OFDM Channel Tracking Using Generalized M Estimators,” IEEE Transactions on Signal Pro- cessing, vol. 55, no. 5, pp. 1659–1672, 2007.

[213] M. Tsatsanis, G. Giannakis, and G. Zhou, “Estimation and equalization of fading channels with random coefficients,” Signal Processing, vol. 53, no. 2-3, pp. 211–229, 1996.

[214] D. Simon, Optimal State Estimation: Kalman, H Infinity, and Nonlinear Approaches. Wiley-Interscience, 2006.

[215] G. Welch and G. Bishop, “An Introduction to the Kalman Filter,” University of North Carolina at Chapel Hill, Chapel Hill, NC, 1995.

[216] S. Kalyani and K. Giridhar, “Leverage Weighted Decision Directed Channel Tracking for OFDM Systems,” Communications, 2006. ICC’06. IEEE International Conference on, vol. 6, pp. 2899–2904, 2006.

[217] Y. Fan, G. Peters, and S. Sisson, “Automating and evaluating reversible jump MCMC proposal distributions,” Statistics and Computing, pp. 1–13.

[218] J. Choi and Y. Lee, “Optimum pilot pattern for channel estimation in OFDM systems,” Wireless Communications, IEEE Transactions on, vol. 4, no. 5, pp. 2083–2088, 2005.

[219] Z. Ben-Haim and Y. Eldar, “Minimax Estimators Dominating the Least-Squares Esti- mator,” Acoustics, Speech, and Signal Processing, 2005. Proceedings.(ICASSP’05). IEEE International Conference on, vol. 4, pp. 53–56, 2005.

[220] P. Bello, I. Adcom, and M. Cambridge, “Characterization of Randomly Time-Variant Linear Channels,” Communications, IEEE Transactions on [legacy, pre-1988], vol. 11, no. 4, pp. 360–393, 1963.

[221] H. Minn and V. Bhargava, “An investigation into time-domain approach for OFDM chan- nelestimation,” Broadcasting, IEEE Transactions on, vol. 46, no. 4, pp. 240–248, 2000.

[222] M. Raghavendra and K. Giridhar, “Improving channel estimation in OFDM systems for sparse multipath channels,” Signal Processing Letters, IEEE, vol. 12, no. 1, pp. 52–55, 2005.

[223] H. Akaike, “A new look at the statistical model identification,” Automatic Control, IEEE Transactions on, vol. 19, no. 6, pp. 716–723, 1974. [224] K. H. Nguyen, VD and M. Patzold, “Estimation of the Channel Impulse Response Length and the Noise Variance for OFDM Systems,” IEEE Vehicular Technology Conference, vol. 61, no. 1, pp. 429–433, 2005.

[225] I. Nevat, G. Peters, and J. Yuan, “OFDM CIR Estimation with Unknown Length via Bayesian Model Selection and Averaging,” IEEE Vehicular Technology Conference, pp. 1413–1417, 2008.

[226] S. Brooks, P. Giudici, and G. Roberts, “Efficient construction of reversible jump Markov chain Monte Carlo proposal distributions,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 65, no. 1, pp. 3–39, 2003.

[227] B. Carlin and S. Chib, “Bayesian model choice via Markov chain Monte Carlo methods,” Journal of the Royal Statistical Society. Series B. Methodological, vol. 57, no. 3, pp. 473– 484, 1995.

[228] C. Andrieu, N. de Freitas, A. Doucet, and M. Jordan, “An Introduction to MCMC for Machine Learning,” Machine Learning, vol. 50, no. 1-2, pp. 5–43, 2003.

[229] A. Doucet and X. Wang, “Monte Carlo methods for signal processing: a review in the statistical signal processing context,” Signal Processing Magazine, IEEE, vol. 22, no. 6, pp. 152–170, 2005.

[230] P. Green, “Trans-dimensional Markov chain Monte Carlo,” Highly Structured Stochastic Systems, vol. 27, pp. 179–98, 2003.

[231] S. Sisson, “Transdimensional Markov Chains: A Decade of Progress and Future Perspec- tives,” Journal of the American Statistical Association, vol. 100, no. 471, pp. 1077–1090, 2005.

[232] F. Liang, C. Liu, and R. Carroll, “Stochastic Approximation in Monte Carlo Computa- tion,” Journal-American Statistical Association, vol. 102, no. 477, pp. 305–320, 2007.

[233] M. Bedard and J. Rosenthal, “Optimal scaling of Metropolis algorithms: Heading toward general target distributions,” The Canadian Journal of Statistics/La revue canadienne de statistique, vol. 36, no. 4, pp. 483–503, 2008.

[234] G. Roberts, A. Gelman, and W. Gilks, “Weak convergence and optimal scaling of random walk Metropolis algorithms,” Annals of Applied Probability, vol. 7, pp. 110–120, 1997.

[235] Y. Atchade and J. Rosenthal, “On adaptive Markov chain Monte Carlo algorithms,” Bernoulli-London-, vol. 11, no. 5, pp. 815–828, 2005.

[236] C. Andrieu and E. Moulines, “On the ergodicity properties of some adaptive MCMC algorithms,” Annals of Applied Probability of Applied Probability, vol. 16, no. 3, pp. 1462– 1505, 2006. [237] C. Andrieu, E. Moulines, and P. Priouret, “Stability of Stochastic Approximation under Verifiable Conditions,” Decision and Control, 2005 and 2005 European Control Confer- ence. CDC-ECC’05. 44th IEEE Conference on, pp. 6656–6661, 2005.

[238] J. Zhang and F. Liang, “Convergence of stochastic approximation algorithms under irreg- ular conditions,” Statistica Neerlandica, vol. 62, no. 3, pp. 393–403, 2008.

[239] Y. Fan, S. Brooks, and A. Gelman, “Output Assessment for Monte Carlo Simulations via the Score Statistic,” Journal of Computational and Graphical Statistics, vol. 15, no. 1, pp. 178–206, 2006.

[240] H. Van Trees, Detection, estimation, and modulation theory.. part 1,. detection, estimation, and linear modulation theory. Wiley New York, 1968.

[241] A. Nosratinia, T. Hunter, and A. Hedayat, “Cooperative communication in wireless net- works,” IEEE Communications Magazine, vol. 42, no. 10, pp. 74–80, 2004.

[242] J. Laneman and G. Wornell, “Distributed space-time-coded protocols for exploiting coop- erative diversity in wireless networks,” Information Theory, IEEE Transactions on, vol. 49, no. 10, pp. 2415–2425, 2003.

[243] S. Lin and D. Costello, Error control coding: fundamentals and applications. Prentice Hall, 1983.

[244] P. Anghel and M. Kaveh, “Exact symbol error probability of a cooperative network in a Rayleigh-fading environment,” IEEE Transactions on Wireless Communications, vol. 3, no. 5, pp. 1416–1421, 2004.

[245] R. R. W. and P. A. N., “A theoretical framework for approximate Bayesian computation.” Proceedings of the International Workshop for Statistical Modelling, pp. 393–396, 2005.

[246] H. Rauch, F. Tung, and C. Striebel, “Maximum likelihood estimates of linear dynamic systems,” AIAA Journal, vol. 3, no. 8, pp. 1445–1450, 1965.

[247] Z. Chen, “Bayesian filtering: From Kalman filters to particle filters, and beyond,” adaptive Syst. Lab., McMaster Univ., Hamilton, ON, Canada. “Education is the best provision for the journey to old age.”

Aristotle