Topics in Statistical Signal Processing for Estimation and Detection in Wireless Communication Systems
by Ido Nevat B.Sc. (Electrical Engineering), Technion - Institute of Technology, Israel, 1998.
A thesis submitted in partial fulfillment for the degree of Doctor of Philosophy
in the Faculty of Engineering School of Electrical Engineering and Telecommunications The University of New South Wales
December 2009
Originality Statement
I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgment is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged.
Signature:
Date:
iii Copyright Statement
I hereby grant the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstract International. I have either used no substantial portions of copyright material in my thesis or I have obtained permission to use copyright material; where permission has not been granted I have applied/will apply for a partial restriction of the digital copy of my thesis or dissertation.
Signature:
Date:
Authenticity Statement
I certify that the Library deposit digital copy is a direct equivalent of the final officially approved version of my thesis. No emendation of content has occurred and if there are any minor variations in formatting, they are the result of the conversion to digital format.
Signature:
Date:
iv “What is a scientist after all? It is a curious man looking through a keyhole, the keyhole of nature, trying to know what’s going on.”
Jacques Yves Cousteau THE UNIVERSITY OF NEW SOUTH WALES
Abstract
Faculty of Engineering School of Electrical Engineering and Telecommunications
Doctor of Philosophy
by Ido Nevat
During the last decade there has been a steady increase in the demand for incorporation of high data rate and strong reliability within wireless communication applications. Among the different solutions that have been proposed to cope with this new demand, the utilization of multiple antennas arises as one of the best candidates due to the fact that it provides both an increase in reliability and also in information transmission rate. A Multiple Input Multi- ple Output (MIMO) structure usually assumes a frequency non-selective characteristic at each channel. However, when the transmission rate is high, the whole channel can become frequency selective. Therefore, the use of Orthogonal Frequency Division Multiplexing (OFDM) that transforms a frequency selective channel into a large set of individual frequency non-selective narrowband channels, is well suited to be used in conjunction with MIMO systems.
A MIMO system employing OFDM, denoted MIMO-OFDM, is able to achieve high spectral efficiency. However, the adoption of multiple antenna elements at the transmitter for spatial transmission results in a superposition of multiple transmitted signals at the receiver, weighted by their corresponding multipath channels. This in turn results in difficulties with reception, and imposes a real challenge on how to design a practical system that can offer a true spectral efficiency improvement.
In addition, as wireless networks continue to expend in geographical size, the distance between the source and the destination precludes direct communication between them. In such scenarios, a repeater is placed between the source and the destination to achieve end-to-end communica- tion. New advances in electronics and semiconductor technologies have enabled and made relay based systems feasible. As a result, these systems have become a hot research topic in the wireless research community in recent years. Potential application areas of cooperation diversity are the next generation cellular networks, mobile wireless ad-hoc networks, and mesh networks for wireless broadband access. Besides increasing the network coverage, relays can provide ad- ditional diversity to combat the effects of the wireless fading channel. This thesis is concerned with methods to facilitate the use of MIMO, OFDM and relay based systems.
In the first part of this thesis, we concentrate on low complexity algorithms for detection of symbols in MIMO systems, with various degrees of quality of channel state information. First, we design algorithms for the case that perfect Channel State Information (CSI) is available at the receiver. Next, we design algorithms for the detection of non-uniform symbols constellations where only partial CSI is given at the receiver. These will be based on non-convex and stochastic optimisa- tion techniques.
The second part of this thesis addresses primary issues in OFDM systems. We first concentrate on a design of an OFDM receiver. First we design an iterative receiver for OFDM systems which performs detection, decoding and channel tracking that aims at minimising the error propaga- tion effect due to erroneous detection of data symbols. Next we focus our attention to channel estimation in OFDM systems where the number of channel taps and the power delay profile are both unknown a priori. Using Trans Dimensional Markov Chain Monte Carlo (TDMCMC) methodology we design algorithms to perform joint model order selection and channel estimation.
The third part of this thesis is dedicated to detection of data symbols in relay systems with non-linear relay functions and where only partial CSI is available at the receiver. In order to design the optimal data detector, the likelihood function needs to be evaluated at the receiver. Since the likelihood function cannot be obtained analytically or not even in a closed form in this case, we shall utilse a “Likelihood Free” inference methodology. This will be based on the Approximate Bayesian Computation (ABC) theory to enable the design of novel data sequence detectors. Acknowledgements
During the course of this thesis I have met and interacted with many interesting people, who influenced the path of my research. I would first like to thank my academic supervisor Dr. Jinhong Yuan, for supporting me over the years, and for giving me the freedom to explore my ideas. His constant friendly attitude, encouragement, technical insights and constructive criticism have been invaluable. I would also like to thank him for his financial support that enabled me to give my research work full atten- tion and travel to international conferences. I have been very fortunate to work with Dr. Gareth Peters of School of Mathematics at Uni- versity of NSW. His patience and willingness to share his vast knowledge is inestimable. I will cherish our fruitful discussions and his helpful comments, both on the whiteboard and during our numerous coffee breaks. Many thanks go to Dr. Miquel Payar´oof CTTC, Barcelona, who helped me while I was very confused in the beginning of my studies. I am most grateful to Dr. Ami Wiesel and Dr. Yonian Eldar of Electrical Engineering at the Technion for helping me get a better understanding of Bayesian inference and optimisation tech- niques. Thank you also goes to Dr. Scott Sisson and Dr. Yanan Fan of School of Mathematics at University of NSW for sharing their knowledge in Bayesian Statistics. I would also like to thank all my friends and colleagues at the Wireless Communication Lab at UNSW with whom I have had the pleasure of working over the years; thank you goes to Imitiaz, Tom, Adeel, Marwan, Anisul, Jonas, David, Giovanni, Tuyen and Nam Tram. Thank you also goes to UNSW staff, especially to Joseph Yiu who was willing to land a hand with any technical matter and to Gordon Petzer and May Park for taking care of all adminis- trative issues. Many thanks also go to the people of ACoRN, especially Dr. Lars Rasmussen and Christine Thursby for arranging academic activities and making sure we have sufficient funding.
I would like to thank my family for their unwavering support and love throughout my life; without them, none of this would have been possible.
Finally I would like to thank my partner Dr. Karin Avnit, for too many reasons to list here.
viii Dedicated to my Parents
“And since you know you cannot see yourself,
so well as by reflection, I, your glass,
will modestly discover to yourself,
that of yourself which you yet know not of.”
William Shakespeare
ix
Contents
Originality Statement iii
Copyright Statement iv
Authenticity Statement iv
Abstract vi
Acknowledgements viii
List of Figures xix
List of Tables xxi
List of Algorithms xxiii
Acronyms xxiv
Notations and Symbols xxx
1 Introduction 1 1.1 Motivation ...... 1 1.2 Outline of the dissertation ...... 4 1.3 Research contributions ...... 6
2 Bayesian Inference and Analysis 9 2.1 Introduction ...... 9 2.2 Background ...... 9 2.3 Bayesian Inference ...... 10 2.3.1 Prior Distributions ...... 11 2.3.2 Point Estimates ...... 13 2.3.3 Interval Estimation ...... 16 2.4 Bayesian Model Selection ...... 16
xi 2.4.1 Introduction ...... 16 2.5 Bayesian Estimation Under Unknown Model Order ...... 18 2.5.1 Bayesian Model Averaging ...... 18 2.5.2 Bayesian Model Order Selection ...... 19 2.6 Bayesian Filtering ...... 19 2.6.1 State Space Models ...... 20 2.6.2 Sequential Bayesian Inference ...... 21 2.6.3 Filtering Objectives ...... 22 2.6.4 Sequential Scheme ...... 22 2.6.5 Linear State-Space Models - the Kalman Filter ...... 23 2.7 Consistency Tests ...... 25 2.8 The EM and Bayesian EM Methods ...... 26 2.9 Lower Bounds on the MSE ...... 28 2.10 Bayesian Methodology Summary ...... 29 2.11 Monte Carlo Methods ...... 29 2.11.1 Motivation ...... 29 2.11.2 Mote Carlo Techniques ...... 30 2.11.3 Sampling From Distributions ...... 31 2.11.4 Inversion Sampling ...... 32 2.11.5 Accept-Reject Sampling ...... 33 2.11.6 Importance Sampling ...... 34 2.12 Markov Chain Monte Carlo Methods ...... 35 2.12.0.1 Introduction ...... 35 2.12.1 Basics of Markov Chains ...... 36 2.12.2 Metropolis Hastings Sampler ...... 37 2.12.3 Gibbs Sampler ...... 38 2.12.4 Simulated Annealing ...... 40 2.12.4.1 Introduction ...... 40 2.12.4.2 Methodology ...... 40 2.12.5 Convergence Diagnostics of MCMC ...... 41 2.12.5.1 Burn in Period ...... 41 2.12.5.2 Autocorrelation Time Series ...... 41 2.13 Trans Dimensional Markov Chain Monte Carlo ...... 42 2.13.1 Introduction ...... 42 2.13.1.1 Posterior densities as proposal densities ...... 44 2.13.1.2 Independent sampler ...... 44 2.13.1.3 Standard Metropolis-Hastings ...... 44 2.14 Stochastic Approximation Markov Chain Monte Carlo ...... 45 2.14.1 Stochastic Approximation ...... 45 2.14.2 Stochastic Approximation Markov Chain Monte Carlo ...... 46 2.15 Approximate Bayesian Computation ...... 47 2.15.1 Introduction ...... 48 2.15.2 Basic ABC Algorithm ...... 48 2.15.3 Data Summaries ...... 50 2.15.4 MCMC-ABC Samplers ...... 50 2.15.5 Distance Metrics ...... 52 2.15.6 ABC methodology summary ...... 53 2.16 Concluding Remarks ...... 54
3 Introduction to Wireless Communication 55 3.1 Introduction ...... 55 3.2 Modeling of Fading Channels ...... 55 3.2.1 Tapped Delay-line Channel Model ...... 57 3.2.2 Doppler Offset ...... 57 3.2.3 Power Delay Profile ...... 58 3.2.4 Coherence Bandwidth ...... 58 3.2.5 Coherence Time ...... 58 3.2.6 Time Selective and Fast Fading Channels ...... 58 3.2.7 Slow Fading Channels ...... 59 3.2.8 Frequency Selective Channels ...... 59 3.2.9 Flat Fading Channels ...... 60 3.3 Channel Models ...... 60 3.3.1 Rayleigh Fading Channels ...... 60 3.3.2 Clarke’s / Jake’s Model ...... 61 3.3.3 Approximation of Jake’s Model ...... 62 3.4 Overview of Multi Antenna Communication Systems ...... 63 3.4.1 The Linear MIMO Channel ...... 63 3.4.2 Channel Model ...... 63 3.4.3 Uncertainty Models for the Channel State Information ...... 64 3.5 Detection Techniques in MIMO Systems ...... 65 3.5.1 Linear Detectors ...... 66 3.5.2 VBLAST Detector ...... 66 3.5.3 Sphere Decoder ...... 67 3.6 Overview of OFDM Systems ...... 69 3.6.1 OFDM Signals and Orthogonality ...... 69 3.6.2 OFDM Symbols Transmission ...... 70 3.6.3 Multi Carrier versus Single Carrier Modulation Schemes ...... 74 3.7 Channel Estimation in OFDM Systems ...... 75 3.7.1 Pilot Aided Channel Estimation ...... 76 3.7.2 Blind Channel Estimation ...... 77 3.7.3 Semi Blind Channel Estimation ...... 78 3.7.4 MIMO-OFDM System Model ...... 79 3.8 Channel Coding ...... 79 3.8.1 Linear Codes ...... 79 3.8.2 Convolutional Coding ...... 80 3.8.3 BICM Technique ...... 80 3.9 Iterative Processing Techniques ...... 81 3.9.1 The Turbo Principle ...... 81 3.9.2 Iterative Detection, Decoding and Estimation ...... 82 3.9.3 Iterative Detector and Decoder Components ...... 82 3.10 Relay Based Communication Systems ...... 84 3.10.1 Introduction ...... 84 3.10.2 Relay System Model ...... 84 3.10.3 MAP Detection in Memoryless Relay Functions ...... 86 3.11 Concluding Remarks ...... 87
4 Detection in MIMO Systems using Power Equality Constraints 89 4.1 Introduction ...... 89 4.2 Background ...... 90 4.3 System Description ...... 91 4.4 Power Equality Constraint Least Square Detection ...... 93 4.4.1 Basic Definitions and Problem Settings ...... 93 4.4.2 Constraint LS Detection for a Specific Power Group ...... 96 4.4.3 PEC-LS Detection with QAM Modulation ...... 97 4.4.4 Ordered PEC-LS Detection with Reduced Number of Power Groups . . . 99 4.5 Improved Ordered Power Equality Constraint Detection ...... 102 4.6 Efficient Implementation and Complexity Analysis ...... 104 4.6.1 Efficient Implementation of Constrained LS Detector ...... 105 4.6.2 Overall Complexity ...... 108 4.7 Simulation Results ...... 109 4.7.1 System Configuration ...... 110 4.7.2 Discussion of Simulation Results ...... 110 4.7.3 Complexity Assessment ...... 113 4.8 Chapter Summary and Conclusions ...... 115
5 Detection of Gaussian Constellations in MIMO Systems under Imperfect CSI117 5.1 Introduction ...... 117 5.2 Background ...... 118 5.3 System Description ...... 119 5.3.1 Pilot Aided Maximum Likelihood Channel Estimation ...... 121 5.3.2 Detection in a Mismatched Receiver ...... 121 5.4 Bayesian Detection under Channel Uncertainty ...... 123 5.4.1 Optimal MAP Detection ...... 123 5.4.2 Linear MMSE Detection ...... 124 5.4.3 Hidden Convexity Based MAP Detector ...... 124 5.4.4 Bayesian EM based MAP Detector ...... 126 5.4.4.1 Initial guess of x0 ...... 128 5.4.5 MAP Detection with Unknown Noise Variance ...... 128 5.4.6 Efficient Implementation and Complexity Analysis ...... 131 5.5 Simulation Results ...... 132 5.5.1 System Configuration ...... 133 5.5.2 Constellation Design ...... 133 5.5.3 Comparison of Detection Techniques ...... 134 5.6 Chapter Summary and Conclusions ...... 139 5.7 Appendix I ...... 140 5.8 Appendix II ...... 141
6 Iterative Receiver for Joint Channel Tracking and Decoding in BICM-OFDM Systems 143 6.1 Introduction ...... 143 6.2 Background ...... 144 6.3 System Description ...... 147 6.3.1 BICM-OFDM Transmitter ...... 147 6.3.2 Channel Model ...... 148 6.3.3 Autoregressive Modeling ...... 149 6.4 Receiver Structure ...... 150 6.4.1 Soft Demodulator ...... 150 6.4.2 MAP Decoder ...... 152 6.4.3 Channel Tracking with Known Symbols ...... 153 6.4.4 Decision-Directed Based Channel Tracking ...... 153 6.4.5 Robust Estimation ...... 155 6.4.6 Channel Tracking using Adaptive Detection Selection ...... 156 6.4.7 Tracking Quality Indicator using Consistency Test ...... 158 6.4.8 Adaptive Detection Selection Algorithm ...... 159 6.4.9 Discussion - Soft versus Hard Kalman Filter ...... 159 6.5 Estimation Error Analysis ...... 160 6.6 Simulation Results ...... 163 6.6.1 System Configuration ...... 163 6.6.2 Kalman Filter Consistency Tests ...... 163 6.6.3 BER and Channel Estimation Error Results ...... 164 6.7 Chapter Summary and Conclusions ...... 169 6.8 Appendix - LLR Values of the A posteriori Probabilities ...... 170
7 Channel Estimation in OFDM Systems using Trans Dimensional MCMC 173 7.1 Introduction ...... 173 7.2 Background ...... 174 7.3 System Description ...... 175 7.3.1 Channel Model ...... 176 7.4 Channel Estimation with Unknown Channel Length ...... 177 7.4.1 Channel Estimation using Bayesian Model Averaging ...... 177 7.4.2 Channel Estimation using Bayesian Model Order Selection ...... 178 7.4.3 Complexity Issues ...... 179 7.4.4 Simulation Results ...... 179 7.5 Channel Estimation with Unknown Channel Length and Unknown PDP . . . . . 182 7.6 Trans Dimensional Markov chain Monte Carlo ...... 184 7.6.1 Specification of Within-Model Moves: Metropolis-Hastings within Gibbs . 188 (t 1) 7.6.1.1 Specification of Transition Kernel T (h − ) (h ) . . . . 188 1:L(t−1) → i∗ 7.6.1.2 Specification of Transition Kernel T ((β(t 1)) (β )) ...... 188 − → ∗ 7.6.2 Specification of the Between-Model Moves Transition Kernel ...... 189 7.7 Design of Between-Model Birth and Death Proposal Moves ...... 190 7.7.1 Algorithm 1: Basic Birth Death Moves ...... 190 7.7.2 Algorithm 2: Stochastic Approximation TDMCMC ...... 190 7.7.3 Algorithm 3: Conditional Path Sampling TDMCMC ...... 194 7.7.3.1 Generic Construction of the CPS proposal ...... 194 7.8 Complexity Analysis ...... 198 7.9 Estimator Efficiency via Bayesian Cram´erRao Type Bounds ...... 198 7.10 Simulation Results ...... 203 7.10.1 System Configuration and Algorithms Initialization ...... 203 7.10.2 Model Sensitivity Analysis ...... 204 7.10.2.1 Sensitivity of Model Order to Prior Choice P r (L) ...... 204 7.10.2.2 Sensitivity of Model Order to the True Decay Rate β ...... 204 7.10.2.3 Analysis of Posterior Precision for Marginals ...... 205 7.10.2.4 Estimated Pairwise Marginal Posterior Distributions ...... 206 7.10.3 Comparative Performance of Algorithms ...... 207 7.10.4 Algorithm Performance ...... 209 7.11 Chapter Summary and Conclusions ...... 213 7.12 Appendix ...... 214
8 Bayesian Symbol Detection in Wireless Relay Networks Using “Likelihood Free” Inference 217 8.1 Introduction ...... 217 8.1.1 Relay Communications ...... 218 8.1.2 Model and Assumptions ...... 222 8.1.2.1 Prior Specification and Posterior ...... 223 8.1.2.2 Evaluation of the Likelihood Function ...... 224 8.1.3 Inference and MAP Sequence Detection ...... 224 8.2 Likelihood-Free Methodology ...... 225 8.3 Algorithm 1 - MAP Sequence Detection via MCMC-ABC ...... 227 8.3.1 Observations and Synthetic Data ...... 227 8.3.2 Summary Statistics ...... 228 8.3.3 Distance Metric ...... 229 8.3.4 Weighting Function ...... 229 8.3.5 Tolerance Schedule ...... 229 8.3.6 Performance Diagnostic ...... 231 8.4 Algorithm 2 - MAP Sequence Detection via Auxiliary MCMC ...... 231 8.5 Alternative MAP Detectors and Lower Bound Performance ...... 233 8.5.1 Sub-optimal Exhaustive Search Zero Forcing Approach ...... 233 8.5.2 Lower Bound MAP Detector Performance ...... 234 8.6 Simulation Results ...... 234 8.6.1 Analysis of Mixing and Convergence of MCMC-ABC Methodology . . . . 235 8.6.2 Analysis of ABC Model Specifications ...... 237 8.6.3 Comparisons of Detector Performance ...... 237 8.7 Chapter Summary and Conclusions ...... 240
9 Conclusions and Future Work 241 9.1 Conclusions ...... 241 9.2 Future Work ...... 243
10 Appendix 245 10.1 Properties of Gaussian Distribution ...... 245 10.2 Definition of circularity ...... 246 10.3 Bayesian Derivation of the Kalman Filter ...... 246 10.3.1 Introduction ...... 246 10.3.2 MMSE Derivation of Kalman Filter ...... 247 10.3.3 MAP Derivation of Kalman Filter ...... 249 10.4 The EM algorithm ...... 252 10.4.1 Introduction ...... 252 10.4.2 Derivation of EM ...... 252 10.4.3 BEM algorithm ...... 254
List of Figures
2.1 Parameter estimation criteria based on the marginal posterior distribution . . . . 15 2.2 The inverse transform method to obtain samples ...... 32 2.3 Sample path of a bivariate distribution using the Gibbs sampler ...... 39
3.1 Types of fading channels ...... 56 3.2 MIMO system model ...... 64 3.3 Idea behind the sphere decoder ...... 68 3.4 Block diagram of OFDM transceiver ...... 70 3.5 Frequency domain representation of three OFDM subcarriers ...... 71 3.6 Rate 1/2 convolutional encoder ...... 80 3.7 Block diagram of a BICM encoder ...... 81 3.8 Block diagram of a BICM decoder ...... 81 3.9 Iterative receiver for coded systems ...... 83 3.10 Parallel Relay Channels with one source, L relay nodes and one destination . . . 85
4.1 Quantum Power Level for 16QAM modulation ...... 94 4.2 Decision Boundaries for Φ1 and Φ3 ...... 103 4.3 Decision Boundaries for Φ2 ...... 104 4.4 BER performance of a MIMO system with 16QAM M = N = 2 ...... 111 4.5 BER performance of a MIMO system with 16QAM M = N = 4 ...... 112 4.6 BER performance of a MIMO system with 64QAM M = N = 2 ...... 112 4.7 BER performance of a MIMO system with 64QAM M = 2,N = 4 ...... 113 4.8 Average number of groups ...... 114
5.1 MIMO system model ...... 120 5.2 (Discrete) finite Gaussian constellation of 16-PAM ...... 120 5.3 Near-Gaussian constellation of 16-PAM with λ = 1/40 ...... 134 5.4 BER performance of MIMO systems with N = 2,M = 1 ...... 135 5.5 BER performance of MIMO systems with N = 4,M = 2 ...... 136 5.6 BER performance of MIMO systems with M = N = 4 ...... 137 5.7 BER performance of MIMO systems with M = N = 8 ...... 137 5.8 BER performance comparison between BEM and BEM-Gibbs detectors . . . . . 138 5.9 Average number of iterations of the BEM algorithm for different starting points . 138
6.1 Block diagram of the BICM-OFDM transmitter ...... 148 6.2 Block diagram of the BICM-OFDM iterative receiver ...... 151
xix 6.3 Number of subcarriers used for different values of confidence interval ...... 164 6.4 BER performance comparison of ADS versus USE-ALL for v = 90 km/hr . . . . 165 6.5 Channel estimation error performance comparison of ADS versus USE-ALL . . . 166 6.6 BER performance comparison of ADS versus GENIE-AIDED for ...... 167 6.7 BER performance comparison of ADS versus EM-KALMAN ...... 168 6.8 BER performance comparison of ADS versus IRLS-HM ...... 168
7.1 Channel model with unknown number of taps and power delay profile ...... 176 7.2 Channel estimation MSE performance of BMA and BMOS estimators ...... 180 7.3 BER performance comparison of BMA and BMOS estimators ...... 180 7.4 CIR length estimation, L = 8, SNR = 10 dB ...... 181 7.5 MAP channel order estimation, K = 32, 64, 128 , L=8 ...... 181 { } 7.6 Sensitivity of MAP estimate from P r(L y) to prior mean λ ...... 205 | 7.7 Sensitivity of MAP estimate from P r(L y) to β ...... 206 | 7.8 Sensitivity of MAP estimate from P r(L y) to β ...... 207 | 7.9 Marginal distribution of L, P r(L y) versus SNR ...... 208 | 7.10 Pairwise marginal posterior distributions for p(h , h y) ...... 209 i j| 7.11 Average MSE of the marginal posterior model probability, P r(L = 8 y) . . . . . 210 | 7.12 MSE performance for the OFDM system using CPS-TDMCMC algorithm . . . . 211 7.13 BER performance of CPS-TDMCMC algorithm ...... 212
8.1 Two hop relay system with L relay nodes ...... 220 8.2 Comparison of performance for MCMC-ABC with different distance metrics . . . 236 8.3 Maximum distance between the edf and the baseline “true” edf ...... 238 8.4 SER performance of the proposed detector schemes ...... 239 List of Tables
2.1 Summary of point estimators ...... 16 2.2 Bayesian estimation under model uncertainty ...... 20
3.1 OFDM system models summary ...... 74
4.1 Relation between Ω (2,S), Φ (S) and (2,S) ...... 95 G 4.2 Decision Boundaries for Φ1 and Φ3 ...... 104 4.3 Decision Boundaries for Φ2 ...... 105 4.4 Complexity of PEC-LS detector components ...... 110 4.5 Number of iterations in the line search as a function of power groups ...... 114
7.1 Computational complexity of within-model moves of the TDMCMC algorithms . 199 7.2 Computational complexity of the between-model moves of BD-TDMCMC algorithm199 7.3 Computational complexity of the CPS-TDMCMC algorithm ...... 200
xxi
List of Algorithms
1 Inversion sampling algorithm ...... 32 2 Accept-Reject sampling algorithm ...... 33 3 Importance sampling algorithm ...... 34 4 Metropolis-Hastings algorithm ...... 37 5 Gibbs sampling algorithm ...... 39 6 Generic TDMCMC algorithm ...... 44 7 Rejection algorithm 1 ...... 48 8 Rejection algorithm 2 ...... 49 9 ABC algorithm 1 : ǫ-tolerance rejection ...... 49 10 ABC algorithm 2 ...... 50 11 ABC-MCMC sampler ...... 52 12 PEC-LS detector ...... 100 13 OPEC based detector for MIMO systems ...... 101 14 Constrained LS soft detection of the ω-th power group ...... 109 15 Consistency Test and Adaptive Subcarriers Selection ...... 159 16 Generic TDMCMC Algorithm ...... 187 17 SA-TDMCMC: Between-Model Moves Transition Kernel ...... 193 18 CPS-TDMCMC: Between-Model Moves Transition Kernel ...... 196
19 Sampling (h1:∗ L∗ , β∗,L∗) via CPS method ...... 197 20 MAP sequence detection algorithm using MCMC-ABC ...... 230 21 MAP sequence detection algorithm using AV-MCMC ...... 232
xxiii Acronyms
“There are three kinds of lies: lies, damned lies, and statistics.”
Benjamin Disraeli
TDMA Time Division Multiple Access
ABC Approximate Bayesian Computation
ACF Autocorrelation Function
ADS Adaptive Detection Selection
AF Amplify and Forward
APP A Posteriori Probabilities a.s. almost surely
AR Auto Regressive
ARMA Auto Regressive Moving Average
AV Auxiliary Variable
AWGN Additive White Gaussian Noiselong
BCRLB Bayesian Cram´erRao Lower Bound
BD Birth-Death
BEM Bayesian Expectation Maximisation
BER Bit Error Rate
BICM Bit Interleaved Coded Modulation
BIM Bayesian Information Matrix
BMA Bayesian Model Averaging
BMOS Bayesian Model Order Selection
BPSK Binary Phase Shift Keying cdf cumulative distribution function
CF Compress and Forward
CFO Carrier Frequency Offset
CFR Channel Frequency Responselong
CIR Channel Impulse Responselong
CP Cyclic Prefix
CPS Conditional Path Sampling
CRLB Cram´erRao Lower Bound
CSI Channel State Information
DD Decision Directed
DF Decode and Forward
DFT Discrete Fourier Transform
ECC Error Control Coding edf empirical distribution function
EF Estimate and Forward
EIV Errors In Variables
EM Expectation Maximisation
FD Frequency Domain
FEC Forward Error Correction
FIR Finite Impulse Response
FFT Fast Fourier Transform
HD Hard Decision
IBI Inter Block Interference
ICI Inter Carrier Interference IDFT Inverse Discrete Fourier Transform
IFFT Inverse Fast Fourier Transform
IG Inverse Gamma i.i.d. independent and identically distributed
IOPEC Improved Ordered Power Equality Constraint
IRLS Iteratively Reweighted Least Squares
ISI Inter Symbol Interference
KF Kalman filter
LDSSM Linear Dynamic State Space Model
LLR Log Likelihood Ratio
LMMSE Linear Minimum Mean Squared Error
LOS Line Of Sight
LS Least Squares
LU Lower Upper
MA Moving Average
MMAP Marginal Maximum A Posteriori
MAP Maximum A Posteriori
MB Maxwell Boltzman
MC Multi Carrier
MCMC Markov Chain Monte Carlo
MF Matched Filter
MH Metropolis Hastings
MIMO Multiple Input Multiple Output
MISO Multiple Input Single Output ML Maximum Likelihood
MLSE Maximum Likelihood Sequence Estimator
MSE Mean Squared Error
MST Most Significant Taps
MMSE Minimum Mean Squared Error
MV Minimum Variance
NIS Normalized Innovation Squared
NLOS Non Line Of Sight
NSE Normalized Squared Error
OFDM Orthogonal Frequency Division Multiplexing
OPEC Ordered Power Equality Constraint
PAPR Peak to Average Power Ratio
PAM Pulse Amplitude Modulation
PASM Pilot Symbol Aided Modulation
PEC Power Equality Constraint pdf probability distribution function
PDP Power Delay Profile pmf probability mass function
PN Phase Noise
PS Parallel to Serial
PSD Power Spectral Density
PSP Per Survivor Processing
PTM Probability Transition Matrix
QAM Quadrature Amplitude Modulation QPL Quantum Power Level
QPSK Quadrature Phase Shift Keying
RF Radio Frequency
RJMCMC Reversible Jump Markov Chain Monte Carlo
RMS Root Mean Square r.v. random variable
SA Stochastic Approximation
SAMC Stochastic Approximation Monte Carlo
SC Single Carrier
SD Soft Decision
SDR Semi-Definite Relaxation
SER Symbol Error Rate
SES Sub-Optimal Exhaustive Search
SES-ZF Sub-Optimal Exhaustive Search - Zero Forcing
SIMO Single Input Multiple Output
SISO Single Input Single Output
SMC Sequential Monte Carlo
SNR Signal to Noise Ratio
SD Soft Decision
SP Serial to Parallel s.t. such that
SVD Singular Value Decomposition
TD Time Domain
TDMCMC Trans Dimensional Markov Chain Monte Carlo TRS Trust Region Subproblem
VBLAST Vertical Bell Labs Layered Space Time
VCO Voltage Controlled Oscillator w.p. with probability
WSSUS Wide Sense Stationary Uncorrelated Scattering
ZF Zero Forcing
Notations and Symbols
“We could, of course, use any notation we want; do not laugh at notations; invent them, they are powerful. In fact, mathematics is, to a large extent, invention of better notations.”
Richard P. Feynman
It shall be assumed that a random variable x can be defined on a probability space of the form (E, ǫ, P). Where, E will represent the space of all outcomes which may be either discrete of continuous and may be of multiple dimension. ǫ will represent σ (E) which is the sigma algebra generated by the space E, which is the set of all possible outcomes and P will be a probability measure on the space E. The notation p (dx) shall be used to represent the law or distribution of the random variable x, which is a probability measure given by the image measure on the space in question. In this thesis, all models considered will be either discrete or continuous, open or compact subsets of Euclidean space. Furthermore, it shall be assumed that all distributions of interest admit densities with respect to either the counting measure or Lebesgue measure.
We now introduce some notation that will be used throughout the thesis. Boldface upper-case letters denote matrices, boldface lower-case letters denote column vectors, and lower-case italics denote scalars.
xxxi N, Z, R, C The set of all natural,integer, real and complex numbers, respectively. R+, The set of all strictly positive real numbers. Zn m, Rn m, Cn m The set of n m matrices with integer-, real- and complex-valued × × × × entries, respectively. If m = 1, the index can be dropped. XT Transpose of the matrix X.
X∗ Conjugate of the matrix X. XH Complex conjugate and transpose (Hermitian) of the matrix X. 1 X− Inverse of the matrix X. T r A Trace of the matrix A. { } X , or Det (X) Determinant of the matrix X. | | vec (X) Vector constructed with the elements in the diagonal of matrix X. diag (x) Matrix constructed with elements of x on its main diagonal.
(X)† Moore-Penrose pseudo-inverse of the matrix X. I or I Identity matrix and identity matrix of dimension n n, respectively n × [H] (i, j) th component of the matrix H. ij − x Magnitude of the complex scalar x. | | x 2 Squared Euclidean norm of the vector x : x 2 = xH x. k k k k x Largest integer smaller than or equal x. ⌊ ⌋ x Smallest integer larger than or equal x. ⌈ ⌉ max,min Maximum and minimum. Equal up to a scaling factor (proportional). ∝ , Defined as. CN (m, C) Complex circularly symmetric Gaussian vector distribution with mean m and covariance matrix C. N (m, C) Gaussian vector distribution with mean m and covariance matrix C. (α, β) Inverse Gamma distribution with shape parameter α and scale parameter β. IG 2 χK chi-squared distribution with K degrees of freedom. P oi ( ; λ) Poisson distribution with mean λ. · U [α, β] uniform distribution on the support [α, β]. pX (x), p(x) Probability density function of the random variable x. P r(x) Probability mass function of the random variable x. p(x y) Conditional distribution of x given y. | p(x, y) Joint distribution of x and y. z p(z) z is distributed according to p(z). ∼ A Cardinality of the set A, i.e., number of elements in A. | | x Sequence of vectors x , , x . 0:n 0 ··· n x k Vector with k-th component missing x k , [x1, x2, . . . , xk 1, xk+1, . . . , xK ] − − − E Mathematical expectation. {·} arg Argument. inf infimum (highest lower bound). Equal up to a scaling factor (proportional). ∝ lim Limit. log ( ) Natural logarithm. · log ( ) Base-a logarithm. a · Re Real part. {·} Im Imaginary part. {·} a.s. almost sure convergence. −→ d convergence in distribution. −→ Matrix Kronecker product. ⊗ δ ( ) Dirac delta function. · I ( ) Indicator function. · T (y) Summary statistics of y. ρ (x, y) Distance metric between x and y. O(N) The computation complexity is order N operations.
Chapter 1
Introduction
“In theory, theory and practice are the same. In practice, they are not.”
Lawrence Peter ”Yogi” Berra
1.1 Motivation
Claude Shannon, an engineer employed at the Bell Labs, is possibly the most important figure in the field of communication theory. Among the many fundamental results in his well-known paper “A Mathematical Theory of Communication” [1], of special importance is his discovery of the capacity formula. Shannon proved that reliable communication between a transmitter and a receiver is possible even in the presence of noise, and laid the foundation to a new field of research, known as information theory. For the particular case of a unit-gain band-limited continuous channel corrupted with Additive White Gaussian Noise (AWGN), Shannon obtained his celebrated formula for the capacity
P + P C = W log T N , (1.1) P N where W , PT and PN represent the bandwidth, the average transmitted power, and the noise power, respectively.
The electromagnetic bandwidth available, W , and the maximum radiated power, PT , are sub- ject to fundamental physical constraints as well as regulations and practical constraints, and therefore limited. One approach to increase capacity is by emitting more power. However, due to the logarithmic dependence of the spectral efficiency on the transmitted power, it would be extremely expensive. It may also violate regulation power masks and result in nonlinearity in the power amplifier. Moreover, the effects of electromagnetic radiation on people’s well being 1 1.1 Motivation 2 should also be taken into consideration. A second approach to increase capacity would be to utilize a wider electromagnetic band. How- ever, as mentioned before, the radio spectrum is a scarce, and therefore, a very expensive re- source. Consequently, designers face the challenge of designing wireless systems that are capable of providing increased data rates and improved performance while utilising existing frequency bands and channel conditions.
Foschini and Telatar showed in [2] and [3] that by using multiple antennas at the transmitter and the receiver, Multiple Input Multiple Output (MIMO) systems can increase the capacity without having to increase transmission or use a wider band when the channel exhibits rich scattering and its variations are accurately tracked by the receiver. More specifically, the capacity of a MIMO system can grow, in principle, linearly with the minimum over the number of inputs and outputs.
Therefore, MIMO systems are a key technology in order to fulfill the requirements for future communication networks. Thus, over the last decade, there has been a growth of research activity in the area of MIMO systems.
A fundamental task of any receiver is the detection of the data transmitted by the transmitter. The optimal data detection in MIMO systems, i.e. the estimation of the transmitted data, can be carried out using a brute force search over all possible codewords. However, such a search results in an exponential explosion in the search space, prohibiting its use. Many suboptimal but less complex schemes have been suggested. Still, data detection in MIMO systems is an ongoing field of research, especially under channel uncertainty.
OFDM transmission technique [4], [5], [6], [7] has been widely investigated in the last 40 years. OFDM is an attractive multi-carrier modulation technique because of its high spectral efficiency and simple single-tap equaliser structure, as it splits the entire bandwidth into a number of over- lapping narrow band subchannels requiring lower symbol rates. Furthermore, the Inter Symbol Interference (ISI) and Inter Carrier Interference (ICI) can be easily eliminated by inserting a Cyclic Prefix (CP) in front of each transmitted OFDM block. However, while the attractive features of OFDM technique exist, some problems still remain to be solved. Reliable estimation of the time-varying parameters, such as channel fluctuations, frequency offset, synchronisation etc., while reducing the overhead of pilot symbols is a challenging task. Another challenge is to perform channel estimation in the absence of a priori knowledge of different physical properties, such as the number of channel taps and Power Delay Profile (PDP).
Another technique which has recently gained attention is the cooperative or relay assisted com- munication where several distributed terminals cooperate to transmit/receive their intended signals [8]. This technique based on the seminal works issued in the 70’s by van der Meulen [8], Cover and El Gamal [9], where they introduced a new component, the relay terminal. 1.1 Motivation 3
In that scenario, the source wishes to transmit a message to the destination, but obstacles degrade the source-destination link quality, or the distance between the source-destination link is too big to perform robust communication. The message sent by the source is also received by the relay terminal, which can re-transmit that message to the destination. The destination may combine the transmissions from the source and relay in order to decode the message. This architecture exhibits some properties of MIMO systems, is therefore known as a virtual MIMO system. In contrast to conventional MIMO systems, the relay-assisted transmission is able to combat the channel impairments due to shadowing and path-loss. These effects occur in the source-destination and the relay-destination links due to the fact that they are statistically independent. The detection of the transmitted data is a challenging task in relay systems. This is due to the fact that in many cases the relay performs non-linear processing on the received signal before re-transmitting it to the destination mode. In many cases, this prohibits the use of classical estimation techniques, since the likelihood function cannot be obtained analytically.
From all that has been discussed above, the aim of this dissertation will be devoted to tackle practical problems in communications systems, under realistic conditions, such as data detection in MIMO and relay systems, channel estimation and receiver design for OFDM systems.
In the first part of this dissertation, the focus is on the design of low complexity data detection schemes for MIMO systems. In the first instance, we develop low complexity symbol detection algorithms when full Channel State Information (CSI) is provided at the receiver. By relaxing the non-convex discrete constellation constraint, and replacing it with a non-convex continuous constraint of specific structures of the symbols constellation, we are able to design algorithms with different levels of complexity and performance.
Secondly, we design low complexity algorithms for data detection of non-uniform symbols con- stellations in MIMO systems when only partial (noisy estimate) CSI is given at the receiver. The resulting non-convex optimisation problem is solved efficiently using the hidden convexity methodology. An alternative solution based on the Bayesian Expectation Maximisation (BEM) methodology is then presented and compared to the hidden convexity based solution in terms of Bit Error Rate (BER) and computational complexity. We then extend this methodology to the case where the noise variance is unknown a priori and needs to be estimated jointly. This shall be achieved using the concept of annealed Gibbs sampling coupled with the BEM approach. The algorithms presented are suitable for both uniform and non-uniform symbols constellations.
In the second part of this dissertation, we focus on receiver design for OFDM systems. First we design an iterative receiver for OFDM systems under high mobility. We propose an algorithm for joint demodulation, data decoding and channel tracking that aims at minimising the error 1.2 Outline of the dissertation 4 propagation effect due to erroneous detection of data symbols. By monitoring the health pa- rameters of the tracking unit, an algorithm is devised for subcarriers subset selection, leading to a decreased error propagation effect, resulting in improved performances.
In the second instance, we design a channel estimation scheme for OFDM systems, in the absence of a priori knowledge of the Channel Impulse Response (CIR) length and unknown PDP decay rate. This will be achieved using Trans Diemnsional Markov Chain Monte Carlo (TDMCMC) methodology to obtain samples from the intractable posterior distribution, and perform numerical integration. We develop three novel algorithms that are based on different sampling methodologies.
The third part of this dissertation is dedicated to data detection in relay systems with non-linear relay functions. This problem is challenging since in most cases there is no analytical expression for the likelihood function. We shall utilise a “Likelihood Free” inference methodology named Approximate Bayesian Computation (ABC) in order to circumvent this problem. We present three novel algorithms: one based on Markov Chain Monte Carlo (MCMC)-ABC; the second is based on an auxiliary MCMC methodology, in which we consider the estimation problem in augmented space. While this enables the use of traditional MCMC approach, this comes at the expense of a bigger state-space solution; the third algorithm is based on a suboptimal exhaustive search Zero Forcing (ZF) detector to perform the detection of the transmitted symbols. This is based on known summary statistics of the channel model and conditional on the mean of the noise at the relay nodes. This will allow for an explicit exhaustive search over the parameter space of code words. We shall compare these three approaches and discuss under what scenarios to use each.
1.2 Outline of the dissertation
In general terms, the scope of this dissertation is the design of wireless communication systems. The main attention is given to two fundamental problems in wireless communication systems, namely detection of the transmitted symbols and channel estimation. In particular, we consider three system configurations throughout this dissertation. These are OFDM, MIMO and relay systems.
The outline of each of the chapters is as follows:
Chapter 1 presents the motivation of this dissertation. It also presents the outline and lists the contributions of this dissertation.
Chapter 2 - This chapter provides an overview of some basic concepts in Bayesian inference that will be used extensively throughout this dissertation. The aim is to give the reader a thorough understanding of the Bayesian methodology, not only to make later chapters comprehensible, 1.2 Outline of the dissertation 5 but also to provide the method for general scientific inference.
Chapter 3 - The basics of modern digital communication systems employing channel estima- tion, channel coding and modulation/detection are described in this chapter. In particular, we provide an overview of different channel models, MIMO and OFDM systems as well as channel estimation and data detection algorithms. General mathematical models representing typical characteristics of these main components as well as the multipath fading transmission channel are presented. The estimation objectives for several digital communication problems are also stated.
Chapter 4 - This chapter deals with the ubiquitous problem of data detection in MIMO sys- tems. We present low complexity algorithms that are based on tightening the constraints of the Zero Forcing (ZF) based detector. This leads to a non-convex optimisation problem that can be solved efficiently. We present several algorithms with different levels of complexity and performance.
Chapter 5 - Here we present algorithms for data detection in MIMO systems with only partial CSI at the receiver. The detector is formulated as a non convex problem that can be solved effi- ciently using the hidden convexity methodology. We also present a competitive approach based on the BEM methodology. An extension to the case where the noise variance is not known a priori is also presented using a stochastic optimisation methodology that is based on the concept of annealed Gibbs sampler.
Chapter 6 - In this chapter we develop a complete receiver design for OFDM systems, suit- able for high mobility. Based on the Turbo principle, we present an algorithm to perform demodulation, decoding and channel tracking. Special attention is given to the problem of er- ror propagation that can lead to poor performance due to divergence of the tracking module. We present an approach to deal with this problem by monitoring the health parameters, and adapting the number of subcarriers used accordingly. This leads to reduced misspecification of the stat-space model enabling superior overall performance to other methods.
Chapter 7 - This chapter deals with the problem of channel estimation in OFDM systems where the number of channel taps and the PDP decay rate are unknown a priori. We formu- late the problem under the Bayesian framework and show that the solution involves solving intractable integrals. In order to circumvent this problem we use the TDMCMC methodology to sample from the intractable posterior density, and perform numerical integration. In particu- lar, we develop three novel algorithms and analyse quantities of interest, such as computational complexity, convergence analysis, sensitivity to different choices of priors and BER.
Chapter 8 - In this chapter we consider the problem of detection of transmitted data in relay systems, where only partial CSI is available at the destination node. We formulate the problem under the Bayesian framework and show that the solution is prohibited due to the fact that 1.3 Research contributions 6 no analytical expression for the likelihood function is available. We therefore use a “Likelihood Free” methodology to design a novel sampling methodology based on the MCMC-ABC algo- rithm. We also develop two alternative solutions: an auxiliary variable MCMC approach in which the addition of auxiliary variables results in closed form expressions for the full condi- tional posterior distributions; The other detector involves an approximation based on known summary statistics of the channel model and an explicit exhaustive search over the parameter space of code words. We study the performance of each algorithm under different settings of our relay system model, and make recommendations regarding the choice of algorithms and various parameters.
Chapter 9: Here we conclude the dissertation and give some topics for future research.
Note that the chapters with original contributions are Chapters 4, 6, 5, 7 and 8.
1.3 Research contributions
To a certain extent, the chapters in this thesis are self contained and can be read independently. The main contributions of this dissertation are:
Low complexity data detection algorithms for MIMO systems with different degrees of CSI • quality.
Design of an iterative receiver for OFDM systems under high mobility conditions. • Channel estimation for OFDM, with no a priori knowledge of the number of taps of the • channel and its PDP decay rate.
Detection of transmitted data in relay systems, with non-linear relay functions, and partial • CSI at the receiver.
In the following, a detailed list of the research contributions in each chapter is presented.
Chapter 4 The main results of this chapter deal with the design of low complexity algorithms for data detection in MIMO systems. These results have been submitted to publication in a journal.
I. Nevat, T. Yang, K. Avnit, and J. Yuan, “Detection for MIMO Systems with High- • level Modulations using Power Equality Constraints,” accepted for publication in IEEE Transactions on Vehicular Technologies, January 2010. 1.3 Research contributions 7
Chapter 5 The main results of this chapter deal with the design of low complexity data detection algorithms for MIMO systems, with only partial CSI available at the receiver. These results have been published in three conference papers and one journal.
I. Nevat, A. Wiesel, J. Yuan, and Y.C. Eldar, “Maximum a-posteriori Estimation in Lin- • ear Models With a Gaussian Model Matrix,” in Proc. IEEE Conference on Information Sciences and Systems (CISS), Baltimore, MD, USA, March 2007.
I. Nevat, G.W. Peters, and J. Yuan, “Maximum A-Posteriori Estimation in Linear Models • With a Random Gaussian Model Matrix: a Bayesian-EM Approach,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP08), Las Vega, Nevada, USA, April 2008.
I. Nevat, G.W. Peters, and J. Yuan, “MAP Estimation in Linear Models With a Random • Gaussian Model Matrix : Algorithms and Complexity,” in Proc. IEEE international Sym- posium on Personal, Indoor and Mobile Radio Communications 2008 (PIMRC08), Cannes, France, October 2008.
I. Nevat, G.W. Peters, and J. Yuan, “Detection of Gaussian Constellations in MIMO • Systems under Imperfect CSI,” IEEE Transactions on Communications, vol. 58, no. 3, March 2010.
Chapter 6 The main results of this chapter deal with the design of an iterative receiver for OFDM systems. These results have been published in three conference papers and one journal.
I. Nevat and J. Yuan, “Channel Tracking For OFDM Systems Using Measurements Prun- • ing,” in Proc. NEWCOM-ACoRN workshop,Vienna, Austria, September 2006.
I. Nevat and J. Yuan, “Channel Tracking Using Pruning for MIMO-OFDM Systems Over • Gauss-Markov Channels,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP07), Hawaii, USA, April 2007.
I. Nevat and J. Yuan, “Error Propagation Mitigation for Iterative Channel Tracking, De- • tection and Decoding of BICM-OFDM Systems,” in Proc. IEEE International Symposium on Wireless Communication Systems 2007 (ISWCS07), Trondheim, Norway, October 2007. This paper has won the best paper award.
I. Nevat and J. Yuan, “Joint Channel Tracking and Decoding for BICM-OFDM Systems • using Consistency Tests and Adaptive Detection Selection,” IEEE Transactions on Vehic- ular Technologies, vol. 58, no. 8, pp. 4316 - 4328, October 2009. 1.3 Research contributions 8
Chapter 7 The main results of this chapter deal with the design of algorithms for channel estimation in OFDM systems, with no a priori knowledge of the number of taps of the channel and its power delay decay rate. These results have been published in two conference papers and one journal.
I. Nevat, G.W. Peters, and J. Yuan, “OFDM CIR Estimation with Unknown Length via • Bayesian Model Selection and Averaging,” in Proc. IEEE Vehicular Technology Conference (VTC08), Singapore, May 2008.
I. Nevat, G.W. Peters, and J. Yuan, “Channel Estimation in OFDM Systems with Un- • known Power Delay Profile using Trans-dimensional MCMC via Stochastic Approxima- tion,” in Proc. IEEE Vehicular Technology Conference (VTC09), Barcelona, Spain, April 2009.
I. Nevat, G.W. Peters, and J. Yuan, “Channel Estimation in OFDM Systems with Un- • known Power Delay Profile using Trans-dimensional MCMC,” IEEE Transactions on Signal Processing, vol. 57, no. 9, pp. 3545 - 3561, September 2009.
Chapter 8 The main results of this chapter deal with the design of algorithms for the detection of the transmitted data in relay systems with non-linear functions and partial CSI at the receiver. These results have been published in one conference paper and one journal.
I. Nevat, G.W. Peters and J. Yuan, “Coherent Detection for Cooperative Networks with Ar- • bitrary Relay Functions using “Likelihood Free” Inference,” in Proc. NEWCOM-ACoRN workshop, Barcelona, Spain, March 2009.
G.W. Peters, I. Nevat, S. Sisson, Y. Fan and J. Yuan, “Bayesian symbol detection in • wireless relay networks via likelihood-free inference,” accepted for publication in IEEE Transactions on Signal Processing, February, 2010.
Other contributions not presented in this dissertation
I. Nevat, G.W. Peters, S. Sisson, Y. Fan and J. Yuan, “Model Selection, Detection and • Channel Estimation for Relay Systems,” to be submitted to IEEE Transactions on Signal Processing.
C. Kim, T. Lehmann, S. Nooshabadi and I. Nevat, “An Ultra-Wideband Transceiver Ar- • chitecture for Wireless Endoscopes,” in Proc. IEEE International Symposium on Commu- nications and Information Technologies (ISCIT’07), Sydney, Australia, October 2007. Chapter 2
Bayesian Inference and Analysis
“Making predictions is hard, especially about the future.”
Niels Bohr
2.1 Introduction
This chapter sets out the essential statistical theory required to understand the material pre- sented in the subsequent chapters. An emphasis is placed on a particular interpretation of general statistical theory, known as Bayesian statistics.
2.2 Background
Our everyday experiences can be summarised as a series of decisions to take actions which manipulate our environment in some way or another. We base our decisions on the results of predictions or inferences of quantities that have some bearing on our quality of life, and we come to arrive at these inferences based on models of what we expect to observe.
Models are designed to capture salient trends or regularities in the observed data with a view to predicting future events. For the majority of real applications the data are far too complex or the underlying processes not nearly well enough understood for the modeller to design a perfectly accurate model. If this is the case, we can hope only to design models that are simplifying approximations of the true processes that generated the data.
The purpose of Bayesian inference is to provide a mathematical machinery that can be used for modeling systems, where the uncertainties of the system are taken into account and the decisions 9 2.3 Bayesian Inference 10 are made according to rational principles. The tools of this machinery are the probability distributions and the rules of probability calculus.
In probabilistic inference there are two well known approaches, namely the frequentist approach and the Bayesian approach. In the frequentist approach (a.k.a classical approach), the param- eters of interest are handled as deterministic with unknown values and the estimation is based solely on the observations. This approach is often associated with the work of R.A. Fisher, R. von Mises, J. Neyman and E. Person.
The Bayesian approach takes a different view. In a Bayesian analysis, all the quantities have a probability distribution associated with them. The performance of different estimators are decided based on the average performance over different realizations of the parameters of interest. Bayes rule provides a means of updating the distribution over parameters from the prior to the posterior distribution in light of observed data. In theory, the posterior distribution captures all information inferred from the data about the parameters. This posterior is then used to make optimal decisions or predictions, or to select between models.
The Bayesian framework depends on the existence of a priori distributions. These priors reflect the user’s knowledge about the quantities of interest before any data have been considered. Many researchers advocating the so-called classical methods of inference oppose the use of priors. They prefer to regard these quantities as having unknown but deterministic values to which it is meaningless to assign a distribution as if it were random. Furthermore, it is often argued that assigning a prior to a proposition means that the resulting inference will be affected by the subjective prior decided by the user. While all these arguments are of great importance, we will not pursue this debate here. We simply note that both the classical and the Bayesian frameworks of inference have their respective merits, and the user should try to be pragmatic and choose the framework most appropriate for their application.
2.3 Bayesian Inference
The Bayesian paradigm is used extensively in estimation theory. Bayesian inference [10] is so named as it centres around Bayes’ rule
p(y x)p(x) p (x y) = | | p(y) (2.1) p(y x)p(x) = | , p(y x)p(x) dx Ω | R where Ω denotes the parameter space of x. The basic blocks of a Bayesian model are the prior model, p (x), containing our original beliefs about x before observing the data, and the likelihood model, p (y x), determining the stochastic mapping from the parameter x to the measurement | 2.3 Bayesian Inference 11 y. Using Bayes rule, it is possible to infer an estimate of the parameter from the measurement. The distribution of the parameter x, conditioned on the observed measurement y, p (x y), is | called the posterior distribution and it is the distribution representing the state of knowledge about the parameter when all the information in the observed measurement y and the model is used.
The distribution p (y) is the evidence or marginal likelihood and is not directly a function of the model parameter x because it is integrated out:
p (y) = p(y x)p(x)dx. (2.2) | ZΩ One of the major problems within Bayesian inference is the calculation of the marginal likeli- hood (2.2). The calculation of this normalising constant usually complicates matters since this integral is analytically intractable in most practically useful scenarios, possibly due to its high dimensional nature. One thus has to resort to approximation techniques.
A second, and related problem, is that of marginalisation of variables. Consider a random vector that can be partitioned as follows x = (x , x ). The marginal density p (x y) can be attained 1 2 1| through:
p (x y) = p(x , x y)dx . (2.3) 1| 1 2| 2 Z In general, this integral is analytically intractable.
There is a multitude of references available on this topic, such as [11], [12], [13] and [14].
2.3.1 Prior Distributions
In order to use Bayesian inference, one needs knowledge of the prior distribution p (x) in (2.1). Hence there is the need for principles that can produce the initial distributions required. These are solely based on whatever information we have beforehand. We briefly present the most popular approaches for the choice of a prior distribution.
Non-informative priors- In many practical situations no reliable prior information concerning x exists, or inference based solely on the data is desirable. In this case we typically wish to define a prior distribution p(x) that contains no information about x in the sense that it does not favor one x value over another. We may refer to a distribution of this kind as a noninformative prior for x and argue that the information contained in the posterior about x stems from the data only. In the case that the 2.3 Bayesian Inference 12 parameter space is x = x ,.. . , x , i.e., discrete and finite, then the distribution { 1 n} 1 p(x ) = , i 1, . . . , n (2.4) i n ∈ { } places the same prior probability to any candidate x value. Likewise, in the case of a bounded continuous parameter space, say Φ = [a, b], < a < b < , then the uniform distribution −∞ ∞ 1 p(x ) = , a < x < b (2.5) i b a i − appears to be noninformative. The uniform prior is not invariant under reparameterisation. Thus, an uninformative prior can be converted, in the case of a different model, to an informative one. One approach that overcomes this difficulty is Jeffreys prior [15], [16] given by
p(x) I(x), (2.6) ∝ p where I(x) is the expected Fisher information matrix, having its ij-th element
2 E ∂ [I]ij = y x p(y x) . (2.7) | ∂x ∂x | i j This rule has the property that the prior is invariant regardless of any transformation that may be performed on x. Although this invariance property is desirable, there are certain situations where Jeffreys’ prior cannot be applied [17].
Conjugate priors- Conjugate prior families were first formalised by Raiffa and Schlaifer [18]. They showed that when choosing a prior from a parametric family, some choices may be more computationally convenient than others. The advantage of using a conjugate family of distributions is that it is generally straightforward to calculate the posterior distribution, which makes it useful in many situations. In particular, it may be possible to select a prior distribution p(θ) which is conjugate to the likelihood p(y θ), that is, one that leads to a posterior p(θ y) belonging to the same family | | as the prior.
It is shown in [19] that exponential families, where likelihood functions often belong, do in fact have conjugate priors, so that this approach will typically be available in practice.
A conjugate prior is constructed by first factoring the likelihood function into two parts. One factor must be independent of the parameter(s) of interest but may be dependent on the data. The second factor is a function dependent on the parameter(s) of interest and dependent on the data only through the sufficient statistics. The conjugate prior family is defined to be proportional to this second factor. It can be shown [18] that the posterior distribution arising from the conjugate prior is itself a member of the same family as the conjugate prior. 2.3 Bayesian Inference 13
As an example, that will be useful in Chapter 5, consider the following posterior:
p (θ y) p (y θ) p (θ) , (2.8) | ∝ | where the prior p (θ) follows an Inverse Gamma (IG) distribution:
p (θ) = (θ; α, β) IG α (2.9) β α 1 β = θ− − exp − , Γ(α) θ where α and β are the shape parameter and scale parameter, respectively. The Gamma function, Γ(z), is defined by
∞ z 1 Γ(z) , t − exp t dt, (2.10) {− } Z0 for a complex number z with positive real part Re [z] > 0.
The posterior in (2.8) follows a conjugacy structure. This is because the prior, p (θ), follows the IG distribution and the likelihood, p (y θ), follows a Normal distribution. Due to the conjugacy | property of the Normal-Inverse Gamma model [20], the posterior follows an IG distribution:
p (θ y) = α, β , (2.11) | IG with posterior hyperparameters
α = α + N/2, (2.12a) N y [i] µ 2 β = β + i=1 | − i| , (2.12b) 2 P where N is the number of elements in y. The mode of p (θ y), which is the the location where | the probability density function attains its maximum value, can be written as
β mode [p (θ y)] = . (2.13) | α + 1
This will be useful in finding the Maximum A Posteriori (MAP) estimate which will be defined in the next Section.
2.3.2 Point Estimates
As stated before, Bayesian analysis revolves around the posterior distribution p (x y). However, | there is more than one way to find the value of the parameter of interest. Point estimates for 2.3 Bayesian Inference 14 parameters of interest are obtained via the specification of a loss function. The construction of a point estimate, which itself is a random variable, is done by defining a suitable loss function that penalises erroneous estimates i.e. we specify a loss function which defines the quality of an estimate. The loss function C (x, x (y)) : E E R+ is used to define the optimal Bayesian × 7→ estimator by minimising the expected loss, as follows: b
∞ ∞ R , E C (x, x) = C (x, x) pxy(x, y) dx dy { } Z−∞ Z−∞ (2.14) ∞ b = I (x) py(y)bdy, Z−∞ where b
∞ I (x) = C (x, x) px y(x y) dx. (2.15) | | Z−∞ Since (2.14) is the expectation ofb a positive quantityb I (x), it is sufficient to minimize (2.15).
From a probabilistic point of view, a rather appealingb point estimate is provided by choosing the value that minimizes the variance of the estimation error, referred to as the Minimum Variance (MV), or more commonly as the Minimum Mean Squared Error (MMSE) estimate. This estimator has a loss function defined as C (x, x) , x x 2. The MMSE estimator can be k − k derived as follows: b b 2 xMV , arg min E x x y x k − k | n n oo E T T E T b = argb min x x yb 2x x y + x x (2.16) x | − { | } E 2 E 2 E 2 = argb min x x y b+ x yb b x y . x k − { | }k k k | − k { | }k n n o o The two last terms in (2.16)b are independentb of x, which leads to the well known MMSE estimate
x = E bx y MV { | } (2.17) = xp (x y) dx. b | Z
Alternatively, one may choose to minimize the hit-or-miss cost function given by
0, x x (y) ǫ C (x, x (y)) = k − k ≤ , (2.18) 1, otherwise b b 2.3 Bayesian Inference 15
∧ ∧ ∧ xMedian xMV xMAP
Figure 2.1: Parameter estimation criteria based on the marginal posterior distribution where ǫ 0 is a positive scalar. Optimising this cost function yields the MAP estimator: →
xMAP (y) = arg max p (x y) x { | } b = arg max log p (x y) (2.19) x { | } = arg max log p (y x) + log p (x) , x { | } where we used the fact the the log function is monotonic increasing. The MAP estimate chooses the model with highest posterior probability density (the mode).
The most frequently used location measures are the mean, the median and the mode of the posterior distribution since they all have appealing properties. In the case of a flat prior the mode is equal to the maximum likelihood estimate. For symmetric posterior densities the mean and the median are identical. Moreover, for unimodal symmetric posteriors all the three measures coincide. These estimators are illustrated in Figure 2.1
In many cases, the computational complexity involved in solving the MMSE estimate in (2.17) is too high for practical applications. Instead, a common approach is to consider the Linear Minimum Mean Squared Error (LMMSE) estimator that achieves the smallest Mean Squared Error (MSE) among all linear estimators, i.e. of the form x = Ay + b. The LMMSE estimator satisfies the following closed form solution b T 1 T x (y) = E x + E xy E− yy (y + E y ) . (2.20) LMMSE { } { } b 2.4 Bayesian Model Selection 16
MMSE estimator:
x = E x y . MV { | }
MAPb estimator:
x (y) = arg maxx log p (y x) + log p (x) . MAP { | }
LMMSEb estimator:
x (y) = E x + E xyT E 1 yyT (y + E y ) . LMMSE { } − { } b Table 2.1: Summary of point estimators
Another, less common estimator yields the posterior median. Its cost function is
C (x, x (y)) = a x x (y) , (2.21) | − | where a > 0. The points estimators areb sumamrised inb Table 2.1.
2.3.3 Interval Estimation
In some cases we are not only interested in the actual value of the estimated parameter x, but we would also like to assign it confidence regions on x, i.e. subsets C of the parameter space Ψ b where x should be with high probability [11]. A 100 (1 α)% credibility set for x is a subset C − of Ψ such that
1 α p(C y) = p(x y)dx. (2.22) − ≤ | | ZC This definition enables appealing statements like “The probability that x lies in C given the observed data y is at least 1 α”. This comes in contrast with the usual interpretation of the − confidence intervals based on the frequency of a repeated experiment.
2.4 Bayesian Model Selection
2.4.1 Introduction
Model selection can have different meanings, depending on the specific problem at hand. There are three broad approaches to understanding model selection which are labeled as the Mopen, 2.4 Bayesian Model Selection 17
Mcompleted and the Mclosed modeling perspectives [13]. They can be summarised as follows:
M : takes the view that the class of models under consideration contains the true • closed model.
M : takes the view that although a formulated belief model is known, due to • completed intractability of analysis other models are considered.
M : takes the view that none of the models under consideration completely captures • open the intricate relationship between the inputs and the outputs.
Throughout this thesis the problem of model selection refers to the inference of which model in an Mclosed set is the most appropriate one for describing a given observation signal. We shall also restrict ourselves to nested models. A set of models is nested when each model in the set can be described as a special case of the models in the set with higher complexity. The problem can be described as follows: suppose we have a finite set of K possible models, where M ,...,M { 1 K } are model indicators, such that:
M1 : x = [x1], M2 : x = [x1, x2], . (2.23) . . MK : x = [x1, . . . , xK ]. Each model is assigned with a prior probability P r (Mk). We would like to find the poste- rior model probabilities, P r (M y) ,M 1,...,K . There are two common approaches to k| k ∈ { } model selection/estimation: the Bayesian Model Order Selection (BMOS) and Bayesian Model Averaging (BMA) methods. BMOS involves the selection of the model, denoted M , which most accurately represents the { i} observations, according to some criterion of interest. The BMOS posterior is expressed as
p(y M )P r(M ) P r (M y) = | i i i| p(y) (2.24) p(y x,M )p(x M )P r(M ) dx = | i | i i . K p(y x,M )p(x M )P r(M ) dx k=1R | k | k k The above analysis reflects the probabilityP R of each given possible model. One can therefore choose to perform a point estimate, such as the MAP estimate of the model
M = arg max p(y Mi)P r(Mi). (2.25) Mi | c 2.5 Bayesian Estimation Under Unknown Model Order 18
The BMOS approach establishes all inferences on one of the K possible models. However, it ignores all other models which means that the information we have on the model uncertainty is discarded. An alternative approach is the BMA method. The BMA considers several or all candidate models in a weighted manner, thus, more information is exploited and we better performance can be expected. The BMA is formulated as follows:
K M = kP r(M y) k| Xk=1 (2.26) K c p(y M )P r(M ) = k | k k . p(y) Xk=1 A thorough discussion on these methods can be found in [21], [22], [23].
2.5 Bayesian Estimation Under Unknown Model Order
In some cases the actual model structure is not of primary interest. Instead, one is interested in estimating a quantity x, where the number of elements of x is unknown. Examples are polynomials of unknown degrees, CIR with unknown number of taps, Auto Re- gressive Moving Average (ARMA) models of different orders, Finite Impulse Response (FIR) filters of different lengths, many types of mixture models etc.. In these cases, a single integer valued parameter, known as the model order is sufficient to describe the model complexity. In the following sections we shall discuss two different approaches to Bayesian model selection.
2.5.1 Bayesian Model Averaging
One popular approach to estimation with unknown model order is BMA [21], [22]. In this approach, the inference is based on average of all possible models in the model space M, instead of a single “best” model. Suppose M M = 1,...,M = K , and let x be the quantity of ∈ { 1 K } interest. The posterior distribution of x is given by
K p(x y) = p(x y,M )P r(M y). (2.27) | | k k| Xk=1 The MMSE estimate is given by
x , E x y MMSE { | } (2.28) = xp(x y)dx. b | Z 2.6 Bayesian Filtering 19
In order to obtain the likelihood function p(x y), we need to marginalise over the latent model | order k, as follows:
K p(x y) = p(x y, k)P r(k y). (2.29) | | | Xk=1 Therefore, the MMSE estimate in (2.28) can be expressed as
K xMMSE = x p(x y, k)P r(k y) dx | | ! Z Xk=1 b K = P r(k y) xp(x y, k)dx (2.30) | | Xk=1 Z K = P r(k y)E x y, k . | { | } Xk=1 The BMA approach performs a weighted sum of the MMSE estimates of all possible K models. Hoeting et al. showed in [21] that averaging over all the possible models k = 1,...,K provides a better estimate than any single model under the logarithmic scoring criterion:
K E log P r(k y)p(x y, k) E log p(x y, j) , j = 1,...,K. (2.31) − ( ( | | )) ≤ − { { | }} Xk=1
2.5.2 Bayesian Model Order Selection
In contrast to the BMA approach, in the BMOS one first finds the most probable model using the MAP estimate of the model order, kMAP , as in (2.25), and then conditions on this estimate to obtain the MMSE estimate of x. This procedure is composed of the following two steps: b
1. kMAP = arg maxk 1,...,K P r(k y) = arg maxk 1,...,K p(y k)P r(k). ∈{ } | ∈{ } | , E 2. bxMMSE k x y, kMAP . | MAP | n o b b b The BMA and BMOS approaches are summarised in Table 2.2.
2.6 Bayesian Filtering
In many applications, one needs to extract the signal from the data corrupted by additive random noise and interferences of different kinds in order to recover the unknown quantities of interest. The data often arrive sequentially in time and, therefore, require on-line decision- making responses. Examples of such applications are speech enhancement [24]; visual tracking 2.6 Bayesian Filtering 20
Bayesian model averaging and MMSE estimator:
x = K P r(k y)E x y, k . MMSE k=1 | { | } P Bayesianb model order selection and MMSE estimator:
1. kMAP = arg maxk 1,...,K P r(k y) = arg maxk 1,...,K p(y k)P r(k). ∈{ } | ∈{ } | , E 2. bxMMSE k x y, kMAP . | MAP | n o b b b
Table 2.2: Bayesian estimation under model uncertainty
[25]; target tracking [26]; stochastic volatility models in economy [27] and many more. In this thesis we shall concentrate on a particular instance of sequential filtering, that is, state space models.
2.6.1 State Space Models
State-space models have been widely studied within the areas of signal processing and systems and control theory [28], [29], [30]. A state-space model is a model where the relationship between the input signal, the output signal and the noises is provided by a system of first-order differential equations. The state vector xn contains the quantities of interest of the underlined system up to and including time n, which is needed to determine the future behavior of the system, given the input. The system is described by (2.32a)-(2.32b).
General state-space model with Gaussian noise
xn = f (xn 1, vn) (2.32a) − yn = h (xn, wn) , (2.32b)
where v N(0, Q), n ∼ w N(0, R), n ∼ E T vnwn = 0.
The functions f ( ) and h ( ) can be in general non-linear. We, however, will restrict ourselves · · to the special case where f ( ) and g ( ) are linear functions, hence, Linear Dynamic State Space · · Model (LDSSM) systems. 2.6 Bayesian Filtering 21
A fundamental property ascribed to the system model is the Markov property.
Definition 2.1. (Markov property). A discrete-time stochastic process x is said to possess { n} the Markov property if p (x x ,..., x ) = p (x x ) . n+1| 1 n n+1| n
This means that the realization of the process at time n contains all information about the past, which is necessary in order to calculate the future behavior of the process. Hence, if the present realization of the process is known, the future is independent of the past.
The state process xn is an unobserved (hidden) Markov process. Information about this process is indirectly obtained from measurements (observations) yn according to the measurement model,
y p (y x ) . (2.33) n ∼ n| n
The observation process y is assumed to be conditionally independent of the state process { n} x , i.e., { n}
p (y x ,..., x ) = p (y x ) , 1 n N. (2.34) n| 0 N n| n ∀ ≤ ≤
2.6.2 Sequential Bayesian Inference
The Bayesian posterior, p (x y ), reflects all the information we have about the state of the 0:n| 1:n system x0:n, contained in the measurements y1:n and the prior p (x0:n), and gives a direct and easily applicable means of combining the two last-mentioned densities via Bayes’ theorem:
p (y1:n x0:n) p (x0:n) p (x0:n y1:n) = | . (2.35) | p (y1:n)
Taking into account that the observations up to time n are independent given x0:n, the likelihood p (y x ) in the above equation can be factorized as follows: 1:n| 0:n n p (y x ) = p (y x ) , (2.36) 1:n| 0:n i| 0:n Yi=1 and, since, conditional on xi, the measurement yi is independent of the states at all other times, it is given by:
n p (y x ) = p (y x ) . (2.37) 1:n| 0:n i| i Yi=1 2.6 Bayesian Filtering 22
In addition, as a result of the Markov structure of the system in (2.32a), the prior p (x0:n) takes the following form:
n p (x0:n) = p (x0) p (xi xi 1) , (2.38) | − Yi=1 resulting in the posterior probability density being equal to
n p (x0) i=1 p (yi xi) p (xi xi 1) p (x y ) = | | − . (2.39) 0:n| 1:n p (y ) Q 1:n
2.6.3 Filtering Objectives
Our objective is to obtain the estimates of the state at time n, conditional upon the measurements up to time n, such as, for example, MMSE of xn:
MMSE E xn = p(xn y1:n) xn | { } (2.40) = x p (x y ) dx , b n n| 1:n n Z or Marginal Maximum A Posteriori (MMAP) given by:
MAP xn = arg max p (xn y1:n) . (2.41) xn | b 2.6.4 Sequential Scheme
The probability density of interest p (x y ) can be obtained by marginalization of (2.39), n| 1:n however, the dimension of the integration in this case grows as n increases. This can be avoided by using a sequential scheme. A recursive formula for the joint probability density can be obtained straightforwardly from (2.39):
p (yn xn) p (xn xn 1) p (x0:n y1:n) = p (x0:n 1 y1:n 1) | | − , (2.42) | − | − p (yn y1:n 1) | − with the marginal p (x y ) also satisfying the recursion [31] n| 1:n
p (xn y1:n 1) = p (xn xn 1) p (xn 1 y1:n 1) dxn 1, (2.43) | − | − − | − − Z
p (yn xn) p (xn y1:n 1) p (x y ) = | | − , (2.44) n| 1:n p (y y ) n| 1:n 2.6 Bayesian Filtering 23 where
p (yn yn 1) = p (yn xn) p (xn y1:n 1) dxn. (2.45) | − | | − Z Equations (2.43) and (2.44) are called respectively prediction and updating. Although the above expressions appear simple, the integrations involved are usually intractable. One cannot typi- cally compute the normalizing constant p (y ) and the marginals of p (x y ), particularly, 1:n n| 1:n p (x y ), except for several special cases when the integration can be performed exactly. The n| n problem is of great importance, which is why a great number of different approaches and filters have been proposed.
The most important special case, and the one we will be interested in this thesis, occurs when all equations are linear and the noise terms are Gaussian in (2.32a)-(2.32b). The optimal solution in terms of MSE is provided by the celebrated Kalman filter [32]. In the nonlinear, non-Gaussian case, there is a class of methods, referred to as Sequential Monte Carlo (SMC) methods, available for approximating the optimal solution [33].
An important property of the linear model (2.32a)-(2.32b) is that all density functions involved are Gaussian. This is due to the fact that a linear transformation of a Gaussian random variable will result in a new Gaussian random variable. Furthermore, a Gaussian density function is completely parameterized by two parameters, the first and second order moments, i.e., the mean and the covariance.
2.6.5 Linear State-Space Models - the Kalman Filter
When Rudolf E. Kalman developed the framework for the Kalman Filter, he did not require the underlying state-space model to be linear as well as all the probability densities to be Gaussian. Kalman’s original derivation of the Kalman filter did not assume these conditions. The only assumptions he made were the following:
1. Consistent minimum variance estimates of the system random variables (i.e. posterior state distribution) can be calculated, by recursively propagating and updating only their first and second order moments.
2. The estimator itself is a linear function of the prior knowledge of the system, summarized
by p (xn y1:n 1), and the new observed information, summarized by p (yn xn). | − | 3. Accurate predictions of the state and of the system observations can be calculated.
Based on these assumptions, Kalman derived in [32] a recursive form of the conditional mean of the state E xn y1:n 1 . In case of linear state-space model and Gaussian noise terms, the { | − } state-space can be expressed as (2.46a)-(2.46b). 2.6 Bayesian Filtering 24
Linear state-space model with Gaussian noise
xn = Anxn 1 + vn, (2.46a) − yn = Bnxn + wn, (2.46b)
where v N(0, Q), n ∼ w N(0, R), n ∼ E T vnwn = 0.
In that case the Kalman filter procedure for estimating xn given the system model in (2.46a)- (2.46b) is given below:
Kalman filter equations
Step 1: a priori estimation (prediction)
xn− = Anxn 1, (2.47a) − H Pn− = AnPn 1An + Q. (2.47b) b b −
Step 2: a posteriori estimation (update)
H H 1 Kn = Pn−Bn BnPn−Bn + R − , (2.48a)
x = x− + K y B x− , (2.48b) n n n n − n n Pn = P− Kn BnP−. (2.48c) b b n − n b
First, the filter calculates the a priori estimates of the channel xn− and its error covariance matrix
Pn−, based on the history prior to the current observation yn, where b H P− = E x− x x− x . (2.49) n n − n n − n n o Next, the a posteriori estimates of the channelb xn andb its error covariance matrix Pn after the observation yn is available at the receiver are evaluated, where b P = E (x x )(x x )H . (2.50) n n − n n − n n o b b 2.7 Consistency Tests 25
The Kalman gain Kn in (2.48a) balances the weight between the predicted estimates and the innovation process. In Appendix 10.3 we derive the Kalman filter from Bayesian point of view. The Kalman filter shall be utilised in Chapter 6 for the purpose of channel tracking in OFDM systems.
2.7 Consistency Tests
We now discuss the issue of verifying that the Kalman filter is performing correctly. As mentioned in the previous sections, the Kalman filter is optimal in the sense that it provides the MMSE estimate. However, this relies on perfect knowledge of An, Bn, Q and R. There are instances where some of those quantities may be different from what we have assumed. For that reason it is important to obtain a measure of reliability regarding the performance of the filter. This gives us a quantifiable confidence in the accuracy of our filter estimates. A consistency test indicates whether the state-space model is consistent with the data y. We begin by expressing the distribution of y1:n as
p (y1:n) = p (yn, y1:n 1) − = p (yn y1:n 1) p (y1:n 1) | − − (2.51) n = p (yi y1:i 1) . | − Yi=1 Further, we assume that the elements in the product of (2.51) are Gaussian distributed, that is:
N p (yi y1:i 1) = yi−, Si , (2.52) | − b T where yi− = Bixi is the predicted observation, and Si = BiPi−Bi + R is the associated obser- vation covariance matrix. It follows that b b
p (ǫi y1:i 1) = N (0, Si) , (2.53) | − where ǫ , y y−. Using (2.53) in (2.52) yields, with the explicit expression for the Gaussian i i − i probability distribution function (pdf): b n p (yn) = p (ǫi) i=1 Yn n (2.54) 1/2 1 T 1 = 2πSi − exp ǫi Si− ǫi . | | ! (−2 ) Yi=1 Xi=1 2.8 The EM and Bayesian EM Methods 26
The distribution in (2.54) is known as the likelihood function of the filter, and the exponent:
n T 1 λn = ǫi Si− ǫi (2.55) Xi=1 T 1 = λn 1 + ǫn Sn− ǫn −
T 1 2 is called the modified log-likelihood function. The individual terms ǫn Sn− ǫn are χK distributed 2 with K degrees of freedom, where K is the dimension of yn. It follows that λn is χKN distributed with Kn degrees of freedom. Using these results, a quality threshold can be established by using 2 the χKn distribution and λn can serve as a track quality indicator. When λn exceeds the quality threshold, it indicates that a model mismatch occurs. We shall use these results in Chapter 6 to reduce model mismatch in the iterative receiver.
2.8 The EM and Bayesian EM Methods
The Expectation Maximisation (EM) method, introduced by Dempster, Laird and Rubin [34], [35], presents an iterative approach for obtaining the mode of the likelihood function. The strategy underlying the EM algorithm is to separate a difficult maximum likelihood problem into two linked problems, each of which is easier to solve than the original problem. The main property of the EM algorithm is that it ensures an increase in the likelihood function at each iteration. Suppose we wish to find
x = arg max p (y x) , (2.56) x | where y is the observation vector andb x is the unknown vector to be estimated. We assume that p (y x) is the marginal of some real-valued function p (y, θ x). It is convenient to think of | | θ either as missing observations or as latent variables. That is, in situations where it is hard to maximise p (y x), EM will allow us to accomplish this by working with p (y, θ x). | | The likelihood function p (y x) cab be expressed as |
p (y x) = p (y, θ x) , (2.57) | | Xθ where θ g (θ) denotes either integration or summation of θ g (θ) over the whole range of θ. The function p (y x, θ) is assumed to be non-negative for all x and θ. The Maximum Likelihood P | P 2.8 The EM and Bayesian EM Methods 27
(ML) estimate in (2.56) can be cast as
x = arg max p (y x) x | b = arg max log p (y x) x | (2.58)
= arg max log p (y, θ x) . x | ! Xθ The pivotal problem encountered while maximising (2.58) is the logarithm of the summation cannot be decoupled. The EM algorithm attempts to compute (2.58) iteratively as follows:
1. Make some initial guess x0.
2. Expectation step: evaluate
, E Q (x, xk) θ y;x log p (y x, θ) . (2.59) | k { | }
3. Maximisation step: compute
xk+1 = arg max Q (x, xk) . (2.60) x
4. Repeat steps 2-3 until convergence.
In the above formulation, the superscript xk denotes the k-th iteration. The full details regarding the derivation of the EM method can be found in Appendix 10.4.
The main property of the EM algorithm is
p(y x ) p(y x ), (2.61) | k+1 ≥ | k which states that the likelihood of the current estimate xk is non decreasing at every iteration.
An important aspect of the EM algorithm is the choice of the unobserved data vector θ. This choice should be done such that the maximization step is easy or at least easier than the max- imization of the likelihood function. In general, doing so is not an easy task. In addition, the evaluation of the conditional expectation may also be rather challenging.
Despite the fact the algorithm can display slow numerical convergence in some cases, the EM algorithm has become a very popular computational method in statistics. One of the main reasons motivating its use is that the implementation of the E-step and M-step is easy for many statistical problems. 2.9 Lower Bounds on the MSE 28
The main disadvantages of the EM method are:
1. Convergence to global optimum is not guaranteed.
2. May have slow convergence rate, i.e. has a linear order of convergence.
3. In some problems, the E-step or M-step may be analytically intractable.
The EM can be extended to the Bayesian setting in order to find the MAP estimate. In that case the expectation step in (2.59) is replaced with
, E Q (x, xk) θ y;x log p (y, x, θ) . (2.62) | k { }
The BEM methodology shall be used in Chapter 5 to perform inference in the presence of nuisance parameters.
2.9 Lower Bounds on the MSE
As shown in Section 2.3.2, the conditional mean E x y is the technique minimizing the MSE. { | } Thus, from a theoretical perspective, finding the MMSE estimator in any given problem is a mechanical task. In practice, however, the complexity involved in computing the conditional mean is often prohibitive. As a result, various alternatives, such as the LMMSE and the MAP techniques, could be used, as these methods can often be solved efficiently. An important goal is to quantify the performance degradation resulting from the use of these suboptimal techniques. One way to do this is to compare the MSE of the method used in practice with the MMSE. Unfortunately, computation of the MMSE is itself infeasible in many cases. Therefore, we are interested in finding simple lower bounds for the MMSE in various settings. As mentioned before, in probabilistic inference there are two well known approaches, namely the frequentis approach and the Bayesian approach. Although the deterministic and Bayesian settings stem from different points of view, a connection between them can be made. Here we concentrate on the Bayesian version of lower bounds of the MSE.
Suppose x = x (y) is an estimator of x Cm given an observation vector y Cn. Its Bayesian ∈ ∈ MSE matrix, Σ, is given by b b 2 Σ = Ey x x x , k − k n o (2.63) = x x 2 p (y, x) dydx, k b− k ZZ b where Ey,x denotes expectation with respect to p (y, x). The Bayesian Cram´erRao Lower Bound (BCRLB) provides a lower bound on the MSE matrix for random parameters [36], [37]. 2.11 Monte Carlo Methods 29
It is the inverse of the Bayesian Information Matrix (BIM) J
1 Σ C , J− , (2.64) ≥ where the matrix inequality indicates that Σ C is a positive semi-definite matrix. The BIM − for x, J, is defined as
x J = Ey x ∆ ln p (y, x) , (2.65) , {− x } where ∆α is defined as the m n matrix of second-order partial derivatives with respect to the β × m 1 parameter vector β and n 1 parameter vector α, × ×
2 2 2 ∂ ∂ ... ∂ ∂β1α1 ∂β1α2 ∂β1αn 2 2 2 ∂ ∂ ... ∂ α ∂β2α1 ∂β2α2 ∂β2αn ∆β = . . . . . (2.66) . . .. . 2 2 2 ∂ ∂ ... ∂ ∂βnα1 ∂βnα2 ∂βmαn The BCRLB will be relevant to results obtained in Chapter 7 where we derive BCRLB type bounds.
2.10 Bayesian Methodology Summary
The Bayesian methodology provides a mathematically rigorous quantification of uncertainty through the application of probability theory. The subjective prior probabilities are updated in light of data. Within the Bayesian inference there exists many cases where intractable cal- culations can prevent its application. Whilst in the past, such issues would have been avoided through the use of conjugate prior distributions, such problems can be overcome using Monte Carlo techniques, which are a class of asymptotically optimal approximate algorithms. We now review a class of Monte Carlo techniques which can be applied to these problems.
2.11 Monte Carlo Methods
2.11.1 Motivation
In Bayesian inference, whenever we attempt to carry out normalisation, marginalisation or ex- pectations, high-dimensional integrals need to be evaluated. Apart from that, many applications involve elements of non-Gaussianity, nonlinearity and nonstationarity. All those reasons preclude the use of analytical integration. In order to obviate this problem, we can either resort to approx- imation methods, numerical integration or Monte Carlo simulation. Approximation methods, 2.11 Monte Carlo Methods 30 such as Gaussian approximation [38] and variational methods [39], are easy to implement. Yet, they do not take into account all the salient statistical features of the processes under consid- eration, thereby often leading to poor results. Numerical integration in high dimensions is far too computationally expensive to be of any practical use. Monte Carlo methods provide the middle ground. They lead to better estimates than the approximate methods. This occurs at the expense of extra computing requirements, but the advent of cheap and massive computa- tional power, in conjunction with some recent developments in applied statistics, means that many of these requirements can now be met. Monte Carlo methods are very flexible in that they do not require any assumptions about the probability distributions of the data. From a Bayesian perspective, Monte Carlo methods allow one to compute the full posterior probability distribution.
2.11.2 Mote Carlo Techniques
Monte Carlo methods are commonly used for approximation of intractable integrals and rely on the ability to draw a random sample from the required probability distribution. The idea (i) N is to simulate N independent and identically distributed (i.i.d.) samples x i=1 from the distribution of interest, which is usually in the Bayesian framework the posterior p (x y) , and | use them to obtain an empirical estimate of the distribution:
N 1 p (x y) = δ x x(i) . (2.67) N | N − Xi=1 b Using this empirical density, the expected value of x
E x y = xp (x y) dx, (2.68) { | } | Z or in general
E ϑ (x) y = ϑ (x) p (x y) dx, (2.69) { | } | Z can be obtained consequently by approximating the corresponding integrals by the sums:
E x y = xp (x y) dx { | } N | Z N (2.70) b 1 = b x(i), N Xi=1 2.11 Monte Carlo Methods 31
E ϑ (x) y = ϑ (x) p (x y) dx { | } N | Z N (2.71) b 1 = ϑ(bx(i)). N Xi=1 1 The estimate (2.71) is unbiased with the variance proportional to N and according to the strong law of large numbers we have that
a.s. lim pN (x y) p (x y) , (2.72) N | −→ | →∞ where a.s. denotes almost surely (a.s.) convergenceb [33]. If we assume that σ2 = E ϑ2(x) y −→ | − E2 ϑ (x) y < the Central Limit Theorem can be applied, which gives { | } ∞
lim √N E ϑ (x) y E ϑ (x) y d N 0, σ2 , (2.73) N { | } − { | } −→ →∞ b where d denotes convergence in distribution [33]. −→ The aforementioned procedure can be easily obtained providing one can sample from p (x y). | This is usually not the case, however, with p (x y) being multivariate, non standard and typically | only known up to a normalizing constant.
2.11.3 Sampling From Distributions
The problem under consideration in this section is to generate samples from some known proba- bility density function, referred to as the target density p(x). However, since we cannot generate samples from p(x) directly, the idea is to employ an alternate density that is simple to draw samples from, referred to as the sampling density s(x). The only restriction imposed on s(x) is that its support should include the support of p(x). When a sample x s(x) is drawn, ∼ the probability that it was in fact generated from the target density can be calculated. This probability can then be used to decide whether x should be considered to be a sample from p(x) or not. This probability is referred to as the acceptance probability, and it is typically expressed as a function of q(x), defined by the following relationship
p(x) q(x)s(x). (2.74) ∝
Depending on the exact details of how the acceptance probability is computed different methods are obtained. The three most common methods are briefly explained in the next sections. 2.11 Monte Carlo Methods 32
1
u F−1(u)
0 x
Figure 2.2: The inverse transform method to obtain samples
2.11.4 Inversion Sampling
A simple yet elegant approach for sampling from distributions had been considered by Ulam prior to 1947. If p (x) is a distribution and it is possible to invert its cumulative distribution function (cdf), F , then it is possible to transform a sample, u, from a uniform distribution over [0, 1] into a sample, x, from p (x) by making use of the following transformation:
1 x = F − (u) . (2.75)
Actually, it suffices to obtain a generalised inverse of F , a function with the property that, F 1 (u) = inf F (x) u as the image of the set of points over which the true inverse is multi- − x { ≥ } valued is π-null. This algorithm is summarized in Algorithm 1, and illustrated in Figure 2.2. Except for simple cases, such as uniform and normal distributions, finding the inverse of the cdf is difficult or impossible, that is, to solve
x F (x) = p(t)dt = u. (2.76) Z−∞ For this reason, other approaches to obtain samples from difficult distributions have been devised.
Algorithm 1 Inversion sampling algorithm Input: F (x) Output: a sample from p (x) 1: Generate u U [0, 1] ∼ 2: Set F (x) = u 1 3: Invert the cdf and solve for x: x = F − (u) 2.11 Monte Carlo Methods 33
2.11.5 Accept-Reject Sampling
The Accept-Reject technique is based on generating a proposal and then accepting or rejecting it with certain acceptance probability. As a result, a true, independent sample from the required distribution is obtained. Let us assume that we have a distribution p(x) f(x) that is easy to ∝ evaluate at discrete points, but from which samples are difficult to draw, and our goal is to obtain realizations of p(x). In such cases we use the Accept-Reject algorithm [40]. According to the algorithm, a sample is drawn from another distribution q(x) g(x) and, based on the outcome, ∝ a decision is made on whether to keep the sample or reject it. This algorithm is summarized in Algorithm 2. The distribution g(x) must be chosen such that f(x) cg(x) for some constant ≤ c. Thus, if the proposal density is quite different from the target one, the method naturally compensates by sampling more points from the required distribution. This results, however, in an unpredictable number of iteration required to complete each step, and proves to be extremely computationally expensive in high-dimensional spaces. The Accept-Reject is widely studied and the interested reader is referred to [11], [41]. The proof of this procedure is easy to obtain:
f(x) p x(i) < z = p x < z u | ≤ cg(x) f(x) p x < z, u cg(x) = ≤ p u f(x) ≤ cg(x) f(x) z cg(x) 0 du g(x)dx (2.77) = −∞ f(x) R R cg(x′) ∞ 0 du g(x)dx −∞ 1 z Rc R f(x)dx = 1 −∞ c R ∞ f(x)dx z−∞ = R f(x)dx, Z−∞ which proves the required result.
Algorithm 2 Accept-Reject sampling algorithm Input: p(x) f(x), q(x) g(x) and number of attempts N ∝ ∝ Output: x(i), samples from p(x) 1: Find a constant c such that f(x) cg(x) for all x ≤ 2: for i = 1 ...N do 3: Generate x from g(x) 4: Generate u U [0, 1] f(x) ∼ 5: if u then ≤ cg(x) 6: x(i) = x (accept) 7: else 8: goto step 3 9: end if 10: end for 2.11 Monte Carlo Methods 34
2.11.6 Importance Sampling
In the Accept-Reject method, computing f(x) and q(x), then throwing x away along with all the associated computations seems wasteful. Importance sampling avoids the trouble of trying to sample directly from the target distribution by instead sampling from an importance distribution q(x). The distribution q(x) is selected to have the property that is simpler to obtain samples from that the target distribution. Then correction is made since these samples were not taken from the distribution of interest p(x), but instead from the importance distribution q(x). This correction step is known as importance weighting. Integrals of some bounded, integrable function ϑ, with respect to the target distribution can be expressed as
E ϑ (x) = ϑ (x) p(x)dx p(x) { } Z p(x) = ϑ (x) q(x)dx (2.78) q(x) Z p(x) = E ϑ (x) , q(x) q(x) and may be approximated as
N 1 E ϑ (x) ϑ(x(i))W (x(i)), (2.79) p(x) { } ≈ N Xi=1
p(x) (i) where W (x) = q(x) is the correction importance weight. The particles, x , are samples from the importance distribution, q(x). This will produce an unbiased estimate with variance of the estimate inversely proportional to the number of particles N . The importance sampling algorithm is summarised in Algorithm 3.
Algorithm 3 Importance sampling algorithm Input: p(x) f(x) is difficult to sample, yet can be evaluated analytically up to a normalisation ∝ constant, and number of samples N (i) N Output: x i=1, samples from p(x) 1: for i = 1 ...N do 2: x(i) q(x) ∼ f(x(i)) 3: w(i) = q(x(i)) 4: end for 5: for i = 1 ...N do (i) w(i) 6: W = N (j) j=1 w 7: end for P N (i) 8: p(x) = i=1 W δx(i) (x) P 2.12 Markov Chain Monte Carlo Methods 35
In some situations the described techniques fail or become too computational intensive. In order to overcome these hurdles, techniques which utilise the framework of Monte Carlo have been developed. One of these techniques which we now discuss is the MCMC technique.
2.12 Markov Chain Monte Carlo Methods
2.12.0.1 Introduction
Monte Carlo methods provide a general statistical approach for simulating multivariate dis- tributions by generating an efficient discretized representation of the needed posterior density. MCMC methods can be used to sample from distributions that are complex and have unknown normalization. This is achieved by relaxing the requirement that the samples should be inde- pendent.
A Markov chain generates a correlated sequence of states. Each step in the sequence is drawn from a transition operator T (x x), which gives the probability of moving from state x to state ′ ← x′. According to the Markov property, the transition probabilities depend only on the current state, x. In particular, any free parameters σ e.g. step sizes, in a family of transition operators, T (x x; σ), cannot be chosen based on the history of the chain. A basic requirement for T is ′ ← that given a sample from p(x), the marginal distribution over the next state in the chain is also the target distribution of interest p:
p(x′) = T (x′ x)p(x), for all x′. (2.80) x ← X By induction all subsequent steps of the chain will have the same marginal distribution. The transition operator is said to leave the target distribution p stationary. MCMC algorithms often require operators that ensure the marginal distribution over a state of the chain tends to p(x) regardless of starting state. This requires irreducibility: the ability to reach any x where p(x) > 0 in a finite number of steps, and aperiodicity: no states are only accessible at certain regularly spaced times. For now we note that as long as T satisfies equation (2.80) it can be useful as the other conditions can be met through combinations with other operators. Since usually p(x) is a complicated distribution, it might seem unreasonable to expect that we could find a transition operator T leaving it stationary. However, it is often easy to construct a transition operator satisfying detailed balance:
T (x′ x)p(x) = T (x x′)p(x′), for all x, x . (2.81) ← ← ′
Detailed balance states that a step starting at equilibrium and transitioning under T has the same probability “forwards” x x or “backwards” x x. ← ′ ′ ← 2.12 Markov Chain Monte Carlo Methods 36
To summarise, MCMC algorithm explores the parameter space by a random walk, but spends more time in regions of higher probability, such that in the long run, the amount of time spent in any region is proportional to the amount of probability in that region.
2.12.1 Basics of Markov Chains
Let (Ω, σ, P) be a probability space, let X = 1, 2,...,N be a finite set and let xn(ω) be a stochastic process in discrete time such that x (ω) for all n = 0, 1, 2,... and all ω Ω. Suppose n ∈ there is an N N matrix P (called the Probability Transition Matrix (PTM) of the process) × such that for all i, j X we have ∈
p (x = i x = j) = [P] . (2.82) n+1 | n i,j
Then we call the process xn a time-homogeneous Markov chains with finite state space in discrete time. The matrix P has non-negative entries and its columns sum to 1.
Definition 2.2. Let P be a PTM. It is said to be irreducible if for every pair i, j X there is ∈ an n 1 such that [P n] > 0. Loosely speaking, a Markov chain is irreducible if (almost) all ≥ i,j states communicate; the property corresponds to the existence of a path of positive probability from (almost) any point in the space to (almost) any measurable set. In a Bayesian context, irreducibility guarantees that we can visit all the sets of parameter values in the posterior’s support
Definition 2.3. Let P be an N N PTM and let 1 i N. The set of return times for state × ≤ ≤ i is defined via n R(i) = n > 0 & [P ]i,i > 0 . n o Definition 2.4. Let P be an N N PTM and let 1 i N and let R(i) be the set return × ≤ ≤ times for state i. Then the period of state i is defined via
p(i) = g.c.d R(i), where g.c.d stands for ”greatest common divisor”.
Definition 2.5. An N N PTM P is called aperiodic if p(i) = 1 for each i = 1,...,N. In the × discrete state space case, a Markov chain is aperiodic if there exist no cycles of length greater than one, where a cycle is defined as the greatest common denominator of the length of all routes of positive probability between two states. 2.12 Markov Chain Monte Carlo Methods 37
2.12.2 Metropolis Hastings Sampler
The Metropolis Hastings (MH) algorithm is the most ubiquitous class of MCMC algorithms. The MH is a general algorithm for computing estimates using the MCMC method. It was introduced by W. Keith Hastings in 1970 [42], as a generalization of the algorithm proposed by Nicholas Metropolis et al. in 1953 [43].
An introduction to the MH algorithm is provided by Chib and Greenberg [44]. The idea of the algorithm which is depicted in Algorithm 4 is borrowed from Accept-Reject sampling, in that the generated samples are either accepted or rejected. However, when a sample is rejected the current value is used as a sample from the target density.
Algorithm 4 Metropolis-Hastings algorithm Input: initial x and number of samples N (i) N Output: x i=1, samples from p(x) (0) 1: Initialise x to an arbitrary staring point 2: for i = 1 ...N do 3: Propose x q(x x(i 1)) ∗ ∼ ∗ ← − (i 1) p(x∗) q(x(i−1) x∗) 4: Compute α x , x = min 1, ← ∗ − p(x(i−1)) × q(x∗ x(i−1)) ← 5: Generate u U [0, 1] ∼ 6: if u α x , x(i 1) then ≤ ∗ − 7: x(i) = x (accept) ∗ 8: else (i) (i 1) 9: x = x − (reject) 10: end if 11: end for
The Markov chain created by this algorithm is reversible and has the required target distri- bution, p(x). The choice of the proposal distribution q( ) is very general, however arbitrary · selection can lead to slow mixing of the chain and long burn in periods. This will be reflected in the acceptance probability ratio. It is straightforward to show that the Metropolis updating rule satisfies detailed balance, and is therefore a valid MCMC algorithm. This means that long Markov chains simulated from the acceptance rule of the MH algorithm will explore the param- eter space with the fraction of time spent in any volume being proportional to the total amount of probability contained in that volume. There have been many studies for optimal acceptance rates using different types of proposal distribution in different dimensions [45]. This will ensure that:
the chain is not proposing steps which are too large, hence rejecting many of the moves, • the chain is not proposing steps which are too small, hence accepting most moves, but • exploring the state space very slowly. 2.12 Markov Chain Monte Carlo Methods 38
Most algorithms use symmetric proposal distributions [43], the Independence sampler [45], Ran- dom walk Metropolis [45], Configurational Bias Monte Carlo [46], Multiply Try Metropolis [47] and the single component MH algorithm. The MH independence sampling algorithm, which is a special case of the MH algorithm, is given by a very interesting type of MH al- gorithm, can be obtained when we adopt the full conditional distributions as p(xi xj=i) = | 6 p(xi x1, . . . , xi 1, xi+1, xm) proposal distributions. This algorithm, known as the Gibbs sampler, | − has been very popular since its development [48]. The following Section describes it in more detail.
2.12.3 Gibbs Sampler
The Gibbs sampler [48] is the most widely used form of the single component MH algorithm. The Gibbs sampler produces a Markov chain by updating one component of the state vector during each iteration. The value of each element at the i-th iteration is sampled from the distribution of that element conditional upon the values of all the other parameters at the (i 1)-th iteration and those parameters which have already been update at the i-th iteration. − (i) (i) (i) (i) Suppose that at the i-th iteration the current state is x = x1 , x2 , . . . , xK . If we also know (i) (i) (i) (i) (i) h i the full conditionals p(xk x1 , . . . , xk 1, xk+1, . . . , xK ) k 1,...,K , the following proposal | − ∈ { } distribution for k 1,...,K can be used ∈ { }
p x x(i) , if x = x(i) (i) k∗ k ∗ k k q x∗ x = | − − − (2.83) ← 0, otherwise.
The corresponding acceptance probability is:
(i) p (x∗) p x x∗ (i) k k α x∗, x = min 1, | − (i) (i) p x p xk∗ x k | − (i) (i) p (x∗) p xk x k = min 1, | − (i) p x p xk∗ x∗ k (2.84) | − p x∗ k = min 1, − (i) p x k − = 1.
As depicted in (2.84), the acceptance probability is equal to one. The Gibbs sampler algorithm is presented in Algorithm 5.
An example of a sample path of bivariate distribution is given in Figure 2.3. It is easy to see that in every step only one component is updated while the other remains at the same location. 2.12 Markov Chain Monte Carlo Methods 39
Algorithm 5 Gibbs sampling algorithm Input: initial x and number of samples N (i) N Output: x i=1, samples from p(x) 1: Initialise x(1) to an arbitrary staring point 2: for i = 1,...,N do (i+1) (i) (i) (i) 3: Sample x1 p(x1 x2 , x3 , . . . , xK ). (i+1) ∼ | (i+1) (i) (i) 4: Sample x p(x x , x , . . . , x ). 2 ∼ 2| 1 3 K 5: . 6: . 7: . (i+1) (i+1) (i+1) (i) (i) 8: Sample x p(x x , . . . , x , x , . . . , x ). k ∼ k| 1 k 1 k+1 K 9: . − 10: . 11: . (i+1) (i+1) (i+1) (i+1) 12: Sample x p(x x , x , . . . , x ). K ∼ K | 1 2 K 1 13: end for −
Gibbs sampling is often viewed as an algorithm in its own right. It is, however, simply a special case of the MH algorithm in which a single component is updated during each step, and the proposal distribution which is used is the true conditional distribution of that parameter given the present values of the others. That is, a single iteration of the Gibbs sampler corresponds to
2.2
2
1.8
2 1.6 x
1.4
1.2 Starting point
1
0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 x 1
Figure 2.3: Sample path of a bivariate distribution using the Gibbs sampler 2.12 Markov Chain Monte Carlo Methods 40 the application of N successive MH steps, with the relevant conditional distribution used as the proposal kernel for each step. The consequence of this special choice of proposal distribution is that the MH acceptance probability is always one, and rejection never occurs. We shall use the Gibbs sampler throughout this Thesis, due to its simplicity and efficiency. In particular, it shall be used in Chapters 5, 7 and 8.
2.12.4 Simulated Annealing
2.12.4.1 Introduction
In many cases we are interested in finding the maximum of a distribution and not in the actual distribution. For example, if one wishes the find the MAP estimate, it only needs to find the location for which the posterior distribution is maximised and not the actual value of the distribution. By contrast, if one wishes to find the MMSE estimate, he will need to obtain exact samples from the posterior. Simulated annealing is a Monte Carlo optimization method that attempts to find the global optima and improves on local search which can get trapped in local optima [49].
Simulated annealing has its roots in statistical physical. For example, when a metal is in a high temperature environment, these metal molecules can move freely as random walkers. When the temperature is decreased slowly, the moving range of these molecules would be restricted. Finally when the temperature is the lowest, these molecules will be localized and fluctuated in certain way. Our goal is to get the ground state of the metal. If the speed of decreasing temperature is too fast, the system will be trapped at the glassy state, it’s not stable and non- uniform. Annealing uses a temperature parameter so that at high temperatures, with some non-zero probability, the algorithm makes unfavorable moves that allow it to move out of local optima. The annealing starts at high temperature and gradually the temperature is lowered so that unfavorable moves become less and less likely.
2.12.4.2 Methodology
(i) ∞ Let T denote the temperature at the i-th iteration. The sequence T i=1 is a cooling schedule (i) (i) (i) if limi T = 0. Let β = 1/T be an inverse temperature parameter. The simulated →∞ annealing method involves simulating a non-homogeneous Markov chain whose invariant distri- bution at iteration i is no longer equal to p (x), but to:
(i) p (x)(i) (p (x))β . (2.85) ∝
( ) The reason for doing that is that under weak regularitory assumptions on p (x), p (x) ∞ is a probability density that concentrates itself on the set of global maximum of p (x). Similar to the 2.12 Markov Chain Monte Carlo Methods 41
MH algorithm, the simulated annealing method with distribution p (x) and proposal distribution involves sampling a candidate value x given the current value x(i) according to q x(i) x . ∗ → ∗ The simulated annealing is the same as Algorithm 4, with the only difference being the modified acceptance probability, which can be expressed as:
β(i) (i) (i) (p (x∗)) q x∗ x α x , x∗ = min 1, ← (2.86) SA β(i) p x(i) q x(i) x ← ∗ 2.12.5 Convergence Diagnostics of MCMC
A critical issue in MCMC methods is how to determine when it is safe to stop sampling and use the samples to estimate characteristics of the distribution of interest. The theoretical convergence of MCMC algorithms is an area of active research. In this Section we shall briefly outline a few related aspects. For further reading, see [50], [51].
2.12.5.1 Burn in Period
Since the Markov chain created in MCMC methods is not initialised from the stationary distri- bution, a burn in period is used. This means that first L values of the chain are discarded. It is assumed that after the chain has run for L steps, it will have converged to its equilibrium, so that the resulting samples will be from the target distribution. How to choose the burn in period is a hard question to answer. Some converges diagnostics have been suggested, for example [45] and [52].
2.12.5.2 Autocorrelation Time Series
Autocorrelation time series is a diagnostic tool for MCMC algorithms. The level of correlation in the final samples will affect the accuracy of the Monte Carlo estimate. This is quantified using the autocorrelation time [53]. Assuming the chain is in equilibrium, let x(i) be the value of the chain at the i-th iteration. The autocorrelation, ρΓ(k), at lag k for some function Γ(φ) is defined as
E Γ x(i) , Γ x(i+k) ρ (k) = . (2.87) Γ (i) var Γ x Expectation is taken over the values of x(i), whose density is p(x(i)). The autocorrelation time,
τΓ, for the function Γ is defined as
∞ τΓ = ρΓ(k). (2.88) X−∞ 2.13 Trans Dimensional Markov Chain Monte Carlo 42
If N >> τΓ, then an approximation to the variance of the estimator of the expected value of Γ(x),
N 1 Γ(x(i)), (2.89) N Xi=1 is var Γ(x) τΓ . This is a factor of τ larger than that of an estimator based on i.i.d. samples of { } N Γ size N. Therefore, the number of effectively independent samples in a run of length N is roughly
N/τΓ.
The autocorrelation is an estimate of the efficiency of the Markov chain once in equilibrium, (i.e. when stationarity has been achieved), and not an estimate of how long to run the chain until the chain approaches stationarity.
2.13 Trans Dimensional Markov Chain Monte Carlo
2.13.1 Introduction
In many cases, when trying to carry out Bayesian analysis in a situation where there is a range of models which have parameter spaces of different dimensionality, a common approach is to assign a prior distribution over the collection of competing models. In such situations the posterior distribution over the unknown models and model parameters, cannot be analysed using the standard MH framework. In the MH algorithm, any transition and its reverse transition are enabled by the same move type. With dimension changing moves this is not possible. Whenever one move increases the number of parameters, the reverse move has to decrease it. One obvious solution would consist of upper bounding the number of possible models by say K and running K independent MCMC samplers, each being associated with a fixed model order k = 1,...,K. However, this approach suffers from severe drawbacks. Firstly, it is computationally very expensive since K can be large. Secondly, the same computational effort is attributed to each value of k. In fact, some of these values are of no interest in practice because they have a very weak posterior model probability P r (k x, y). However, in practical settings we do not | know a priori on which k-s we should focus our computational effort.
The Reversible Jump Markov Chain Monte Carlo (RJMCMC) algorithm, proposed by Green [54], [55], solves this problem by extending the MH algorithm to problems where the dimension of the parameter space is variable. The RJMCMC methodology is designed to create a Markov chain which has its invariant distribution which takes its support on such general state spaces. Up to this Section, we have been comparing densities in the acceptance ratio. However, if we are carrying out model selection, then comparing the densities of objects in different dimensions has no meaning. It is like trying to compare spheres with circles. Instead, we have to be more 2.13 Trans Dimensional Markov Chain Monte Carlo 43 formal and compare distributions P (dx) = P r(x dx) under a common measure of volume. ∈ The distribution P (dx) will be assumed to admit a density p(x) with respect to a measure of interest, e.g. Lebesgue (see [56]) in the continuous case: P (dx) = p(x)dx. Hence, the acceptance ratio will now include the ratio of the densities and the ratio of the measures (Radon Nikodym derivative [56]). The latter gives rise to a Jacobian term. To compare densities pointwise, we need, therefore, to map the two models to a common dimension.
Green [54] realised that instead of doing the model search in the full product space, one could focus on disjoint union spaces of the form X = K ( k X ) and the target distribution ∪k=1 { } × k defined on such a space is then given by,
K I p(k, dx) = p(m, dxm) m X (m, x) , (2.90) { }× k m=1 X where K is the family of models and x X are the model dependent parameters. To m ∈ m summarise, under Green’s formulation, the RJMCMC allows the Markov chain to explore within the sub-spaces and also jump between sub-spaces, say from Xm to Xn.
It is important to mention that to allow this behavior one must extend each pair of commu- nicating spaces, X and X , to X , X U and X , X U , and also define m n m,n m × m,n n,m n × n,m a deterministic diffeomorphism, dimension matching function between these extended spaces, labeled h . This means that the user must define the proposal distributions q ( m, x ) nm mn ·| m and q ( n, x ) which go from (n, x ) to (m, x ) and back again, the extended state spaces nm ·| n n m Xm,n and Xn,m and the deterministic transform between the spaces hmn. As shown in [33], in a move which goes from (n, x ) to (m, x ) one must first generate u q ( n, x ) and then n m n,m ∼ nm ·| n evaluate (xm, um,n) = hmn(xn, un,m) where the notation xm∗ = hnm(xn, hn,m) is used for the xm component of the function hnm. This move will then be accepted according to the following acceptance probability of a dimension changing move as shown below
p(m, x ) q (n m) q (u m, x ) ∂h (x , u ) min 1, m∗ | mn m,n| m∗ det nm m m,n , (2.91) p(n, x ) × q (m n) × q (u n, x ) × ∂(x , u ) n nm n,m n m m,n | | where the term det ∂hnm(xm,um,n) is the Jacobian of the function h . The generic TDMCMC ∂(xm,um,n) n,m algorithm is presented in Algorithm 6. The algorithm generates samples for the model indicators and the model posterior P r (k y) can be estimated by |
N 1 P r (k y) = I (k ) . (2.92) | N k l Xl=1 c Some care has to be taken when switching between models with different model indicators. In accordance with the definition of the model indicator such moves will be called dimension 2.14 Stochastic Approximation Markov Chain Monte Carlo 44
Algorithm 6 Generic TDMCMC algorithm Input: Initial state of the Markov chain, x(0), k(0) Output: N samples from the joint distribution p (x, k) 1: for i = 1 ...N do 2: Propose a move from model n to model m with probability q (m n). | 3: Sample u from a proposal density q (u n, x ). n,m nm n,m| n 4: Set (x , u ) = h (x , u ), where h ( ) is a bijection between (x , u ) and m∗ m,n nm n n,m nm · n n,m (xm∗ , um,n), where un,m and um,n play the role of matching the dimensions of both vectors.
5: The acceptance probability of the new model, (xm∗ , m) can be calculated as
p(m, xm∗ ) q (n m) qmn(um,n m, xm∗ ) ∂hnm(xm, um,n) α ((x∗ , m) , (x , n)) = min 1, | | det . m n p(n, x ) × q (m n) × q (u n, x ) × ∂(x , u ) n nm n,m n m m,n | |
6: Generate u U [0, 1] ∼ 7: if u α then ≤(i) (i) 8: x , k = (xm∗ , m) (accept) 9: else (i) (i) 10: x , k = (xn, n) (reject) 11: end if 12: end for changing moves. Therefore, any TDMCMC algorithm is based on matched pairs of dimension changing moves. We now list a few options for the proposal kernels of the TDMCMC algorithm.
2.13.1.1 Posterior densities as proposal densities
If p (x y, k) is available in closed form for each model k, then acceptance probability (2.91) k| reduces to
pn (x∗) qnm (x∗, x) α (x, x∗) = min 1, . (2.93) mn p (x) q (x, x ) m mn ∗
2.13.1.2 Independent sampler
If all parameters of the proposed model are generated from the proposal distribution, then
(xm∗ , um,n) = (xn, un,m) and the Jacobian in (2.91) is one.
2.13.1.3 Standard Metropolis-Hastings
When the proposed model m equals the current model n, then Algorithm 6 corresponds to the traditional MH algorithm. The TDMCMC methodology will be used in conjunction to other methodologies in 7 to solve the problem of channel estimation with unknown number of channel taps. 2.14 Stochastic Approximation Markov Chain Monte Carlo 45
2.14 Stochastic Approximation Markov Chain Monte Carlo
It is well known that the MH algorithm is prone to get trapped into local energy minima in simulations from a system for which the energy landscape is rugged. This means that they may explore the underlined space in a very slow fashion, hence, the mixing rate may be very slow. To overcome the local trap problem, advanced Monte Carlo algorithms have been proposed, such as parallel tempering [57], simulated tempering [58], evolutionary Monte Carlo [59] and dynamic weighting [60], among others. We concentrate on a class of adaptive MCMC processes which aim at behaving as an “optimal” target process via a learning procedure. The special case of adaptive MCMC algorithms governed by Stochastic Approximation (SA) is considered next.
2.14.1 Stochastic Approximation
Consider the problem of finding the unique root ζ of a function h(x). If h(x) can be evaluated exactly for each x and if h is sufficiently smooth, then various numerical methods can be employed to locate ζ. A majority of these numerical procedures, including the popular Newton-Raphson method, are iterative by nature, starting with an initial guess x(0) of ζ and iteratively defining a sequence x(n) that converges to ζ as n . The update step of Newton-Raphson for finding → ∞ ζ has the form:
1 (n) (n 1) 2 (n 1) − (n 1) x = x − h x − h x − . (2.94) − ∇ ∇ h i Here h ( ) is the gradient and 2h ( ) is the Hessian matrix. ∇ · ∇ · Now consider the situation where only noisy observations on h(x) are available; that is, for any input x one observes y = h(x) + ǫ, where ǫ is a zero mean random error. Unfortunately, standard deterministic methods cannot be used in this problem. In their seminal paper, Robbins and Monro [61] proposed a stochastic approximation algorithm for defining a sequence of design points x(n) targeting the root ζ of h in this noisy case. Start with an initial guess x(0). At stage n 1, use the state x(n 1) as the input, observe y(n) = h(x(n 1)) + ǫ(n), and update the ≥ − − guess x(n 1), y(n) x(n). More precisely, the Robbins-Monro algorithm defines the sequence − → (n) x as follows: start with x(0) and, for n 1, set ≥
(n) (n 1) (n) (n) x = x − + w y (2.95) (n 1) (n) (n 1) (n) = x − + w h(x − ) + ǫ 2.14 Stochastic Approximation Markov Chain Monte Carlo 46 where ǫ(n) is a sequence of i.i.d. random variables with mean zero, and the weight sequence (n) w satisfies w(n) > 0, w(n) = , ∞ n (2.96) X 2 w(n) < . ∞ n X While the simulated annealing algorithm above works in more general situations, we can develop our intuition by looking at the special case considered in [61], namely, when h is bounded, continuous and monotone decreasing. If x(n) < ζ, then h(x(n)) > 0 and we have
E x(n+1) x(n) = x(n) + w(n+1) h x(n) + E ǫ(n+1) | n o n o = x(n) + w(n+1)h x(n) (2.97) > x(n).
Likewise, if x(n) > ζ, then E x(n+1) x(n) < x(n). This shows that the move x(n) x(n+1) will | → be in the correct direction on average. 2 Some remarks on the conditions in (2.96) are in order. While w(n) < is necessary n ∞ to prove convergence, an immediate consequence of this condition is that w(n) 0. Clearly P → w(n) 0 implies that the effect of the noise vanishes as n . This, in turn, has an averaging → → ∞ effect on the iterates y(n). On the other hand, the condition w(n) = washes out the effect n ∞ (0) of the initial guess x . P
2.14.2 Stochastic Approximation Markov Chain Monte Carlo
Let x be a random variable taking the support on some finite or compact space E with a dominating measure ν. Let p(x) = κp0(x) be a probability density on X with respect to µ with possibly unknown normalizing constant κ > 0. We wish to estimate fdµ, where f is some function depending on p or p . For example, suppose p(x) is a prior and g(y x) is the conditional 0 R | density of y given x. Then f(x) = g(y x)p(x) is the unnormalized posterior density of x and | its integral, the marginal density of y, is needed to compute a Bayes factor. The following
Stochastic Approximation Monte Carlo (SAMC) method is introduced in [62]. Let A1,...,Am be a partition of X and let η = fdµ; i = 1, . . . , m. Take η(0) as an initial guess, and let η(n) i Ai i i be the estimate of ηi at iterationRn 1. For notational convenience,write ≥ b b (n) (n) θi = log ηi T (2.98) (n) (n) (n) θ = θ1b , . . . , θm . 2.15 Approximate Bayesian Computation 47
T The probability vector π = (π1,. . . , πm) will denote the desired sampling frequency of the Ais; that is, πi is the proportion of time we would like the chain to spend in Ai . The choice of π is flexible and does not depend on the particular partition A ,...,A . { 1 m} The generic SA-MCMC algorithm of [62] is presented in the following:
Stochastic Approximation Markov Chain Monte Carlo Algorithm
1. Initialisation: start with initial estimate θ(0) and for n 0 do: ≥ 2. Sampling: Draw a sample z(n+1) from working density using any MCMC method: p(z θ(n)) m f(z) exp θ(n) I (z), z X. | ∝ i=1 i Ai ∈ P n o 3. Weight update: Update the working estimate of θ(n), recursively by setting
θ(n+1) = θ(n) + w(n+1) ζ(bn+1) π , − (n) (n+1) I (n+1) I (n+1) T where w is as in (2.96) and ζ = A1 (z ,..., Am (z .
It turns out that, in the case where no Ai are empty, the observed sampling frequency πi of Ai converges to π . This shows that π is independent of its probability pdµ. Consequently, the i i Ai b resulting chain will not get stuck in regions of high probability, as aR standard Metropolis chain b might. The SA-MCMC method will be used in Chapter 7 to design a TDMCMC based algorithm.
2.15 Approximate Bayesian Computation
This section deals with the class of Bayesian statistical models which involve intractability in the likelihood model. These classes of models are typically referred to as either likelihood-free or Approximate Bayesian computation (ABC), and these terms will be used interchangeably throughout. The term “intractability” will be used with a slight abuse of notation, in particular it can be used to refer to settings in which the likelihood: can not be expressed in a closed analytic form; can only be written down analytically as a function, operation or an integral expression, which can not be solved analytically; can not be directly evaluated point-wise; or evaluation point-wise involves a prohibitive computational cost. We shall demonstrate the ABC model approximation to the true posterior in its most general form, and describe how evaluation of the intractable likelihood is circumvented. 2.15 Approximate Bayesian Computation 48
2.15.1 Introduction
There are a large number of models for which computation of likelihoods is either impossible or very time consuming. The likelihood function, p (y x), is of fundamental importance to both the | frequentist and the Bayesian schools of statistical inference. The frequentist approach is based on ML estimation in which we aim to find x = arg max p (y x) whereas the Bayesian approach x | is centered around finding the posterior distribution p (x y) p (y x) p (x). b | ∝ | Clearly, if the likelihood is unknown, both of these approaches may be impossible, and we will need to perform likelihood-free inference.
Likelihood free inference techniques that can be applied in this situation have been developed over the previous decade and are often known as ABC methods. The first use of ABC ideas was in the field of genetics, as developed by Tavare et al. [63] and Pritchard et al. [64].
The development of the ABC methodology has led to consideration of intractable likelihood models being considered in several different research disciplines: finance [65] [66]; statistics [67]; ecology, [68]; extreme value theory, [69]; protein networks, [70]; and operational risk [65]. A detailed overview of Likelihood Free techniques is found in [71], and the theoretical properties of sampling algorithms working with such ABC posterior models are studied in [72].
The basic algorithm is based upon the rejection method, but has since been extended in various ways. Marjoram et al. [73] suggested an approximate MCMC algorithm, Sisson et al. [74] a sequential importance sampling approach, and Beaumont, Zhang and Balding [75] improved the accuracy by performing local-linear regression on the output.
2.15.2 Basic ABC Algorithm
The simplest ABC algorithm is based upon the rejection algorithm. This was first given by von Neumann in 1951 [76] and described in Algorithm 7.
Algorithm 7 Rejection algorithm 1 Input: p (x) Output: A sample from the p (x y) | 1: Draw a sample x∗ from p (x) 2: Accept x with probability p (y x) ∗ |
It’s clear that this algorithm cannot be implemented as p (y x) is unknown. However, there | is an alternative version of this algorithm which does not depend on explicit knowledge of the likelihood, but only requires that we can simulate from the model, as depicted in Algorithm 8.
However, difficulties arise in the case that the observations stem from a continuous space, since it would be impossible to get any samples as the probability of y = y∗ is virtually 0. 2.15 Approximate Bayesian Computation 49
Algorithm 8 Rejection algorithm 2 Input: p (x) Output: A sample from the p (x y) | 1: Draw a sample x∗ from p (x) 2: Simulate data y from p (y x ) ∗ | ∗ 3: Accept x∗ if y = y∗
A solution to this problem is to relax the requirement of equality, and to accept x∗ values when the simulated data is close to the real data. To define that we require a metric ρ on the state space X, with ρ ( , ): X X R+, and a tolerance ǫ. The algorithm, our first ABC algorithm, · · × → is depicted in Algorithm 9.
Algorithm 9 ABC algorithm 1 : ǫ-tolerance rejection Input: p (x) Output: A sample from the p (x ρ (y, y ) < ǫ) | ∗ 1: Draw a sample from p (x) 2: Simulate data y from p (y x ) ∗ | ∗ 3: Accept x if ρ (y, y ) ǫ ∗ ∗ ≤
Unlike Algorithms 7 and 8, Algorithm 9 is only an approximation of the posterior density. Ac- cepted x∗ values do not form a sample from the posterior distribution, but from some distribution that is an approximation to it. The accuracy of the algorithm (measured by some suitable dis- tance measure) depends on ǫ in a non-trivial manner. Algorithm 9 is obviously approximative when ǫ = 0. The output from the ǫ-tolerance rejection algorithm is thus associated with the 6 distribution
p (x ρ (y, y∗) ǫ) p (x) px (ρ (y, y∗) ǫ) (2.99) | ≤ ∝ ≤ with
px (ρ (y, y∗) ǫ) = I [ρ (y, y∗) ǫ] p (y∗ x∗) dx∗. (2.100) ≤ ≤ | Z
The choice of ǫ is therefore paramount for good performances of the method. If ǫ is too large, the approximation is poor; when ǫ it amounts to simulating from the prior since all simulations → ∞ are accepted (as px (ρ (y, y ) ǫ) 1 when ǫ ). If ǫ is sufficiently small, p (x ρ (y, y ) ǫ) ∗ ≤ → → ∞ | ∗ ≤ is a good approximation of p (x y). There is no approximation when ǫ = 0, since the ǫ-tolerance | rejection algorithm corresponds to the exact rejection algorithm, but the acceptance probability may be too low to be practical. Selecting the “right” ǫ is thus crucial. It is customary to pick ǫ such that the acceptance probability is around 20%. 2.15 Approximate Bayesian Computation 50
2.15.3 Data Summaries
For problems with large amounts of high-dimensional data, Algorithm 9 will be impractical, as the simulated data will never closely match the observed data. The standard approach to reducing the number of dimensions is to use a summary statistic, T (y), which should summarise the important parts of the data. We then adapt Algorithm 9 so that parameter values are accepted if the summary of the simulated data is close to the summary of the real data. This approach also lends itself naturally to application in cases where y has continuous components. This Algorithm is presented in Algorithm 10.
Algorithm 10 ABC algorithm 2 Input: p (x) Output: A sample from the p (x ρ (T (y) , T (y )) < ǫ) | ∗ 1: Draw a sample from p (x) 2: Simulate data y from p (y x ) ∗ | ∗ 3: Accept x if ρ (T (y) , T (y )) ǫ ∗ ∗ ≤
The summary statistic T (y) is called a sufficient statistic for x if and only if
p (x y) = p (x T (y)) , (2.101) | | i.e., if the conditional distribution of x given the summary equals the posterior distribution. The idea is that if T (y) is known, then the full data set, y, can not provide any extra information about x. If a sufficient statistic is available, then Algorithm 10 is essentially the same as Algo- rithm 9. However, for problems where both the posterior distribution and the likelihood function are unknown, it will not in general be possible to determine whether a statistic is sufficient.
The ABC presented before still suffers from two serious problems:
1. The rejection based algorithms are inefficient. Only a fraction of the proposed samples are actually accepted, while most of the samples are discarded.
2. In the majority of cases, directly sampling from the target ABC posterior via inversion of the cdf is not achievable.
Because of these two reasons, a class of sampling algorithms known as MCMC-ABC has been developed. We present it in the following Section.
2.15.4 MCMC-ABC Samplers
As mentioned before the ABC rejection algorithm is inefficient because it will continue to draw samples in regions of parameter space that are clearly not useful. One approach to this problem 2.15 Approximate Bayesian Computation 51 is to derive a MCMC-ABC Algorithm. The hope is that the Markov chain will spend more time in interesting regions of high probability compared to ABC rejection.
It is shown in [71] that the ABC method embeds an “intractable” target posterior distribution, denoted by p (x y), into an augmented model, |
p (x, z, y) = p (y z, x) p (z x) p (x) , (2.102) | | where z X is an auxiliary vector on the same space as y. In this augmented Bayesian model, ∈ the density p (y z, x) weights the intractable posterior. A popular example that we shall utilize | is given by
1, if ρ (T (y) , T (z)) ǫ, p (y z, x) ≤ (2.103) | ∝ 0, otherwise.
This makes a Hard Decision (HD) to reward a summary statistic of the augmented auxiliary variables, T (z), within an ǫ-tolerance of the summary statistic of the actual observed data, T (y), as measured by distance metric ρ. In cases where the data samples are small in size a comparison of the actual data with the auxiliary variables is feasible and so do not require summary statistics and the distance metric that can be considered is Euclidean distance. Other more sophisticated choices of weighting function and distance metric are considered in [72], [71] and [66]. Hence, in the ABC context the intractable target posterior marginal distribution, p (x y), that we are | interested in is given by:
pABC (x y) p (y z, x) p (z x) p (x) dz | ∝ X | | Z E = p (x) p(z x) p (y z, x) | { | } (2.104) N p (x) p y z(n), x , ≈ N | n=1 X where z(1),..., z(N) are sampled realizations of data from the (intractable) likelihood. The ABC-MCMC sampler is presented in Algorithm 11
Additionally, we note that the tolerance ǫ typically should be set as low as possible for a given computational budget. Typically, this will depend in the ABC context on the choice of algorithm used to sample from the ABC posterior distribution.
In Chapter 8 we shall introduce a novel weighting function, which will serve as an alternative to the HD rule. Instead of using the HD weighting function which rewards summary statistics of the augmented auxiliary variables, we shall introduce a Soft Decision (SD) weighting function that penalizes summary statistics as a non-linear function of the distance between summary 2.15 Approximate Bayesian Computation 52
Algorithm 11 ABC-MCMC sampler Input: p (x) N Output: x(i) , samples from p (x ρ (T (y) , T (y )) < ǫ) i=1 | ∗ 1: Initialise x(0) to an arbitrary staring point 2: for i = 1,...,N do 3: Propose x q(x x(i 1)) ∗ ∼ ∗ ← − 4: Simulate data y from p (y x ) ∗ | ∗ 5: if ρ (T (y) , T (y )) ǫ then ∗ ≤ (i 1) p(x∗) q(x(i−1) x∗) 6: Compute α x , x = min 1, ← ∗ − p(x(i−1)) × q(x∗ x(i−1)) ← 7: Generate u U [0, 1] ∼ 8: if u α x , x(i 1) then ≤ ∗ − 9: x(i) = x (accept) ∗ 10: else (i) (i 1) 11: x = x − (reject) 12: end if 13: else (i) (i 1) 14: x = x − (reject) 15: end if 16: end for statistics. That is, even though the weighing may be small, it will remain non-zero, unlike the HD rule. The intension of the new rule is to improve the mixing rate of the Markov chain.
2.15.5 Distance Metrics
Having obtained summary statistic vectors T (y) and T (z), likelihood-free methodology then measures the distance between these vectors using a distance metric, denoted generically by ρ (T (y) , T (z)). The most popular example involves the basic Euclidean distance metric which sums up the squared error between each summary statistic as follows:
dim(T ) ρ (T (y) , T (z)) = (T (y) T (z))2 . (2.105) i − i Xi Recently more advanced choices have been proposed and their impact on the methodology has been assessed, these include: scaled Euclidean distance given by
dim(T ) ρ (T (y) , T (z)) = W (T (y) T (z))2 ; (2.106) i i − i Xi Mahalanobis distance:
1 T ρ (T (y) , T (z)) = (T (y) T (z)) Σ− (T (y) T (z)) ; (2.107) − − 2.15 Approximate Bayesian Computation 53
Lp norm:
dim(T ) ρ (T (y) , T (z)) = [ T (y) T (z) p]1/p ; (2.108) | i − i | Xi and city block distance:
dim(T ) ρ (T (y) , T (z)) = T (y) T (z) . (2.109) | i − i | Xi In particular we note that distance metrics which include information regarding correlation between the summary statistics produces estimates of the marginal posterior which are, for a finite computational budget, typically more accurate and involve greater efficiency in the simulation algorithms utilized. The ABC theory will be useful in Chapter 8, where we develop algorithms in systems with intractable likelihood.
2.15.6 ABC methodology summary
The ABC methodology is still relatively new and in its infancy and there are still many open questions to be answered. While ABC methods are only approximate, as their name suggests, they do have many advantages. Firstly, they are almost trivial to use and code once you are able to simulate from the model. Secondly, changes to the model or data structure can easily be incorporated without any changes to the inference mechanism. Finally, if the likelihood function is unavailable or prohibitively expensive to compute, then MCMC methods can not be used, which is the main reason to use and develop ABC methods.
There are many technical questions which remain to be answered. For example, currently it is not known how accurate the approximation p (x y) is, or how the accuracy depends on the ABC | choice of metric and ǫ. Another problem is that if data summaries need to be used, the intuition of the practitioner is relied upon when choosing a good summary statistic. An approximate sufficiency is still needed along with a methodical way of finding summaries which are nearly sufficient and which therefore capture the pertinent parts of the data. 2.16 Concluding Remarks 54
2.16 Concluding Remarks
In this chapter we have presented background material on Bayesian inference. The main points presented in this chapter:
We formulated estimation objectives under the Bayesian framework and derived a few • common estimators.
We discussed the model selection problem and parameter estimation under model uncer- • tainty.
We provided and overview of Bayesian sequential filtering and the Kalman filter. • We presented the EM and BEM methods to obtain the ML and MAP estimates, respec- • tively.
We presented an overview of Monte Carlo methods, in particular, the MCMC methodology • for Bayesian inference and the TDMCMC methodology for Bayesian inference under model uncertainty.
We formulated the SA and SA-MCMC methodologies to overcome the local trap problem. • We presented an overview of the ABC methodology that enables inference in models for • which the likelihood does not exist or is intractable. Chapter 3
Introduction to Wireless Communication
“Essentially, all models are wrong, but some are useful.”
George Edward Pelham Box
3.1 Introduction
This chapter provides a brief overview of wireless communications. The presentation is not intended to be exhaustive and does not provide new results, but it is rather intended to provide the necessary background to understanding the following chapters.
3.2 Modeling of Fading Channels
The importance of understanding radio propagation channels for successful design of communi- cation systems can not be overstated. Earlier, the wireless medium was viewed as an obstacle or a limiting factor in designing reliable communication links. However, decades of research and subsequent insights have changed this paradigm. Modern day communication systems rather tend to exploit the channel behaviour for increased performances.
Wireless channels operate through electromagnetic radiation from the transmitter to the receiver. The transmitted signal propagates through the physical medium which contains obstacles and surfaces for reflection, causing multiple reflected signals of the same source to arrive at the receiver at different times. The reflected signals are directly influenced by the material properties of the surface it reflects or permeates. These can be relative to dielectric constants, permeability, conductivity, thickness, etc.. The aforementioned effects due to the propagation medium are represented as an abstract entity called the channel. 55 3.2 Modeling of Fading Channels 56
Fading
Large scale fading Small scale fading
Signal disperssion Time variance of the channel
Flat fading Frequency selective Fast fading Slow fading
Figure 3.1: Types of fading channels
The effect of multiple wavefronts is represented as multiple paths in a channel. If the transmitter or the receiver is mobile, the channel is said to be time-varying. In order to recover the transmit- ted signal at the receiver, it is essential to know some information about the channel, referred to as channel estimation. The cancellation of channel effects is referred to as equalization.
In principle, one could solve the electromagnetic field equations, in conjunction with the trans- mitted signal, to find the electromagnetic filed impinging on the receiver antenna. However, this task is not practical to achieve, since in order to do so, one must take into account the physical properties (e.g. location, material type etc.) of the obstructions, such as ground, buildings, vehicles, etc.. Since solving the field equations is a too complex of a task, a simpler model which is mathematically tractable is used. Although only an approximation of the real physical envi- ronment, it gives good performance. The general term fading is used to describe fluctuations in the envelope of a transmitted radio signal. However, when speaking of such fluctuations, one must consider whether the observation has been made over short distances or long distances. For a wireless channel, the former case will show rapid fluctuations in the signal’s envelope, while the latter will give a more slowly varying, averaged view. For this reason, the first scenario is formally called small-scale fading or multi-path, while the second scenario is referred to as large-scale fading or path loss.
Large-scale fading is reflected only on the strength of the received signal, and will not be consid- ered in this thesis. We will look, however, at different types of small-scale fading, both in terms of the signal dispersion and time variance of the channel (see Figure 3.1). We shall now briefly outline a few important properties of wireless channels. 3.2 Modeling of Fading Channels 57
3.2.1 Tapped Delay-line Channel Model
We represent the Channel Impulse Response (CIR) of a time-varying channel as an input-output relation between the transmit and receive antennas whose impulse response is modeled by L propagation paths [77],
L 1 − h(τ, t) = a (t)δ(τ τ (t)), (3.1) l − l Xl=0 where h(τ, t) is the response at time t for an impulse signal transmitted at time t τ, and − L 1 a (t) − are the channel coefficients at time t. { l }l=0 The corresponding Channel Frequency Response (CFR) can be expressed as
H(f, t) = ∞ h(τ, t) exp j2πfτ dτ {− } Z∞ L 1 (3.2) − = a (t) exp j2πfτ (t) . l {− l } Xl=0 The expressions in (3.1)-(3.2) are elegant and easy to work with. Next we discuss different aspects and properties wireless channels.
3.2.2 Doppler Offset
Due to the relative motion between the transmitter and the receiver, each multipath wave experiences a frequency shift. The phenomenon is termed the Doppler shift, and is directly proportional to the velocity and direction of motion of the mobile with respect to the direction of arrival of the received multipath wave. Thus, the Doppler shift can be written as,
v f = cos (α) d λ (3.3) vf = c cos (α) , c where v is the speed of the mobile, α is the direction of motion of the mobile with respect to the direction of arrival of the multipath, c is the speed of light and λ and fc are the wavelength and carrier frequency of the radio signal, respectively. In general, if we introduce any form of acceleration or change of direction between the transmitter and receiver (e.g., driving in a curve), the Doppler shift will become time dependent. 3.2 Modeling of Fading Channels 58
3.2.3 Power Delay Profile
The Power Delay Profile (PDP) is defined as the squared value of the CIR.
L 1 − P (t) = a2(t)δ(τ τ (t)). (3.4) l − l Xl=0 The PDP represents the relative received power in excess delay with respect to the first path. PDPs are usually found by averaging instantaneous PDP measurements. With indoor Radio Frequency (RF) channels, the PDP of the channel usually has an exponentially decaying function.
3.2.4 Coherence Bandwidth
The coherence bandwidth is a statistical measure of the range of frequencies over which the chan- nel can be considered flat (i.e., a channel which passes all spectral components with approxi- mately equal gain and linear phase) [78]. The coherence bandwidth is an important parameter in the design of many wireless systems. In particular, in OFDM systems, if the subcarrier spacing is set to be less than coherence bandwidth of the channel, each subcarrier would be affected by a flat channel and thus a simple one-tap equalization can be used.
3.2.5 Coherence Time
The coherence time, Tc, is the time domain dual of the Doppler spread and is used to characterize the time-varying nature of the frequency dispersiveness of the channel in the time domain. Tc is the time for which the correlation of the channel responses decrease by 3 dB. For example, in OFDM systems, to avoid fast fading effect, the OFDM symbol length needs to be shorter than the coherence time of the channel.
3.2.6 Time Selective and Fast Fading Channels
If the CIR changes rapidly within the symbol duration, then the channel is said to be a fast fading channel. Under such conditions, the coherence time of the channel, Tc, is smaller than the symbol period of the transmitted signal, Ts. This causes frequency dispersion (also called time- selective fading) due to Doppler spreading, which leads to signal distortion. In the frequency domain, mobility results in frequency spread of the signal which dependents on the operating frequency and the relative speed between the transmitter and receiver, also know as Doppler spread [79]. Therefore, a signal undergoes fast fading or time selective fading if [78]
Ts > Tc, (3.5) 3.2 Modeling of Fading Channels 59 and
Bs < Bd, (3.6) where Bs is the badndwith of the transmitted signal and Bd is the Doppler spread of the channel.
3.2.7 Slow Fading Channels
In a slow fading channel, the CIR changes at a rate much slower than the transmitted baseband signal. In this case, the channel may be assumed to be static over one or several reciprocal bandwidth intervals. In the frequency domain, this implies that the Doppler spread of the channel is much less than the bandwidth of the baseband signal. Therefore, a signal undergoes slow fading if [78]
Ts << Tc, (3.7) and
Bs >> Bd. (3.8)
3.2.8 Frequency Selective Channels
If the channel possesses a constant-gain and linear phase response over a bandwidth that is smaller than the bandwidth of transmitted signal, then the channel creates frequency selective fading on the received signal. Under such conditions, the CIR has a multipath delay spread which is greater than the reciprocal bandwidth of the transmitted message waveform. When this occurs, the received signal includes multiple versions of the transmitted waveform which are attenuated (faded) and delayed in time, and hence the received signal is distorted. As a result of that, the channel induces Inter Symbol Interference (ISI) [78]. For frequency selective fading, the spectrum of the transmitted signal has a bandwidth which is greater than the coherence bandwidth Bc of the channel. Thus, a signal undergoes frequency selective fading if
Bs >> Bc, (3.9) and
Ts < στ , (3.10) where στ is the Root Mean Square (RMS) value of the delay spread. 3.3 Channel Models 60
3.2.9 Flat Fading Channels
If the mobile radio channel has a constant gain and linear phase response over a bandwidth which is greater than the bandwidth of the transmitted signal, then the received signal will undergo frequency flat fading or simply, flat fading. Flat fading is sometimes referred to as narrowband channels, since the bandwidth of the applied signal, BS, is narrow as compared to the channel
flat fading bandwidth or the coherence bandwidth, Bc. To summarize, a signal undergoes flat fading if [78]
Bs << Bc, (3.11) and
Ts >> στ . (3.12)
3.3 Channel Models
Many channel models have been suggested for wireless communication [80], [81], [82], just to cite a few. We now review two channel probabilistic models which are widely used, mainly due to their simplicity and their ability to form a good approximation of the real physical channels.
3.3.1 Rayleigh Fading Channels
The Rayleigh fading channel model is the simplest model for channel filter taps. It is based on the assumption that there are a large number of statistically independent reflected and scattered paths with random amplitudes in the delay window corresponding to a single tap. Each tap hl is composed of the sum of many independent random variables, so that by invoking the Central Limit Theorem it can be modeled as a zero-mean Complex Gaussian random variable:
h CN 0, σ2 (3.13) l ∼ l The magnitude h of the l-th tap is a Rayleigh random variable with density | l| x x2 p (x) = exp − , x 0 (3.14) σ2 2σ2 ≥ l l and the squared magnitude h 2 is exponentially distributed with density | l| 1 x p (x) = exp − , x 0. (3.15) σ2 σ2 ≥ l l 3.3 Channel Models 61
Rayleigh fading corresponds to a Non Line Of Sight (NLOS) situation. The other common fading model is the Rician fading model which corresponds to a Line Of Sight (LOS) situation. Further information about fading models can be found in [83].
3.3.2 Clarke’s / Jake’s Model
The Rayleigh fading channel model provides a statistical view of each channel component hl. This, however, is only part of the full behavior of the channel, as it does not provide information about the statistical behavior of the channel over time. A statistical quantity that models the relationship of the channel taps over time is the tap gain autocorrelation function Rl[m], which is defined as
R [m] , E h [n]h∗[n + m] . (3.16) l { l l }
In Jake’s model, the transmitter is fixed and the mobile receiver is moving at speed v. The transmitted signal is assumed to be scattered by stationary objects around the mobile. There are K paths, the k path arriving at angle θ , 2πk/K, k [0, ,K 1], with respect to the k ∈ ··· − direction of motion. The scattered path arriving at the mobile at angle θ has a delay τθ(t) and a time invariant gain aθ. The input/output relationship is given by
K 1 − y(t) = a x(t τ (t)). (3.17) θk − θk Xk=1 The most common scenario assumes uniform power distribution and isotropic antenna gain patter. This models the situation when the scatterers are located in a ring around the mobile. Making the assumption that the phase is uniformly distributed in [0, 2π] and i.i.d. across all angles of θ, the tap gain is a sum of many small independent contributions, one from each angle. By the Central Limit Theorem, the process can be approximated as Gaussian. It can be shown that the process is stationary with an autocorrelation function Rl[m] given by
2πmf R [m] = 2a2πJ d , (3.18) l 0 W where W is the bandwidth and J ( ) is the zeroth-order Bessel function of the first kind: 0 · 1 π J (x) , exp jx cos θ dθ. (3.19) 0 π { } Z0 3.3 Channel Models 62
The power spectral density S (f) defined on [ 1/2, +1/2] is given by [84] l −
1 , f f T 2 d s f πfdTs 1 | | ≤ Sl (f) = − fdTs (3.20) r 0 , else,
1 where fd is the Doppler offset and Ts = W is the symbol duration. It can be verified that taking the inverse Fourier transform of (3.20) yields (3.18).
3.3.3 Approximation of Jake’s Model
Since (3.20) is a nonrational function, precise fitting of the theoretical statistics is impossible by an Auto Regressive Moving Average (ARMA) model of any order. However, our goal is to accurately capture the dynamics of the wireless channel yet remaining mathematically tractable for implementation.
To simulate the structured variations of time-selective wireless fading channel, three types of linear models are usually considered [85], [77], which are:
1. Auto Regressive (AR) or ‘All-pole’ model.
2. Moving Average (MA) or ‘All-zero’ model.
3. ARMA model.
Out of these three, the AR model is the most frequently used model because of its simplicity and ease of designing (i.e., the equations that determine its parameters are linear). In [85], it was demonstrated that it is possible to capture most of the channel tap dynamics by using a low-order AR model .
The Yule-Walker equations [77] can be used to determine the parameters of the model. The Levinson-Durbin recurrent algorithm, proposed by Norman Levinson in 1947 and improved by James Durbin in 1960, can be used to solve these equations. The key of the algorithm is the recursive computation of the filter coefficients, beginning with the first order and increasing the order recursively, using the lower order solutions to obtain the solution to the next higher order.
Selecting the model order is another difficult problem in developing a linear model. In [86] the information theoretic results show that the first order AR model provides a sufficiently accurate model for time-selective fading channels. As shown in [87], a simple Gauss-Markov model can capture most of the channel tap dynamics, and suitable for channel tracking and, therefore, we will adopt it henceforth. Thus, using discrete-time notations, hl varies according to
h [n] = αh [n 1] + v[n], (3.21) l l − 3.4 Overview of Multi Antenna Communication Systems 63 where α is the AR coefficient which accounts for the variations in the channel due to Doppler 2 shift, and v[n] is the zero-mean complex Gaussian noise with covariance σv and is statistically independent of h [n 1]. Using Yule-Walker equations [77], α and σ2 can be calculated as l − v 2πf v α = R[1] = 2a2πJ c (3.22a) l 0 W c σ2 = 1 α2, (3.22b) v − where al is the variance if hl.
Now we have introduced the basic elements of the wireless channel, we shall discuss a few transmission techniques that take advantage of the properties of the channel.
3.4 Overview of Multi Antenna Communication Systems
A large suite of techniques, known collectively as Multiple Input Multiple Output (MIMO) communications, have been developed in the past several years to exploit effectively the spatial domain, when multiple antennas are used at both ends of the wireless communication link. Following Telatar’s paper [3], extraordinary capacity gains have been demonstrated by MIMO systems over the conventional Single Input Single Output (SISO). In addition to diversity gain and array gain, MIMO systems can offer the so called multiplexing gain by using parallel data streams (usually called spatial mode or eigen subchannels) within the same frequency band at no additional power expense. In the presence of rich scattering, MIMO links offer capacity gains that are proportional to the minimum of the number of channel inputs and outputs.
3.4.1 The Linear MIMO Channel
Although MIMO channels arise in many different communication scenarios such as wire line systems or frequency selective systems [88], [89], [90], in this thesis we concentrate on the flat- fading MIMO uncorrelated channels. The MIMO system model with nT transmit antennas and nR receive antennas is depicted in Figure 3.2
3.4.2 Channel Model
The MIMO channel is characterized by its transition probability density function, given by p(y x), which describes the probability of receiving the vector y conditioned on the fact that the | vector x was actually transmitted. Common to MIMO communication systems is that under some assumptions the input output relation can be described rather accurately by the following 3.4 Overview of Multi Antenna Communication Systems 64
1 1
1 1
2 2
1 H 1 x Receiver
nT
n T nR 1 1
Figure 3.2: MIMO system model linear model:
y = Hx + w, (3.23) where x AnT is the transmitted symbol, y is the received vector, H CnR nT represents the ∈ ∈ × linear response of the channel, such that its element [H]ij denotes the channel path gain between the j-th transmitter and i-th receiver. In the vast majority of cases (and in this dissertation in particular), the noise terms are modeled as i.i.d. circularly symmetric complex Gaussian random 2 variable with zero mean and covariance matrix given by Rw = Iσw. Or, formally:
E w = 0, (3.24) { } H E ww = Rw. (3.25) 3.4.3 Uncertainty Models for the Channel State Information
In many practical cases, we do not have full knowledge of the CSI. This may be attributed to a noisy channel estimate, quantized values etc.. In the most typical communication setup, channel estimation is performed at the receiver during the training period in which the transmitter sends a pilot or training sequence. Consequently, the quality of the CSI can be categorized into three different situations:
No CSI: the receiver does not have any knowledge about the values of the CSI, or • its statistics. Under these conditions, the detection of symbols is termed non-coherent detection.
Perfect CSI: the receiver has full knowledge of the instantaneous channel realization. • Under these conditions, the detection of symbols is termed coherent detection. 3.5 Detection Techniques in MIMO Systems 65
Imperfect CSI: the receiver has inaccurate knowledge about the parameters describing • the channel. For example, the receiver may be informed of the estimated channel matrix H = H, with corresponding covariance error matrix C. In that case the channel model 6 that can be used is one in which the channel is a random matrix H CN H, C . b ∼ b 3.5 Detection Techniques in MIMO Systems
MIMO systems transmit parallel data sequences simultaneously in the same frequency band at the same time. As a result, the receiver has a difficult task to distinguish multiple data sequences. To solve this problem, several MIMO detection algorithms were stated in the litera- ture. In general, these algorithms defer by their performance merits, such as BER and expected complexity, etc.. The Maximum A Posteriori (MAP) detector minimizes the error probability, and can be expressed in the case of perfect CSI as
x = arg max p (x H, y) x AnT | ∈ (3.26) b = arg max p (y H, x) P r (x) . x AnT | ∈ where A is the constellation used to transmit the symbols. The likelihood function in that case can be expressed as
1 y Hx 2 p(y H, x) = exp k − k . (3.27) | 2 nR/2 − σ2 (πσw) ( w )
If there is no a priori information at the input, or if the input symbols are equiprobable, the Maximum Likelihood (ML) detector is equivalent to the MAP detector and can be expressed as
x = arg max p (y H, x) x AnT | ∈ 1 y Hx 2 b = arg max exp (3.28) n /2 k − 2 k x AnT (πσ2 ) R − σw ∈ w ( ) = arg min y Hx 2 . x AnT k − k ∈ In both cases, the exponential growth of the search space, A nT prohibits the use of brute- 2 | | force ML detection - i.e., simply evaluating y Hx for all possible x AnT . Therefore k − ′k ′ ∈ more efficient but possibly suboptimum detectors need to be studied. We now provide a brief overview of some of these detection methods. 3.5 Detection Techniques in MIMO Systems 66
3.5.1 Linear Detectors
In order to reduce the complexity of the optimal ML detector, many suboptimal schemes have been devised. The two most basic ones are the Zero Forcing (ZF) and the Minimum Mean Squared Error (MMSE) detectors. The ZF approach solves the set of linear equations by forcing the noise component w to zero, or equivalently, solves the unconstrained ML estimate, which gives
x = H†y = x + H†w (3.29)
The solution in (3.29) needs to be quantizedb to the closest lattice point in the constellation A
x = A [x] . (3.30) ZF Q
Since the signal components are fully decoupled,b thisb quantization can be performed separately for each symbol. The main disadvantage of the ZF detector is that in case of some column vectors of H are close to parallel, the corresponding components of w will be significantly amplified by the multiplication with H†. This noise enhancement can become infinity for singular matrices. To alleviate this problem, the MMSE detector has been suggested. This approach balances residual interference and noise enhancement by finding the matrix GMMSE such that
2 GMMSE = arg min E Gy x . (3.31) G k − k n o Defining the estimation error e = Gy x we obtain from the principle of orthogonality −
E eyT = 0, (3.32) and finally
1 σ2 − G = HH H + w I HH . (3.33) MMSE σ2 x In the limit for high Signal to Noise Ratio (SNR) the MMSE based detector approaches the ZF one.
3.5.2 VBLAST Detector
The Vertical Bell Labs Layered Space Time (VBLAST) is an improved variant of the decision- feedback equalization strategy and was introduced by Foschini in [91]. In VBLAST algorithm, rather than jointly decoding all the transmit signals, we first decode the “strongest” signal, then subtract this strongest signal from the received signal, proceed to decode the strongest signal 3.5 Detection Techniques in MIMO Systems 67 of the remaining transmit signals, and so on. The optimum detection order in such nulling and canceling strategy is from the strongest to the weakest signal. Assuming that the channel is known at the receiver, the main steps of the VBLAST algorithm can be summarized as follows:
1. Nulling: an estimate of the strongest transmit signal is obtained by nulling out all the weaker transmit signals (say using zero forcing criterion).
2. Slicing: the estimated signal is detected to obtain the data bit.
3. Cancellation: These data bits are remodulated and the channel is applied to estimate its vector signal contribution at the receiver. The resulting vector is then subtracted from the received signal vector and the algorithm returns to the nulling step until all transmit signals are decoded.
3.5.3 Sphere Decoder
As mentioned before, ML detection involves an exhaustive search, and so the computational complexity is exponential in the length of the codeword. The sphere decoding algorithm [92] was proposed to lower the computational complexity. Considerable research has gone into sphere decoding in the last decade [93], [94], [95]. This has resulted in the emergence of quite a few sphere decoders with various variants to facilitate the decoding process. The conventional sphere decoders have been replaced by sphere decoders where the search proceeds independent of the initial radius [96]. There are also list sphere decoders where more than one solution can be found [97]. The size of the list can be as large as the constellation size, in which case the complexity is the same as the ML decoding technique or it can be of much smaller size, in which case the number of points scanned would reduce. The principle of the sphere decoding algorithm is to search the closest lattice point to the received signal within a sphere radius, where each codeword is represented by a lattice point in a lattice field [98]. In a two-dimension problem illustrated in Figure 3.3, one can easily restrict the search by drawing a circle around the received signal just enough to enclose one lattice point and eliminate the search of all the points outside the circle.
In order to derive the sphere decoder, we rewrite the ML detection rule in (3.28) as
x = arg min y Hx 2 x AnT k − k ∈ (3.34) H H b = arg min (x x) H H (x x) , x AnT − − ∈ where x is the unconstrained ML estimate of x,b defined in (3.29).b Based on Fincke Pohst method [99], a lattice point which lies inside the sphere with radius d has to fulfill the condition b d2 y Hx 2 = (x x)H HH H (x x) + y 2 Hx 2 . (3.35) ≥ k − k − − k k − k k b b 3.5 Detection Techniques in MIMO Systems 68
search area
constellation point soft decision
Figure 3.3: Idea behind the sphere decoder
By defining d 2 = d2 y 2 + Hx 2, (3.35) can be rewritten as ′ − k k k k
2 H H d′ (x x) H H (x x) . (3.36) ≥ − −
The matrix HH H can be decomposed to triangularb matricesb with Cholesky decomposition, so that HH H = UH U, where U is an upper triangular matrix.
Further simplifications of (3.35) yield
2 H H d′ (x x) H H (x x) ≥ − − = (x x)H UH U (x x) − b − b nR nR b2 b Ui,j = Ui,i (xi xi) + (xj xj) − Ui,i − (3.37) Xi=1 j=Xi+1 = U 2 (x xb )2 b nR,nR nR − nR 2 2 UnR 1,nR − + UnR 1,nR 1 xnbR 1 xnR 1 + (xnR xnR ) + . ... − − − − − Un 1,n 1 − R− R− Because of the upper triangular nature ofb U, one can begin evaluationb of the last element in x as
2 2 2 U (x x ) d′ , (3.38) nR,nR nR − nR ≤ which leads to b
d d x ′ x x + ′ . (3.39) nR − U ≤ nR ≤ nR U nR,nR nR,nR b b 3.6 Overview of OFDM Systems 69
2 2 2 The method employs an iterative search, for every xnR satisfying (3.39), dn′ 1 = d′ UnR,nR (xnR xnR ) R− − − can be defined, and a new condition can be written as b 2
2 UnR 1,nR 2 Un 1,n 1 xnR 1 xnR 1 + − (xnR xnR ) dn′ 1, (3.40) R− R− − − − U − ≤ R− nR 1,nR 1 − − xn −1|n b R R b | {z } which is equivalent to b
dn′ 1 dn′ 1 R− R− xnR 1 nR xnR 1 xnR 1 + . (3.41) − | − Un 1,n 1 ≤ − ≤ − Un 1,n 1 R− R− R− R− b b In a similar fashion, one proceeds for xn 2, and so on, stating nested necessary conditions for R− all elements of x.
To ensure that the lattice point is inside the sphere, the initial radius must be big enough to enclose at least one lattice point.
A comprehensive overview of different detection schemes can be found in [100], [101].
3.6 Overview of OFDM Systems
Orthogonal Frequency Division Multiplexing (OFDM) is nowadays ubiquitous and used for achieving high data rates as well as combating multipath fading in wireless communications. In this multi-carrier modulation scheme data is transmitted by dividing a single wideband stream into several smaller or narrowband parallel bit streams. Each narrowband stream is modulated onto an individual carrier. The narrowband channels are orthogonal to each other, and transmit- ted simultaneously. In doing so, the symbol duration is increased proportionately, which reduces the effects of ISI induced by multipath Rayleigh-faded environments. The spectra of the sub- carriers overlaps each other, making OFDM more spectral efficient as opposed to conventional multicarrier communication schemes.
3.6.1 OFDM Signals and Orthogonality
In OFDM systems, subchannels (subcarriers) are obtained via an orthogonal transformation on each block of data (OFDM symbol) comprising N subcarriers. Orthogonal transformations are used so that at the receiver, the inverse transformation can be used to demodulate the data with- out error in the absence of noise. Weinstein [6] proposed that Discrete Fourier Transform (DFT) for multicarrier modulation. The DFT exhibits the desired orthogonality and can be imple- mented efficiently through the Fast Fourier Transform (FFT) algorithm. OFDM schemes use 3.6 Overview of OFDM Systems 70
Processing in the frequency domain Processing in the time domain
serial parallel Add In D/A RF Up- MOD to IFFT to Cyclic Lowpass Converter parallel serial Extension Transmitter
CHANNEL
parallel serial Remove out Sampling RF Down- DEMOD to FFT to Cyclic A/D Converter serial parallel Extension Receiver
Baseband signal HF signal
Figure 3.4: Block diagram of OFDM transceiver rectangular pulses for data modulation, hence a given subchannel has significant spectral overlap with a large number of adjacent subchannels (See Figure 3.5). When the channel distortion is mild relative to the channel bandwidth, data can be demodulated with a very small amount of in- terference from the other subchannels, due to the orthogonality of the transformation. However, in order to completely remove the ISI a Cyclic Prefix (CP) is inserted in front of every OFDM symbol. The CP is a copy of the OFDM symbol tail [102]. For complete ISI removal the length of the CP G must be longer than the essential support of the CIR L. The length of the OFDM symbol after insertion of the CP is denoted by P = N + G. In the following Section we provide a detailed overview of data transmission in OFDM systems.
3.6.2 OFDM Symbols Transmission
OFDM maps a symbol vector containing N symbols (corresponding to N subcarriers) at time n, d[n] = [d [n], , d [n]]T CN , where the subscript i 1,...,N is for the carrier index, 1 ··· N ∈ ∈ { } according to
H s[n] = TCP WN d[n], (3.42) 3.6 Overview of OFDM Systems 71
f 1 f 2 f 3 Amplitude
Normalised frequency
Figure 3.5: Frequency domain representation of three OFDM subcarriers where the CP insertion is described via matrix
ICP P N TCP , R × , (3.43) " IN # ∈ where the matrix I RG N denotes the last G rows of the identity matrix I RN N . The CP ∈ × ∈ × unitary DFT matrix W CN N has elements N ∈ × 1 j2πik [W ] , exp − , i, k 0,...,N 1 . (3.44) N i,k √ N { } ∈ { − } N After parallel to serial conversion s[n] is transmitted over the multipath channel. We express the CIR in vector notation as
h0 . h = . , (3.45)
hL 1 − 3.6 Overview of OFDM Systems 72 where we assume that L < G. Let
h 0 0 0 ········· . . . . ...... .. .. . hL 1 . . . P P H , − C × , (3.46) ISI . . . . ∈ 0 ...... . . . . ...... 0 0 ... 0 hL 1 . . . h0 − be the lower triangular Toeplitz channel matrix and let
0 ... 0 hL 1 . . . h1 . . . − . . ...... . .. .. . . . hL 1 P P H , − C × , (3.47) IBI . . ∈ . .. 0 . . . . .. . 0 0 ············ be the upper triangular Toeplitz channel matrix. We can express the received signal as
r[m] = H s[m] + H s[m 1] + w[m], (3.48) ISI IBI − where the first term represents the ISI between two consecutive OFDM symbols, the second term corresponds to Inter Block Interference (IBI) between two consecutive OFDM block transmis- sions at time m and time (m 1) and w[m] CP is the AWGN, assumed to be i.i.d. complex − ∈ 2 Gaussian with zero-mean and variance σw.
At the receiver the CP of length G is removed, and a DFT is performed on the remaining N samples. The CP removal can be represented by the matrix
N P RCP , [0N GIN ] R × , (3.49) × ∈ which removes the first G entries from the vector r[m] CP if the product R x[m] is formed. ∈ CP As long as G L, ≥
RCP HIBI = 0N P , (3.50) × 3.6 Overview of OFDM Systems 73 which indicates that the ISI between two consecutive OFDM symbols is completely eliminated. Finally, the received signal can be written as:
y[m] = WN RCP r[m]
= WN RCP HISI s[m] + WN RCP w[m] (3.51) H = WN RCP HISI TCP WN d[m] + WN RCP w[m] H = WN HWN d[m] + WN RCP w[m], where
H , RCP HISI TCP (3.52) H = WN diag(g)WN , where the CFR g CN is defined as the DFT of the CIR ∈
g , WN Lh, (3.53) × where WN L is the partial DFT matrix containing the first L columns of WN . Using (3.52) we × rewrite (3.51) as
y[m] = diag(g)d[m] + z[m], (3.54)
, 2 where the elements of z[m] WN RCP w[m], are white with variance σw. Hence, the covariance matrix of z[m] has diagonal structure with identical elements
H Rz[m] = E WN RCP w[m](WN RCP w[m]) 2n H H o (3.55) = σwWN RCP IP RCP WN 2 = σwIN .
In an OFDM system, according to (3.54), every element of the symbol vector d[m] is transmitted over an individual frequency-flat subcarrier. Note that (3.54) can also be expressed as
y[m] = diag(d[m])g + z[m] (3.56) = diag(d[m])WN Lh + z[m]. ×
For the detection of symbols, using (3.54) would be more convenient, while for the purpose of channel estimation, the usage of (3.56) would be more convenient. These two equivalent system models are presented in Table 3.1. 3.6 Overview of OFDM Systems 74
OFDM system model for detection of symbols :
y[m] = diag(g)d[m] + z[m].
OFDM system model for channel estimation :
y[m] = diag(d[m])WN Lh + z[m]. ×
Table 3.1: OFDM system models summary
3.6.3 Multi Carrier versus Single Carrier Modulation Schemes
We now list a few pros and cons of using Multi Carrier (MC) techniques. The main advantages of MC over Single Carrier (SC) modulation schemes are:
1. Narrowband interference - MC systems are robust against narrowband interference, because such interference affects only a small number of sub-carriers.
2. Equalization - in MC systems, equalization is very simple. This is because the time dispersive channel is transformed into parallel channel gains on each subcarrier. Therefore, equalization can be implemented using a one-tap equalizer. In SC systems, in contrast, a careful design of the equalizer needs to be implemented in order to mitigate the ISI effect.
3. Adaptive schemes - in MC systems, different modulation, power allocation and coding schemes can be assigned to each subcarrier in a natural fashion [103], [104]. For example, a subcarrier in deep fade can be assigned with a Binary Phase Shift Keying (BPSK) modulation and low power allocation while a subcarrier on“good” channel can be assigned with high order modulation scheme (e.g. 64 Quadrature Amplitude Modulation (QAM)). Thus, the channel can be used more efficiently.
The main disadvantages of MC are:
1. Frequency offsets and phase noise - the sensitivity to frequency offsets and phase noise of MC systems is well known. When the receiver’s Voltage Controlled Oscillator (VCO) is not oscillating with exactly the same carrier frequency as the transmitter’s VCO, both Carrier Frequency Offset (CFO) and Phase Noise (PN) may occur. Both CFO and PN result in ISI, as the subcarriers are no longer orthogonal and interfere with each other. Because OFDM divides the spectral allotment into many narrow subcarriers, each with small carrier spacing, it may be very sensitive to CFO and PN errors [105], [106]. The characteristics of ISI are similar to additive white Gaussian noise and lead to a degradation of the overall SNR. 3.7 Channel Estimation in OFDM Systems 75
2. High Peak to Average Power Ratio (PAPR) - The main cause of large PAPRs is when symbol phases in the subcarriers line up in a fashion that results in constructively forming a peak in the time-domain signal. The signal transmitted by the OFDM system is the superposition of all signals transmitted in the narrowband subchannels. According to the Central Limit Theorem the transmitted signal follows a Gaussian distribution leading to high peak values compared to the average power. A system design not taking this into account will have a high clip rate: each signal sample that is beyond the saturation limit of the power amplifier suffers either clipping to this limit value or other non-linear distortion, both creating additional bit errors in the receiver [107], [108]. The PAPR is defined as
max s[n] 2 PAPR s[n] , | | . (3.57) { } E ns[n] 2 o | | n o Throughout this thesis we shall not deal with ICI, CFO, PN and PAPR phenomena, but instead assume that the aforementioned affects are compensated for.
3.7 Channel Estimation in OFDM Systems
In mobile wireless communications systems, the channel is time-varying because of the relative motion between the transmitter and the receiver. This results in variation in the propagation path. Most of the modern digital receivers rely on coherent detections. This requires knowledge of the fading amplitude and phase. Channel estimation is a vital task for the receiver, in order to obtain satisfactory performance. It can be carried out in both time and frequency domains. In the time domain, the CIR h in (3.56) is estimated, while in the frequency domain, the CFR g in (3.53) is estimated.
Channel estimation methods can be classified as blind, semi-blind and pilot-aided. Blind algo- rithms do not require any training data and exploit statistical or structural properties of com- munication signals. Semi-blind methods combine blind methods criteria with limited amount of pilot data. Pilot-aided methods on the other hand rely on a set of known symbols interleaved with data in order to acquire the channel estimate.
The advantage of estimating h instead of g lies in the fact that the number of elements in h (that is, the number of channel taps) is usually much smaller than the number of elements in g (that is, the number of subcarriers). In a typical OFDM system the number of subcarriers can reach hundreds or even thousands while the number of channel taps is usually less than ten elements. This means that for the same amount of pilot symbols, a smaller MSE can be achieved using time domain channel estimation techniques. We shall therefore concentrate in the sequel in the estimation of the CIR h. 3.7 Channel Estimation in OFDM Systems 76
We now provide a short review of these methods.
3.7.1 Pilot Aided Channel Estimation
Pilot Symbol Aided Modulation (PASM) based schemes obtain the estimate on the basis of known pilot symbols that are interleaved among the transmitted data symbols, see [109], [110], [111], [112], [113].
Least squares channel estimation • The simplest approach to PASM channel estimation in OFDM systems is the Least Squares (LS) approach. In that case, no a priori information is assumed to be known about the statistics of the channel taps. Based on (3.56), the LS estimate of h (which is also the ML solution for the case of additive Gaussian noise) can be expressed as
1 h = AH A − AH y. (3.58) b where A , diag(d[m])WN L. The MSE of the LS estimator, ǫ, can be written as ×
2 H 1 ǫ = σz T r A A − . (3.59) n o 2 2 It’s important to note that (3.58) minimises the quantity y Ah and not h h . − − b b MMSE channel estimation • In case that the channel is known to be Rayleigh fading, then we can use the Bayesian framework to find quantities such as the MAP or MMSE estimates. The MMSE estimate can be written as
H 2 1 H h = A A + σwRh − A y, (3.60) b where Rh is the covariance matrix of h. The MSE of the MMSE estimator can be written as
H 2 1 H ǫ = T r cov (h) A A + σ Rh − A cov (h, y) . (3.61) − w n o 2 Unlike (3.58), the MMSE estimator does minimise the cost function h h . − b 3.7 Channel Estimation in OFDM Systems 77
3.7.2 Blind Channel Estimation
The term blindness means that the receiver has no knowledge of the transmitted sequence and the CIR. Channel identification, equalization or demodulation is then performed using only some statistical or structural properties of communication signals, e.g. cyclostationarity, high- order statistics, or the finite-alphabet property. The need for higher data rates motivates the search for blind channel estimation methods. In OFDM systems, the CP typically occupies up to 25% of the transmitted data. Furthermore, if pilot symbols are used for channel estimation and synchronization purposes, those may require another 15% 20% of the remaining data. − Therefore, blind estimators are of interest, especially in the case of slow time-varying channels. It may be classified as follows: correlation-based methods, subspace methods, methods exploiting the finite alphabet property and maximum likelihood estimation.
Decision Directed (DD) methods: In this method, the detection of symbols at time • n, d[n], is carried out conditional on the channel estimation at n 1, h[n 1], see for − − example [114] and [115]: b b
d[n] = arg max p y[n] d, h[n 1] . (3.62) d | − b b Next, the channel estimation at time n is updated conditional on the current detected symbols, d[n], as virtual pilots:
b h[n] = arg max p h[n] d[n], y[n] . (3.63) h[n] | b b DD methods are effective at moderate to high SNRs and slow varying channels, when reliable symbol decisions are available to the receiver. However, this method is prone to error propagation. We will discuss the DD in detail in Chapter 6.
Precoding based approach: these methods employ non-redundant precoding at the • transmitter side in SISO [116], [117] or MIMO systems [118]. With non-redundant precod- ing, the block length remains unchanged, but a specific correlation structure is induced at the transmitter, e.g. by correlating each carrier with a reference carrier. A proper balance must be found between the level of transmitter-induced correlation, leading to ICI, and a small channel estimation variance which may in turn improve the system performance.
Correlation based approach: this approach takes advantage of the specific structure of • the CP. Since the CP is periodic, thus introduces redundancy as well as cyclostationarity. Cyclostationary signals have the property that statistics, such as mean or autocorrelation function, are periodical [119]. Linear time-invariant filtering does not affect cyclostationar- ity. Consequently, periodicity is expected in the time-varying correlation at the output of the channel. Cyclostationary statistics carry information on channel amplitude and phase 3.7 Channel Estimation in OFDM Systems 78
and allow blind channel estimation [120], [121], [122].
Joint channel estimation and detection methods: It is possible to perform joint • estimation of the channel and the symbols [123], [124] or alternatively, perform symbols detection by marginalisation over the channel parameters.
If we are interested only in the detection of the transmitted symbols, then the channel h can be viewed as a nuisance parameter and can be integrated out. Thus, the MAP detector can be expressed as
d = arg max P r (d y, h) p (h) dh. (3.64) | d Z b In most cases the solution of (3.64) would be hard to obtain due to the intractability of the integral. Another approach then would be to derive the joint channel estimation and symbols detection. This can be carried out by expressing the joint posterior
p (h, d y) p (y h, d) P r (d) p (h) . (3.65) | ∝ |
Based on (3.65), one can obtain various quantities, such as the ML, MMSE or MAP estimates. For example, the joint MAP estimate can be obtained by solving
h, d = arg max p (y h, d) P r (d) p (h) . (3.66) h,d | b b The complexity of such approaches can be prohibited in practice, and other approaches have been suggested, such as Turbo receivers, which will be discussed in Section 3.9.
Other blind approaches include the finite-alphabet property based methods [125], [126] and sub- space based methods [127], [128], [129] which are not discussed in this thesis and will be therefore skipped.
3.7.3 Semi Blind Channel Estimation
Semi blind methods are based on limited training data in conjunction with blind algorithms [125], [130]. Semi blind methods possess three benefits over blind methods: the ambiguities inherent to blind methods may be resolved, convergence speeds are improved, and more effective and robust tracking of time-varying channels is achieved. 3.8 Channel Coding 79
3.7.4 MIMO-OFDM System Model
On the one hand, OFDM is an effective technique to combat multipath fading in wireless com- munications systems. On the other hand, the capacity of wireless communications systems can be improved by using MIMO techniques. By combining these two techniques, OFDM can trans- form a frequency selective MIMO channels into a set of parallel frequency-flat MIMO channels as long as the channel length is smaller than the CP length. As a result the receiver complexity decreases and the advantages of each techniques are retained.
3.8 Channel Coding
During data transmissions, the original signal is likely to be corrupted by the channel and the noise at the receiver. This causes the signal to be received with errors, which increases the unreliability of reconstructing the original information from the received data. In order to alleviate this problem, Error Control Coding (ECC) can be used. By adding redundant information to the transmitted data, it helps to correct the received errors and reconstruct the original data. Using ECC can help achieve the same BER at a lower SNR in a coded system than in a comparable uncoded system [131], known as coding gain.
3.8.1 Linear Codes
A code is linear if the sum c + c of any two length N code words, c, c C, is again a code word ′ ′ ∈ in C. It follows that the code C is a K dimensional subspace of the vector space of all 2N binary length N vectors. K linearly independent code words in C form the basis of the subspace C, i.e. any code word c C can be uniquely expressed as a linear combination of these K linearly ∈ independent vectors. These K base vectors entirely define the code and are commonly arranged as the rows of a K N generator matrix G. This offers a convenient linear encoding rule from × the set of information words to the set of code words:
c = uG (3.67)
The columns of G correspond to the code word positions, the rows to the information word positions. The encoding mapping is systematic if the K information bits, u, are contained in the code word c.
Alternatively, the code C may be defined as the null space of a (N K) N parity check matrix − × H:
T cH = 01 (N K) (3.68) × − 3.8 Channel Coding 80
+ + ci 1, u i D D
c + i 2,
Figure 3.6: Rate 1/2 convolutional encoder where 1 (N K) is the all-zero vector of length N K. The columns in H correspond to the × − − code word positions, the rows to the parity check equations fulfilled by a valid code word.
3.8.2 Convolutional Coding
Convolutional codes were introduced by Elias [132] and are now broadly utilized in different communication fields. These codes are highly structured, allowing a simple implementation and a good performance with short block length. The Viterbi algorithm [133] is an efficient implementation of the optimum ML word based decoding for convolutional codes. The basic concept is the sequential computation of the metric and the tracking of survivor paths in the trellis. The algorithm was extended in [134] for soft-outputs (SOVA algorithm). Convolutional codes are usually linear codes. An example of a rate 1/2 convolutional code is shown in Figure 3.6. The code words of a convolutional code are the output sequence of a linear encoder circuit fed by the information bits. This code construction sets additional constraints on the characteristics of the corresponding G and H matrices.
3.8.3 BICM Technique
Bit Interleaved Coded Modulation (BICM), first suggested by Zehavi [135] is the serial concate- nation of a code, interleaver and mapper, as depicted in Figure 3.7. The information bits are processed by a single encoder and random interleaver Π. The coded and interleaved bit sequence c is partitioned in Ns subsequences cn of length M:
c = (c1,..., cn,..., cNs ) , with (3.69) cn = (cn,1, . . . , cn,m, . . . , cn,M ) .
M The bits (cn,1, . . . , cn,m) are mapped at time index n to a symbol xn chosen from the 2 -ary signal constellation X according to the binary labeling map µ : 0, 1 M X. The optimum { } → 3.9 Iterative Processing Techniques 81
c ,mn
ui xn Encoder Π Demultiplex Mapping X,
Figure 3.7: Block diagram of a BICM encoder
BICM receiver, depicted in Figure 3.8, would perform the overall ML decoder. However, the complexity of a joint ML demapper and decoder is not manageable. Therefore, we separate the demapping and decoding task and consider a BICM receiver without and with iterative demapping and decoding.
∧ (cL ) uL i yn ,mndem Demapper Π−1 Decoder
Figure 3.8: Block diagram of a BICM decoder
3.9 Iterative Processing Techniques
3.9.1 The Turbo Principle
Practical coding structures that perform close to the capacity limit are of great interest. In- deed, although Shannon’s theory proved the existence of such codes, it did not give provide a mechanism to design them. In fact, Shannon’s theory did not prove the existence of capacity- approaching codes with tractable decoding complexity. In 1993, Claude Berrou’s research group proposed a coding structure, referred to as turbo code [136], operating close to Shannon’s bound while exhibiting reasonable decoding complexity. Perhaps more than the proposed coding struc- ture, the most groundbreaking contribution of Berrous team lays in the iterative processing used for decoding the received observations. The so-called turbo decoder consisted in two low- complexity decoders iteratively exchanging some soft information about the transmitted bits. Due to its outstanding performance when applied to the decoding of turbo codes, the so-named turbo principle has then been further applied to a variety of other receiver tasks: demodulation [137], equalization [138], multi-user joint reception [139]. Note that, the turbo principle was orig- inally developed in a rather ad-hoc way for turbo codes. Recently, a mathematical framework under the name of factor graphs [140] provides insightful ways to develop iterative algorithms, based on a graphical representation of a problem or system. 3.9 Iterative Processing Techniques 82
3.9.2 Iterative Detection, Decoding and Estimation
Iterative (“Turbo”) processing techniques have received more and more attention followed by the discovery of the powerful Turbo codes. The turbo principle can be applied not only to channel decoders, but also to a wide variety of combinations of detectors, decoders, equalizers, multiuser detectors, coded modulators, joint source/channel coders, etc.
Communications systems usually consist of a collection of cascaded system blocks. For example, let’s consider a receiver consisting of a symbol detector and a decoder. In a conventional system, the detector makes a hard decision about the symbol based on the received signal. Then, its decision is passed to the decoder that decides what the transmitted data bits were. This solution, though simple, comes with a price, since there is significant information loss when the information about a symbol is truncated to a hard decision. If the confidence level of the detector is passed along with its symbol decision, approximately 2 3 dB of performance can be gained at high − SNRs [77]. However this performance is still far from optimal, because earlier stages are not getting any of the information collected by the later stages in the chain. The optimal solution is a Maximum Likelihood Sequence Estimator (MLSE) technique, requiring the construction and evaluation of a super-trellis that includes the channel and code effects. This way, the estimation process considers the joint effects of both the channel and the coder. Although this approach is optimal, it is computationally prohibitive. Iterative processing methods provide an alternative to pass the information from later stages to earlier stages. For iterative processing to work, the individual sub-blocks must produce MAP, or soft output estimates of the quantities that they estimate. Namely, both the detector and the decoder must produce MAP or soft output estimates of the transmitted bits.
3.9.3 Iterative Detector and Decoder Components
An example of an iterative detector and decoder is shown in Figure 3.9. At each iteration of the loop the detector makes a decision about the coded bits c[n] by considering the received signal y[n], a priori information about the coded bits from previous iteration, and using knowledge of the system structure, which includes channel structure, modulation type, noise statistics, etc. The superscript j indicates the j-th iteration of the turbo processing algorithm. The 1 system blocks indicated by Π and Π− are called interleaver and deinterleaver respectively. Their purpose and function will be described later. By applying Bayes’ rule we get the following factorization:
pj (y[n] c[n] = 1]) P rj 1 (c[n] = 1]) P rj (c[n] = 1 y[n]) = | − (3.70a) | p (y[n]) pj (y[n] c[n] = 1]) P rj 1 (c[n] = 0]) P rj (c[n] = 0 y[n]) = | − (3.70b) | p (y[n]) 3.9 Iterative Processing Techniques 83
LLR from detector Extrinsic input to decoder
∧ + Π y Detector/ _ b _ Decoder Estimator Π−1 +
Extrinsic input to detector /estimator LLR from decoder
Figure 3.9: Iterative receiver for coded systems
Instead of posterior distributions, usually Log Likelihood Ratio (LLR) values are used.
P rj (c[n] = 1 y[n]) Λj (c[n]) = log | 1 P rj (c[n] = 0 y[n]) | j 1 p (y[n] c[n] = 1) P r − (c[n] = 1) (3.71) = log | + log . p (y[n] c[n] = 0) P rj 1 (c[n] = 0) | − j j−1 λ1(c[n]) λ2 (c[n]) | {z } | {z } j The quantity Λ1 (c[n]) is called the extrinsic information produced by the detector, which is the information about the coded bit extracted from the received signal, and the a priori information j 1 about the other coded bits, but not from the a priori probability of c[n]. The quantity λ2− (c[n]) is the a priori LLR of c[n], which can be generated from the decoder’s output of the previous j iteration. The extrinsic information λ1 (c[n]) is sent to the channel decoder, which uses it as a priori information. The channel decoder uses the information passed by the detector and information about the channel structure to calculate an a posteriori LLR.
j The likelihood here again is expressed as a sum of the extrinsic likelihood λ2 (c[n]) gleaned from j the input, the code structure and all coded bits except c[n], and the a priori likelihood λ1 (c[n]). j The extrinsic information λ2 (c[n]) is fed back to the first block as a priori information about the coded bits.
It is important to note that the above equations hold only if the inputs to the individual sub- blocks are independent. Obviously, the sequence of coded bits is not independent, because the parity bits are generated from the data bits and, hence, there is some correlation among them. To remedy this, an interleaver or deinterleaver device that shuffles the bits is inserted to make the bit sequence appear random. The block on which it operates must be large (approximately on the order of 1, 000 bits or more). For simplicity here we employ a random interleaver. Another advantage of the interleaver is that it disperses burst errors evenly throughout the frame. Also, bits that happen to be transmitted during a long lasting fade of the channel, get shuffled and more confident decisions about the new neighbors helps reconstruct the original. Remarkably, after a 3.10 Relay Based Communication Systems 84 few iterations, the decisions become refined, and the estimates become significantly confident. The overall performance of the receiver approaches the performance of the MLSE receiver.
3.10 Relay Based Communication Systems
3.10.1 Introduction
In a relay-based communication system, transmission between the source and the destination is achieved through an intermediate transceiver unit called a relay. The main property of relay channels is that certain terminals, called relays, receive, process, and re-transmit some information bearing signal(s) of interest in order to improve performance of the system. The relay channel, first introduced by van der Meulen in 1971 [141] has recently received con- siderable attention due to its potential in wireless applications. In [9], Cover and El-Gamal introduced two relaying strategies commonly referred to as Decode and Forward (DF) and Esti- mate and Forward (EF). The relaying techniques have the potential to provide spatial diversity, improve energy efficiency, and reduce the interference level of wireless channels. A number of relay strategies have been studied in the literature. These strategies include Amplify and Forward (AF) [142], where the relay sends a scaled version of its received signal to the desti- nation. The AF scheme is attractive because of its simple operation at the relay nodes. Other strategies include the demodulate-and-forward [142] in which the relay demodulates individual symbols and retransmits, DF [143] in which the relay decodes the entire message, re-encodes it and re-transmits it to the destination, and Compress and Forward (CF) [9] where the relay sends a quantized version of its received signal. With cooperative communications, the design of the encoder and decoder at the source and destination is accompanied with the design of the functionality of the relay nodes. The choice of the relay function affects different aspects of the system, such as potential capacity [144], [145], or SNR optimality [146]. Clearly, the most desirable schemes are those that achieve the optimality criteria with minimal processing complexity at the relays. Memoryless relay functions are highly relevant for this objective, due to their simplicity.
3.10.2 Relay System Model
We consider a relay network model as illustrated in Figure 3.10, which consists of a single source node, a destination node, and L relay nodes r L . In this model, the relays facilitate the { }r=1 ultimate transmission from the source to the destination by cooperating with the source. All the relays are working in half duplex mode, in which they cannot transmit and receive at the same time on the same frequency band. In the first time slot, the source broadcasts to all 3.10 Relay Based Communication Systems 85
g v h1 n1 1 1
X + Relay 1 X + g v h2 n2 2 2
source X + Relay 2 X + destination
X + Relay L X +
g v hL nL L L
Figure 3.10: Parallel Relay Channels with one source, L relay nodes and one destination the relays. In the second time slot, the relays transmit to the destination in the L orthogonal subchannels. The total L orthogonal subchannels can be realized in time division, frequency division, or code division. The source broadcasts the signal (symbol or codeword) s to all the relays. The received signal at the i-th relay, r(i), is
r(i) = h(i)s + w(i), i 1, ,L , (3.72) ∈ { ··· } where h(i) is the channel coefficient between the source and the i-th relay and w(i) is the additive noise at the i-th relay.
The memoryless relay processing function (with possibly different functions at each of the relays) of the i-th relay node is denoted by f (i), i 1, ,L , and can be both linear or non-linear. ∈ { ··· } Next, each relay transmits its signal on orthogonal subchannels. The corresponding received signals at the destination, y(i), can be expressed as
y(i) = f (i) r(i) g(i) + v(i) (3.73) = f (i) h(i)s + w(i) g(i) + v(i), i 1, ,L , ∈ { ··· } where g(i) is the channel coefficient between the i-th relay and destination and v(i) is the additive noise at the destination during the i-th slot. 3.10 Relay Based Communication Systems 86
3.10.3 MAP Detection in Memoryless Relay Functions
Detection of the transmitted symbols is a fundamental task of the receiver node. Since the chan- (i) L nel gains are mutually independent, the received signals y i=1 are conditionally independent given s. The MAP decision rule is given by
L s = arg max P r s y(i) s S | ∈ Yi=1 (3.74) b L = arg max p y(i) s P r (s) . s S | ∈ Yi=1 In the simple case, which is of the form f (β) = αβ, where α is a constant, the likelihood can be obtained analytically as
2 p y(i) s = CN αsh(i)g(i), α2 g(i) σ2 + σ2 I . (3.75) | w v
In general, the relay function may not be of a linear form. Conditional on s, we know the distribution at the relay r(i) s, |
p(i) r(i) s = p sh(i) + w(i) s | | (3.76) CN (i) 2 = sh , σw . However, finding the distribution of the random variable after the non-linear function is applied i.e. the distribution of f r(i) , f r(i) g(i) given s, involves the following change of variables formula e 1 1 (i) − (i) (i) − (i) ∂f p f r s = pr(i) f r r s , (3.77) | | ∂r(i) e e e which can not always be written down analytically for arbitrary f. The second more serious complication is that even in cases where the density for the transmitted signal is known, one e must then solve a K fold convolution to obtain the likelihood:
(i) (i) p y s = p f r s p (i) | | ∗ v (3.78) ∞ (i) = e p f (z s) p (i) y z dz. | v − Z−∞ e Typically this will be intractable to evaluate pointwise. In Chapter 8 we shall develop algorithms to overcome these problems using the ABC theory. 3.11 Concluding Remarks 87
3.11 Concluding Remarks
In this chapter we have presented an overview of fundamental concept of wireless communica- tions. The main point presented in this chapter:
We gave an overview of different aspects and properties of wireless channels. • We presented a few common statistical fading channel models. • We provided an overview of multiple antennas systems and data detection techniques. • We discussed the transmission and reception of OFDM systems. • We presented different families of channel estimation techniques for OFDM systems. • Error correction codes and iterative receiver principles were presented. • We presented different families of channel estimation techniques for OFDM systems. • We presented an overview of wireless relay networks and discussed the problem of data • detection due to non-linear relay functions.
Chapter 4
Detection in MIMO Systems using Power Equality Constraints
“The Americans have need of the telephone, but we do not. We have plenty of messenger boys.”
Sir William Preece, chief engineer of the British Post Office, 1876
4.1 Introduction
In this chapter we present novel algorithms for detection of data symbols in MIMO systems with perfect Channel State Information (CSI) at the receiver. In the proposed approach, each transmitted symbol vector in the multi-dimensional constellation is classified into one group from a finite set of groups each containing equal-power. For each of these groups, we relax the non-convex discrete constraint, and replace it with a non-convex continuous Power Equality Constraint (PEC). This results in a series of non-convex optimization problems for all the power groups. Using the hidden convexity methodology [147], these optimization problems can be solved efficiently, and a list of “soft candidates” is produced. Once the list is assembled, the soft candidates are quantized (rounded) to yield hard candidates, out of which, the one that best fits the received signal vector is chosen as the final detection solution. Although the number of power groups under consideration can be considerable, we will show that only a small number of power groups is relevant in selecting the final detection solution, thus a significant complexity reduction can be attained. An appealing property of the proposed detection scheme is that the algorithm does not require the knowledge of the noise variance, which is in many cases unknown a priori.
In addition to the proposed detection approach, an improved detection algorithm with a heuristic search is also presented. Based on the list of soft candidates, a local search is performed in order 89 4.2 Background 90 to improve the performance of the detection. Numerical results show that the proposed detection algorithms significantly outperform the Minimum Mean Squared Error (MMSE) detector. For a MIMO system with four transmit antennas, four receiver antennas and 16 QAM, the performance improvement is between 2 to 8 dB over the MMSE detector, depending on whether the heuristic search is used. The main contributions presented in this chapter are as follows:
Algorithm 1 - Power Equality Constraint - Least Squares detector (PEC-LS): We present a • novel detection scheme for MIMO systems with high level modulation constellations. The proposed detection approach is based on a relaxation of the ML optimization problem in a multidimensional constellation. Each symbol vector in the multidimensional constellation can be classified into a finite set of equi-power groups, leading to a set of non-convex opti- mization problems that can be solved efficiently using the hidden convexity methodology.
Algorithm 2 - Ordered Power Equality Constraint (OPEC) detector: We present a novel • algorithm which significantly reduces the complexity of Algorithm 1 without any perfor- mance degradation. This will be achieved by sorting the set of soft solutions according to their MSE performance and producing only a subset of hard decision candidates.
Algorithm 3 - Improved Ordered Power Equality Constraint (IOPEC) detector: The pur- • pose of this algorithm is to enhance the BER performance of the previous two algorithms. This is achieved by incorporating a local search in the neighborhood of the soft solutions.
Performance analysis and complexity reduction: in order to reduce the complexity of the • proposed detectors, we present an efficient implement of the proposed algorithms. We shall separate the computational operations into two categories:
1. Pre-Processing Phase - contains operations that are common to all the power groups and depend on only the current observation vector. 2. Processing Phase - contains operations that are group specific.
We show that it is possible to design our algorithms in such a way that most of the com- plexity burden lies in the Pre-Processing Phase, leading to a low overall complexity. We present a complexity analysis of the proposed algorithms and show that they have the same order of complexity as the MMSE detector.
4.2 Background
MIMO systems arise in many modern communication channels, such as multiple access and multiple antennas channels. It is well known that MIMO systems can yield vast capacity and 4.3 System Description 91 error performance gains if a rich-scattering environment is properly exploited [2], as compared to traditional single antenna systems. In order to exploit these gains, the system must be able to efficiently detect the transmitted symbols at the receiver. The optimal method for detecting the transmitted symbols is a ML detector, which minimizes the error probability. Unfortunately, the ML detector requires solving a combinatorial optimization problem that has a complexity that is exponential to the number of transmit antennas and the rate of modulation, which makes it impractical for many applications [148]. Therefore, a few suboptimal detectors are proposed as alternatives to the ML detector. The most common suboptimal detectors are linear detectors, such as the Matched Filter (MF), the de-correlator or Zero Forcing (ZF) detector and the Minimum Mean Squared Error (MMSE) detector. However, the performance of these linear detectors is far from the ML performance. Moreover, their performance gap from the ML detector becomes larger as the constellation size increases.
Apart from the ML detector and the linear detectors, a different approach that can be taken is to implement a two-step detector:
In the first step, an approximated solution is found based on a relaxation of discrete ML • by a tractable continuous optimization problem. For example, a Semi-Definite Relaxation (SDR) detector has been proposed, where the discrete constellation constraint was replaced by a polynomial constraint [149], [150]. Other examples are given in [151], where the algorithm is based on a relaxation of the ML by using a quadratic non-convex constraint and in [152] by using a convex continuous constraint.
In the second step, heuristic search methods can be used to further improve the initial • detection solution [153], [154].
4.3 System Description
Consider a MIMO system consisting of M transmit antennas and N receive antennas. The relationship between the transmitted symbol vector s and the received vector y is determined by y = Hs + w, (4.1) where H CN M denotes the flat fading channel matrix consisting of independent complex ∈ × Gaussian values which are assumed to be known by the receiver [155]. In (4.1), s CM 1 ∈ × stands for the transmitted symbol vector, where the elements of s belong to some known complex constellation S with cardinality D. The additive noise vector w is of size N 1 and with i.i.d. × CN 2 complex random elements with every element wi 0, σw . For the clarity of presentation, ∼ . throughout this chapter, we use as an example a 16 QAM constellation and M = 2 so that S = 4.3 System Description 92
16 QAM and D = 16, although our approach is general and can be used with any constellation size and any number of antennas.
First, we provide a shot overview of some well-known detectors. If the input has a flat prior, the ML detector minimizes the error probability and can be written as
2 sML = arg min y Hs , s k − k (4.2) s.t. s SM . b ∈ To find the solution for (4.2), which is a combinatorial problem, a brute-force search over all of the DM candidates (or lattice points) must be used [156]. However, this is impractical as M and D become large.
Now, we review two linear detectors, which are the ZF and the MMSE detector. In both ZF and MMSE detectors, as a simple relaxation, it is assumed that the elements of s belong to an M -dimensional continuous complex plane, denoted by s CM . This relaxation results in the ∈ following optimization problem
s = arg min y Hs 2 , s k − k (4.3) s.t. s CM . b ∈ The solution of (4.3) leads to the well-known ZF Detector, given as [157]
1 s = HH H − HH y , (4.4) ZF Q h i where [x] denotes component wiseb quantization of x according to the symbol alphabet used. Q 1 This is, however, suboptimal in general because the multiplication by HH H − HH introduces H 1 correlations of the noise components, and small eigenvalues of H H − Hwill lead to large errors due to noise amplification. The problem of noise enhancement through ZF has been addressed by using the MMSE detection. By combining the same relaxation as in the ZF detector with minimization of the MSE, the MMSE detector provides a trade-off between noise amplification and interference suppression and is able to achieve an improved performance, relative to the ZF detector. The MMSE solution is given by [157]
1 σ2 − s = HH H + w I HH y , (4.5) MMSE Q E " s #
b 2 where Es is the average power of the constellation and the noise variance σw needs to be known by the receiver. Although a significant improvement over the ZF detector is achieved, the performance of the MMSE detector is still far from that of the ML detector. Therefore, there has been considerable effort in nonlinear approximations of the ML detector. 4.4 Power Equality Constraint Least Square Detection 93
In the rest of the chapter, we will propose a detector which is based on the solution for an optimization problem of a relaxed continuous non-convex constraint set. Since the constraint is significantly tighter than the one used in the ZF and MMSE detectors, a performance gain in terms of BER is expected.
4.4 Power Equality Constraint Least Square Detection
In this section, we propose a PEC-LS detector for MIMO systems with high-level QAM modu- lations. In the proposed detector, the set of DM possible symbol combinations is replaced by a relaxed constrained set. This relaxation leads to a non-convex optimization problem that can be solved efficiently. First, we introduce the background for our approach and provide some useful definitions.
4.4.1 Basic Definitions and Problem Settings
Definition 4.1. Quantum Power Level (QPL): Let Φ (S) be a set of all possible power levels that are associated with a system of a single transmit antenna and a constellation S, that is
Φ(S) , P s S : s 2 = P . (4.6) ∈ R | ∃ ∈ k k n o . Then Φ (S) is called quantum power level set of constellation S. In the case of S = 16 QAM, as depicted in Figure 4.1, we have
Φ (16 QAM) = 2, 10, 18 , Φ (16 QAM) = 3. (4.7) { } | |
Definition 4.2. Symbol Vector Power Group: Let Ω (M,S) be a set of all possible powers of a transmitted signal vector that are associated with a system of M transmit antennas and constellation S. Then the symbol vector power group set is defined as
M Ω(M,S) , P s SM : s 2 = s[m] 2 = P , (4.8) ∈ R | ∃ ∈ k k | | ( m=1 ) X where s = [s[1], . . . , s[m], . . . , s[M]]T and s[m] is the symbol from the m-th transmit antenna. 4.4 Power Equality Constraint Least Square Detection 94
Φ 3 = 18
-3+3i -1+3i 1+3i 3+3i Φ 2=10
-3+i -1+i 1+i 3+i Φ 1=2
-3-i -1-i 1-i 3-i
-3-3i -1-3i 1-3i 3-3i
Figure 4.1: QPL for 16QAM modulation
The cardinality of Ω (M,S) depends on the constellation’s structure S as well as the number of transmit antennas M. Note that Ω (1,S) = Φ (S) , (4.9) and also that the elements in Ω (M,S) are composed of the sums of M elements from Φ (S). Gen- erally, the cardinality of the set Ω (M,S) can be upper bounded by Ψ(M,S) , where Ψ (M,S) | | is the set of permutations to choose M symbols out of possible Ω (1,S) different values with | | repetitions, that is Ω(M,S) Ω (1,S) M , Ψ(M,S) . (4.10) | | ≤ | | | | . For example, we consider a setting that M = 2 and S = 16 QAM. In this case, the possible power combinations are:
Ψ (2,S) = (2 + 2), (2 + 10), (2 + 18), (10 + 2), (10 + 10), 4 12 20 12 20 (4.11) | {z } | {z } | {z } | {z } | {z } (10 + 18), (18 + 2), (18 + 10), (18 + 18) , 28 20 28 36 and the cardinality is equal to| nine.{z } The| {z number} | of{z unique} | {z elements} in the power group set Ω (2,S) is only five, that is
Ω (2,S) = 4, 12, 20, 28, 36 , Ω (2,S) = 5. (4.12) { } | |
The ω-th element in Ω (M,S) is denoted by Ω (M,S) , ω 1,..., Ω(M,S) , and it is referred ω ∈ { | |} to as a power group, as it represents a power level of a group of transmitted symbol vectors. 4.4 Power Equality Constraint Least Square Detection 95
. Groups representation of M = 2, S = 16 QAM
ω Ω (2,S) (2, S) (2,S) ω Gω |Gω | 1 4 (2 + 2) (2,S) = 1 |G1 | 2 12 (2 + 10), (10 + 2) (2,S) = 2 |G2 | 3 20 (2 + 18), (10 + 10), (18 + 2) (2,S) = 3 |G3 | 4 28 (10 + 18), (18 + 10) (2,S) = 2 |G4 | 5 36 (18 + 18) (2,S) = 1 |G5 |
Table 4.1: Relation between Ω (2,S), Φ (S) and (2,S) G
Definition 4.3. The combinations of QPL elements that compose a power group Ωω (M,S) are µ denoted by (M,S) and each possible element in (M,S) is denoted by ω (M,S), where Gω Gω G µ 1,..., (M,S) . ∈ { |Gω |}
The relation between Ω (2,S), Φ (S) and (2,S) in our example is depicted in Table 4.1, where G every row represents a power group. From Table 4.1, it is clear that the number of power groups in Ω (2,S) is five. For example, the element ω = 3 and the power group Ω3 (2,S) = 20 correspond to symbol vectors s such that s 2 = 20. For this power group, there are (2,S) = 3 k k |Gω | combinations which are: 3 (2,S) = (2 + 18) , (10 + 10) , (18 + 2) . It is clear that Ψ(M,S) = Ω(M,S) G { } | | | | (M,S) . ω=1 |Gω | P Now, using the above definitions, we introduce an approximated solution of the ML detection based on replacing the discrete constraint in (4.2) by a continuous non-convex constraint set. Unlike the constraints used in the detectors reviewed in Section 4.3, we use a power equality constraint on the transmitted symbol vector in our proposed detector. This relaxation of the ML problem here is significantly tighter than the one in (4.4) and (4.5), where the relaxation is s CM . However, since there are Ω(M,S) possible power groups, there will be Ω(M,S) ∈ | | | | constraint detectors, one for each power group. Therefore, we obtain the following Ω(M,S) | | independent optimization problems:
2 sω = arg min y Hs s k − k (4.13) s.t. s 2 = Ω (M,S) , b k k ω 4.4 Power Equality Constraint Least Square Detection 96 where Ω (M,S) is the ω-th element in Ω (M,S) and ω 1,..., Ω(M,S) . By solving (4.13) ω ∈ { | |} once for each power group, a set of Ω(M,S) soft candidates (SC) is assembled as: | |
sSC , sSC , ω 1,..., Ω(M,S) , (4.14) ω ∈ { | |} and their corresponding Euclideanb distancesb to the received signal y are defined as:
2 ∆SC , ǫ , y H sSC , ω 1,..., Ω(M,S) . (4.15) ω − ω ∈ { | |} n o The sets of sSC , ∆SC will serve as theb starting point for our proposed algorithms in the following Sections. b
4.4.2 Constraint LS Detection for a Specific Power Group
We now discuss the solution of (4.13) for a specific power group. Problem (4.13) constitutes the LS estimation with the additional constraint that the candidate vectors lie in the hypersphere s 2 = Ω (M,S). This problem is nonconvex, since the relaxed set s 2 = Ω (M,S) does not k k ω k k ω define a convex set. However, this seemingly nonconvex problem has been studied extensively [147], [158] ,[159] as part of the Trust Region Subproblem (TRS), and can be solved efficiently. In this work we will use the hidden convexity methodology [147] to solve problem (4.13) as it will cater for an efficient solution for the MIMO detection.
In general, a non-convex minimization problem is called a hidden convex minimization problem [147] if there exists an equivalent transformation such that the transformed minimization problem is convex. In [151], problem (4.13) was solved using the above mentioned methodology. We now present this solution.
Theorem 4.4. ([151]) The solution to
s = arg min y Hs 2 s k − k (4.16) s.t. s 2 = P, b k k is
1 s = HH H + ηI − HH y, (4.17) where η λ HH H is the uniqueb root of ≥ − min s 2 = P. (4.18) k k b It is interesting to see that the solution (4.17) has a similar form as the linear MMSE receiver 2 in (4.5), where the constant positive regularization term σw in the MMSE receiver is replaced Es 4.4 Power Equality Constraint Least Square Detection 97 by the symbol-dependent term η. We note that η can be both negative (de-regularization) and positive (regularization). An appealing property of Theorem 4.4 is that this estimator does not 2 require the knowledge of the noise variance σw, which may not be known in practice.
We now discuss the implementation of Theorem 4.4. The only issue is the evaluation of the single parameter η. This can be done by trying different values of η in (4.17) until the one that 2 2 1 satisfies (4.18) is found. Due to the fact that s = HH H + ηI − HH y is monotonic k ηk decreasing in η λ HH H , we can find a value of η that satisfies (4.18) using a simple ≥ − min line-search, such as bi-section search [160]. The search range is η η η , where η left ≤ ≤ right left can be chosen such that
s 2 P and (4.19a) ηleft ≥ η λ HH H , (4.19b) left ≥ − min and ηright can be chosen such that
s 2 P. (4.20) ηright ≤
In Section 4.6, we derive ηleft and ηright that enable an efficient line search on a reduced range.
Since the constraint in (4.16) is related to power levels of hyper-spheres, rather than the signal constellation points, we call the solution in (4.17) a soft solution (or a soft candidate). After the soft solution is obtained, it should be properly rounded to the M-dimensional signal constella- tions of the corresponding power level, yielding a hard candidate. The details will be given in the next Section.
4.4.3 PEC-LS Detection with QAM Modulation
As stated previously, the soft solution for a specific power group can be solved efficiently. In a MIMO system with QAM modulation, there are a number of power groups, each yields a soft solution. Using the set of soft candidates sSC found in (4.14), a set of Ω(M,S) hard-decision | | candidates can be produced by: b sHD = sSC , ω 1,..., Ω(M,S) , (4.21) ω Qω ω ∈ { | |} where [ ] refers to the hard-decisionb b operation of the ω-th element in Ω (M,S). Qω · We now elaborate on the [ ] operation. Unlike the MMSE and ZF detectors, for which Qω · the quantization operation can be carried out element-wisely, the quantization in the proposed detector needs to be carried out vector-wisely. This is due to the fact that we have the additional 4.4 Power Equality Constraint Least Square Detection 98 constraint on the power of the transmitted vectors, hence the hard-decision solution based on a soft solution must satisfy the power constraint of the group.
A direct implementation of (4.21) is cumbersome, since the quantization operation for a given power group involves choosing the combination of M (hard-decision) symbols that has the min- imum Euclidean distance to the soft solution, among all the possible combinations of symbols that compose the power group. This can be formulated as:
HD SC 2 sω = arg min sω s 2 − s: s =Ωω(M,S) k k M (4.22) b b SC 2 = arg min sω [m] s [m] , 2 − s: s =Ωω(M,S) (m=1 ) k k X b where s[m] refers to the m-th element of s. . For example, for M = 2, S = 16 QAM, ω = 3, the constraint Ω3 (M,S) = 20 can be satisfied by 16 + 64 + 16 = 96 combinations. A direct implementation of (4.22) involves computing the squared Euclidean distances between 96 candidate vectors to the soft-decision vector of the power group, then comparing their distances and choosing a candidate vector which is closest to the soft candidate, yielding a hard-decision vector for that power group.
In order to reduce the complexity of the above hard-decision operation, we now present a method which can perform the quantization operation on each group efficiently. In the proposed method, instead of generating a single hard-decision candidate for each power group, we produce a set of (M,S) hard-decision candidates for the ω-th power group. For each combination of |Gω | QPL elements given by (M,S), the transmitted signal power from every antenna is specified, Gω therefore, the quantization operation in (4.21) can be implemented element-wisely and we have
M 2 2 min sSC s = min sSC [m] s [m] ω − ω − (m=1 ) X (4.23) M b b 2 = min sSC [m] s [m] . ω − m=1 X b Given a QPL, the hard-decisions for the signal from each transmit antenna lie on a circle. As a result, the element-wise quantization can be carried out via a simple phase rounding, which is denoted by µ [ ]. QGω · For each power group indexed by ω, (M,S) hard-decision candidates are generated by using |Gω | the phase rounding and they form a hard-decision candidate set for this power group. This set of hard-decision candidate is denoted by sHD, sHD,..., sHD . Then, we compute the ω,1 ω,2 ω, ω(M,S) |G | squared Euclidean distances of the hard-decisionn candidates to the receivedo signal y, which is b b b 4.4 Power Equality Constraint Least Square Detection 99 denoted by
2 ǫHD = y HsHD , i 1,..., (M,S) . (4.24) ω,i − ω,i ∈ { |Gω |}
Then, a hard-decision with the smallest b squared Euclidean distance to the receive signal is chosen as the hard-decision of the power group. That is,
2 sHD = arg min y HsHD , i 1,..., (M,S) , (4.25) ω − ω,i ∈ { |Gω |} n o and the correspondingb hard distance is b
2 ǫHD = min y HsHD , i 1,..., (M,S) . (4.26) ω − ω,i ∈ { |Gω |} n o b . For example, in a case of M = 2, S = 16 QAM, ω = 3, Ω3 (M,S) = 20, there are three possible combinations of QPL symbols ( (2,S) = (2 + 18) , (10 + 10) , (18 + 2) ). Accordingly, a phase G3 { } SC SC SC T rounding is implemented on the soft candidate s3 =[s3 [1], s3 [2]] for three cases. In the first instance, sSC [1] is rounded to a symbol with s 2 = 2 and sSC [2] is rounded to a symbol 3 | | 3 2 b b b 2 with s = 18. In the second instance sSC [1] is rounded to a symbol with s = 10 and sSC [2] | | 3 | | 3 b 2 b is rounded to a symbol with s = 10. In the third instance sSC [1] is rounded to a symbol with | | 3 2 SC b 2 b s = 18 and s3 [2] is rounded to a symbol with s = 2. For this power group, the proposed | | | | b method needs to compute three squared Euclidean distance and three phase rounding operations, b as compared to 96 squared Euclidean distance calculations in the direct implementation of (4.22). Therefore, the proposed method is less complex.
Repeating the phase rounding and hard-decision operation for all power groups, we get a set of HD HD HD Ω(M,S) hard-decision candidates s1 , s2 ,..., s Ω(M,S) and their corresponding Euclidean | | { | |} HD HD HD distances to the received vector y, which are denoted by ǫ1 , ǫ1 , . . . , ǫ Ω(M,S) . Finally, the b b b { | |} candidate with the minimal squared Euclidean distance from all power groups is chosen as the detection output, that is
2 s = arg min y HsHD , ω 1,..., Ω(M,S) . (4.27) HD − ω ∈ { | |} n o We refer to the detectorb proposed above asb PEC-LS detector, and the algorithm is depicted in Algorithm 12.
4.4.4 Ordered PEC-LS Detection with Reduced Number of Power Groups
In the previous Section we have shown that the solution to the PEC-LS detection is found based on solving a series of optimization problems. For a system with a large number of transmit antennas and a high modulation level, the number of phase rounding and Euclidean distance 4.4 Power Equality Constraint Least Square Detection 100
Algorithm 12 PEC-LS detector Input: y, H, S, M Output: sHD 1: Compose the Quantum Power Levels set QP L (M,S) according to (4.6). 2: Composeb the Symbol Vector Power Groups Ω (M,S) according to (4.8). 3: for ω = 1 to Ω(M,S) do | | 2 2 4: sSC = arg min y Hs , s.t. s = Ω (M,S), ω s k − k k k ω 5: for i = 1 to w (M,S) do HD |G SC | 6: b sω,i = i sω QGω 7: i = i + 1 8: endb for b 2 9: sHD = arg min y HsHD , i 1,..., (M,S) . ω − ω,i ∈ { |Gw |} 10: end for 2 11: s b = arg min y HsHDb , ω 1,..., Ω(M,S) . HD − ω ∈ { | |} n o b b computation grows exponentially with the number of antennas. In this subsection, we present an inherent property of the PEC-LS algorithm. Using this property, we can significantly reduce the detection complexity by deleting a number of irrelevant power groups in finding the detection output. Now, we use the following Lemma to explain this inherent property of the PEC-LS algorithm. After that, we propose an Ordered PEC-LS (OPEC) algorithm.
2 Lemma 4.5. Let ǫHD , y HsHD be the squared distance of a hard-decision solution j − j of the j-th power group, j 1,..., Ω( M,S) . Given another power group indexed by k,
∈ { b | |} k 1,..., Ω(M,S) , if the soft distance ∈ { | |}
2 y HsSC ǫHD, (4.28) − k ≥ j then, the k-th group can not produce a hard-decisionb candidate such that ǫHD ǫHD. k ≤ j
2 Proof. By virtue of Theorem 4.4, the soft value sSC minimizes the expression y H sSC . k − k HD SC Therefore, any hard solution sk , which is obtained from sk , has a greater (or the same) b b squared distance. Therefore, we have b b 2 2 2 ǫHD = y HsHD y HsSC y H sHD = ǫHD (4.29) k − k ≥ − k ≥ − j j and the Lemma is proved. b b b
Based on the result of Lemma 4.5, we can now develop the OPEC algorithm which yields the same solution of the PEC-LS algorithm but with a reduced complexity. Now, we recall the set of Ω(M,S) soft detections in (4.14): | |
sSC = sSC , ω 1,..., Ω(M,S) , (4.30) ω ∈ { | |} b b 4.5 Improved Ordered Power Equality Constraint Detection 101 associated with Ω(M,S) soft Euclidean distances | |
2 ∆SC = ǫ = y HsSC , ω 1,..., Ω(M,S) . (4.31) ω − ω ∈ { | |} n o In the detection, we first sort the Ω( M,Sb) power groups according to their corresponding soft | | Euclidean distances in an ascending order. Then we find the hard decision solution according to the sorted groups. Suppose that quantization and selection of the best hard-decision candidate as discussed in Section 4.4.3 has been performed on the i-th power group, i 1,..., Ω(M,S) . ∈ { | |} Its hard squared distance ǫi is compared with the soft squared distance of the remaining groups (ǫ , i < j Ω(M,S) ). If any of the remaining power groups’ soft error is greater than the j ≤ | | current hard-decision squared distance, all the power groups from this power group onwards can be omitted from the candidacy list, since they can not generate a hard decision which is closer to the received vector than the current hard decision. Note that this algorithm is deterministically equivalent to the PEC-LS detector in terms of detected symbols and bares no degradation in terms of BER. We refer to this algorithm as OPEC detector and it is depicted in Algorithm 13. In Section 4.7, we will assess the complexity reduction achieved by using this algorithm, relative to the previously presented PEC-LS algorithm.
Algorithm 13 OPEC based detector for MIMO systems Input: y, H, S, M Output: sHD 1: Initialize: σ = 1, ǫ = . HD ∞ 2: Composeb the Quantum Power Levels set QP L (M,S) according to (4.6). 3: Compose the Symbol Vector Power Groups Ω (M,S) according to (4.8). 4: for ω = 1 to Ω(M,S) do SC | | 2 2 5: sω = arg mins y Hs , s.t. s = Ωω (M,S) k 2− k k k 6: ǫ = y H sSC ω − ω 7: endb for sorted SC 8: ∆ , index b= sort ∆ in an ascending order. 9: ssorted = sSC , ω 1,..., Ω(M,S) ω index(ω) ∈ | | 10: for ω = 1 to Ω(M,S) do | | 11: b if ǫ b ǫω then HD ≤ 12: return 13: else 14: for µ = 1 to index(ω) (M,S) do HD G sorted 15: sµ = µ sω QGindex (ω) HD 2 16: if y Hsµ ǫHD then b − ≤b 17: sHD = sHD µ 2 18: ǫ = yb H sHD HD − µ 19: endb if b
20: end for b 21: end if 22: end for 4.5 Improved Ordered Power Equality Constraint Detection 102
4.5 Improved Ordered Power Equality Constraint Detection
In this Section we propose an improved detection scheme, based on the OPEC detector. From the previous Section, we see that there are primarily two steps: in the first step we find the soft solution. In the second step, based on the soft solution we find the hard decision candidate 2 for each power group, according to (4.22), such that sSC sHD is minimized. However, this ω − ω HD 2 hard decision solution may not minimize y Hsω in that power group, which is the hard − b squared distance to the received signal.
Motivated by this observation, we present an approach to improve our proposed OPEC algo- SC rithm, where a local search around each set of soft candidates sω is incorporated. The basic idea of the method is that in performing the hard decision by doing the phase rounding for each b element, we choose two constellation points that have the smallest phase differences to the soft solution. This is different from the previously proposed algorithm where we only choose one point that has the smallest phase difference. This can be seen as a local search algorithm [161] SC around the soft-decision sω .
The starting point of theb improved detector is the same as that in the OPEC detector, which obtains Ω(M,S) soft solutions as in (4.14). To find the two points with the smallest phase | | differences from the soft solution, we define the neighborhood from which two points will be chosen. Since we are interested in a simple phase rounding operation, the neighborhood will be naturally defined under phase criterion. We define the neighborhood N (d) of d, where d is an element of a soft candidate, as
′ N (d) = (si, sk) S , i = k ∈ 6 | (4.32) n ′ max ∠d ∠s , ∠d ∠s < ∠d ∠s j 1,...,D , j = i, k , {| − i| | − k|} | − j| ∀ ∈ 6 n o o ′ ′ where S is a subgroup of S with size D which corresponds to a QPL group.
In summary, the procedure for the improved detection is similar to the OPEC algorithm. The distinction is that instead of creating one hard-decision candidate for each power group in the OPEC algorithm, 2M hard-decision candidates are created in the improved OPEC algorithm. . For example, in the case of M = 2,S = 16 QAM we now have up to 4 Ψ (2,S) = 36 candidates ×| | instead of Ψ (2,S) = 9, as in the PEC-LS algorithm. In the ordered algorithm, the number of | | hard-decision candidates is much smaller than the one without ordering.
In order to create the neighborhood N (d) for the Improved Ordered PEC-LS, modified decision boundaries are required. By investigating the structure of 16 QAM signal, we observe that two different types of decision boundaries. The first type of decision boundaries will be serving the elements in Ω (M,S) that are composed of power levels Φ1 (S) and Φ3 (S), as they contain only four constellation points with the same arguments. The second type of decision boundary will 4.6 Efficient Implementation and Complexity Analysis 103 be used for the elements in Ω (M,S) that are composed of power level Φ2 (S), as there are eight constellation points with the same arguments (see Figure 4.1).
Now, we describe a general method for generating the decision boundaries for the neighborhood N (d) in (4.32). The decision rule for the decision region is based on phase distance. Each decision region contains two adjacent constellation points and is fully described by a start phase α and an end phase β. In order to construct the decision regions boundaries the following optimization problem needs to be solved:
1 α = ∠ (s2) + ∠ arg min ∠ (s1) ∠ (x) Mod (2π) , (4.33a) 2 x S=(s1) − ( " ∈ 6 # !) 1 β = ∠ (s1) + ∠ arg min ∠ (x) ∠ (s2) Mod (2π) , (4.33b) 2 x S=(s2) − ( " ∈ 6 # !) where (x) Mod (2π) is the modulus x with 2π, and the two points are chosen such that (∠s1 < ∠s2). The resulting values α and β represent the right and left region boundary for the combination of s1 and s2, respectively. The two types of decision boundaries are explained in Figures 4.2 and 4.3, and are given in Tables 4.2 and 4.3.
Once the hard candidates for each power combination, as specified in (4.11), have been chosen, the one that has the smallest squared distance is selected as the hard candidate for this power combination. Next, by using (4.27), a hard candidate is selected as the final solution. We refer to this detector as an IOPEC detector.
text text
Region 1 Region 4 Region 2 Region
Region 3
text text
Figure 4.2: Decision Boundaries for Φ1 and Φ3 4.6 Efficient Implementation and Complexity Analysis 104
Region 3 text text Re 4 n g o io i n 2 Reg Region 1 Region text text Region 5 Region
text text
Re 8 g on io i n g 6 e text text R Region 7
Figure 4.3: Decision Boundaries for Φ2
Decision boundaries for Φ1 and Φ3
Regions Phase of s Candidates
Region 1 1 ∠ (s) 3 (1 + i) , ( 1 + i) 4π ≤ ≤ 4π − Region 2 3 ∠ (s) 5 ( 1 + i) , ( 1 i) 4π ≤ ≤ 4π − − − Region 3 5 ∠ (s) 7 ( 1 i) , (1 i) 4π ≤ ≤ 4π − − − Region 4 7 ∠ (s) 1 (1 i) , (1 + i) 4π ≤ ≤ 4π −
Table 4.2: Decision Boundaries for Φ1 and Φ3 as depicted in Figure 4.2
4.6 Efficient Implementation and Complexity Analysis
In this Section we discuss efficient implementation of the above mentioned algorithms and analyze their complexity. In the implementation of the proposed detectors we separate the necessary computations into two groups:
Pre-Processing phase : contains operations that are common to all power groups Ω (M,S) • and depend on only the current observation vector y.
Processing phase : contains operations that are specific to each power group Ω (M,S). • ω 4.6 Efficient Implementation and Complexity Analysis 105
Decision boundaries for Φ2 Regions Phase of s Candidates
Region 1 1.8524π ∠ (s) 0.1476π (3 i) , (3 + i) ≤ ≤ − Region 2 0.1476π ∠ (s) 0.3524π (3 + i) , (1 + 3i) ≤ ≤ Region 3 0.3524π ∠ (s) 0.6476π (1 + 3i) , ( 1 + 3i) ≤ ≤ − Region 4 0.6476π ∠ (s) 0.8524π ( 1 + 3i) , ( 3 + i) ≤ ≤ − − Region 5 0.8524π ∠ (s) 1.1476π ( 3 + i) , ( 3 i) ≤ ≤ − − − Region 6 1.1476π ∠ (s) 1.3524π ( 3 i) , (3 3i) ≤ ≤ − − − Region 7 1.3524π ∠ (s) 1.6476π (3 3i) , (1 3i) ≤ ≤ − − Region 8 1.6476π ∠ (s) 1.8524π (1 3i) , (3 i) ≤ ≤ − −
Table 4.3: Decision Boundaries for Φ2 as depicted in Figure 4.3
The overall complexity can be expressed as:
= + , (4.34) C CPP CP where and are the complexity of the Pre-Processing and Processing phases, respectively. CPP CP
4.6.1 Efficient Implementation of Constrained LS Detector
In order to implement the proposed detectors, the set of soft candidates (equation (4.13)) needs to be assembled first. We begin by presenting an efficient way to obtain the solutions of the set of optimization problems in (4.13).
As mentioned in Section 4.4, the constrained least squares in (4.13) of every power group can be implemented using a simple line search, such as bi-section search. At each iteration of the line search, a new value
1 2 s 2 = HH H + ηI − HH y , (4.35) k ηk