Machine Learning for Link Adaptation

Supervised and Reinforcement Learning Theory and Algorithms

VIDIT SAXENA

Doctoral Thesis in Electrical Engineering Stockholm, Sweden, 2021 KTH Royal Institute of Technology School of Electrical Engineering and Computer Science Division of Information Science and Engineering TRITA-EECS-AVL-2021:35 SE-10044 Stockholm ISBN 978-91-7873-886-1 Sweden

Akademisk avhandling som med tillst˚and av Kungl Tekniska h¨ogskolan framl¨agges till o↵entlig granskning f¨or avl¨aggande av Teknologie doktorexamen i elektroteknik torsdagen den 20 maj 2021 klockan 13.00 i Sal F3, Lindstedtsv¨agen 26, Kungliga Tekniska H¨ogskolan, Stockholm. Academic thesis which, with permission of the KTH Royal Institute of Technology, is submitted for public scrutiny for the completion of the Ph.D. in Electrical Engi- neering on Thursday May 20, 2021 at 13.00 in the lecture hall F3, Lindstedtsv¨agen 26, KTH Royal Institute of Technology, Stockholm.

c Vidit Saxena, April 28, 2021 Tryck: Universitetsservice US AB i

Abstract

Wireless data communication is a complex phenomenon. Wireless links encounter random, time-varying, channel e↵ects that are challenging to pre- dict and compensate. Hence, to optimally utilize the channel, wireless links adapt the data transmission parameters in real time. This process, known as wireless link adaptation, can lead to large gains in link performance. Link adaptation is hence an integral part of state-of-the-art wireless deployments. Existing link adaptation schemes use simple heuristics that match the data transmission rate to the estimated channel. These schemes have proven to be useful for the ubiquitous wireless services of voice telephony and mo- bile broadband. However, as wireless networks increase in complexity and also evolve to support new service types, these link adaptation schemes are rapidly becoming inadequate. The reason for this change is threefold: first, in several operating scenarios, simple heuristics-based link adaptation does not fully exploit the available channel. Second, the heuristics are typically tuned empirically for good performance, which incurs additional expense and can be error-prone. Finally, traditional link adaptation does not naturally extend to applications beyond the traditional wireless services, for example to industrial control or vehicular communications. In this thesis, we address wireless link adaptation through machine learn- ing. Our proposed solutions eciently navigate the link parameter space by learning from the available information. These solutions thus improve the link performance compared to the state-of-the-art, for example by doubling the link throughput. Further, we advance link adaptation support for new wireless services by optimizing the link for complex performance objectives. Finally, we also introduce mechanisms that autonomously tune the link adap- tation parameters with respect to the operating environment. Our schemes hence mitigate the dependence on empirical configurations adopted in current wireless networks. This thesis is composed of six technical papers. Based on these papers, there are three key contributions of this thesis: a neural link adaptation model (Paper I, Paper II,andPaper III), link adaptation under packet error rate constraints (Paper IV and Paper V), and ecient model-based link adaptation (Paper VI). In this thesis, we emphasise the theoretical underpinnings of our pro- posed machine learning schemes for link adaptation. We approach this goal in three ways: First, we make theoretically reasoned choices for machine learn- ing models and learning algorithms for link adaptation. Second, we extend these models for the specific problem formulations encountered in link adap- tation. For this, we develop rigorous problem formulations that are analyzed using classical techniques. Third, we develop theoretical results for the real- time behaviour of the proposed schemes. These bounds extend the machine learning state-of-the-art in terms of performance bounds for stochastic online optimization. The contributions of this thesis hence go beyond the realm of wireless optimization, and extend to new developments applicable to broader machine learning problems. ii

Keywords: Wireless Communications, Reinforcement Learning, Multi-Armed Bandits, Thompson Sampling, Convex Optimization, Deep Learning. iii

Sammanfattning

Tr˚adl¨os datakommunikation ¨ar ett komplext fenomen. Tr˚adl¨osa l¨ankar st¨oter p˚aslumpm¨assiga och tidsvarierande kanale↵ekter som ¨ar utmanande att f¨oruts¨aga och kompensera f¨or. F¨or att optimalt utnyttja den tr˚adl¨osa kanalen anpassar d¨arf¨or kommunikationssystem data¨overf¨oringsparametrarna i realtid. Denna process, ¨aven kallad tr˚adl¨os l¨ankanpassning, kan leda till stora vinster i l¨ankprestanda. L¨ank-anpassning ¨ar d¨arf¨or en integrerad del av alla moderna kommunikationssystem. Befintliga metoder f¨or l¨ankanpassning anv¨ander enkla heuristiker som an- passar data¨overf¨oringshastigheten till den skattade tr˚adl¨osa kanalen. Dessa system har visat sig vara anv¨andbara f¨or de brett anv¨anda tr˚adl¨osa tj¨ansterna r¨osttelefoni och mobilt bredband. Eftersom tr˚adl¨osa n¨atverk ¨okar i komplexi- tet och ocks˚autvecklas f¨or att st¨odja nya tj¨anstetyper, blir dock dessa meto- der f¨or l¨ankanpassning snabbt otillr¨ackliga. Anledningen till detta ¨ar trefaldig: F¨or det f¨orsta s˚autnyttjar heuristikbaserad l¨ankanpassning i flera nya tj¨anster utnyttjar helt enkelt inte den tillg¨angliga kanalen till fullo. F¨or det andra s˚a ¨ar heuristiken vanligtvis anpassad empiriskt f¨or bra prestanda, vilket kan va- ra felben¨aget i nya scenarion och vilket medf¨or extra kostnader. Slutligen s˚a generaliserar traditionell l¨ankanpassning inte naturligt till till¨ampningar som g˚ar ut¨over de traditionella tr˚adl¨osa tj¨ansterna, till exempel till industriella reglersystem eller fordonskommunikation. Idennaavhandlingbehandlarvil¨ankanpassning genom maskininl¨arning. V˚ara f¨oreslagna system utforskar e↵ektivt l¨ankparameterutrymmet genom att l¨ara av tillg¨anglig information. De f¨oreslagna metoderna f¨orb¨attrar s˚aledes l¨ankprestandan j¨amf¨ort med den senaste tekniken, till exempel genom att f¨ordubbla l¨ankgenomstr¨omningen. Vidare utvecklar vi ocks˚al¨ankadaptationsst¨od f¨or nya tr˚adl¨osa tj¨anster genom att optimera l¨anken f¨or mer komplexa prestan- dam˚al. Slutligen s˚aintroducerar vi ocks˚amekanismer som autonomt justerar l¨ankanpassningsparametrarna baserat p˚adriftsmilj¨on. V˚ara system mildrar d¨armed beroendet p˚aempiriska konfigurationer som anv¨ands i nuvarande tr˚adl¨osa n¨atverk. Denna avhandling best˚ar av sex tekniska artiklar. Baserat p˚adessa artik- lar finns det tre viktiga bidrag fr˚an denna avhandling: En modell f¨or anpass- ning av neurala l¨ankar (Paper I, Paper II och Paper III), l¨ankanpassning under begr¨ansningar i paketfelfrekvensen (Paper IV och Paper V), och e↵ektiv modellbaserad l¨ankanpassning (Paper VI). I denna avhandling betonar vi den teoretiska grunden f¨or v˚ara f¨oreslagna maskininl¨arningsmetoder f¨or l¨ankanpassning. Vi n¨armar oss detta m˚al p˚atre s¨att: F¨or det f¨orsta g¨or vi teoretiskt motiverade val f¨or maskininl¨arningsmodeller och inl¨arningsalgoritmer f¨or l¨ankanpassning. F¨or det andra ut¨okar vi dessa modeller f¨or de specifika problemformuleringar som p˚atr¨a↵as vid l¨ankanpassning. F¨or detta utvecklar vi noggranna problemformuleringar som analyseras med klassiska tekniker. F¨or det tredje utvecklar vi teoretiska resultat f¨or de f¨oreslagna systemens realtidsbeteende. Dessa gr¨anser ut¨okar f¨altet maskininl¨arningen n¨ar det g¨aller prestationsgr¨anser f¨or stokastisk online-optimering. Bidragen fr˚an denna avhandling g˚ar allts˚aut¨over omr˚adet f¨or tr˚adl¨os kommunikation och str¨acker sig till nya till¨ampningsomr˚aden. iv

सारांश

वायरलसडे टाे सचारं एक जटल ूबया ह।ै वायरलसे कड़याँ (लस)ं अयविःथत और बम-रहत चनलै ूभाव का सामना करतीं ह, िजनक तपतू कर पाना चनौतीपु णू ह।ै अतः, चनलै का सवम उपयोग करने के लए, वायरलसे लसं वाःतवक समय म डटाे सचारणं मापदडं (परामीटस)ै को अनकु ूलत करते ह। इस ूबया को वायरलसे लकं अनकु ूलन के नाम से जाना जाता ह,ै जो अयाधनकु वायरलसे परनयोजन का एक अभ अगहं ।ै मौजदाू लकं अनकु ूलन योजनाएं अनभवु पर आधारत, सरल, अनमानु का उपयोग करती ह। आमतौर स,ये े योजनांए डटाे सचारणं दर का अनमानतु वायरलसचे नलै से मले कराती ह । पवकालू म, ये योजनाएं दरभाषू और मोबाइल ॄॉडबड क सवयापी वायरलससे वाओे ं के लए उपयोगी साबत हईु ह। कत,जु सै -जे सै े वायरलसने टवके जटल होते जा रहे ह , और नए ूकार क सचारणं -यवःथाएं वकसत हो रह ह, मौजदाू लकं अनकु ूलन योजनाएं भी तजीे से अपया होती चल जा रह ह। इस परवतन के यह तीन मयु कारण ह: पहला, कई परँय म, सरल लकं अनकु ूलन मौजदाू चनलै का परू तरह से उपयोग नहं कर पाता। दसरा,ू सचारणं मापदडं को सामायतः आनभावकु प से चनाु जाता ह,ै जो अतर सचरणं को बढ़ाता है और इसम ऽटयु क सभावना अधक होती ह।अै ततः,पारं ंपरक लकं अनकु ूलन नयी सवा-ूयोगे क ओर ःवाभावक प से वःतार नहं करता - उदाहरण के लए, औोगक नयऽणं अथवा वाहन-आधारत सचार।ं इस शोध ूबधं (थीसस) म, हम मशीन लनग के मायम से वायरलसे लकं अनकु ूलन का अनसु धानं करते ह। हमारे ूःतावत समाधान सामाय सचारणं जानकार से सीखकर, लकसं चरणं मापदडं का ःवतः और कुशलतापवकू सालन करते ह। अयाधनकु अनकु ूलन वधय क तलनाु म, हमारे समाधान लकं नंपादन (परफॉरमस) म सधारु करते ह, उदाहरण के लए लकं ूवाह मता (ापू ट)ु को दोगनाु करके । इसके अतर,हमारे समाधान जटल नंपादन उेँय के लए लकं को अनकु ूलत करके नई वायरलससे वाओे ं को लाभ पहचातंु े ह। अतं म, हम ऐसी तकनीक भी ूःततु करते ह जो वायरलसे वातावरण के आधार पर, लकं अनकु ूलन मापदडं को ःवतः सचालतं करती ह।ै इस ूकार, हमार ूःतावत योजनाएं आज के वायरलसने टवके क अनभवजयु नभरता कम करती ह। इस थीसस म छह तकनीक पऽ समाहत ह। इन पऽ के आधार पर,यह थीसस तीन ूमखु ेऽ म योगदान दतीे ह:ै एक यरलू लकं अनकु ूलन मॉडल (पपरे I, पपरे II और पपरे III), पकै े टऽटु दर क कमी के तहत लकं अनकु ूलन (पपरे IV और पपरे V ), और मॉडल आधरत कुशल लकं अनकु ूलन (पपरे VI)। इस थीसस म, हम लकं अनकु ूलन के लए अपनी ूःतावत मशीन लनग योजनाओं क सांतकै मजबतीू पर बल दते े ह। इस लआय तक पहँचनु े के लए हम नन तीन सऽू को अपनाते ह: सबसे पहल,े हम लकं अनकु ूलन के िकोण से उिचत, मशीन लनग के सांतकै मॉडस और अगोरथस का ूयोग करते ह। दसरा,ू हम लकं अनकु ूलन म आई वशषे समःयाओं के हते ु मशीन लनग तकनीक का वःतार करते ह। तीसरा, हम ूःतावत योजनाओं के वाःतवक-समय यवहार के लए सांतकै परणाम वकसत करते ह। ये परणाम नंपादन सीमा के सदभं म अयाधनकु मशीन लनग को भी वकसत करती ह। अतः इस शोध ूबधकं े योगदान वायरलसे अनकु ूलन से बढ़कर, मशीन लनग म या समःयाओं पर नए वकास क ओर बढ़ावा दते े ह।

Acknowledgement

The journey of academic research is full of unforeseen paths and uncertain out- comes. However, regardless of its conclusion, research is a rewarding endeavor in and of itself. My doctoral project has been structured as an industry-academia collaboration between Ericsson AB and the KTH Royal Institute of Technology, Sweden. During the course of my doctoral work, I have been extremely fortunate to have received the support and guidance of uncountably many people at both these organizations and beyond. Their presence has made these past few years the most fulfilling and productive time of my life. My first token of gratitude is for my principal supervisor, Prof. Joakim Jald´en, for his keen and insightful guidance throughout the period of my doctoral work. I am deeply inspired by Joakim’s calm and focused attitude toward research, which has shaped my own approach towards addressing new challenges. I am also grate- ful for the unwavering support of my co-supervisors, Prof. Mats Bengtsson, and Dr. Hugo Tullberg. Mats drew on his immense bank of knowledge to help guide my theoretical ideas towards a broader application context. Hugo, who is with Ericsson Research, inspired me to think beyond incremental gains and instead ex- plore the vast unknown, which has influenced some of the more significant impact of my work. During my doctoral work, I have also had the privilege of working at the University of California, Berkeley (UCB), USA, as a visiting scholar. I will forever be grateful to Prof. Ion Stoica for inviting me to his lab, and to Dr. Joseph E. Gonzalez for his support and guidance during my visit. The scale and ambition of the projects at UCB is truly awe-inspiring, and has motivated me to identify and address challenging problems in my research domain. At Ericsson, I am deeply grateful for the support that I have received from my manager, Markus Ringstr¨om. Markus was responsible both for bringing the doctoral position to my knowledge, and for suggesting that I reach out to Joakim. Over these years, Markus has steadfastly ensured that I have access to the best of resources and opportunities at all times. I am also thankful to Dr. Anders Casp´arfor facilitating my collaboration with KTH in a smooth manner. I am grateful to the Wallenberg AI, Autonomous Systems and Software Pro- gram (WASP) funded by the Knut and Alice Wallenberg Foundation, for their financial support. I am also thankful to WASP for organizing myriad courses,

vii viii ACKNOWLEDGEMENT

study trips, and summer schools, that have contributed greatly to my develop- ment as a researcher. I express my gratitude to the administrative sta↵that have, often behind the scenes, ironed out the operations at Ericsson, KTH, UCB and WASP. I am thankful for the joyous company of my KTH colleagues, who filled this time with bright ideas and cheerful conversation. In particular, I have learnt a lot from the collaborative work with Pol del Aguila Pla, Lissy Pellaco, and Baptiste Cavarec. I am also thankful for the company of H˚akan Carlsson, Xuechun Xu, and the rest of my peers at KTH. I have gained immensely from the discussions with the faculty at KTH, and express my deepest gratitude for their kindness and insightful comments. Apart from KTH, I am thankful for the useful collab- orations with Dr. Henrik Klessig and Simon Lindst˚ahl that have made valuable contributions to my work. My time at Ericsson has been enriched with the presence of knowledgeable colleagues, which has significantly improved the quality of my work. I especially extend my thanks to Dr. David Astely, Dr. Euhanna Ghadimi, and Dr. Rohit Chandra at Ericsson in Stockholm, Sweden, for their involved discussions in the context of my work. Dr. Ali Khayrallah and Per Karlsson were kind enough to host me at their Ericsson Research group in Santa Clara, USA. I will forever be thankful to Nimish Radia for believing in me and introducing me to his academic network at UCB. I am also thankful to many other colleagues at Ericsson Re- search, both in Stockholm, Sweden and Santa Clara, USA, who have shaped my research with their valuable contributions. My academic journey as a doctoral candidate is nestled firmly in the warmth and love of my friends and family. The support from Rohit and Soni, and Vivek, during crucial moments has made the timely conclusion of my work possible. The time spent with Emmanuel and Neha in Lund, and the deep meaningful conversations with Akhila, will always be cherished. I have been extremely lucky to have my parents’ firm support and belief in every endeavour. As is perhaps typical of parents, they have believed in me far more than I can honestly claim to be justified. However, they have also sacrificed immensely to support me as I pursued my aspiration in far-away, foreign, lands. In this, I am forever indebted to my Dada and Bhabhi, for being near to our parents and taking care of the bulk of their needs. Their lovely Joy brings smiles to all our faces every single day and keeps us rooted to what is most important - a cohesive, happy, family. My deepest regards is reserved for my grandparents, who continue to inspire everyday with their strength, wisdom and infinite love. My final words of love and gratitude are dedicated to my wife, Sameeksha, who has added such color and meaning to life as I had never imagined possible. We have made innumerable memories during this time, and look forward to incredible years ahead with our two little treasures, Agastya and Divit. Acronyms

List of commonly used acronyms:

3GPP Third Generation Partnership Project Fourth Generation 5G Fifth Generation ACK Acknowledgement AMC Adaptive and Coding ANN Artificial Neural Network BICM Bit-interleaved coded modulation CMAB Contextual Multi-Armed Bandit CQI Channel Quality Indicator DRL Deep Reinforcement Learning FEP Frame Error Probability IEEE Institute of Electrical and Electronics Engineers ILLA Inner Loop Link Adaptation LTS Latent Thompson Sampling MAB Multi-armed Bandit MCS Modulation and Coding Scheme NACK Negative Acknowledgement OFDM Orthogonal Frequency Division Multiplexing OLLA Outer Loop Link Adaptation OLM O✏ine Link Model RL Reinforcement Learning SINR Signal to Interference and Noise Ratio TS Thompson Sampling UCB Upper Confidence Bound WiFi Wireless Fidelity (IEEE 802.11)

ix

List of Papers

I Deep learning for frame error probability prediction in BICM-OFDM systems. Vidit Saxena, Joakim Jald´en, Mats Bengtsson, Hugo Tullberg IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018)

II A learning approach for optimal codebook selection in spatial modu- lation systems. Vidit Saxena, Baptiste Cavarec, Joakim Jald´en, Mats Bengtsson, Hugo Tull- berg 52nd Asilomar Conference on Signals, Systems, and Computers (2018)

III Contextual multi-armed bandits for link adaptation in cellular net- works Vidit Saxena, Joakim Jald´en, Joseph E. Gonzalez, Mats Bengtsson, Hugo Tullberg, Ion Stoica Proceedings of the 2019 Workshop on Network Meets AI & ML (NetAI’19) (2019)

IV Bayesian link adaptation under a BLER target Vidit Saxena, Joakim Jald´en 21st IEEE International Workshop on Signal Processing Advances in Wireless Communications (SPAWC) (2020)

V Thompson sampling for linearly constrained bandits Vidit Saxena, Joseph E. Gonzalez, Joakim Jald´en 23rd International Conference on Artificial Intelligence and Statistics (AIS- TATS) (2020)

VI Reinforcement learning for ecient and tuning-free link adaptation Vidit Saxena, Hugo Tullberg, Joakim Jald´en IEEE Transactions on Wireless Communications (Under Review)

xi xii LIST OF PAPERS

Other Papers

During the timeframe of my doctoral work, I collaborated on a few additional projects, which are not reflected in this thesis. These projects led to peer-reviewed papers that are nevertheless listed below for completeness.

I Optimal UAV base station trajectories using flow-level models for reinforcement learning. Vidit Saxena, Joakim Jald´en, Henrik Klessig IEEE Transactions on Cognitive Communications and Networking (2019) II Wireless link adaptation with outdated CSI—a hybrid data-driven and model-based approach. Lissy Pellaco, Vidit Saxena, Mats Bengtsson, Joakim Jald´en 21st IEEE International Workshop on Signal Processing Advances in Wireless Communications (SPAWC) (2020) III Spotnet–Learned iterations for cell detection in image-based immunoas- says. Pol del Aguila Pla, Vidit Saxena, Joakim Jald´en 16th International Symposium on Biomedical Imaging (ISBI) (2019) xiii

To my family.

Contents

Acknowledgement vii

Acronyms ix

List of Papers xi

Contents 1

1 Introduction 3 1.1 Contributions...... 5 1.2 Discussion...... 7 1.3 Organization ...... 8

2 Thesis Overview 9 2.1 Wireless Link Adaptation ...... 10 2.2 Neural Probability Estimation ...... 13 2.3 Multi-armed Bandits ...... 14 2.4 Thompson Sampling ...... 17

3 Summary of the Included Papers 21 3.1 PaperI ...... 21 3.2 PaperII...... 23 3.3 PaperIII ...... 24 3.4 PaperIV ...... 25 3.5 PaperV...... 26 3.6 Paper VI ...... 27

4 Conclusion 29 4.1 KeyTakeaways ...... 29 4.2 Research Directions ...... 30

References 33

Fulltext of the Included Papers 38

1

Chapter 1

Introduction

The wireless physical layer is highly configurable. Prominent wireless access pro- tocols, for example the cellular fifth-generation new radio (5G NR), and the IEEE 802.11 WiFi standards, provide hundreds of link parameter configurations that are tunable in real time. The wireless link can hence be adapted with a fine granularity to optimize performance in a variety of application scenarios. This extreme configurability allows wireless networks to provide robust and ecient access across diverse geographies, service demands, and usage patterns. The online tuning of wireless link parameters is called link adaptation. Link adaptation algorithms adjust the transmission parameters, for example the data transfer rate, to maximally utilize the channel resources and to combat stochas- tic impairments such as channel noise and interference. Most state-of-the-art link adaptation algorithms use simple, empirically configured, heuristics to dynami- cally adjust the data transmission parameters. While the existing heuristics can be beneficial in terms of fast, real-time, execution, they often do not fully exploit the available channel. Hence, as wireless networks grow in size and complexity, current link adaptation implementations can become increasingly suboptimal. Further, these implementations need to be tuned for good performance. Current wireless networks employ manual tuning based on empirical evidence, which can be both expensive and error-prone. Yet another drawback of the existing link adaptation schemes is that they do not adequately handle applications beyond the traditional voice telephony and mobile broadband services. As such, there is a need for powerful link adaptation schemes that address new services, for example real-time industrial control and vehicular communications. In this thesis, we address wireless link adaptation through machine learning. Our proposed solutions learn from the available information to eciently navigate the link parameter space in dynamic wireless environments. Further, we advance link adaptation for complex performance objectives. Our approach thus provides better support for new wireless services. Finally, we also introduce mechanisms to autonomously tune link adaptation performance in diverse operating conditions,

3 4 CHAPTER 1. INTRODUCTION

which reduces the dependence on manual adjustments. A well-known disadvantage with machine learning is its dearth of interpretable models and learning algorithms. Several machine learning solutions are imple- mented in a black-box fashion with an incomplete understanding of their inner workings. While the need for interpretability is often neglected in the light of superior empirical performance, this nevertheless leads to solutions that are dif- ficult to extend to new use cases. Further, black-box machine learning solutions additionally do not provide any performance guarantees that may be critical for system and resource provisioning . In this thesis, we therefore develop the theoretical foundations for our pro- posed machine learning solutions. We approach this objective in three ways: First, we make reasoned choices for our machine learning models and learning algorithms. These choices are guided by in-depth knowledge of the specific prob- lem dynamics, as well as the interpretability of the candidate models. Second, we formulate certain link adaptation use cases in the form of rigorous, mathemati- cally tractable, problems. As an example, we characterize link reliability as an online convex optimization problem under linear constraints. Third, we develop theoretical bounds for the real-time performance of our proposed algorithms. In addition to link adaptation, these bounds are equally applicable to more general online optimization problems. The contributions of this thesis hence go beyond the realm of wireless optimization to broader machine learning applications.

Figure 1.1: Wireless systems operate in complex and time-varying environments. Today, these systems enable a broad range of applications beyond the traditional voice telephony and mobile broadband: for example, smart manufacturing, vehic- ular communications, and internet-of-sensors. 1.1. CONTRIBUTIONS 5

1.1 Contributions

This thesis addresses link adaptation in terms of finding the optimal data trans- mission rate over a wireless channel1. This thesis is composed of six technical papers that are summarized and reproduced in the subsequent chapters. Based on these papers, there are three key contributions of this thesis: a neural link adaptation model (Paper I, Paper II, and Paper III), link adaptation under error rate constraints (Paper IV and Paper V), and sample-ecient link adap- tation with subspace learning (Paper VI). Each of these three contributions are summarized next.

Neural Link Adaptation Model We pose link adaptation as the problem of predicting the channel-conditioned success probability for each available data transmission rate. As a consequence, we desire parameterized models that can learn the mapping from an arbitrary wireless channel state to the respective packet success probabilities. For this problem, we employ an artificial neural network (ANN) model, which is known to be powerful general-purpose function approximator. Our choice of an ANN model, however, is inspired by the following lesser-exploited property of ANNs: for a suitable choice of training loss, the ANN outputs estimate the true conditional probability for the target classes. As a consequence, the ANN outputs for a given channel state can be rigorously interpreted as the respective packet success probabilities for each available data transmission rate. We then use the trained ANN model for optimal data rate selection in wireless links. Paper I and Paper II introduce supervised learning algorithms for our pro- posed neural link adaptation model. While Paper I addresses modulation and coding scheme (MCS) selection in cellular networks, Paper II deals with select- ing the optimal codebook in spatial modulation systems. Paper III extends the approach of Paper I to a reinforcement learning (RL) setting, where the optimal MCS is learnt online by sequentially exploring and exploiting the space of all available MCSs.

Link Adaptation under Error Rate Constraints Several real-time wireless applications require reliable channel access to maintain an acceptable quality of service (QoS). Examples of these applications include video streaming and real-time industrial control. This link reliability metric is commonly expressed in terms of an upper limit on the packet error rate expe- rienced by the link. For this problem, we formulate link adaptation in terms of maximizing the link throughput under a linear constraint on the packet error rate. We model this problem as a multi-armed bandit (MAB), where the optimal rates

1This problem is referenced in the literature through several terms, viz. adaptive modulation and coding (AMC), modulation and coding scheme (MCS) selection, and rate sampling. 6 CHAPTER 1. INTRODUCTION

are learnt by choosing the arm to played in every time interval. Further, we pro- pose a Bayesian learning algorithm to optimize the link adaptation MAB for the link reliability objective. Our learning algorithm thus extends the state-of-the-art Thompson sampling heuristic to the constrained optimization setting. Paper IV proposes a constrained Thompson sampling algorithm for link adaptation under packet error rate constraints. Paper V develops theoretical upper bounds on the performance of constrained Thompson sampling, in terms of both the reward maximization and the constraint satisfaction metrics.

Link Adaptation with Latent Models Link parameter configurations exhibit a high degree of structure. For example, packet failure at a certain transmission rate indicates that higher rates are also likely to fail for given wireless channel conditions, and vice versa for lower rates. This structural property can be exploited for sample ecient RL, by reducing the number of exploratory transmissions required to learn the optimal rate. In particular, the dependence between rates can be modeled parsimoniously in a latent subspace. We first identify one such low-dimensional subspace suitable for RL-based link adaptation. Subsequently, we propose an extension to Thompson sampling that exploits this latent subspace for sample-ecient learning. Paper VI proposes a reinforcement learning link adaptation algorithm that learns a latent signal-to-noise-and-interference (SINR) model to predict the opti- mal data transmission rate.

In terms of impact, this thesis advances the state of the art in link adaptation in three key ways: improved link spectral eciency, support for complex QoS met- rics, and autonomous calibration. First, improving the link spectral eciency would allow better utilization of the scarce and expensive wireless resource. This would lead to higher trac volumes being served with the bandwidth slice as- signed to a wireless access provider. While all the composing papers make better use of the available spectral resources, Paper I, Paper II, and Paper VI ex- plicitly optimize for the average link throughput. Second, link optimization for complex QoS metrics would make it possible to extend wireless access to new application domains. An example would be to constrain the link error rate below a maximum allowed level, which is essential for robust real-time control in in- dustrial and vehicular applications. These and many other application domains are expected to fuel the next generation of wireless growth. Paper III, Paper IV, and Paper V specifically address this use case. Finally, current link adap- tation parameters are experientially selected and seldom updated in response to changes in the ambient wireless environment or trac patterns. The dependence on manual, on-field, network maintenance is hence expensive and prone to errors. Autonomous calibration would improve network operations by self-tuning link 1.2. DISCUSSION 7

adaptation algorithms across diverse deployments. Paper III and Paper IV serve this autonomy goal by exploiting contextual information commonly avail- able in existing networks. Contextual information allows faster, link-agnostic learning from multiple parallel data flows. Paper VI additionally proposes a scheme that autonomously tunes the link adaptation parameters to optimally track the channel variations.

1.2 Discussion

Link adaptation has been deployed in live networks for the past several decades. However, despite their widespread adoption, link adaptation schemes have not been subject to significant updates. The reason for this is twofold: hardware constraints and legacy issues. Wireless hardware has been severely resource- constrained to minimize the capital and operating costs. However, this is quickly changing with the expanding potential of wireless services and the emergence of edge computing. Access to upgraded computing resources will allow sophisti- cated physical layer algorithms, of which link adaptation is one example, to be implemented in wireless networks. Secondly, despite their known shortcomings, legacy link adaptation schemes are typically “carried over” to the next generation of wireless deployments owing to their familiarity. This is also likely to change in the near future as wireless access expands to new service areas where legacy schemes might be insucient and where alternative schemes promise large gains. The topic of link adaptation is hence open for innovation, and promises valuable gains from its advancement in future wireless networks. Cellular versus WiFi. Link adaptation has been implemented in the con- text of cellular as well as IEEE 802.11 (WiFi) networks. However, the termi- nology and the general approach adopted for implementation di↵ers between the two protocols. While cellular link adaptation makes substantial use of channel measurements, such techniques have found limited application in WiFi. Further, both cellular and WiFi links adapt the data transmission rate based on the out- come for previous packetized transmissions. However, while cellular links adopt a model-based approach for iterating towards the optimal rate, WiFi searches over the rates by sampling the available rates sequentially. In terms of configurability, cellular links typically provide more parameter choices than WiFi, including a higher number of available data rates. In this thesis, we evaluate our proposed schemes primarily in the context of downlink data transmission in cellular net- works. However, these techniques are equally applicable to WiFi links. In most of the included papers, we hence also benchmark our results against state-of-the-art WiFi link adaptation algorithms. 8 CHAPTER 1. INTRODUCTION

1.3 Organization

The rest of this thesis is structured as follows. Chapter 2 presents an overview of this thesis. First, in this chapter, the wireless link adaptation problem is introduced. Subsequently, this chapter summarizes some of the key concepts that form the basis of the technical contributions of this thesis. Next, Chapter 3 sequentially summarizes each of the papers included in this thesis. Chapter 4 concludes the thesis with a summary of the key takeaways and the discussion of some future research directions. Links to camera-ready versions of the included papers are provided towards the end of this document. Chapter 2

Thesis Overview

Radio-based wireless communications can be traced back to the late nineteenth century, when the first successful demonstrations of this technique were made [1]. However, for more than a century, wireless was limited to a fairly small set of applications: public broadcasts via radio, short-distance links and point-to-point telegraph, and specialized military equipment. The reason for this limited use was that radio hardware was bulky, expensive, and required a large amount of energy to operate. Hence, these devices could hardly be made mobile for general- purpose communication. Until the late twentieth century, the only truly mobile communication systems were car-based radios, and even those served only low- fidelity voice telephony. The digital explosion towards the end of the twentieth century, when general- purpose computing devices started becoming commonly available, sparked a con- current revolution in wireless communication technologies. During this period, handheld wireless devices became feasible owing to the small form factor and energy eciency of high-performance chipsets. In addition, wirelessly delivered services expanded beyond voice telephony to include mobile broadband that al- lowed access to a rapidly-growing internet ecosystem. Today, over 90% of the global population enjoys a subscription to cellular mobile services [2]. The lat- est generation of wireless technologies seeks to extend connectivity to tens of billions of devices and serve a multitude of new applications in the transport, manufacturing, and allied sectors [2, 3]. The rest of this chapter is organized as follows: Section 2.1 discusses the link adaptation problem and provides an overview of the existing approaches as well as some related problems encountered in other domains. Next, Section 2.2 highlights an important property of ANNs that allows robust and interpretable models for link adaptation. Section 2.3 serves as an introduction to the powerful MAB framework, which is used for RL-based link adaptation. Finally, Section 2.4 discusses a Bayesian heuristic for MAB optimization, Thompson sampling, which has been adopted and extended for link adaptation in this thesis.

9 10 CHAPTER 2. THESIS OVERVIEW

2.1 Wireless Link Adaptation

Wireless data communication is a complex phenomenon. The bulk of this com- plexity is attributed to the stochastic and time-varying nature of the wireless channel, which stems from complex interactions between the data-carrying radio waves and physical objects in the signal path [1]. Since the instantaneous wireless channel state is not ordinarily available at the transmitter, wireless systems must devise mechanisms to eciently navigate the channel. Practical wireless networks address this challenge by making the wireless links configurable, where the data transmission parameters can be selected from a set of pre-defined discrete val- ues for optimal predicted performance. Discretizing the parameter space in this manner serves two goals: first, the link can search through the parameters in a suciently small time before the channel state changes appreciably and second, the selected parameters can, with a relatively small overhead, be communicated to the receiver for decoding. In modern cellular networks, the typical link selects from a few hundred possible parameter configurations once every few millisec- onds. Hence, the key challenge in these networks is to quickly and eciently navigate the link parameter configuration space for optimal performance. Wireless link adaptation deals with the problem of tuning the data transmis- sion parameters to maximize the utility of a wireless channel [4–7]. Link adapta- tion techniques can be classified into two categories: inner loop (or closed loop), and outer loop (or open loop), respectively. Inner loop link adaptation (ILLA) makes use of explicit channel estimates [8]. These channel estimates are measured at the receiver using known pilot signals, and are subsequently fed back to the transmitter for rate selection. ILLA hence incurs significant overhead for pilot signaling, receiver-side measurements, and channel state feedback. In contrast to the inner loop, outer loop link adaptation (OLLA) does not involve any channel measurement. Instead, this loop adjusts the transmission parameters based solely on the observed outcome of previous transmissions. If one or more previous pack- ets were decoded successfully at the receiver (indicated by an acknowledgement (ACK) feedback signal), OLLA moves up its estimate of the wireless channel utility. On the other hand, if too many packets fail to be decoded, the outer loop falls back to more conservative data transmission rates [7]. The inner and outer link adaptation loops are complementary to each other. While the inner loop is more responsive, it does so at the cost of high channel reporting overheads. Conversely, the outer loop does not accrue any signaling overhead, but it is slow to respond to large channel variations. Wireless systems hence deploy both loops, albeit at di↵erent timescales: the inner loop, which compensates substantial channel movements, is only triggered infrequently to minimize overhead. On the other hand, the outer loop adjustments are smaller in magnitude but also more frequent – an outer loop update is typically exe- cuted after every packet transmission. A suitably configured pair of inner and outer loops allows the wireless network to optimize the link in diverse operating environments. 2.1. WIRELESS LINK ADAPTATION 11

Paper I and Paper II included in this thesis address ILLA for throughput maximization. Further, Paper III, Paper IV, and Paper V optimize OLLA for the more complex link objective of throughput maximization under an error rate constraint. Finally, Paper VI revisits OLLA for throughput maximization, where an updated learning model is exploited for fast and ecient link adaptation.

Link Adaptation Schemes Interest into link adaptation schemes goes back several decades, coinciding with the inception of wide-area and cellular wireless networks [9]. Since then, several link adaptation schemes have been proposed in the literature. In the early years of link adaptation research, the focus was on ILLA in terms of accurately and compactly characterizing the wireless channel state. Owing to is explicit signaling requirements, ILLA is strongly regulated by the respective wireless standard. In contrast, the outer loop flexibly uses one or more of the existing control signals for adaptation. Subsequently, starting from the third-generation (3G) cellular networks, OLLA has also been studied extensively. Concurrently with cellular networks, link adaptation schemes have also been proposed in the context of wireless local area networks such as the IEEE 802.11 (WiFi) standard. In the rest of this section, we will highlight some prominent developments on this topic. ILLA: A robust mechanism to compresses the high-dimensional wireless chan- nel state to a scalar metric was proposed in [10]. This metric, known as the ef- fective signal-to-interference-and-noise-ratio (SINR), improves signaling eciency and hence has been enthusiastically adopted and extended by later cellular stan- dards [11,12]. However, compressing the channel state is inevitably lossy. Hence, the e↵ective SINR approach su↵ers from link performance loss owing to sub- optimal parameter configuration. In [13, 14], a supervised learning approach based on K-nearest neighbors was proposed, which directly maps from high- dimensional wireless channel state to the optimal transmission parameters. This scheme was shown to outperform an e↵ective SINR scheme in terms of the aver- age link throughput. Despite its empirical gains, the model in [13] was dicult to scale and not amenable to a theoretical interpretation. Paper I included in this thesis provides an enhanced, ANN-based, supervised learning model that both scales well with the channel dimensionality and where the model outputs are rigorously interpreted as the respective packet success probabilities. Paper II extends this approach to the related problem of selecting an optimal codebook in spatial modulation systems. Legacy OLLA: One of the earliest OLLA schemes, which relies on ACK feedback, was proposed in [15]. This scheme proposes maintaining an o↵set to the ILLA e↵ective SINR, which is adjusted on a per-packet basis. If the previ- ous transmission was successful (i.e., an ACK was received), OLLA increases its SINR estimate by a configurable amount. Otherwise, if a negative ACK (NACK) is received, OLLA decreases the SINR amount proportionally. Several drawbacks of this simple scheme are known, which is why ad-hoc fixes have been proposed to 12 CHAPTER 2. THESIS OVERVIEW

address one or more of its shortcomings [15–18]. However, somewhat surprisingly, the basic OLLA heuristic in [15] has remained in operation in cellular networks for the past two decades with minimal changes. In contrast to cellular imple- mentations, WiFi OLLA schemes networks generally do not involve the SINR metric. Instead, WiFi OLLA schemes heuristically switch between data trans- mission rates based on the statistical ACK/NACK behaviour over a moving time window [19,20].

Reinforcement Learning (RL) for OLLA: A key characteristic of OLLA is that the ACK feedback corresponds only to the selected data transmission rate, and does not provide direct information about other rates. Hence, to find the optimal rate, OLLA needs an ecient mechanism for exploring the available rates. Previous schemes handle exploration by sequentially probing the available rates in the order of their spectral capacities. However, this approach can be suboptimal when the number of rates is large, or when the channel variations are frequent. RL is an alternative, principled, approach that deals with online exploration [21]. RL has recently been proposed for OLLA in the context of cellular as well as WiFi deployments. Many of these RL OLLA schemes employ ANNs to model the link behavior in real time. A few ways in which ANN-based RL OLLA schemes improve link adaptation performance are: by optimizing legacy OLLA tuning parameters [22], learning dependencies between the available rates [23], and exploiting high-dimensional channel contexts [24]. Paper III included in this thesis proposes and evaluates an ANN-based RL OLLA scheme for cellular networks, where the ANN model of Paper I has been extended to online link optimization.

Multi-armed Bandits (MAB): MABs encode a powerful RL framework to balance between exploration and exploitation within stochastic environments. MABs were first proposed for link adaptation in [25]. Their algorithm quickly maximizes the link throughput by exploiting structural properties inherent to the link adaptation problem. However, this schemes does not naturally extend to more complex link performance objectives encountered for several wireless appli- cations. In Paper IV, we propose a MAB optimization algorithm that incorpo- rates an average packer error rate constraint. In contrast to [25], which adopted a frequentist learning heuristic based on upper confidence bounds (UCB), we use a Bayesian heuristic based on Thompson sampling that typically provides better learning performance. We theoretically analyze our proposed constrained opti- mization algorithm in Paper V, where we obtain new results on its finite-time performance. In Paper VI included in this thesis, we revert to the problem of unconstrained throughput maximization. Our proposed MAB optimization algorithm in Paper VI learns in a lower-dimensional channel subspace to sub- stantially improve the link throughput compared to the previous schemes. 2.2. NEURAL PROBABILITY ESTIMATION 13

Connection to Other Domains The basic OLLA problem is formulated as learning the optimal transmission rates based on observed ACK/NACK feedback. This simple formulation is echoed by several RL problems in otherwise unrelated domains. Interestingly, although these related problems may have been studied for several decades, their connec- tion to link adaptation has largely been overlooked. We highlight some of these related problems below, which are commonly modeled as MAB instances:

Weblink selection: A webpage publisher seeks to place one or more aliate • weblinks to attract ad revenues [26, 27]. In each round, a user either clicks one of the displayed weblinks or does not click any weblink. The goal is to select weblinks that maximize the cumulative revenue generated from user clicks. In the context of link adaptation, a displayed weblink is analogous to a selected rate, a user’s click corresponds to an ACK/NACK, and the set of available rates corresponds to the set of weblinks available for display.

Dynamic pricing: A seller aims to maximize the cumulative revenue by • optimally pricing the available goods. In the absence of any contextual in- formation, the only feedback available to the seller is the successful sale of an individual item. This problem can be formulated as provisioning a set of discrete selling prices that are eciently probed to determine the optimal, revenue-maximizing, price. Recently, latent models of the demand behavior have been employed to substantially speed up the learning of optimal pricing strategies.

Inventory management: Constrained MAB problems have recently been • studied in the context of revenue maximization under a finite inventory set- ting, termed bandits with knapsacks (BwK). In [28], an upper confidence bound (UCB)-based approach was introduced that was shown to be opti- mal for the stochastic BwK problem. Further in [26], a Thompson sampling algorithm for budgeted MABs was proposed that outperforms the UCB BwK algorithm. Subsequently, in [29], Thompson Sampling was studied for rev- enue optimization for a finite inventory that contains multiple non-identical products.

2.2 Neural Probability Estimation

In this section, we take a slight detour to highlight an important property of ANNs, which makes them particularly interesting for link adaptation modeling. Recall that link adaptation can be formulated as the problem of learning the ACK probabilities for an arbitrary channel state. Denoting the packet success feedback with ek Ek = 0, 1 ,whereek = 1 denotes an ACK event, ek =0 denotes a NACK event,2 and{ k } 1,...,K denotes the rate index, and denoting 2{ } 14 CHAPTER 2. THESIS OVERVIEW

the channel state vector by , we wish to model 2

PE (ek ;✓), k 1,...,K (2.1) k| | 8 2{ } where ✓ is the set of aprioriunknown parameters of the true ACK probability model. In other words, ✓ denotes the parameters of an oracle model that ac- curately encodes the conditional ACK probabilities as a function of the wireless channel state. Our goal is to learn, based on an observed channel states and their corresponding ACK/NACK events, an approximate model for the conditional

ACK probability PEk (ek ;✓), where ✓ denotes the parameters of our learnt model. | | An ANN model for link adaptation,b bf(): ⇢NN, maps the channel state NN NN NN 7! to the ANN output ⇢ =[⇢1 ,...,⇢K ] through a nonlinear transformation. For the sake of exposition, we assume that the kth ANN output corresponds to the kth data transmission rate. The ANN is trained with a training dataset (1) (1) (N) (N) of (channel state, ACK) tuples, ( ,ek ),...,( ,ek ) collected using an arbitrary rate selection scheme. We repeat a following key result for ANNs, first identified in [30]: The ANN outputs, when trained to minimize the cross entropy loss or the mean squared loss with respect to the observed ACK events, provide maximum-likelihood estimates of the true channel-conditional ACK probabilities in the limit of infinitely many training samples. The significance of this result is that for a suciently large training dataset, the ANN outputs can be rigorously interpreted as ACK probability estimates. As a consequence, ANN models can be used to optimize not only for the link throughput, but also for more complex link performance objectives, for example ones that take packet error rates into accounts. Paper I and Paper II use this ANN property for link adaptation, where supervised training of ANN is performed o✏ine. Paper III extends this ANN model to an reinforcement learning setting, where the training data is collected online through an epsilon-greedy policy.

2.3 Multi-armed Bandits

In the typical MAB setting, a decision-maker (that is, the agent) has access to a set of discrete or continuous-valued actions (arms) within an environment.The experiment is divided into sequential rounds, where the agent is allowed to pull one or more arm in every round. Pulling an arm corresponds to executing the re- spective action in the environment. The environment generates a reward for each pull of an arm, where the reward distribution is not explicitly made available to the agent at any stage. However, the agent may estimate reward characteristics by exploring the available arms over successive rounds. Subsequently, by exploiting these estimates, the agent can predict the arm (or set of arms) that optimizes a target performance objective. The central challenge in MAB optimization relates to balancing between exploration and exploitation, that is, identifying techniques 2.3. MULTI-ARMED BANDITS 15

that optimally explore the available arms to quickly find the best exploitative arms for reward maximization. In the context of RL based link adaptation, MABs are an attractive modeling framework. Here, the transmitter acts as an agent to optimize the link perfor- mance within the wireless environment. The wireless environment induces an a priori unknown distribution over the ACK probability for each available rate. The transmitter has access to a finite set of discrete-valued data transmission rates, r1,...,K , exactly one of which is selected for packet transmission in every transmission interval. The environment responds with an ACK or a NACK, which encodes the reward collected by the agent in that round. The transmitter hence needs to eciently explore the available rates and predict the optimal rate, or set of rates, that can be exploited to optimize the link performance. A discussion on MABs in general, and their application to various domains, is available in [31]. In the rest of this section, we will describe MABs as applied to link adaptation prob- lems – in terms of, respectively, the specific reward structure, wireless channel dynamics, MAB optimization algorithms, and conclude with a note on general performance bounds for MAB optimization.

Reward Structure The wireless transmitter sends packetized data, which, if decoded successfully, delivers the entire packet contents to the intended receiver. On the other hand, if the decoding was unsuccessful, zero data bits are delivered. The decoding

Figure 2.1: The multi-armed bandit formulation is classically attributed to a slot machine with multiple arms that generate rewards with an unknown reward dis- tribution. The agent plays the arms with the goal of eciently maximizing his cumulative returns over several, sequential, rounds. 16 CHAPTER 2. THESIS OVERVIEW

outcome gets encoded as the binary ACK/NACK signal ck[t][t], where k[t]is the rate index selected at the discrete transmission time intervals t =1, 2,.... We first assume a stationary wireless channel such that the channel-conditional ACK probability is only a function of the data transmission rate, rk[t]. Later in this section, we will extend the analysis to address non-stationary channels. For the stationary channel, the binary random variable ck[t][t] is independent and identically distributed according to a Bernoulli distribution with mean ⇢k[t][t]= E[ck[t][t]], where the expectation is taken over multiple packet transmissions for an arbitrary channel state. The wireless transmitter uses the historical ACK/NACK rewards to estimate ACK probabilities for the next time interval, denoted by [⇢1[t + 1],...,⇢K [t + 1]]. Subsequently, for the common goal of link throughput maximization, the trans- mitter selects the rate index by post-processing the predictedb ACK probabilities,b

k⇤[t + 1] = argmax rk⇢k[t + 1], (2.2) k 1,...,K 2{ } that is, the predicted rate index that maximizesb the expected link throughput. This thesis also considers more complex link performance objectives, for example where the average packet error rate is constrained below a certain threshold. For such objectives, an appropriate post-processing step is executed for rate selection in every time interval.

Environment Dynamics Wireless environments typically evolve over time, owing to the physical motion of the objects within the environment. However, in certain scenarios, for example indoor WiFi deployments, the wireless channel may evolve slowly such that it can be considered approximately stationary over the duration of a typical data session. For these scenarios, the MAB formulation above adequately optimizes for an arbitrary channel state. In other scenarios, where the channel varies appreciably within the timeframe of a single data session, the MAB model needs to be adapted for channel dynamicity. This thesis proposes two ways of addressing channel dynamicity: channel state conditioning and forgetting factors. With channel state conditioning, the MAB model learns the ACK probabilities conditioned on periodically reported channel quality indicators (CQI). The channel is assumed to be quasi-stationary between successive CQI reports. Further, for a given CQI, the channel state is assumed to be approximately constant. The channel-conditional ACK probability can hence be learnt over successive transmissions in a dynamic channel. Alternatively, a forgetting factor can also be employed for addressing channel variations, where the historical ACK signal is weighted inversely with the time gap from the current transmission time. A specific instance of forgetting factor, which uses a sliding time window heuristic, has been proposed in [25] for optimization under nonstationary wireless channels. 2.4. THOMPSON SAMPLING 17

Optimization Algorithms MAB optimization algorithms fall under two broad categories: frequentist and Bayesian. Frequentist MAB algorithms are rooted in empirical statistics that is employed to select an optimal arm in every round. One of the first ecient and provably optimal frequentist algorithm, UCB1, predicts the arm with the highest upper confidence bound (UCB) for reward maximization. In contrast to frequen- tist optimization schemes, Bayesian MAB optimization have been proposed that assign a degree of belief to the true reward parameters. This belief is updated at the beginning of every round based on the previously observed rewards. The most common Bayesian heuristic, Thompson sampling, assigns an initial prior distribution over the reward parameters. In every round, Thompson sampling updates its belief by computing a new prior in every round based on the collected rewards in all previous rounds. This updated prior then is used to predict the optimal arm that maximizes the expected reward. There are several reasons that motivate Thompson sampling for MAB-based link adaptation. First, several models of the wireless system are available, which are grounded in expert knowledge and verified through decades of experimental studies. This motivates the use of a Thompson sampling approach, which can incorporate model-guided knowledge into the prior beliefs of reward distribution. Additionally, Thompson sampling has recently been shown to be asymptotically optimal for reward maximization, and additional performance results are also be- ing made available at a fast pace [29, 32]. Finally, Thompson sampling is known to outperform UCB-based approaches for a large number of empirical bench- marks [33]. Hence, this thesis adopts Thompson sampling as the optimization algorithm of choice for the link adaptation problem. In the next section, we will introduce Thompson sampling, and describe its extensions proposed in this thesis for various link adaptation formulations.

2.4 Thompson Sampling

Thompson sampling, also known as posterior matching, is the state-of-the-art Bayesian heuristic for MAB optimization. First proposed in 1933 [34], Thomp- son sampling has recently found a resurgence in interest owing to the empirical evidence of its superior performance [33] and rigorous theoretical bounds on its finite-time performance [32, 35]. Thompson sampling maintaining a prior belief over the reward parameters, which is updated based on the rewards collected in the previous rounds. Subsequently, in every round, Thompson sampling estimates the per-arm reward by sampling from the associated prior. This sampling step naturally balances between exploration, by sometimes sampling actions that have a uncertain rewards, and exploitation, by choosing the actions with the highest predicted reward at other times. In the context of link adaptation, the ACK events are Bernoulli-distributed with a mean given by the true ACK probability, ⇢ , for each rate index k k 2 18 CHAPTER 2. THESIS OVERVIEW

1,...,K . Since the ACK probabilities are a priori unknown, Thompson sam- pling{ models} them by assigning a Beta prior distribution over the ACK prob- ability, B(↵k,k), where ↵k,k are the distribution parameters. The choice of a Beta distribution is motivated by it conjugacy to the Bernoulli distribution. This conjugacy property greatly simplifies the calculation of the prior in every round. Hence, at every time step t =1, 2,..., Thompson sampling computes the prior distribution parameters based on the ACK/NACKs obtained in previous transmission intervals,

↵k[t]= 1+ek[i], i

Subsequently, Thompson sampling obtains the latest sampled ACK probabilities, ⇢k[t] B(↵k[t],k[t]) k 1,...,K , which is used to select the optimal rate for data⇠ transmission.8 2{ } e In the rest of this section, we summarize two key contributions to Thompson sampling based link adaptation made in this thesis: link adaptation under error rate constraints, and an ecient learning scheme that exploits a latent model of the wireless channel state.

Constrained Thompson Sampling

Several wireless applications need link reliability to maintain an acceptable QoS. Link reliability can be expressed in terms of a packet error rate threshold ⌘, that is the maximum allowed fraction of packets that may fail without disrupting the service. The link reliability objective can hence be formulated as

1 T maximize r ⇢ T k(t) k(t) t=1 X 1 T subject to ⇢ ⌘, (2.4) T k(t)  t=1 X

that is, maximize the time-averaged link throughput subject to the satisfaction of the error rate constraint. The problem formulation above is NP-hard in general. As a consequence, we relax this problem such that the constraint needs to be satisfied only in expectation 2.4. THOMPSON SAMPLING 19

over multiple data flows. This relaxed problem is expressed as the linear program,

K

LP (⇢[t]) = argmax pk[t]⇢k[t]rk k 1,...,K 2{ } kX=1 e K e subject to p [t]⇢ [t] 1 ⌘, (2.5) k k kX=1 e where [p1,...,pK ] is a probability simplex that assigns selection probabilities to each candidate rate. Intuitively, the solution to LP (⇢)[t] is a probabilistic mixture of rates for the constrained link performance objective. The instantaneous rate for transmission is then obtained by sampling the solutione to LP (⇢)[t]. We extend Thompson sampling to address linearly constrained optimization of the form above. Here, the Thompson samples in every rounde parameterize the linear program. This sample-parameterized linear program is optimized to compute a probabilistic arm selection policy, which is then executed in the envi- ronment. Paper IV included in this thesis introduces a constrained Thompson sampling algorithm based on the linear program above, which is employed for link adaptation in cellular networks. Subsequently, Paper V obtains theoretical upper bounds for the finite-time performance of the proposed scheme in terms of two metrics: the cumulative loss in average throughput, and the cumulative number of average constraint violations.

Latent Thompson Sampling Data transmission rates exhibit a strong correlation in terms of their throughput performance. In cellular networks, this dependence has previously been encoded through an o✏ine link model (OLM), which models the ACK probability for each rate over a range of channel signal-to-interference-and-noise-ratios (SINR). OLMs are used in state-of-the-art cellular networks for concisely reporting channel state measurements over a feedback link. However, these OLMs can also be used for exploiting the dependence between data rates. In particular, the ACK probability estimate for a given data rate can be mapped to channel SINR, which is a latent subspace common to all available rates. Hence, the OLM, together with the channel SINR, also predicts the ACK probability for all available rates. We exploit OLMs for sample-ecient link adaptation in Paper VI. Here, we introduce a latent Thompson sampling algorithm that models the channel SINR. This algorithm assigns a prior distribution over the channel SINR. In every round, a new SINR prior is computed based on the ACKs observed in previous rounds. This updated prior is used to sample the latest SINR estimate, which is mapped to respective ACK probabilities with the help of the OLM. Since the choice of any rate provides evidence of the channel SINR, this algorithm is able to quickly converge to the true channel SINR, and hence eciently predicts the optimal throughput-maximizing rate.

Chapter 3

Summary of the Included Papers

In this chapter, we summarize the papers included in this thesis. The papers are presented in the order suitable to the discussion in the previous chapters, and do not follow any specific chronology. For each paper, we summarize the contents under four headings: Background, System Model, Method, Eval- uation, and Division of Work,respectively.TheBackground section gives a short introduction of the target problem as well as the motivation for the paper. The next section summarizes the System Model considered in the paper, which is used both for developing the proposed method and for subsequent numerical evaluations. An outline of the proposed approach is subsequently provided in the Method section. Next, the Evaluation section discusses the results of the nu- merical evaluations in terms of the performance compared to benchmark schemes. Finally, the Division of Work section serves to highlight the contribution made by the doctoral candidate towards the paper.

3.1 Paper I: Deep Learning for Frame Error Probability Prediction in BICM-OFDM Systems

Background. The wireless channel typically exhibits a non-uniform response over the transmission bandwidth, known as the frequency selectivity of the chan- nel. Frequency selectivity makes it dicult to accurately predict the link perfor- mance for the available transmission parameter configuration. Wireless systems hence employ approximate, empirically tuned, models that map the channel re- sponse to predicted frame error probabilities (FEP). In practice, these models can be expensive to tune and inaccurate in terms of the predicted FEP for a given channel state. System Model. We consider wireless systems that employ bit-interleaved coded modulation (BICM) for generating transmission symbols, and orthogonal frequency division multiplexing (OFDM) for propagation. We further consider a finite number of modulation and coding schemes (MCS) available for packing the

21 22 CHAPTER 3. SUMMARY OF THE INCLUDED PAPERS

Figure 3.1: Abstract model of a BICM-OFDM link that maps the observed channel state, , to the FEP for each available MCS.

data bits into a transmission frame. Subsequently, we propose an abstract model that maps the channel state, expressed as a vector of per-subcarrier signal-to- interference-and-noise (SINR) values, to the FEP for each available MCS. This model is parameterized by the set of learnable parameters ✓, and is illustrated in Fig. 3.1.

Method. We propose an neural probability estimation model, where an ANN models the FEP for each available MCS given an arbitrary channel state. We use a fully-connected ANN with multiple dense layers and rectified linear unit (ReLU) activations. There are K ANN outputs, where K denotes the number of available MCSs. Each output is compressed through a sigmoid function, and trained through supervised learning with ACK/NACK events collected within the environment. The training proceeds by iteratively minimizing the cross entropy between the ANN outputs and the ACK/NACK events through batch gradient descent. We show that, for this model, the ANN outputs estimate the maximum- likelihood parameters of the true FEP model. Hence, in the limit of infinite train- ing samples, the ANN outputs can be interpreted as the FEPs for the respective MCSs. During the inference phase, these ANN outputs are used to determine the optimal MCS that maximizes the expected link throughout.

Evaluation. We numerically compare our proposed ANN-based FEP predic- tion scheme with the state-of-art approach that uses an e↵ective SINR mapping (ESM) heuristic. We evaulate both schemes in terms of their respective channel- conditional FEP estimation accuracy and the realized average link throughput for an industry standard model of the wireless channel. Our approach reduces the FEP prediction error by up to 50% and thereby increases the link throughput by up to 20%.

Division of Work. The problem formulation, and the proposal to use an ANN model is primarily attributed to the main supervisor and the collaborators. The doctoral candidate developed the theoretical proof related to ANN model training, and performed the numerical evaluations. 3.2. PAPER II 23

3.2 Paper II: A Learning Approach for Optimal Codebook Selection in Spatial Modulation Systems

Background. Spatial modulation is an instance of the more general index- based modulation schemes, where the data bits encode the choice of an antenna pattern. These spatial modulation bits are then recovered at the receiver by exploiting the channel diversity across antenna patterns. Spatial modulation al- lows even low-complexity devices with hardware constraints to partially exploit the benefits associated with multi-antenna transmissions. However, any perfor- mance gains critically depend on an optimal configuration of the codebook used for indexing the available antenna patterns. System Model. We consider a spatial modulation system with access to mul- tiple antenna patterns but only a single radio front-end. Both the transmitter and the receive have access to a common set of candidate codebooks, one of which needs to be selected to optimal data transmission performance. The receiver first determines the optimal throughput-maximizing codebook by measuring the channel response for each antenna pattern, and feeds back the selected codebook index. Subsequently, the transmitter uses this codebook for data transmission that are recovered at the receiver, for example by minimizing the Euclidean dis- tance between received signal and the codebook symbols. This spatial modulation scheme is also illustrated in Fig. 3.2.

Figure 3.2: Spatial modulation based on the choice of a single transmit antenna, where one codebook symbols is transmitted in each time interval.

Method. We propose a ANN-based supervised learning approach that pre- dicts the channel-conditional average symbol error rate (SER) for each candidate codebook. We employ a fully-connected neural network with ReLU activations at the inner layers and a sigmoid-activated output layer. Each ANN output corre- sponds to the SER for a candidate within the set of available codebooks. We train the ANN to minimize the mean squared loss between the ANN outputs, and the observed ACK/NACKs for a given channel state. For this setup, the ANN also approximates the true posterior SER for the available training data. The ANN outputs are subsequently employed to maximize the average link throughput. 24 CHAPTER 3. SUMMARY OF THE INCLUDED PAPERS

Evaluation. We evaluate our proposed approach for a Rayleigh chan- nel, for uncorrelated as well as correlated antenna patterns at di↵erent operating SNRs. Our approach uniformly increases the realized link throughput compared to a non-adaptive scheme that utilizes a fixed spatial modulation codebook. Fur- ther, our approach is robust against channel impairments and antenna correla- tions. Division of Work. The idea to employ ANN-based spatial modulation model and the central proof in this paper is attributed to the doctoral candidate. The collaborators contributed with the problem formulation, the investigated scenar- ios, and the numerical evaluations.

3.3 Paper III: Contextual Multi-armed Bandits for Link Adaptation in Cellular Networks

Background. Cellular deployments address the evolution of wireless channel state through link adaptation. With link adaptation, the instantaneous channel state is estimated through a combination of periodic channel quality information (CQI) reports and ACK/NACK feedback. State-of-the-art cellular networks em- ploy relatively simple heuristics, known as outer loop link adaptation (OLLA) to refine the channel state estimates. However, OLLA su↵ers from convergence issues and is not robust to impairments such as CQI reporting delays. System Model. We consider downlink data transmission within a cellular link modeled on fourth-generation (4G) standard. The transmitter has access to periodic CQI reports, ACK/NACK feedback, and estimates of the relative user speed for each served link. In every time interval, the transmitter needs to select an MCS from a finite set of available MCSs. The predicted optimal MCS in every transmitted interval maximizes the expected link throughput.

Figure 3.3: Contextual MAB that determines the per-link optimal MCS for link throughput maximization based on the available link channel context. 3.4. PAPER IV 25

Method. We propose an online learning approach that uses contextual multi- armed bandit (CMAB) for link adaptation. The channel information serves as the MAB context, which learns the ACK probability for each available MCS with an ANN model. The ANN is trained online with the ACK/NACK feedback, where we show that this training scheme models the true posterior ACK probability. The training scheme uses an epsilon-greedy exploration strategy with batch gradient descent. Further, by conditioning the MCS selection on the channel context, our approach is able to learn from multiple, parallel, links for time-ecient learning. We also propose to accelerate the learning with a pre-training phase that learns approximate ACK probabilities using o✏ine lookup tables. In every time step, the CMAB is used to select the predicted throughput-maximizing MCS to optimize the link performance. Evaluation. We numerically evaluate our proposed CMAB approach in a simulated 4G environment and for a standard vehicular channel model. Our results demonstrate that compared to OLLA, our approach improves the link throughput by 15 20% for di↵erent CQI reporting periodicities and delays. Division of Work. The doctoral candidate proposed using an online learning approach based on an ANN model. The collaborators came up with the MAB formulation for this problem, and contributed to the the implementation details. Further, the candidate developed the central theoretical proof with the help of the main supervisor, and carried out the numerical evaluations.

3.4 Paper IV: Bayesian Link Adaptation under a BLER Target

Background. Link adaptation optimizes the choice of transmission parame- ters in a dynamic wireless channel. Several wireless applications require a mini- mum level of link reliability, expressed in terms of a maximum block error rate (BLER) to maintain their respective quality of service (QoS). For such services, the link adaptation objective is to maximize the data throughput subject to an upper BLER constraint. However, the state-of-the-art link adaptation scheme, OLLA, only optimizes for the BLER, which may lead to a poor link throughput. System Model. We consider a cellular link model, where an MCS is selected from a finite set for data block transmission in every time interval. This MCS selection is aided by periodic CQI reports, which provide approximate channel measurements encoded using an o✏ine link model, and the ACK/NACK feedback for previous block transmissions. This objective is formulated as a linear program, which computes a probability simplex over the available MCSs. The support of this simplex is the set of optimal MCSs that satisfy the link performance objective. Three possible MCS selection strategies are illustrated in Fig. 3.4, where the variable µ denotes the ACK probability. Method. We propose BayesLA, a Bayesian link adaptation scheme that extends the classical Thompson sampling scheme to constrained optimization. 26 CHAPTER 3. SUMMARY OF THE INCLUDED PAPERS

Figure 3.4: Optimal MCS selection policy for throughput maximization under a BLER constraint (left to right): No feasible solution, single optimal MCS, proba- bilistic mixture of two MCSs.

BayesLA rigorously formulates the actual link objective of throughput maximiza- tion under a BLER constraint. In every transmission interval, BayesLA solves a linear program parameterized by latest ACK probability estimates to compute an optimal MCS selection policy. For time-ecient learning, BayesLA exploits CQI reports to initialize the ACK probability for each available MCS. Further, BayesLA refines these probabilities based on the ACK/NACK feedback. Evaluation. We conduct numerical experiments for a Rayleigh fading wire- less channel with two di↵erent BLER targets of 10% and 30% respectively. Com- pared to OLLA, our proposed BayesLA scheme improves the link throughput while simultaneously also lowering the realized link BLER. These dual improve- ments are the result of rigorous problem formulation adopted by BayesLA, which overcomes the issues inherent in the classical OLLA approach. Division of Work. The doctoral candidate proposed the Thompson sam- pling scheme for the studied problems and carried out the numerical evaluations. The main supervisor helped formulate the convex optimization problem in the context of Thompson sampling.

3.5 Paper V: Thompson Sampling for Linearly Constrained Bandits

Background. In this paper, we address MAB problems where the reward- maximization objective is subject to a probabilistic linear constraint. Real-world instances of this problem appear in several domains, for example in online ad- vertising, digital inventory management, and wireless link adaptation. Recently, constrained extensions of Thompson Sampling heuristic have been proposed to address this problem. However, the existing works only provide O(pT )-type bounds on the performance of these schemes. System Model. The system model considered in this paper comprises a MAB with a finite number of discrete arms operating in a stochastic environ- ment. The rewards for each arm are assumed to be Bernoulli-distributed, and independent and identically distributed across the rounds. Further, the arms are 3.6. PAPER VI 27

assumed to be independent of each other. In this setting, we propose LinConTS, a linearly constrained Thompson sampling algorithm that optimizes the cumulative reward earned over multiple rounds under a probabilistic linear constraint. Method. We analyze the finite-time performance of LinConTS. Our anal- ysis makes use of the convex optimization theory, where we employ Lagrangian duals to track the evolution of the parameters associated with each MAB arm. We combine this analysis with previous work on the theoretical performance of Thompson sampling, to arrive at the regret contribution for each arm. We are able to show that for the suboptimal arms, that is, the arms that are not sup- ported by the stationary optimal arm-selection policy, the regret contribution is upper bounded by a term that scales logarithmically with time. Similarly, the contribution to cumulative violations by these arms is also upper bounded logarithmically in time. However, the regret and violation contributions of the optimal set of arms scales only with the square root of time. Evaluation. We evaluate LinConTS on two real-world datasets, and bench- mark the results against an analogous frequentist scheme for constrained online optimization. These results demonstrate that LinConTS has significantly supe- rior performance in terms of both the regret and the violation metrics. Division of Work. The main supervisor proposed the formulation of the studied convex optimisation problem, while the collaborators helped address the optimization through Thompson sampling. The doctoral candidate developed the proof that is central to this paper with the help of the main supervisor, and carried out the numerical evaluations.

3.6 Paper VI: Reinforcement Learning for Ecient and Tuning-free Link Adaptation

Background. The channel-conditional ACK probability as a function of data transmission rates exhibits structure. In particular, it has been shown that under a few simplifying assumptions, the ACK probability is a nonlinear function of the scalar channel SINR and the data rate. State-of-the-art cellular networks exploit this structure for encoding a parsimonious model of the ACK probability as a function of the channel SINR. This so-called o✏ine link model (OLM) assists in the reporting of channel state measurements for optimal rate selection. System Model. We consider a cellular link model, where both the base station and the user equipment has access to their individual OLM. Further, we consider link adaptation in the cellular downlink, with the optimization objec- tive of maximizing the average link throughput. For data transmissions, several modulation and coding schemes (MCS) are available, and a single MCS is se- lected in every transmission instance to encode the data bits. The receiver feeds back an ACK/NACK to indicate the decoding outcome, and additionally reports discretized channel measurements periodically to the base station. 28 CHAPTER 3. SUMMARY OF THE INCLUDED PAPERS

Method. We propose latent Thompson sampling (LTS) for link adaptation, which maintains a probabilistic model of the channel SINR. In every round, LTS updates the SINR distribution based on the ACK/NACK received in previous rounds. The latest SINR distribution is sampled to obtain an estimate of the latest channel state. Subsequently, this sample is mapped to the ACK prob- ability for each available rate through the OLM. Since a latent SINR variable encodes the ACK probability for all the available MCSs, LTS fully exploits the interdependence between the available rates. LTS hence quickly converges to the true instantaneous channel SINR. We extend LTS to nonstationary channel by evolving the SINR distribution in every round. This evolution is implemented through Gaussian smoothing, which is motivated by the well-known Rayleigh model for channel fading. The magnitude of smoothing in every round depends on the rate of channel variations. To automate this smoothing parameter, we rely on Doppler estimates of the user, which are commonly available at the base station. LTS hence eciently optimizes the link throughput without the need for any manual parameter tuning. Evaluation. We numerically evaluate LTS for several simulated cellular de- ployments. Further, we compare the performance of LTS with state-of-the-art link adaptation schemes in cellular as well as WiFi networks. Compared to these schemes, LTS increases the link throughput by up to 100%. In addition, LTS mitigates the need for channel reporting, which reduces the signaling overhead in cellular networks. Division of Work. The doctoral candidate proposed the use of OLMs in the context of reinforcement learning, developed the LTS algorithm, and carried out the numerical evaluations. The main supervisor helped formulate the al- gorithm, and the collaborators contributed to extending the basic algorithm to nonstationary wireless environments. Chapter 4

Conclusion

This thesis has addressed link adaptation, which deals with selecting optimal parameters for wireless data transmission. Although link adaptation is a well- studied topic, the exponential growth of wireless services motivates the need for more powerful techniques suitable for the next-generation networks. This thesis adopts a machine learning approach and introduces several techniques that improve link adaptation in terms of its flexibility, performance, and autonomous operation. Still, the contributions made in this thesis is only an initial foray into the vast possibilities of machine learning-driven wireless optimization. The ideas presented here will hopefully serve as instructive tools to address several other crucial challenges in wireless networks and other application areas. This thesis concludes with a summary of the key takeaways from this work, and an outline of possible future research directions.

4.1 Key Takeaways

Machine learning techniques, which substitute expert knowledge with purely • data-driven optimization, have been achieving disruptive performance results across application domains. This thesis investigates machine learning for link adaptation, where the proposed solutions indeed outperform the state- of-the-art wireless algorithms. However, to achieve this result, it is crucial to understand the dynamics and objective of the underlying problem, and use this knowledge to select appropriate data-driven models and learning schemes.

Artificial neural networks (ANN) are well-known to be powerful, general- • purpose, nonlinear models. However, one drawback with ANNs is that these are generally black-box, in the sense that the ANN outputs seldom have a physical interpretation. This thesis highlights a particular configuration of the ANN model that can be rigorously analyzed in terms of classical maximum- likelihood optimization. In the context of link adaptation, the output of this

29 30 CHAPTER 4. CONCLUSION

model directly relates to the transmission success probability, which is a key performance metric in wireless optimization.

For state-of-the-art wireless services, performance metrics goes beyond the • simple, classical, objective of maximizing the link throughput. An example is the need for reliable link behaviour to minimize service-level disruptions. This thesis provides a general technique for incorporating complex link adaptation objectives. The specific problem of link reliability can hence be rigorously formulated as a linearly constrained convex optimization problem.

Theoretical performance guarantees are crucial for the provisioning of wire- • less resources. However, for online learning problems such as link adaptation, such guarantees have traditionally been elusive. This thesis hence adopts the elegant multi-armed bandit (MAB) framework for link adaptation, which is amenable to theoretical analysis. In this setting, performance guarantees are provided for the proposed algorithms, which will help in configuring ade- quately provisioned wireless services.

Wireless networks are deployed in extremely diverse environments, ranging • from rural areas to urban high-rises, and low-mobility scenarios to high-speed operation. Hence, it is advantageous to use optimization techniques that automatically tune for the particular environment. This thesis proposes tech- niques that mitigate the need for manual tuning of the network parameters. Instead, the proposed solutions learn the environment dynamics to self-tune the algorithm parameters for optimal performance.

4.2 Research Directions

This thesis chiefly addresses the problem of selecting the optimal data transmis- sion parameters for a wireless link. The ideas developed here, and their perfor- mance results, motivate several interesting research directions, some of which are outlined below:

In addition to data transmission parameters, wireless networks configure a • multitude of parameters in real time. These include multi-antenna related parameters, for example the number of spatial transmission layers, the choice of transmission beam, transmit power levels, etc. Machine learning, and in particular reinforcement learning, is an attractive technique that can address these problems with a data-driven approach.

Some wireless links, for example in high-density urban areas, operate in en- • vironments that are severely impaired due to interference. The extension of the solutions provided here to such scenarios, and other wireless impairments encountered in practice, is non-trivial and the subject of future research. 4.2. RESEARCH DIRECTIONS 31

Modern wireless services encapsulate complex link performance requirements, • an example being the link reliability addressed in this thesis. More research is required to extend this work for other service objectives that arise from the specific applications that they serve. This thesis develops performance bounds for algorithms that address certain • optimization objectives, for example linearly constrained convex optimization. However, several online optimization problems in wireless communications and other domains encounter other forms of constraints and optimization objectives. For such problems, the techniques proposed in this thesis can serve as a useful starting point for obtaining relevant performance bounds.

References

[1] A. Molisch, Wireless Communications. Wiley, 2010.

[2] Ericsson, “Ericsson mobility report,” Stockholm, Sweden, November 2020.

[3] V. Saxena, J. Bergman, Y. Blankenship, A. Wall´en, and H. S. Razaghi, “Re- ducing the modem complexity and achieving deep coverage in lte for machine- type communications,” in 2016 IEEE Global Communications Conference (GLOBECOM), pp. 1–7, IEEE, 2016.

[4] E. Esteves, P. J. Black, and M. I. Gurelli, “Link adaptation techniques for high- speed packet data in third generation cellular systems,” in European Wireless Conference, Citeseer, 2002.

[5] G. Ku and J. M. Walsh, “Resource allocation and link adaptation in lte and lte advanced: A tutorial,” IEEE communications surveys & tutorials, vol. 17, no. 3, pp. 1605–1633, 2014.

[6] K. L. Baum, T. A. Kostas, P. J. Sartori, and B. K. Classon, “Performance characteristics of cellular systems with di↵erent link adaptation strategies,” IEEE Transactions on Vehicular Technology, vol. 52, no. 6, pp. 1497–1507, 2003.

[7] P. Bertrand, J. Jiang, and A. Ekpenyong, “Link adaptation control in lte uplink,” in 2012 IEEE Vehicular Technology Conference (VTC Fall), pp. 1–5, IEEE, 2012.

[8] S. Zarei, “Channel coding and link adaptation,” Ausgewhlte Kapitel der Nachrichtentechnik, WS, vol. 2010, pp. 1–14, 2009.

[9] F. Watanabe, “IMT-2000 and beyond IMT,” Evolution, vol. 1990, p. 2000s, 1980.

[10] S. Nanda and K. M. Rege, “Frame error rates for convolutional codes on fading channels and the concept of e↵ective Eb/N0,” IEEE Transactions on Vehicular Technology, vol. 47, pp. 1245–1250, Nov 1998.

33 34 REFERENCES

[11] Ericsson, “E↵ective SNR mapping for modelling frame error rates in multiple- state channels,” Tech. Rep. 3GPP2-C30-20030429-010, April 2003.

[12] K. Brueninghaus, D. Astely, T. Salzer, S. Visuri, A. Alexiou, S. Karger, and G. A. Seraji, “Link performance models for system level simulations of broad- band radio access systems,” in 2005 IEEE 16th International Symposium on Personal, Indoor and Mobile Radio Communications, vol. 4, pp. 2306–2311, Sept 2005.

[13] R. C. Daniels, C. M. Caramanis, and R. W. Heath, “Adaptation in Convolu- tionally Coded MIMO-OFDM Wireless Systems Through Supervised Learn- ing and SNR Ordering,” IEEE Transactions on Vehicular Technology, vol. 59, pp. 114–126, Jan 2010.

[14] R. C. Daniels, Machine learning for link adaptation in wireless networks.PhD thesis, 2011.

[15] T. M. NEC, “Selection of MCS levels in HSDPA,” Tech. Rep. 3GPP R1-01- 0589, May 2001.

[16] V. Buenestado, J. M. Ruiz-Avil´es, M. Toril, S. Luna-Ram´ırez, and A. Mendo, “Analysis of throughput performance statistics for benchmarking lte net- works,” IEEE Communications Letters, vol. 18, pp. 1607–1610, Sept 2014.

[17] R. A. Delgado, K. Lau, R. Middleton, R. S. Karlsson, T. Wigren, and Y. Sun, “Fast convergence outer loop link adaptation with infrequent updates in steady state,” in 2017 IEEE 86th Vehicular Technology Conference (VTC-Fall), pp. 1– 5, Sept 2017.

[18] P. Wu and N. Jindal, “Coding versus ARQ in fading channels: How reliable should the PHY be?,” in GLOBECOM 2009 - 2009 IEEE Global Telecommu- nications Conference, pp. 1–6, Nov 2009.

[19] J. C. Bicket, Bit-rate selection in wireless networks. PhD thesis, Massachusetts Institute of Technology, 2005.

[20] D. Xia, J. Hart, and Q. Fu, “Evaluation of the minstrel rate adaptation algo- rithm in IEEE 802.11 g WLANs,” in 2013 IEEE International Conference on Communications (ICC), pp. 2223–2228, IEEE, 2013.

[21] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.MIT press, 2018.

[22] S. K. Pulliyakode and S. Kalyani, “Reinforcement learning techniques for outer loop link adaptation in 4G/5G systems,” arXiv preprint arXiv:1708.00994, 2017. REFERENCES 35

[23] S. Khastoo, T. Brecht, and A. Abedi, “Neura: Using neural networks to im- prove wifi rate adaptation,” in Proceedings of the 23rd International ACM Conference on Modeling, Analysis and Simulation of Wireless and Mobile Sys- tems, pp. 161–170, 2020. [24] E. Makridis, “Reinforcement learning for link adaptation in 5G-NR networks,” 2020. [25] R. Combes, J. Ok, A. Proutiere, D. Yun, and Y. Yi, “Optimal rate sampling in 802.11 systems: Theory, design, and implementation,” IEEE Transactions on Mobile Computing, vol. 18, no. 5, pp. 1145–1158, 2018. [26] Y. Xia, T. Qin, W. Ma, N. Yu, and T.-Y. Liu, “Budgeted multi-armed bandits with multiple plays.,” in IJCAI, pp. 2210–2216, 2016. [27] K. Chen, K. Cai, L. Huang, and J. Lui, “Beyond the click-through rate: Web link selection with multi-level feedback,” arXiv preprint arXiv:1805.01702, 2018. [28] A. Badanidiyuru, R. Kleinberg, and A. Slivkins, “Bandits with knapsacks,” in Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Sympo- sium on, pp. 207–216, IEEE, 2013. [29] K. J. Ferreira, D. Simchi-Levi, and H. Wang, “Online network revenue man- agement using thompson sampling,” Operations research, 2018. [30] M. D. Richard and R. P. Lippmann, “Neural Network Classifiers Estimate Bayesian a Posteriori Probabilities,” Neural Computation, vol. 3, no. 4, pp. 461–483, 1991. [31] A. Slivkins, “Introduction to multi-armed bandits,” arXiv preprint arXiv:1904.07272, 2019. [32] S. Agrawal and N. Goyal, “Further optimal regret bounds for thompson sam- pling,” in Artificial Intelligence and Statistics, pp. 99–107, 2013. [33] O. Chapelle and L. Li, “An empirical evaluation of thompson sampling,” in Advances in neural information processing systems, pp. 2249–2257, 2011. [34] W. R. Thompson, “On the likelihood that one unknown probability exceeds another in view of the evidence of two samples,” Biometrika, vol. 25, no. 3/4, pp. 285–294, 1933. [35] E. Kaufmann, N. Korda, and R. Munos, “Thompson sampling: An asymptoti- cally optimal finite-time analysis,” in International Conference on Algorithmic Learning Theory, pp. 199–213, Springer, 2012.

Paper I

37 38 PAPER I 39

Deep Learning for Frame Error Probability Prediction in BICM-OFDM Systems

In the context of wireless communications, we propose a deep learning approach to learn the mapping from the instantaneous state of a frequency selective fading channel to the corresponding frame error probability (FEP) for an arbitrary set of transmission parameters. We propose an abstract model of a bit interleaved coded modulation (BICM) orthogonal frequency division multi- plexing (OFDM) link chain and show that the maximum likelihood (ML) estimator of the model parameters estimates the true FEP distribution. Further, we exploit deep neural networks as a gen- eral purpose tool to implement our model and propose a training scheme for which, even while training with the binary frame error events (i.e., ACKs / NACKs), the network outputs converge to the FEP conditioned on the input channel state. We provide simula- tion results that demonstrate gains in the FEP prediction accuracy with our approach as compared to the traditional e↵ective expo- nential SIR metric (EESM) approach for a range of channel code rates, and show that these gains can be exploited to increase the link throughput.

c 2018 IEEE. Reprinted, with permission, from V. Saxena, J. Jald´en, M. Bengtsson, and H. Tullberg, “Deep learning for frame error probability prediction in BICM-OFDM systems.”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2018. 40 PAPER I

1 Introduction

The eciency of a radio link depends on its ability to adapt to the stochastic radio channel conditions that typically vary over time (i.e., fading) as well as over the signal bandwidth (i.e., frequency selectivity). Practical radio systems perform this so-called “link adaptation” by selecting the optimal transmission parameters in each frame that fulfil some criteria related to, e.g., target error rates, throughput, or latency etc. [1]. In this paper, we investigate the problem of predicting the frame error probability (FEP) for an estimated channel state in bit-interleaved coded modulation (BICM) orthogonal frequency division multiplexing (OFDM) systems [2] [3]. Owing to their flexibility and performance, BICM-OFDM systems have been widely adopted by most of the modern radio air interfaces including those for local wireless area networks (e.g., WiFi) and for cellular communication such as Long Term Evolution for 4G, and recently, New Radio for 5G [4]. In general for frequency selective channels, it is intractable to compute the FEP conditioned on the frame channel state characterized by the received per- subcarrier signal to interference and noise ratios (SINRs). Therefore, several ap- proximate techniques for FEP prediction have been developed that compress the per-subcarrier SINRs to an approximate e↵ective scalar metric, which is mapped to pre-computed FEP values stored as lookup tables [5] [6] [7]. However, these techniques assume ideal channel coding performance and do not take into ac- count practical system impairments. Further, the choice of a suitable compres- sion function can be somewhat arbitrary and has been empirically shown to have a significant impact on the FEP prediction performance [8]. As an alternative to the e↵ective SINR approach, supervised learning tech- niques that map the per-subcarrier SINR vector for each frame to the corre- sponding FEP have been proposed [9] [10]. During the traning phase of these techniques, the model parameters are selected to minimize the mean cost between the model output for several frame channel state vectors and their Monte Carlo simulated FEPs. With sucient training, these techniques have been shown to improve the realized link throughput of BICM-OFDM systems compared to the e↵ective SINR approach. However, the proposed supervised learning techniques are limited by the accuracy of the simulated training datasets and additionally do not provide any insight into the optimality of the trained models. In this paper, we cast the FEP prediction problem as a probabilistic binary classification task, where the classes correspond to frame error and success events (i.e., NACKs and ACKs) respectively, and make the following three main contri- butions: (i) We propose an abstract model of the BICM-OFDM link chain where the observations are the frame channel states and their binary frame error events and show that, in the limit of infinite training samples, the maximum likelihood (ML) estimator of the model parameters estimates the true FEP distribution, (ii) We use this model to develop a supervised learning approach for FEP prediction based on deep neural networks, where the training phase requires only the ob- served channel states and binary frame error events, thus mitigating the need for 2. BICM-OFDM SYSTEM AND ML ESTIMATION 41

Figure 2.1: Block diagram of the BICM-OFDM link chain considered in this paper. The parameter selection module exploits knowledge of the instantaneous channel state to select one out of several possible transmission parameter configurations in each frame.

simulations to measure the FEP, and (iii) We provide simulation results to show that our approach improves the FEP prediction accuracy, and consequently the link throughput, compared to the well-studied exponential e↵ective SIR metric (EESM) approach. The rest of this paper is organized as follows: In Sec. 2, we describe a typ- ical BICM-OFDM wireless communication link and propose an abstract system model along with an ML estimator of the model parameters. Next in Sec. 3, we summarize the EESM approach for FEP prediction and introduce our deep learn- ing approach. Next in Sec. 4 we present simulation results for the FEP prediction performance as well as the realized throughput for our two considered approaches and finally in Sec. 5, we conclude the paper.

2 BICM-OFDM System and ML Estimation

System Model We consider a BICM-OFDM link chain similar to the LTE downlink and illustrate its block diagram in Fig. 2.1 [11]. Here, a “transport block”, b =(b1,...,bT ) of information bits is first encoded by a channel encoder and subsequently bit- interleaved by a random interleaver to generate the bit sequence l =(l1,...,lL). The interleaved bits are then used to generate the “rate-matched” bit sequence r =(r1,...,rMSJ) according to ri = li mod L,i =1,...,MSJ,whereM is the number of transmission subcarriers, S is the number of frame OFDM symbols, and J is the modulation order. The channel code rate is therefore = T/MSJ. The rate matched bits are mapped onto MS modulated symbolsR by a labeling function that assigns one out of 2J complex-valued constellation symbols to each J-tuple of bits. Finally, a length-M IFFT operation is applied on each group of M modulated symbols to generate the frame OFDM symbols X =(x1,...,xS) and mapped onto physical resources. 42 PAPER I

The frame OFDM symbols are transmitted over the physical channel resulting in the received signal ys = h xs + gs,s =1,...,S, where denotes the Hadamard product, h is the complex-valued vector of channel coecients in the M 1 2 frequency domain, and gs ⇥ (0, ) is i.i.d. noise. We assume that the channel vector remains constant⇠N for the frame OFDM symbols (i.e., the channel is block fading). At the receiver, each received OFDM symbol ys is multiplied by the elementwise inverse of the estimated frequency-domain channel vector, followed by a length-M FFT operation for OFDM de-multiplexing. The de-multiplexed symbols are then mapped onto soft values through an inverse labeling operation, de-interleaved, and decoded by the channel decoder to generate the reconstructed bit sequence bˆ =(ˆb1,...,ˆbT ). We define the binary frame error event at the receiver as 0if bˆ = b e = . (2.1) 1if bˆ = b ⇢ 6 ML Estimation The BICM-OFDM link chain described above can be approximated as a stochastic non-linear function that generates a frame error event with an unknown probabil- ity distribution for the frame channel state and a particular choice of transmission parameters. In Fig. 2.2, we illustrate an abstract model of the BICM-OFDM link chain, which is parameterized by the model parameters ✓ and maps the ob- served channel state characterized by the received per-subcarrier channel SINRs, =(1 ...,M ), to the observed frame error event, .i.e.,

ek 1 ek PEk (ek ; ✓)=⇢k (1 ⇢k) , (2.2) | | where the k 1,...,K denotes the kth transmission parameter configuration. 2 Here, ⇢k = ⇢k(; ✓)=PEk (Ek =1; ✓) is the conditional frame error proba- bility (FEP) . In the rest of| this section,| we show that the ML estimator of the model parameters asymptotically estimates the true conditional FEPs. The ML estimator [12] of the model parameters for n =1,...,N frame real- izations is defined as

K N ˆML n n ˆ ✓ = arg max PEk,(ek , ; ✓) ˆ ✓ n=1 kX=1 Y K N n n ˆ = arg max ln PEk (ek ; ✓)P()d ˆ | | ✓ n=1 Z kX=1 X , arg max (✓ˆ), where (2.3) ✓ˆ C K N ˆ 1 n n ˆ (✓)= ln PEk (ek ; ✓) (2.4) C N | | n=1 ! kX=1 X 2. BICM-OFDM SYSTEM AND ML ESTIMATION 43

is the cost function to be maximized, and we have used the fact that the channel state is independent of the model. In the limit of infinite training samples, it follows by the law of large numbers that

K N (✓ˆ) !1 E ln P (E ; ✓ˆ) (2.5) C ! { k| } kX=1 ˆ ˆ where P (Ek ; ✓) , PEk (Ek ; ✓) for brevity, and where the expectation is | | | taken over P (Ek, ; ✓). We now subtract and add the true pmf to the r.h.s. of Eq. (2.5) to obtain

K (✓ˆ)= E ln P (E ; ✓ˆ) ln P (E ; ✓)+lnP (E ; ✓) C { k| k| k| } kX=1 K P (E ; ✓) K = E ln k| + E ln P (E ; ✓) . (2.6) ˆ k ( P (Ek ; ✓)) { | } kX=1 | kX=1

Figure 2.2: Abstract model of a BICM-OFDM link chain that maps the observed channel state to the frame error events for k =1,...,K transmission parameter configurations.

The second term in (2.6) is independent of the argument to be maximized. Further we observe that by multiplying and dividing the first term with proba- bility distribution P(), we obtain

K P (E ; ✓)P () K P (E , ; ✓) E ln k| = E ln k = ˆ ˆ ( P (Ek ; ✓)P ()) ( P (Ek, ; ✓)) kX=1 | kX=1 K KL P (E ; ✓) P (E ; ✓ˆ) , (2.7) k| || k| kX=1 n o where KL( ) is the Kullback-Leibler divergence (KLD) between the true and estimated pmfs.·||· Given that the KLD is non-negative, and equal to zero if and 44 PAPER I

only if P (E ; ✓)=P (E ; ✓ˆ), it follows that the ML estimator converges to the k| k| true FEP distribution in the limit of large N. Note however that P (Ek ; ✓)= ML | P (Ek ; ✓ˆ) does not necessarily imply that ✓ˆ = ✓ as the ML estimate of ✓ may be| non-unique.

3 FEP Prediction Techniques

E↵ective SINR Approach In this subsection we outline the EESM approach, where the the channel state characterized by the per-subcarrier SINRs is compressed to a scalar “e↵ective” SINR for an equivalent AWGN channel. The FEP for the kth transmission pa- rameter configuration is then predicted to be

EESM AWGN ⇢ˆk ()=⇢k (gk()), (3.1) 1 M where g ()= log exp m (3.2) k k M m=1 k ! X ✓ ◆ is the EESM for channel state , and k is a tunable parameter. For the fre- quency selective channels commonly observed in practical systems, gk() amounts to a lossy compression of the channel state vector, since the original channel state can no longer be recovered. The FEP for the equivalent AWGN channel, AWGN ⇢k (gk()), is obtained by interpolating between several Monte Carlo simu- lated FEP values for the AWGN channel. The optimal k minmizes the Euclidean distance between the predicted FEP and observed frame errors for n =1,...,N training frames, i.e.,

N opt EESM, n n 2 k = arg min ⇢ˆk ek . (3.3) k | | n=1 X opt For large N, estimating k in this manner is equivalent to the traditional ap- proach that minimizes the mean squared cost between the predicted FEPs and and the measured FEPs obtained through Monte Carlo simulations for the train- ing frames [8].

Deep Learning Approach The EESM approach described earlier relies on a scalar approximation of the channel state, which is obtained through a lossy compression and therefore does not guarantee optimality of the the corresponding FEP prediction. In this subsec- tion we describe our approach for FEP prediction based on deep neural networks, which discriminatively learns the mapping between the (uncompressed) channel state vector and the corresponding FEPs for multiple transmission parameter configurations. 3. FEP PREDICTION TECHNIQUES 45

Neural networks have long been known as a powerful tool for approximat- ing a wide range of highly non-linear functions, however, their acceptance for implementation in practical systems has been limited by an insucient under- standing of the models that they learn from training data. Although a complete understanding of neural networks is still a topic of active research, several recent breakthroughs related to deep neural networks coupled with cheap computational power have led to drastic performance improvements for several challenging prob- lems [13]. In this paper, we consider the fully connected L layered feedforward neural network illustrated in Fig. 3.1, for which the output of the lth “hidden layer” with dimension dl can be described as

(l) (l) (l) (l 1) (l) ⌘ = W ⌘ + b , (3.4) ⇣ ⌘ (l) (l) where W is the trainable dl 1 dl weight matrix, b is the trainable dl 1 bias vector, and (l) is a fixed non-linear ⇥ “activation” function. By simply substituting⇥ the system parameters ✓ with neural network weights and biases, we allow the neural network to learn the set of mappings ⇢k(; ✓ˆ) for k =1,...,K from data. It has been shown previously that an ordering of the per-subcarrier SINRs can be used to suciently parameterize the frame error rate while reducing the train- ing requirements [9]. Therefore in this paper, we use the sorted per-subcarrier (0) SINR vector, , as the input to the network, i.e., ⌘ , =(1,...,M ). The activation function for the each of the non-output layers can be any continu- ously di↵erentiablee non-linear function within some practical constraintse e [13].e For (L) x the output layer, we choose sigmoid activation function (x)=1/(1 + e ) (L) NN and interpret the network outputs as the predicted FEPs, i.e., ⌘ , ⇢ˆ = NN NN (ˆ⇢1 ,...,⇢ˆK ). We choose the cross-entropy loss between the network outputs and the frame error events as the cost function, i.e.,

1 K (✓ˆ)= e ln⇢ ˆNN +(1 e )ln(1 ⇢ˆNN) (3.5) C K k k k k kX=1 K 1 ˆ = ln PE (ek ; ✓), (3.6) K k| | kX=1 where the latter expression is obtained using Eq. (2.2). By observing the direct correspondence between Eqs. (2.4) and (3.6), we observe that minimizing the neural network cost function over n =1,...,N training frames is equivalent to ML estimation of the neural network parameters, which we have shown to estimate the true FEPs. It is crucial to point out that there is no guarantee that the neural network trained using stochastic gradient decent will converge to the ML estimate of the network parameters, however, our simulation results indicate that the neu- ral network does indeed provide good estimates of ⇢k(; ✓ˆ). Further, the result obtained above is equivalent to the conclusions of a previous analysis of neural 46 PAPER I

Figure 3.1: Neural network layout that maps the channel state characterized by the sorted per-subcarrier SINR values to the FEP for each of the K transmission parameter configurations.

networks [14]. However, we believe that it is instructive to demonstrate this result in the context of ML estimation as well.

FEP Prediction for Throughput Maxmization

In this paper, we study the selection of the optimal channel code rate k,k 1,...,K that maximizes the link throughput over a fading channel. ThereforeR 2 for{ each FEP} prediction approach, we select the channel code rate that maximizes EESM the predicted expected throughput in that frame, i.e., k = arg maxk Tk(1 EESM NN NN ⇢k ), and k = arg maxk Tk(1 ⇢k ),whereTk = MSJ k is the number of transmitted information bits when the kth channel code rate isR selected. The real- EESM 1 n ized throughput over N evaluted frames is therefore = N n TkEESM (1 n NN 1 n n T n ekEESM ) and = N n TkNN (1 ekNN )respectively,whereek denotes the actual frameT error event for the kth channel code rate in the nth frame.P P

4 Numerical Results

In this section we provide simulation results for the FEP prediction accuracy and the achieved link throughput for our proposed deep learning approach and con- trast it with the EESM approach performance. The simulation parameters are listed in Table 4.1. We use open source Python signal processing and communi- cation libraries to foster reproducibility of the demonstrated results described in 5. CONCLUSIONS 47

this section, and utilize Tensorflow for the neural network implementation and evaluations [15] [16]. We assume perfect knowledge of the channel at the transmitter as well as the receiver, i.e., = h 2/2. The neural network comprises 3 hidden layers with dimensions [60, 10,|60]| respectively and each hidden layer employs a Rectified Linear Unit (ReLU) activation function [13]. The training datasets for EESM and neural network approaches are generated using 104 frames for each channel code rate. The test dataset for throughput maximization comprises 103 realizations for 10 evenly spaced long term average SINR values in the range [ 10, 20] dB. Table 4.1: Simulation Parameters

Simulation Parameter Value Channel Model Extended Pedestrian A [17] Max. Doppler Spread 3 Hz Number of OFDM Symbols (S) 12 Number of Subcarriers (M) 600 Channel Coding Turbo + Repetition Channel Code Rates ( ) [0.01, 0.02..., 0.30] Rk We train the neural network parameters iteratively be employing an ADAM optimizer [13]. The Root Mean Square Error (RMSE) of the FEP prediction per- formance versus the number of neural network training steps is shown in Fig. 4.1. We observe that the neural network learns to improve the FEP prediction by iteratively training its parameters, and outperforms the EESM approach after a few iterations. The throughput performance with the EESM and our deep learn- ing approach is shown in Fig. 4.2. We observe that our approach increases the throughput compared to the EESM approach.

5 Conclusions

In this paper, we have proposed a deep learning approach for FEP prediction that learns the mapping between the frame channel state characterized by the per-subcarrier SINRs and the FEPs for arbitrary transmission parameter config- urations. Further by utilizing a training scheme that relies only on the observed channel state and the binary frame error events, our approach is shown to improve the FEP prediction accuracy and consequently increase the link throughput. 48 PAPER I

Figure 4.1: RMSE of the FEP prediction error for EESM (blue, solid) and deep learning (red, solid) approaches for the simulation setup in Table 4.1. The neu- ral network iteratively learns model parameters that improve its FEP prediction accuracy compared to EESM approach.

Figure 4.2: Link throughput for EESM (blue, solid) and deep learning (red, solid) based rate selection that maximizes the expected throughput in each frame. The upper bound “Genie” curve (black, solid) is obtained by simulating each channel code code rate for each frame and picking the largest transport block that is suc- cessful. The lower bound “Fixed Code Rate” curves (gray, dashed) are obtained by fixing the channel code rate over the entire test dataset.

References

[1] A. Molisch, Wireless Communications. Wiley, 2010.

[2] G. Caire, G. Taricco, and E. Biglieri, “Bit-interleaved coded modulation,” IEEE Transactions on Information Theory, vol. 44, pp. 927–946, May 1998. 5. CONCLUSIONS 49

[3] D. Agrawal, V. Tarokh, A. Naguib, and N. Seshadri, “Space-time coded OFDM for high data-rate wireless communication over wideband channels,” in Vehic- ular Technology Conference, 1998. VTC 98. 48th IEEE, vol. 3, pp. 2232–2236 vol.3, May 1998.

[4] A. Zaidi, R. Baldemair, M. Andersson, S. Fax´er, V.-M. Cas´es, and Z. Wang, “Designing for the future: the 5G NR physical layer,” Ericsson Technology Review, July 2017.

[5] S. Nanda and K. M. Rege, “Frame error rates for convolutional codes on fading channels and the concept of e↵ective Eb/N0,” IEEE Transactions on Vehicular Technology, vol. 47, pp. 1245–1250, Nov 1998.

[6] Y. W. Blankenship, P. J. Sartori, B. K. Classon, V. Desai, and K. L. Baum, “Link error prediction methods for multicarrier systems,” in IEEE 60th Vehic- ular Technology Conference, 2004. VTC2004-Fall. 2004, vol. 6, pp. 4175–4179, Sept 2004.

[7] Ericsson, “E↵ective SNR mapping for modelling frame error rates in multiple- state channels,” Tech. Rep. 3GPP2-C30-20030429-010, April 2003.

[8] K. Brueninghaus, D. Astely, T. Salzer, S. Visuri, A. Alexiou, S. Karger, and G. A. Seraji, “Link performance models for system level simulations of broad- band radio access systems,” in 2005 IEEE 16th International Symposium on Personal, Indoor and Mobile Radio Communications, vol. 4, pp. 2306–2311, Sept 2005.

[9] R. C. Daniels, C. M. Caramanis, and R. W. Heath, “Adaptation in Convolu- tionally Coded MIMO-OFDM Wireless Systems Through Supervised Learn- ing and SNR Ordering,” IEEE Transactions on Vehicular Technology, vol. 59, pp. 114–126, Jan 2010.

[10] S. Yun and C. Caramanis, “Multiclass support vector machines for adaptation in MIMO-OFDM wireless systems,” in 2009 47th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 1145–1152, Sept 2009.

[11] 3rd Generation Partnership Project, “Evolved Universal Terrestrial Radio Ac- cess (E-UTRA); Physical layer procedures,” Tech. Rep. 36.213 v12.3.0, Sept. 2016.

[12] S. M. Kay, Fundamentals of Statistical Signal Processing: Estimation Theory. Prentice Hall PTR, 1993.

[13] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org. 50 PAPER I

[14] M. D. Richard and R. P. Lippmann, “Neural Network Classifiers Estimate Bayesian a Posteriori Probabilities,” Neural Computation, vol. 3, no. 4, pp. 461–483, 1991. [15] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Cor- rado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Man´e, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vi´egas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous sys- tems,” 2015. Software available from tensorflow.org. [16] V. Saxena, “py-itpp.” https://github.com/vidits-kth/py-itpp, 2020. [17] 3rd Generation Partnership Project, “Evolved Universal Terrestrial Radio Ac- cess (E-UTRA); Base Station (BS) radio transmission and reception,” Tech. Rep. 36.104 v12.3.0, Sept. 2016. Paper II

51 52 PAPER II 53

A Learning Approach for Optimal Codebook Selection in Spatial Modulation Systems

In spatial modulation (SM) systems that utilize multiple trans- mit antennas/patterns with a single radio front-end, we propose a learning approach to predict the average symbol error rate (SER) conditioned on the instantaneous channel state. We show that the predicted SER can be used to lower the average SER over Rayleigh fading channels by selecting the optimal codebook in each trans- mission instance. Further by exploiting that feedforward artifi- cial neural networks (ANNs) trained with a mean squared error (MSE) criterion estimate the conditional a posteriori probabilities, we maximize the expected rate for each transmission instance and thereby improve the link spectral eciency.

c 2018 IEEE. Reprinted, with permission, from V. Saxena, B. Cavarec, J. Jald´en, M. Bengtsson, and H. Tullberg, “A learning approach for optimal codebook selection in spatial modulation systems.”, 52nd Asilomar Conference on Signals, Systems, and Computers, 2018. 54 PAPER II

1 Introduction

Multi-antenna wireless communication systems provide significant capacity gains by exploiting the available spatial degrees of freedom in rich scattering environ- ments. Spatial modulation (SM) [1], where information is encoded onto the choice of transmitting antenna in addition to the transmitted symbols, provides a prac- tical means for multi-antenna gains in transmitters constrained by a single radio front-end, e.g., low-complexity devices in an Internet of Things (IoT) framework. However since the choice of transmitting antenna in SM depends on the informa- tion bit stream, it is dicult to predict the optimal transmission parameters in stochastic channels [2]. Recently, a SM codebook selection scheme that relies on full or quantized channel state fed back to the transmitter has been proposed [3], where a heuristic based approach is used to approximate the optimal SM codebook containing the (antenna, symbol) pairs. On the other hand, machine learning ap- proaches where the receiver utilizes the channel estimate to select one out of a finite set of pre-defined codebooks based on k-nearest neighbor scheme have also been proposed [4,5]. Although these schemes limit the feedback to few bits indi- cating the choice of codebook, they su↵er from a high computational complexity unsuitable for real-time execution and do not provide an estimate of the code- book error probability. In this paper, we propose a deep learning approach that predicts the average symbol error probability for each codebook conditioned on the channel state, relying on the fact that suciently parameterized feedforward artificial neural networks (ANNs) are good Bayesian a posteriori estimators [6] and show that suciently trained ANNs lower the symbol error rate for single- as well as multi-antenna SM receivers. Further, we exploit the predicted symbol error probability to select the codebook that maximizes the expected rate in each transmission instance, and demonstrate that our approach improves the spectral eciency of the link.

2 System Model

We assume Nt antennas at the transmitter and Nr antennas at the receiver in a frequency flat communication channel, and that the transmitter is equipped with a single radio front-end as shown in Fig. 2.1. The Nr 1 received signal is given by ⇥ y = p⇢↵Hxi + w, (2.1) where ⇢ is the average received signal-to-noise ratio (SNR), ↵ ⌦ is the trans- 2 i mitted symbol and i is the transmitting antenna, H =[h1, , hP ]istheNt Nr channel matrix, x =[(i, 1), ,(i, N )]T is the antenna··· selection vector,⇥ and i ··· t w CNr the additive white Gaussian noise (AWGN) with variance 2.TheSM receiver2 estimates the (antenna, symbol) combination, (˜i, ↵˜), that minimizes the

Euclidean distance y ↵˜Hxi . A successful symbol transmission occurs when (˜i, ↵˜)=(i, ↵) and is|| denoted by|| s = 1. The problem of codebook selection is to e 3. CONDITIONAL PROBABILITY ESTIMATION USING FEEDFORWARD ANNS 55 find the codebook : ⌦ ,...,⌦ that satisfies the given optimality criteria C { 1 Nt } often related to the symbol error rate (SER). In this paper, we restrict ⌦i to M PSK symbols, where M 0, 2, 4, and M = 0 implies that the antenna will not transmit. 2{ ···}

Figure 2.1: SM based on the choice of one transmit antenna and one symbol from a codebook. The receiver estimates the optimal codebook based on the instantaneous channel state and feeds it back to the transmitter.

3 Conditional Probability Estimation Using Feedforward ANNs

ANNs are parameterized models for approximating non-linear mappings in a high- dimensional space. A fully connected L-layer feedforward ANN has the activation

(l) (l) (l) (l 1) (l) ⌘ = W ⌘ + b (3.1) ⇣ ⌘ at the lth layer, where W and b are the trainable weight and bias parameters at the lth hidden layer, l 1,...,L , and is an element-wise nonlinear function. We denote the ANN parameters2{ } by ✓ and its output by f(; ✓), where is the observed channel state. The optimal ANN parameters that minimize the MSE loss between the outputs and the observed binary success variable for several symbol transmissions using an arbitrary codebook are

✓opt = arg min (✓) (3.2) ✓ L N 1 2 = arg min (sn f(; ✓)) (3.3) ✓ N n=1 X N ✓opt !1 arg min E (s f(; ✓))2 , (3.4) ! ✓ { } 56 PAPER II

Input Output P (s ) 0 1 | 1 . P2(s ) e. . . . | . . . e N N PJ (s ) t⇤ r |

Figure 3.1: NN fore estimation of the error probability induced by SNRs i for each mode transmission mode m 1, ,M . { } 2{ ··· } e

where the expectation is taken over PS,(s, ). Following the general approach suggested in [6], we expand the expectation to obtain

(✓) (s f(; ✓))2P (s, )d (3.5) L ⇡ S, Z s 0,1 2X{ } 2 = (s f(; ✓)) PS (s )Ps ,(s0, )d (3.6) | | 0 Zs0 0,1 s 0,1 2X{ } 2X{ } = E f 2(; ✓) 2f(; ✓)E s + E s2 (3.7) { | } { | } ⇢ = E (f(; ✓) E s )2 + E var s (3.8) { | } { | } ⇢ ⇢ where we obtain Eq. 3.7 by noting that f(; ✓) is independent of the channel. We observe from Eq. 3.8 that the loss is minimized when the ANN outputs estimate the conditional expectation of successful symbol transmission.

4 Optimal Codebook Selection

In this paper, we assume that there exist a finite set of available candidate code- books, j j 1,...,J , where the cardinality of each codebook j corresponds to {C } 2{ } |C | a maximum spectral eciency of log2 j bits/s/Hz. We utilize the ANN struc- ture described in the previous section|C to| predict the conditional expectation of symbol success for each candidate codebook, conditioned on the instantaneous channel state. When the optimality criterion is defined as minimizing the SER at a fixed rate (e.g., in low-latency applications), we select the codebook index with the maximum predicted conditional expectation of success,

˜j = arg max fj(; ✓). (4.1) j Further, when the optimality criterion is defined as maximizing the spectral ef- ficiency of the link, we select the codebook index that maximized the expected 5. RESULTS 57

Table 5.1: Simulation Parameters

Nt 4 Nr 1 (Fig. 5.1), 4 (Fig. 5.2-5.3) Channel fading i.i.d (0, I ) CN 8 Possible constellation sizes 0, 2, 4, 8, 16, 32 { } Bits per channel use 4

Table 5.2: Candidate Codebooks

Codebook Index [ ⌦ , ⌦ , ⌦ , ⌦ ] | 0| | 1| | 2| | 3| 0 [16, 0, 0, 0] 1 [8, 8, 0, 0] 2 [8, 4, 4, 0] 3 [8, 4, 2, 2] 4 [4, 4, 4, 4]

rate, ˜ j = arg max fj(; ✓) log2 j . (4.2) j |C |

5 Results

The SM link parameters are provided in Table 5.1. We characterize the channel state by a vector containing the received SNR for each transmit-receive antenna h || ij || pair, 2 . We assume prefect knowledge of the channel at the receiver, and assume that the correlation coecient for transmit as well as receive antennas is r =0.9. The SERs for 4 1 and 4 4 SM link are shown in Fig. 5.1 and Fig. 5.2 respec- tively, where we observe⇥ that⇥ the ANN lowers the SER in both cases compared to a fixed codebook selection. Candidate codebooks for this simulation are given in Table 5.2, which have a fixed cardinality j = 16. In fig. 5.2, we assume that the channel between each transmit and receive|C | antenna su↵ers from i.i.d. Rayleigh i0 j0 fading and a deterministic antenna correlation r ,wherei0,j0 [N ] and | | 2 t i0,j0 [Nr] at the transmitter and the receiver respectively. In2 Fig. 5.3, we plot the achieved rates for fixed codebooks as well as ANN based codebook selection. The space of candidate codebooks in this case are code- books with cardinalities j 2, 4, 8, 16, 32, 64 . The codebooks with smaller cardinalities achieve a lower|C |2 symbol{ error rate} while codebooks with a higher cardinalities provide a higher maximum spectral eciency. We observe that the ANN based approach is able to exploit these complementary gains by predicting the expected rate for each codebook at each transmission instance, resulting in significant spectral eciency gains over fixed codebooks. 58 PAPER II

Figure 5.1: SER for SM over 4 1 channel, with the codebooks defined in Table 5.2. The ANN approach predicts the⇥ optimal codebook in each transmission instance to provide up to 4 dB gain in the average SER.

Figure 5.2: SER for SM over 4 4 channel, r =0.9 with the codebooks defined in Table 5.2. The ANN approach predicts⇥ the optimal codebook in each transmission instance to provide up to 2 dB gain in the average SER. 5. RESULTS 59

Figure 5.3: Spectral eciency for Nt Nr = 4 4 ,r =0.9 with ANN (black, solid) based rate selection that maximizes⇥ the expected{ ⇥ } rate for each transmission. The “Fixed Codebook” curves (multiple colors, dashed) are obtained by fixing the codebook over the entire test dataset.

References

[1] M. Di Renzo, H. Haas, A. Ghrayeb, S. Sugiura, and L. Hanzo, “Spatial Modu- lation for Generalized MIMO: Challenges, Opportunities and Implementation,” 2013.

[2] P. Henarejos, A. Perez-Neira, A. Tato, and C. Mosquera, “Channel dependent mutual information in index ,” International Conference on Ac- coustics Speech and Signal Processing ICASSP 2018.

[3] B. Cavarec and M. Bengtsson, “Channel dependent codebook design in spatial modulation,” International Conference on Accoustics Speech and Signal Pro- cessing ICASSP 2018.

[4] P. Yang, Y. Xiao, L. Li, Q. Tang, Y. Yu, and S. Li, “Link adaptation for spatial modulation with limited feedback,” IEEE Transactions on Vehicular Technology, vol. 61, pp. 3808–3813, Oct 2012.

[5] P. Yang, M. D. Renzo, Y. Xiao, S. Li, and L. Hanzo, “Design guidelines for spatial modulation,” IEEE Communications Surveys Tutorials, vol. 17, pp. 6– 26, Firstquarter 2015. 60 PAPER II

[6] M. D. Richard and R. P. Lippmann, “Neural network classifiers estimate bayesian a posteriori probabilities,” Neural Computation, vol. 3, no. 4, pp. 461– 483, 1991. Paper III

61 62 PAPER III 63

Contextual Multi-Armed Bandits for Link Adaptation in Wireless Communication Systems

Wireless communication systems dynamically adapt the trans- mission parameters for each communication link to optimize link and system throughputs and error rates. This so-called link adap- tation occurs in every transmission time interval (TTI) and relies on explicit estimates of the wireless channel state as well, as the channel-dependent acknowledgements (ACKs) observed for trans- missions in the previous few TTIs. The state-of-the-art outer loop link adaptation (OLLA) techniques rely on an o✏ine wireless link model whose parameters are adjusted online to meet a heuristically- defined target error rate for each link. However, OLLA is known to su↵er from slow convergence for low target error rates and addition- ally does not strive to directly optimize for the link throughput. In this paper, we propose a contextual multi-armed bandit (CMAB) approach to learn the link characteristics online, where the context is provided by the channel state observations and each arm of the CMAB models a unique transmitter parameter configuration. We show that our approach by learning from multiple wireless links, our approach converges faster than OLLA for a low target error rate of 0.1 and consequently provides rate gain up to 40% over OLLA in the initial few TTIs. Further, we show that by exploiting the ACK probability predicted by each arm of the CMAB for a given channel state, our approach increases the average link throughput by up to 25% for a high BLER target of 0.5,andlowerstherealized error rate (and therefore the number of retransmissions) by 15% for amoderateBLERtargetof0.3.

c 2019 Association for Computing Machinery. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. 64 PAPER III

1 Introduction

Transmitting information over a wireless channel is a complex phenomenon. The channel capacity is determined by the signal attenuation and baseband phase rotation, which in turn depend on a variety of parameters, including the sig- nal’s frequency, the distance between the transmitter and the receiver, and the nearby objects (scatterers) that interact with the electromagnetic waves in the intervening period between transmission and reception. Furthermore, any of the transmitter, the receiver, or the scatterers might be mobile. As a result, the re- sponse of the wireless channel will vary over time [1], which further increases the complexity. To eciently utilize the wireless channel, a wireless communication system needs to navigate this complex configuration space, and do it quickly. In particu- lar, it must adapt the transmission parameters for each communication link in real time based on the instantaneous or the predicted future state of the channel. This process is called link adaptation. Link adaptation is an area of active research, with substantial work in cellular networks [2, 3], wireless local area networks [4], as well as personal area networks [5]. A typical link adaptation scheme is composed of two control loops: an inner loop, and an outer loop, respectively. With the inner loop [6], the adaptation takes place over short time intervals, called transmission time intervals (TTIs). A wireless communication receiver estimates the instantaneous channel response by periodically transmitting pilot signals to the receiver in the current TTI, and using these estimates to set the transmission parameters for the next TTI. The optimal transmission parameters are defined by application-specific optimality criteria and are typically defined in terms of performance metrics such as error rates and throughput. Unfortunately, inner loop link adaptation is not always accurate, as it is negatively a↵ected by the channel state quantization (which is done to reduce the communication overhead), inaccurate model-based predictions given the estimated channel state, and the randomly varying interference. Therefore to improve the link adaptation performance, wireless cellular systems employ an additional outer loop link adaptation (OLLA) that attempts to compensate for the di↵erences between the predicted and the actual link performance, by observing the link behavior over a longer time scale, i.e., over several TTIs to dynamically adjust the inner loop link adaptation parameters [7]. While OLLA improves link adaptation performance, it has two significant shortcomings. First, existing OLLA techniques aim to achieve a target block er- ror rate (BLER), defined as the ratio of unsuccessful transmissions to the number of total transmissions over a time window, such that to maximize throughput or other link performance metrics. The problem is that the optimal target value of BLER for future TTIs is not known in advance. As such, the OLLA tech- niques employ heuristics to approximate the target BLER [8]. An example of such heuristic is estimating the BLER value based on the short-term average signal-to-noise-ratio (SINR) [9]. Second, OLLA has been shown to su↵er from 1. INTRODUCTION 65

slow convergence which significantly degrades the performance of 4G cellular net- works [10]. To improve OLLA’s convergence rate, previous solutions have proposed var- ious heuristics that leverage sequential hypothesis testing [11, 12]. Furthermore, a recent solution employs a contextual multi-armed bandit (CMAB) approach, where the CMAB arms correspond to a discrete set of parameter adjustments applied to the inner loop link adaptation [13]. Although this may improve the performance by converging faster to a heuristically-defined target BLER, it does not account for the sub-optimality of the target BLER value. In this paper, we propose a sequential approach to learn the transmission pa- rameter configuration for each TTI, given the observed channel state, such that to maximize the link throughput under the target BLER constraint. The condi- tional success probability for a given transmission parameter configuration is not known a priori, therefore our approach estimates these probabilities by randomly exploring the space of available configurations under time-varying wireless chan- nel responses. We formulate our approach using contextual multi-armed bandit (CMAB), where the context is provided by the channel state observations and each arm of the CMAB models a unique transmission parameter configuration. By learning the error probability for each arm of the CMAB through sequential exploration and exploitation over a time-varying wireless channel, our solution is able to find the optimal BLER for each TTI, thereby reducing the realized BLER by up to 15% for commonly used target for mobile broadband communication and increasing the throughput by 25% for high target BLERs. Additionally, our approach provides rate gains as high as 40% by exploiting faster convergence for low BLER targets. We note that in the context of cognitive wireless networks, multi-armed bandit have been proposed to select one out of several available (channel, rate) combi- nations during each TTI to maximize the throughput [14]. Our approach di↵ers from this approach in two important aspects. First, we select transmission pa- rameters in the context of a single, time-correlated wireless channel in contrast to context-free selection of independent channels in earlier work. Second, our approach is constrained by a target BLER that places restrictions on the space of valid actions in each TTI, whereas the earlier work aims to maximize the throughput without any error rate constraints. The rest of this paper is organized as follows: In Sec. 2, we describe the wireless link model studied in this paper. Next, in Sec. 2, we discuss link adptation in the downlink of wireless cellular networks including OLLA, and in Sec. 3 we describe our proposed CMAB approach. Further, in Sec. 4, we provide numerical evaluation results for OLLA as well as CMAB approaches and discuss the relative merits of our approach. Finally, in Sec. 5, we summarize the paper and outline the key takeaways. Notation: We denote vectors with bold lowercase letters and matrices with uppercase letters. Further, we use calligraphic uppercase letters to denote proba- bility distributions as well as sets, where the usage will be clear from the context. 66 PAPER III

Finally, we denote estimated values with a hat ( . ).

2 Wireless Link Model b

In this section we describe the wireless cellular link model, where a variable num- ber of data bits are processed in each TTI by first encoding them for reliable re- ception and subsequently modulated onto complex symbols for transmission over the air. We further discuss the classical approach for optimizing the link perfor- mance by adapting the said modulation and coding scheme (MCS) in each TTI. This inner loop link adaptation approach, also known as adaptive modulation and coding (AMC), forms an essential part of most modern wireless networks. In each TTI the choice of MCS, m , determines the number of data bits D(m) that are packed into a transport block2M. The sequence of D(m) data bits are mapped into a sequence of N(m) coded bits by an error protecting channel code (coding). The coded bits are then mapped onto data symbols from a symbol al- phabet, and transmitted over the air. Assuming that f and t is the bandwidth and TTI duration, respectively, the (instantaneous) rate of the transmission is given by 1 R(m) D(m) [bits / s / Hz] , (2.1) , tf

i.e., the number of information bits transmitted in the TTI normalized by the resources utilized to do so. The specific channel code used (type and rate), and the symbol alphabet (e.g., QPSK, 8-PSK or 16-QAM) is determined by the MCS m, thus allowing for adaptive control the rate and error protection of the transmission. Known, deterministic, pilot symbols are periodically inserted into the transmitted signal in order to allow the receiver to estimate the frequency- and time-dependent quality of the wireless channel. The receiver attempts to recover the transmitted bits, and reports back to the transmitter a positive acknowledgement (ACK), c = 1, in the event the reconstructed transport block is received correctly (usually validated by a cyclic redundancy check (CRC) embedded in the transport block), and a negative ACK (NACK), c = 0, otherwise. Viewing c = c(m) as a family of stochastic variables, indexed by m, and with distributions induced by the random e↵ects of the channel attenuation, noise, and interference, yields the block error probability P c(m)= 0 and expected throughput R(m)P c(m)=1 for a chosen MCS m. Link Adaptation The optimal MCS depends on the instantaneous state of the wireless channel, where optimality is typically defined in terms of maximizing the throughput under a target BLER. In the downlink of wireless cellular networks, the UE estimates the channel state that needs to be fed back to the BS for selecting the downlink 2. WIRELESS LINK MODEL 67

Algorithm 2.1 Outer Loop Link Adaptation. Initialize: OLLA, up, down while True do Input ACK c if c =1then OLLA OLLA +up else OLLA OLLA down end if end while

MCS. In practical networks that need to limit control signaling overhead, the UE reports a suitably quantized representation of the observed channel state known as the channel quality index (CQI) [15] that is exploited by the BS to select transmission paramaters in future TTIs. This scheme for downlink link adaptation is illustrated in Fig. 2.1. Based on the CQI value q 1,...,Q , the BS estimates the signal-to- interference-and-noise ratio (SINR)2{(q) at} the UE using a lookup table gen- erated o✏ine through simulations and / or historical network data. Next, the BS estimates the conditional error probabilityb for each MCS given the estimated SINR through various modeling techniques [16]. In the case of AMC, the BS then selects the optimal MCS that is predicted to maximize the throughput under an application-specific target BLER constraint.

Figure 2.1: Each active UE transmits periodic CQI reports containing quantized estimates of the downlink channel state to the BS. These CQI reports are used for selecting MCS for future downlink transmissions. 68 PAPER III

The SINR estimates based on the CQI reports are often inaccurate on account of channel state quantization at the UE and systematic modeling errors in the CQI-SINR and SINR-MCS lookup tables at the BS. Further, the time delay between generating the the CQI and eventual downlink transmission leads to additional errors in the predicted BLER for each MCS. To compensate for these errors, an outer loop SINR o↵set OLLA is applied to select the optimal MCS in each TTI,

mopt = arg max D(m) , (2.2) m 2M s.t. P c(m)=0 <, | where = (q)+OLLA is the adjusted SINR estimate, c(m) denotes the transmission ACK event for the MCS m, and optimality is defined as the largest MCS that satisfiesb the target BLER constraint. Additionally to mitigate inac- curacies associated with inner loop link adaptation, supervised learning based approaches that map the channel state to the optimal MCS have been proposed in the past [17, 18]. In the classical approach for OLLA (outlined in Alg. 2.1), the SINR o↵set OLLA is adapted by incrementally increasing and decreasing the o↵set with fixed step sizes depending on the received ACK or NACK for the transmission in the ith TTI, OLLA +up ,c =1 OLLA = i i . (2.3) i+1 OLLA down ,c =0 ⇢ i i The ratio of OLLA step sizes is configured to approximate the target BLER over several transmissions according to [19]

down = . (2.4) up 1 The performance of OLLA critically depends on the absolute values of the step sizes. For relatively large step sizes, OLLA converges quickly towards the target BLER, however, there are larger fluctauations around the steady state BLER value. On the other hand for small step sizes, OLLA has stable steady state performance at the cost of slower convergence [12]. Further from Eq. 2.4, it is clear that OLLA performance is also a function of the target BLER. Especially in case of small target BLER values, e.g., for latency-sensitive applications to avoid retransmission delays, OLLA convergence is slow on account of a small down value.

3 Contextual Multi-armed Bandit Approach

In this section, we develop a sequential approach to learn the error probabilities for a finite set of transmission parameter configurations given the observed channel 3. CONTEXTUAL MULTI-ARMED BANDIT APPROACH 69

state. We first formulate the problem of selecting the optimal transmission param- eters given the channel state observations as a constrained optimization problem, where the constraint arises from the need to satisfy a target BLER. Next, we demonstrate that this optimization problem can be reformulated as a stochastic linear programming problem that lends itself to a contextual multi-armed bandit (CMAB) approach with additional constraints on the space of actions in each exploitation step. Finally, we refer to state-of-the-art CMAB approaches that have been developed to solve related problems albeit in di↵erent contexts [20,21]. We consider the link adaptation scheme ⇡ ⇧that selects the MCS m⇡(q)(t) in the tth TTI given the channel state observation2 vector q(t)=[q(t),...,q(t 2 WM)], where W is number of TTIs over which the channel state is observed. Then, the average rate and BLER over a finite number of TTIs T is given by

n⇡(q) 1 m r⇡(T )= D(m)c(i) (3.1) T m,qq i=1 X X n⇡(q) 1 m and ✓⇡(T )=1 c(i) (3.2) T m,qq i=1 X X ⇡(q) respectively, where nm is a random variable denoting the the number of times the MCS M is chosen by the policy ⇡(q)overtheT TTIs. We can now formulate the task of finding the optimal scheme that maximizes the throughput under the target BLER constraint as the constrained optimization problem

arg max r⇡(T ), ⇡ ⇧ 2 s.t. ✓⇡(T ) . (3.3)  An unconstrained version of optimization problem has been studied in [14] to find the optimal (channel, rate) combination in the context of cognitive wireless systems using unstructured as well as structured non-contextual multi-armed ban- dits. Here we extend the problem by assigning the context q(t) to the bandit and introducing a constraint that restricts the action space of the bandit in each TTI. Using Wald’s lemma for Eq. 3.1, we find that the expected rate up to time T can be written as

⇡ 1 ⇡(q) E[r (T )] = E[n ]D(m)✓m,qq , (3.4) T m m,qq X

where ✓m,qq is the ACK probability for MCS m given the observed channel state q. Similarly and at the same time from Eq. 3.2, the expected BLER is

⇡ 1 ⇡(q) E[✓ (T )] = 1 E[n ]✓m,qq . (3.5) T m m,qq X 70 PAPER III

Based on this formulation, the problem of finding the link adaptation scheme ⇡ that solves the constrained optimization problem stated in Eq. 3.3 can be stated as the equivalent problem of designing a stochastic linear program that solves

1 ⇡(q) arg max E[nm ]D(m)✓m,qq , ⇡ ⇧ T 2 m,qq X 1 ⇡(q) s.t. E[n ]✓m,qq 1 , (3.6) T m m,qq X This problem can be solved in an online fashion through a constrained version of the contextual multi-armed bandit (CMAB) approach [20],where the context is the observed channel state vector q, the arms correspond to the available MCSs, and reward is the observed ACK / NACK for each transmission weighted by the number of transmitted bits D(m). In the context of linear payo↵functions, ecient algorithms that learn through sequential exploration and exploitation have been developed for CMABs [22,23]. Additionally, artificial neural networks suited for modeling non-stationary re- wards have been incorporated within CMABs that rely on a multi-expert ap- proach to determine the optimal action in each exploitation step [21]. In the case of wireless links, the mapping from the observed channel state q(t) to the proba- bility of ACK for each MCS is a nonlinear function on account of the processing steps during the encoding, transmission over the physical wireless channel, and decoding of the bits, some of which has been discussed in Sec. 2. Therefore, we utilize a fully connected artificial neural network to model the arms of the CMAB, where the input the network is provided by the observed channel state vector q(t), and the output of the network is the vector of the predicted success probability success for each MCS. In each TTI given q(t), the CMAB selects the MCS m that is predicted to maximize the expected throughput in that TTI while also satisfying the target BLER constraint. For this selected MCS, the observed true reward is D(m) with a probability 1 ✓ q and 0 with m,qq a probability ✓m,qq . Therefore to approximate the true values of ✓m,qq , the CMAB needs to explore the space of available MCSs given the context vectors q(t). We implement an ✏ greedy algorithm that learns by sequential exploration and exploitation, where the exploration is controlled by a step-decaying ✏ value. The complete algorithm is outlined in Alg. 3.1, where we introduce the additional index l 1,...,L to denote multiple wireless links that utilizes a common CMAB for2{ link adaptation.}

4 Experimental Results

In this section, we provide numerical results for throughput and BLER perfor- mance of the link adaptation techniques discussed in this paper. We simulate a wireless link modeled on the 4G and 5G link chains with the simulation pa- rameters listed in Table 4.1. The wireless channel is modeled as a time- and 4. EXPERIMENTAL RESULTS 71

Algorithm 3.1 ✏ greedy CMAB with step decay. Initialize: Greed ✏ =1.0, Target BLER while True do Input: TTI t Input: Contexts ql(t)=[ql(t),...,ql(t W )], l =1,...,L. if (t mod T0)=0then ✏ ✏ 0.1 end if if rand(0, 1) ✏ then for Link l =1,...,L do m arg max D(m)✓ l m m,qql(t) s.t. ✓m,qql(t) end for  else for link l [1,...,L] do 2 ml rand[ ] end for M end if end while

Table 4.1: Simulation Parameters

Simulation Parameter Value Channel model ITU Vehicular B [25] Number of links L = 1000 UE speed (60, 10) kmph UE long-term average SINR N (10, 30) dB CQIs qU 0,...,21 MCSs m2{0,...,21} Channel coding Rate-1/32{ Turbo} CQI observation window W = 10 ✏ decay interval T0 = 100 Neural network layer sizes [20, 20, 20]

frequency-varying channel that is correlated in time according to the UE Doppler frequency, and correlated in frequency according to a multipath propagation en- vironment model [1]. Particularly, our simulator uses the parameters and config- urations specified by the 4G standard [24] with very few approximations and a well-studied model for the wireless channel [25] that is made available in industry- standard package, ITPP (accessed through its Python wrapper, py-itpp). 72 PAPER III

We generate results for L independent vehicular links with UE speeds dis- tributed normally around a mean relative speed of 60 kmph which is commonly observed in urban environments. In addition, to simulate UEs at varying dis- tances from the BS, we use uniform distribution to generate long-term average UE SINRs. We assume that each link l 1,...,L reports the instantaneous 2{ } CQI value ql(t) in each TTI t. Further since the links are served by a common BS, we can assume that the simulated links experience similar systematic errors for mapping the CQI values to the predicted BLER for each MCS [11]. Finally, we simulate the channel code rates and modulation orders associated with each MCS according to the standardized values for CQI generation in 4G networks [24]. As discussed in the previous section, the error probability for transmission over a wireless link depends on the instantaneous channel state that is known only approximately at the BS. Further given the channel state, the AMC link adaptation models the wireless link through o✏ine lookup tables to select the MCS in each TTI. To compensate systematic errros, an OLLA step is applied in each TTI to better approximate the observed link characteristics. However, the classical OLLA approach depends heavily on the heuristically-defined target BLER su↵ers from slow convergence as well as does not directly optimize for the link throughput. In contrast, our CMAB approach sequentially learns the map- ping from the observed channel state to the error probability for each transmission parameter configuration. We then exploit the CMAB to select the optimal MCS that maximizes the throughput under the target BLER constraint. Next, we illustrate and discuss the performance of three link adaptation ap- proaches for the simulation scenario described above:

AMC : The inner-loop link adaptation techique that selects the MCS based • on the CQI-SINR and the SINR-MCS lookup tables shown in Sec. 2.

OLLA + AMC (or OLLA for short): Use OLLA to adjust the AMC param- • eters with the SINR o↵set, such that to meet the target BLER, as described in 2. Note that using OLLA, in conjunction to AMC, is expected to improve AMC performance, as it compensates any systematic di↵erences betweent the BLER predicted by AMC and the realized BLER.

CMAB: Our approach, as described in Sec. 3. We use ✏-greedy exploration • with step decay to train CMAB, as shown in Alg. 3.1. CMAB directly maps the observed channel state to the predicted BLER for each MCS, and simulta- neously aims to maximize the throughput given the target BLER constraint.

For the experiments reported in the remainder of this section, we set the tar- get BLER values 0.1, 0.5, and 0.3 respectively. These target BLER values are consistent with the commonly observed values in real-world wireless cellular de- ployments. We define the throughput in the ith TTI as the average rate (measure in bits per second) per Hertz, as defined in Eq. 2.1. We estimate the realized 4. EXPERIMENTAL RESULTS 73

th 1 BLER in the i TTI as 1 L l cl(i) for the first transmission of a transport block; Retransmissions are not simulated. In the figures below, we plotP the link throughput and realized BLER for AMC (blue curves), AMC + OLLA (orange curves) and CMAB (green curves). Note that with CMAB, we use exploitation, i.e., the CMAB select the MCS based on the policy learnt over all the previous TTIs across all simulated links. =0.1: As discussed in previous sections, latency-sensitive applications rely on a low target BLER to reduce the retransmission overhead. Fig. 3.1 shows the plots for the average link rate (in bits per second per Hertz) and the realized BLER, corresponding to a target BLER value of 0.1. CMAB performs the best, as it is successful in sequentially learning a policy that maximizes the throughput, subject to maintaining the realized BLER below the target value. In contrast, while OLLA converges to the maximum throughput (by adjusting the SINR o↵set OLLA in each TTI), it su↵ers from slow convergence to a low target value for BLER. This behavior is consistent with our discussion in Sec. 2. The reason CMAB is able to converge to a better average rate is because it exploits the rewards from all the L links in parallel. In contrast, OLLA operates onlynon a per-link basis, which leads to a slower convergence. We plot the advantage from faster convergence of CMAB with the black curve, where we observe up to 40% gain in rate that drops over the next few TTIs as OLLA converges to the optimal setting. =0.5: Fig. 3.2 plots the OLLA and CMAB performances for a somewhat artificially high target BLER of 0.5. This regime aims to evaluate these two approaches in environments where the channel is rapidly changing, or where ap- plications can tolerate high jitter caused by retransmissions. In this regime, OLLA achieves significantly lower rate than CMAB. The reason for this is apparent from the BLER’s plot. Recall that OLLA aims to achieve the heuristically-set target BLER, and indeed here it is successful in doing so by outperforming CMAB. However, this performance comes at the expense of ignoring the throughput. In contrast, CMAB only aims to maintain the target BLER within its bounds. As a result, BLER is able to exploit the ”room” within the BLER bounds to maximize the throughput. We observe that in this setting, the CMAB is able to achieve as much as 25% rate gain compared to OLLA (black curve). =0.3: Finally in Fig. 3.3, we evaluate the OLLA and CMAB performance for a target BLER of 0.3. This value is often used to evaluate the mobile broad- band applications, as these applications are relatively tolerant to retransmission delays, while striving to maximize the throughput. In terms of average rate, CMAB performs similarly to BLER, after the initial exploration phase. However, CMAB achieves this rate for a lower realized BLER. In practice, a lower BLER improves the wireless link performance by triggering fewer retransmissions. We plot this advantage in terms of the retransmission gain for CMAB over OLLA (black curve), where we observe that CMAB can reduce the BLER and therefore the number of retransmissions by 15% while achieving similar throughput. 74 PAPER III

5 Conclusions

We have proposed an online learning approach to solve the problem of selecting the optimal transmission parameters in each TTI of a wireless communication sys- tem. We formulate this problem as a stochastic linear program and show that this can be reduced to a standard CMAB problem. Further, we develop an ✏ greedy CMAB with step decay to sequentially explore the space of available transmitter parameter configurations given the observed channel state. Finally, we apply our approach for selecting the optimal MCS that maximizes the throughput under the constraint of a target BLER and through numerical evaluations demonstrate that our approach outperforms the classical OLLA technique for link adaptation. 5. CONCLUSIONS 75

References

[1] A. Molisch, Wireless Communications. Wiley, 2010.

[2] G. Ku and J. M. Walsh, “Resource allocation and link adaptation in lte and lte advanced: A tutorial,” IEEE Communications Surveys Tutorials, vol. 17, pp. 1605–1633, thirdquarter 2015.

[3] C. Yu, L. Yu, Y. Wu, Y. He, and Q. Lu, “Uplink scheduling and link adaptation for narrowband internet of things systems,” IEEE Access, vol. 5, pp. 1724– 1734, 2017.

[4] L. Deek, E. Garcia-Villegas, E. Belding, S.-J. Lee, and K. Almeroth, “A practi- cal framework for 802.11 mimo rate adaptation,” Computer Networks, vol. 83, pp. 332 – 348, 2015.

[5] L.-J. Chen, R. Kapoor, M. Y. Sanadidi, and M. Gerla, “Enhancing bluetooth TCP throughput via link layer packet adaptation,” in 2004 IEEE International Conference on Communications (IEEE Cat. No.04CH37577), vol. 7, pp. 4012– 4016 Vol.7, June 2004.

[6] T. L. Jensen, S. Kant, J. Wehinger, and B. H. Fleury, “Fast link adaptation for mimo ofdm,” IEEE Transactions on Vehicular Technology, vol. 59, pp. 3766– 3778, Oct 2010.

[7] M. Shikh-Bahaei, “Joint optimization of transmission rate and outer-loop snr target adaptation over fading channels,” IEEE Transactions on Communica- tions, vol. 55, pp. 398–403, March 2007.

[8] P. Wu and N. Jindal, “Coding versus arq in fading channels: How reliable should the phy be?,” in GLOBECOM 2009 - 2009 IEEE Global Telecommu- nications Conference, pp. 1–6, Nov 2009.

[9] S. Park, R. C. Daniels, and R. W. Heath, “Optimizing the target error rate for link adaptation,” in 2015 IEEE Global Communications Conference (GLOBE- COM), pp. 1–6, Dec 2015.

[10] V. Buenestado, J. M. Ruiz-Avil´es, M. Toril, S. Luna-Ram´ırez, and A. Mendo, “Analysis of throughput performance statistics for benchmarking lte net- works,” IEEE Communications Letters, vol. 18, pp. 1607–1610, Sept 2014.

[11] A. Dur´an, M. Toril, F. Ruiz, and A. Mendo, “Self-optimization algorithm for outer loop link adaptation in lte,” IEEE Communications Letters, vol. 19, pp. 2005–2008, Nov 2015. 76 PAPER III

[12] R. A. Delgado, K. Lau, R. Middleton, R. S. Karlsson, T. Wigren, and Y. Sun, “Fast convergence outer loop link adaptation with infrequent updates in steady state,” in 2017 IEEE 86th Vehicular Technology Conference (VTC-Fall), pp. 1– 5, Sept 2017. [13] S. Katri Pulliyakode and S. Kalyani, “Reinforcement learning techniques for Outer Loop Link Adaptation in 4G/5G systems,” ArXiv e-prints, Aug. 2017. [14] R. Combes and A. Proutiere, “Dynamic rate and channel selection in cognitive radio systems,” IEEE Journal on Selected Areas in Communications, vol. 33, pp. 910–921, May 2015. [15] E. Dahlman, S. Parkvall, and J. Sk¨old, 4G LTE/LTE-Advanced for Mobile Broadband. Elsevier, 2011. [16] K. Brueninghaus, D. Astely, T. Salzer, S. Visuri, A. Alexiou, S. Karger, and G. A. Seraji, “Link performance models for system level simulations of broad- band radio access systems,” in 2005 IEEE 16th International Symposium on Personal, Indoor and Mobile Radio Communications, vol. 4, pp. 2306–2311, Sept 2005. [17] V. Saxena, J. Jald´en, M. Bengtsson, and H. Tullberg, “Deep learning for frame error probability prediction in bicm-ofdm systems,” in 2018 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6658–6662, April 2018. [18] R. C. Daniels, C. M. Caramanis, and R. W. Heath, “Adaptation in Convolu- tionally Coded MIMO-OFDM Wireless Systems Through Supervised Learn- ing and SNR Ordering,” IEEE Transactions on Vehicular Technology, vol. 59, pp. 114–126, Jan 2010. [19] F. Blanquez-Casado, G. Gomez, M. d. C. Aguayo-Torres, and J. T. En- trambasaguas, “eolla: an enhanced outer loop link adaptation for cellular networks,” EURASIP Journal on Wireless Communications and Networking, vol. 2016, p. 20, Jan 2016. [20] J. Langford and T. Zhag, “The epoch-greedy algorithm for multi-armed ban- dits with side information,” in Advances in neural information processing sys- tems, pp. 817–824, 2008. [21] R. Allesiardo, R. F´eraud, and D. Boune↵ouf, “A neural networks committee for the contextual bandit problem,” in Neural Information Processing (C. K. Loo, K. S. Yap, K. W. Wong, A. Teoh, and K. Huang, eds.), (Cham), pp. 374–381, Springer International Publishing, 2014. [22] W. Chu, L. Li, L. Reyzin, and R. Schapire, “Contextual bandits with linear payo↵functions,” in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 208–214, 2011. 5. CONCLUSIONS 77

[23] L. Li, W. Chu, J. Langford, and R. E. Schapire, “A contextual-bandit ap- proach to personalized news article recommendation,” in Proceedings of the 19th international conference on World wide web, pp. 661–670, ACM, 2010. [24] 3rd Generation Partnership Project, “Evolved Universal Terrestrial Radio Ac- cess (E-UTRA); Physical layer procedures,” Tech. Rep. 36.213 v12.3.0, Sept. 2016. [25] ITU-R, “Guidelines for evaluation of radio transmission technologies for IMT- 2000,” tech. rep., 1997. 78 PAPER III

(a) Average Throughput per Link

(b) Block Error Rate

(c) Throughput Gain (CMAB vs OLLA)

Figure 3.1: Average link rate (a) and realized BLER (b) for target BLER =0.1, shown with a horizontal red line in the right figure. Compared to OLLA, our CMAB approach converges faster to a higher average link rate (c). 5. CONCLUSIONS 79

(a) Average Throughput per Link

(b) Block Error Rate

(c) Throughput Gain (CMAB vs OLLA)

Figure 3.2: Average link rate (a) and realized BLER (b) for target BLER =0.5, shown with a horizontal red line in the right figure. For this target BLER, our CMAB approch provides significant gains in average throughput over OLLA while simultaneously lowering the realized BLER (c). 80 PAPER III

(a) Average Throughput per Link

(b) Block Error Rate

(c) Retransmission Gain (CMAB vs OLLA)

Figure 3.3: Average link rate (a) and realized BLER (b) for target BLER =0.3, shown with a horizontal red line in the right figure. Both CMAB and OLLA have similar average rate performance, however, CMAB achieves this rate at a lower BLER (c). Paper IV

81 82 PAPER IV 83

Bayesian Link Adaptation under a BLER Target

The optimal modulation and coding scheme (MCS) for wire- less transmission depends on the dynamic wireless channel state. Hence, wireless link adaptation relies on periodically reported chan- nel quality index (CQI) values to select the optimal MCS for each transmission instance. However, to optimize link performance for a given wireless environment, current link adaptation techniques rely on tuning parameters such as a block error rate (BLER) target and algorithm adjustments that are dicult to optimize heuristically. Here, we propose BayesLA,aBayesianlinkadaptationschemethat does not require careful parameter tuning for optimal link perfor- mance in diverse wireless environments. BayesLA, which is inspired by the Thompson Sampling approach widely used for online learn- ing, eciently learns the MCS success probabilities conditioned on the reported CQI values. Through numerical simulations for a Rayleigh fading wireless channel and a typical cellular link configu- ration, we demonstrate that BayesLA outperforms state-of-the-art outer loop link adaptation (OLLA) approach in terms of the real- ized link throughput for a given BLER target.

c 2020 IEEE. Reprinted, with permission, from V. Saxena, and J. Jald´en, “Bayesian Link Adaptation under a BLER Target.”, IEEE 21st International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), 2020. 84 PAPER IV

1 Introduction

Link adaptation is a critical component in the ecient operation of wireless net- works. By optimally selecting the link parameters in response to the observed wireless channel state, link adaptation can significantly improve the link perfor- mance in terms of its throughput and average packet error rate. Current link adaptation techniques rely on periodically reported channel quality index (CQI) values to dynamically adjust the modulation and coding scheme (MCS) used for transmitting physical-layer transport blocks over the air [1]. Link adaptation is applicable to uplink as well as downlink transmission in wireless cellular networks. However, downlink link adaptation is substantially more challenging on account of quantized, delayed, and noisy CQI feedback from the user equipment. Hence, we address downlink link adaptation in the rest of this paper. Link adaptation in wireless cellular networks comprises two inter-dependent control loops: and inner loop also known as adaptive modulation and coding (AMC), and outer loop link adaptation (OLLA) respectively. For a reported CQI value, AMC estimates the optimal MCS using lookup tables that are generated o✏ine and stored at the base station. The lookup tables provide an approximate transmission success probability for each MCS conditioned on the CQI. However, owing to modeling errors in the lookup table, the realized link performance de- viates from the optimal performance predicted by AMC. To compensate for the modeling errors that a↵ect AMC performance, OLLA estimates an o↵set value for the AMC parameters iteratively over several transmissions. The AMC-OLLA scheme strives to maximize the link throughput under an externally defined block error rate (BLER) target. The AMC-OLLA approach (OLLA for brevity hereon) for link adaptation su↵ers from several issues. First, the lookup tables occupy substantial memory space in the resource-constrained base station hardware. Second, the optimal BLER target, in terms of maximizing the link throughput, does not have a closed- form solution and typically varies between 10% and 30% [2,3]. Another drawback of OLLA approach is that the step size for AMC o↵set adjustments is defined heuristically. Using a small step size leads to slow OLLA convergence. Conversely, using large step sizes allows OLLA converge quicker at the cost of large oscillations around the optimum. In a field study of live 4G networks [4], OLLA was found to have a strong impact on the network performance. Several extensions have been proposed for OLLA which alleviate some of the drawbacks mentioned above. In [5], a supervised o✏ine scheme was proposed to replace the AMC lookup tables with compact artificial neural network rep- resentations. In [6], a hypothesis testing framework was proposed to optimize the OLLA step size for faster convergence. The optimality of BLER target was investigated in [3], which also derived an approximate analytical expression that maximizes the throughput. A theoretically motivated online learning scheme for link adaptation was proposed in [7], which speeds up OLLA by learning from multiple parallel links. 2. MODEL AND OBJECTIVES 85

In this paper, we propose BayesLA, a Bayesian approach for link adaptation. BayesLA models the MCS success probabilities conditioned on the CQI through a suitable probability distribution. Over successive transmissions, BayesLA up- dates these probability distributions based on the observed transmission success (i.e., ACK) and failure (i.e., NACK) events. BayesLA reduces the memory foot- print of lookup tables used by OLLA by encoding the MCS success probabilities with a small number of parameters. Further, in contrast to OLLA, BayesLA does not strive to match the BLER target. Instead, BayesLA maximizes the expected link throughput subject to the BLER target by automatically converging to the optimal, throughput-maximizing, BLER allowed by the target value. BayesLA is inspired by the Thompson sampling approach, which has been widely studied in machine learning and has been applied in recommender systems as well as operations research [8]. Recently, Thompson sampling has also been shown to improve the throughput of WiFi links in [9]. In contrast to the scheme in [9] that optimized for the link throughput, BayesLA optimizes link through- put subject to the BLER target. Our recent work in [10] proposed the first constrained Thompson sampling for this problem that provides theoretical per- formance guarantees compared to an oracle policy. Compared to [10], BayesLA conditions the MCS success probabilities on the CQI. Further, we incorporate the o✏ine AMC lookup tables to significantly speed up BayesLA learning for the initial few transmissions. We conduct numerical simulations for cellular links over a Rayleigh fading channel to show that BayesLA outperforms OLLA while alleviating the dependence on heuristically tuned parameters.

2 Model and Objectives

Wireless Link Model In each frame indexed by t =1,...,T scheduled for downlink transmission, the base station packs information bits into variable-sized transport blocks to be trans- mitted over the physical air interface. The transport block size Dk(t) is dictated by the choice of the MCS index k(t) 1,...,K in frame t. The transport bits is first encoded using an error-protection2{ channel} code and subsequently interleaved with a pseudo-random sequence in order to randomize the e↵ect of noise and interference. Subsequently, the channel-encoded bits are mapped onto complex-valued symbols defined by the modulation order that corresponds to k(t). In 4G and 5G systems, the modulated symbols are loaded onto frequency- domain subcarriers that constitute an orthogonal frequency division multiplexing (OFDM) symbol. Several, contiguous, OFDM symbols are packed into a frame along with known reference symbols for channel estimation and transmitted over the air. The e↵ective data transmission rate is 1 r = D [bits / s / Hz], (2.1) k ft k 86 PAPER IV

for MCS index k(t)=k,wheref and t denote the bandwidth and time spanned by the frame respectively.

The complex-valued and time-varying wireless channel, denoted here by ht, attenuates the transmitted signal and adds stochastic impairments such as phase rotation, noise, and interference. The receiver estimates the channel hˆt from the reference symbols embedded in the frame and uses it to compensate for channel e↵ects on the information-containing symbols. The receiver then inverses the data processing steps applied prior to transmission and constructs the received transport block. The successful reception of transport block is validated using a cyclic redundancy check of the recovered bits, and an ACK (xk(t) = 1) or NACK (xk(t) = 0) signal indicating respective success or failure is fed back to the base station. Additionally, the receiver periodically estimates a discrete CQI, qt = f(hˆt) , which is fed back to base station. For a choice of MCS indices k(t)=k 21Q,...,K and observed CQI q(t)=q , the binary transmission 2 2Q success events xk,q = x(rk,q) can be viewed as a family of stochastic random variables, which yields the conditional transmission success probability µk,q = P (xk,q =1q). We model this probability as an independent and identically distributed Bernoulli| random variable, i.e., x Bern(µ ). k,q ⇠ k,q

Optimal MCS Selection under a BLER Target

Consider a wireless link with true transmission success probabilities µ ,k k 2 1,...,K for the corresponding set of available rates rk and for a given CQI. In each{ frame} indexed by the time interval t =1,...,T,selectingtheMCSindex k(t) 1,...,K either delivers r bits to the receiver with probability µ , 2{ } k(t) k(t) or zero bits with probability 1 µk(t). Under the assumption that CQI captures the wireless channel state accurately, the packet success event is a Bernoulli ran- dom variable with mean µk,i.e.,xk Bern(µk). The goal of link adaptation is to maximize the average throughput⇠ over T frames subject to the BLER target ⌘,

1 T maximize r x T k(t) k(t) t=1 X 1 T subject to x 1 ⌘. (2.2) T k(t) t=1 X

Denote by nk(T ) and sk(T ) the number of times that MCS index k(t)=k is se- lected and the number of observed successes respectively until frame T . Assuming that the transmission success events are independent and identically distributed, 2. MODEL AND OBJECTIVES 87

we have the optimal MCS selection problem:

1 K argmax rknk(T )sk(T ) n1(T ),...,nK (T ) T kX=1 1 K subject to n (T )s (T ) 1 ⌘. (2.3) T k k kX=1 The optimal solution to (2.3) is NP-hard and requires computational time that is exponential in T . However, wireless applications are designed to be tolerant to occasional transmission failures due to unpredictable radio channel conditions. This implies that the BLER target constraint in (2.3) can be transformed to a soft constraint that needs to be satisfied only in the expectation sense. Recall that the packet success probability for rate rk is denoted by µk, then we can write the relaxation of (2.3) as the linear program

K

LP (p) := maximize pkrkµk kX=1 K subject to p µ 1 ⌘, (2.4) k k kX=1

where pk [0, 1] is the probability of selecting MCS index k, and p =[p1,...,pK ] 2 K is the probabilistic MCS selection vector such that k=1 pk = 1. We denote the stationary optimal solution to (2.4) by p⇤ =[p1⇤,...,pK⇤ ]. The problem in (2.4) is an optimization problemP with a linear objective and K a linear constraint. The space of the feasible objective values, k=1 pkrkµk is the convex hull of the expected throughputs, rkµk,k 1,...,K . There are three possible scenarios: First, no feasible solution to2 (2.4){ exists.P} Second, an optimal solution exists and is equivalent to always selecting a certain MCS for transmission. Third, an optimal solution exists that is composed of a linear combination of more than one available MCSs. We describe these three scenarios subsequently.

Proposition 1. For a set of transmission success probabilities µ1,...,µK such that µ < 1 ⌘ k 1,...,K , there does not exist any feasible{ solution to}(2.4). k 8 2{ }

K K Proof. In this case, we have that k=1 µkpk maxk µk k=1 pk maxk µk < 1 ⌘. Therefore, the BLER target constraint cannot be satisfied using the available MCSs. Consequently, no feasible solutionP to the problem exists.P

Next, we describe the scenarios under which a feasible solution exists. In the following, we assume that the rate corresponding to the MCS index i = 88 PAPER IV

argmaxk 1,...,K rkµk is the throughput-maximizing rate. Then we consider the two non-overlapping2{ } scenarios where the MCS index i satisfies and fails the BLER target respectively, i.e., µ 1 ⌘ and µ < 1 ⌘ respectively. i i Proposition 2. If µ 1 ⌘, the optimal rate selection probabilities are degenerate i to p⇤ =1,k= i and p⇤ =0 k = i. k k 8 6 K K Proof. We have that k=1 rkµkpk riµi k=1 pk riµi, with equality i↵ pi = 1. Thus, always selecting the MCS index i maximizes the objective. Further, since K µ p = µ 1 P⌘, the BLER constraintP is satisfied for this MCS selection k=1 k k i scheme. Therefore, the optimal MCS selection probabilities are degenerate to pk = 1P,k= i and p =0 k = i. k 8 6

Proposition 3. If µi <and a feasible solution exists, then the probabilistic MCS selection vector is supported by at least two MCS indices.

Proof. Consider the set of arms 0 := i riµi > maxk rkµk,µk < 1 ⌘ .Then U { | 2U } for any k1 0 and k2 , and ✏>0, there exists another constant >0such that ✏µ +2Uµ =(✏ +2)(1U ⌘). Since µ r >µ r by definition, we have k1 k2 k1 k1 k2 k2 that ✏µk1 rk1 + µk2 rk2 > (a + b)µk2 rk2 .Setting✏ + = pk2 ,wherepk2 is the contribution of the MCS index k2 towards the success probability, we obtain ✏ and

such that ✏µk1 + µk2 = pk2 (1 ⌘) and ✏µk1 rk1 + µk2 rk2 >pk2 µk2 rk2 . Repeating this process for all k , we obtain the optimal MCS selection vector that satisfies this proposition. Consequently,2U the optimal link adaptation policy is a probabilistic mixture of more than one MCS indices.

Hence, for any given CQI, the optimal MCS selection strategy corresponding to (2.4) has three possibile outcomes: (i) no feasible solution, (ii) the throughput- maximizing MCS also satisfies the constraint and therefore is the single optimal MCS, and (iii) a probabilistic mixture of two MCSs optimizes the throughput subject to the BLER target. In the rest of this paper, we propose BayesLA, a Bayesian scheme that eciently learns the optimal MCS selection strategy online, and present numerical performance results.

3 Link Adaptation Schemes

Outer Loop Link Adaptation The base station obtains periodic, discretized CQI values, q , from the re- ceiver. In OLLA, the transmitter maps the CQI to approximate2Q transmission success probabilities,µ ¯1,...,K , using a parameterized, o✏ine link-to-system (L2S) lookup table ✓(q):q µ¯1,...,K [11]. These transmission success probabilities are corruptedM by modeling7! errors, quantization noise, and feedback delays. The transmitter compensates for these inaccuracies through a dynamic o↵set applied 3. LINK ADAPTATION SCHEMES 89

to the the parameters of the L2S model through OLLA. OLLA adds a fixed o↵- set, up =, for every observed transmission success and subtracts another, fixed, o↵set, down, to the model parameters for for every observed transmission up 1 ⌘ failure. The ratio of o↵set values are chosen to be down = ⌘ such that OLLA converges asymptotically to the BLER target ⌘. In every frame t, OLLA uses the OLLA latest parameter o↵set, t , to estimate the transmission success probabilities for each MCS through a mapping function f over the local o✏ine model. The OLLA algorithm steps are described in Alg. 3.1. There are three major shortcoming of OLLA: (i) OLLA converges to the BLER target ⌘. However, as discussed in Sec. 2, the optimal BLER target is not known analytically. As a results, the MCS selected by OLLA after convergence may provide a sub-optimal throughput. (ii) The steady-state performance of OLLA depends on the scaling factor for the step adjustments. For relatively large step sizes, OLLA converges quickly towards the optimal rate selection policy, but su↵ers from larger fluctuations around the steady state. On the other hand for small step sizes, OLLA has stable steady state performance at the cost of slower convergence. (iii) Finally from the definition of step sizes, we observe that OLLA convergence is a function of ⌘. For relatively large values of ⌘, the upward adjustment step is small, and conversely for larger values of ⌘, the downward adjustment step is small.

Algorithm 3.1 Outer Loop Link Adaptation

1: Initialize: µ¯ 1,...,K ,0, ,⌘,q {up 1} ⌘ down 2: Initialize: = ⌘ , = OLLA 3: Initialize: 0 =0 4: for Time index t =1to T do OLLA 5: Setµ ¯ 1,...,K ,q(t) f(t , ✓,qt) { } M 6: Select k(t) = argmax µ¯k,q(t)rk k subject toµ ¯ (t) 1 ⌘ k(t),q

7: Observe the success event xk(t). 8: if xk(t) =1then OLLA OLLA up 9: Update: t t 1 + 10: else OLLA OLLA down 11: Update: t t 1 12: end if 13: end for 90 PAPER IV

BayesLA

The shortcomings of OLLA arise from the fact that it depends on a careful choice of tuning parameters for optimality. In contrast, we propose BayesLA, a Bayesian link adaptation approach, where the algorithm parameters are learnt automati- cally from the observed ACK / NACKs for prior transmissions. BayesLA first as- signs uniform prior distributions over aprioriunknown conditional transmission success probabilities, µk,q. In each frame, BayesLA samples these distributions to estimate the MCS that maximizes the expected throughput under the BLER tar- get. Subsequently, BayesLA refines the posterior distributions with the observed ACK / NACK value for the selected MCS. This sampling-based approach, also known as Thompson Sampling, eciently balances the exploration-exploitation trade-o↵in learning the optimal action from a set of finite available actions [8].

BayesLA assigns independent Beta priors, Beta(↵k,q(1),k,q(1)), to the mean transmission success probability for each combination of MCS and CQI values using the o✏ine L2S model ✓(q). These priors are updated with the observed ACK / NACK for the transmittedM MCS, and the updated distributions are used for MCS selection in the next frame. The choice of Beta priors follows naturally from the conjugacy property of Beta and Bernoulli distributions. In particular, since the observed ACK, xk(t),q = 1, is Bernoulli-distributed (Sec. 2), the posterior distribution over µk(t),q is simply another Beta-distributed random variable with parameters (↵k(t),q(t)+1,k(t),q(t)). Conversely for an observed NACK, the posterior Beta distribution has the parameters (↵k(t),q(t) ,k(t),q(t) + 1).

In each frame indexed by the time interval t, BayesLA samples the suc- cess probabilities, µk,q(t),k 1,...,K , from the latest posterior for each MCS and for the observed CQI2{ value. The} estimated success probabilities are used to solve the LPe in (3.1) and obtain the probabilistic MCS selection vector [p1⇤(t),...,pK⇤ (t)]. The optimal MCS index, k(t) is obtained by sampling this vector. For the selected MCS, the transmission success or failure event, ACK / NACK, is observed through the binary variable xk(t), which is used to update the Beta distributions for each MCS according to the update rule in (3.2).

Compared to the empirically-selected OLLA parameters, the BayesLA algo- rithm parameters ↵1,...,↵K ,1,...,K are initialized using the L2S model and refined online from the observed ACK / NACK for the transmitted MCSs. In [7], it was shown that the theoretical convergence of a closely related algorithm is guaranteed to be sub-linear in time. No comparable results are available for OLLA. Further, since BayesLA solves the sampled LP (p(t)) in every frame, it obtains the optimal probabilistic MCS selection strategy. Hence, BayesLA does not su↵er from the problem of sub-optimally defined BLER target as in the case of OLLA. 4. NUMERICAL EVALUATION 91

Algorithm 3.2 BayesLA 1: Input: MCS indices k =1,...,K, BLER Target ⌘ 2: Initialize: ↵ (1) (q) q { 1,...,K }q M✓ 8 2Q 3: Initialize: (1) (q) q { 1,...,K }q M✓ 8 2Q 4: for Time index t =1to T do 5: Sample µ (t) Beta(↵ (t), (t)) for each MCS index k =1,...,K and k,q ⇠ k,q k,q with CQI q = qt. 6: Solve thee linear program:

K LP (p(t)) : maximize rkµk,q(t) pk(t) kX=1 ✓ ◆ K e subject to µ (t)p (t) 1 ⌘, k,q k kX=1 K e pk(t)=1, and kX=1 p (t) 0 k 1,...,K . (3.1) k 8 2{ }

7: if Eq. 3.1 has a feasible solution, then 8: Sample k(t) [p⇤(t),...,p⇤ (t)] ⇠ 1 K 9: else 10: Sample k(t) uniformly from 1,...,K . { } 11: end if 12: Observe: Transmission success xk(t). 13: Update:

↵k(t),q(t + 1) = ↵k(t),q(t)+xk(t) (t + 1) = (t)+(1 x ). (3.2) k(t),q k(t),q k(t) 14: end for 92 PAPER IV

(a) Throughput for BLER Target 0.1 (b) BLER for BLER Target 0.1

(c) Throughput for BLER Target 0.3 (d) BLER for BLER Target 0.3

Figure 4.1: Average throughput and BLER performance for outer loop link adap- tation (OLLA) and Bayesian Link Adaptation (BayesLA) for a Rayleigh fading channel. BayesLA achieves a higher throughput and lower BLER compared to OLLA, thereby improving the link performance.

4 Numerical Evaluation

We conduct numerical simulations for a 4G wireless link models in Python. The transmission parameters are obtained from the 4G physical-layer standard [12] and are listed in Tab. 4.1. To obtain the o✏ine L2S model, ✓(q) use the ap- proach in [11]. The o✏ine model is used to aid in link adaptationM over a single-tap, Rayleigh fading channel with a normalized Doppler of 0.01. For each experiment, we simulate a downlink data transmission channel with 5000 contiguous frame transmissions. The throughput and block error rates are obtained by averaging the realized throughput and the NACKs respectively over 1000 independent links with randomly generated channel and noise realizations. We simulate two algo- rithms: OLLA, and BayesLA respectively. The OLLA step size is chosen to be 0.1 dB based on the analysis in [2]. We initialize the priors for BayesLA in the follow- ing manner: For the kth MCS, the prior is Beta(1+ Nµ¯ (q) , 1+ N(1 µ¯ (q)) ), b k c b k c whereµ ¯k(q) is the success probability provided by the L2S model ✓(q) for the channel quality index q and N = 10 controls the confidence in theM L2S model. The complete simulation code is available at [13]. 4. NUMERICAL EVALUATION 93

Table 4.1: Simulation Parameters

Parameter Value K 28 T 1000 ⌘ 0.1, 0.3 SNR 15 dB Channel fading Rayleigh type Normalized 0.01 Doppler Number of subcar- 72 (i.e., 6 resource blocks) [12] riers Transport Block From [12, Tab. 7.1.7.2.1-1] Sizes Modulation orders From [12, Tab. 7.1.7.1-1]

In Fig. 1.1(a) and Fig. 1.1(b), we plot the realized throughput and the realized BLER for the considered algorithms, when the BLER target is set to 0.1. The BayesLA and OLLA algorithms have similar throughput at the beginning of the experiment along with a BLER close to 0.1. However, after a few transmissions, BayesLA achieves a slightly better throughput than OLLA. In addition, BayesLA converges to a lower BLER compared to OLLA. Hence, by training the BayesLA priors on the o✏ine L2S model, it is possible to achieve a good initial performance, and subsequently refine this further through online parameter updates to improve the performance compared to state-of-the-art OLLA.

In Fig. 1.1(c) and Fig. 1.1(d), we plot the realized throughput and the realized BLER for a BLER target of 0.3. For this BLER target, we find that OLLA su↵ers a significant performance drop. Since OLLA is designed to converge to the BLER target, it adjusts its o↵set values such that it converges to a BLER close to 0.3. However, since a BLER target of 0.3 is sub-optimal for this particular wireless configuration, the throughput performance of OLLA is sub-optimal. In contrast, BayesLA optimizes the throughput under the BLER target of 0.3, thereby achiev- ing a high throughput along with a lower BLER value, which is optimal for the simulated wireless environment scenario. Hence, BayesLA mitigates the e↵ect of poorly defined link adaptation parameters. 94 PAPER IV

5 Conclusions

In this paper, we address link adaptation in terms of MCS selection under BLER target. We formulate a rigorous model of this constrained optimization problem and develop a linear programming relaxation suitable for the wireless setting. We propose a Bayesian, online learning algorithm, BayesLA, which optimizes the link throughput by learning the optimal probabilistic MCS selection policy. Further, compared to the state-of-the-art OLLA approach, BayesLA mitigates the reliance on heuristically-configured parameters that are ofted dicult to tune for optimal link performance. Through numerical experiments, we demonstrate that BayesLA improves the link performance compared to OLLA both in terms of the realized throughput and BLER.

Acknowledgements

This work was partially supported by the Wallenberg Artificial Intelligence, Au- tonomous Systems and Software Program (WASP) funded by Knut and Alice Wallenberg Foundation, and by the European Research Council project AGNOS- TIC (742648). 5. CONCLUSIONS 95

References

[1] A. Molisch, Wireless Communications. Wiley, 2010.

[2] P. Wu and N. Jindal, “Coding versus arq in fading channels: How reliable should the phy be?,” IEEE Transactions on Communications, vol. 59, no. 12, pp. 3363–3374, 2011.

[3] S. Park, R. C. Daniels, and R. W. Heath, “Optimizing the target error rate for link adaptation,” in 2015 IEEE Global Communications Conference (GLOBE- COM), pp. 1–6, IEEE, 2015.

[4] V. Buenestado, J. M. Ruiz-Avil´es, M. Toril, S. Luna-Ram´ırez, and A. Mendo, “Analysis of throughput performance statistics for benchmarking lte net- works,” IEEE Communications Letters, vol. 18, pp. 1607–1610, Sept 2014.

[5] V. Saxena, J. Jald´en, M. Bengtsson, and H. Tullberg, “Deep learning for frame error probability prediction in bicm-ofdm systems,” in 2018 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6658–6662, IEEE, 2018.

[6] R. A. Delgado, K. Lau, R. Middleton, R. S. Karlsson, T. Wigren, and Y. Sun, “Fast convergence outer loop link adaptation with infrequent updates in steady state,” in 2017 IEEE 86th Vehicular Technology Conference (VTC-Fall), pp. 1– 5, Sept 2017.

[7] V. Saxena, J. Jald´en, J. E. Gonzalez, M. Bengtsson, H. Tullberg, and I. Stoica, “Contextual multi-armed bandits for link adaptation in cellular networks,” in Proceedings of the 2019 Workshop on Network Meets AI & ML, pp. 44–49, 2019.

[8] O. Chapelle and L. Li, “An empirical evaluation of thompson sampling,” in Advances in neural information processing systems, pp. 2249–2257, 2011.

[9] H. Qi, Z. Hu, X. Wen, and Z. Lu, “Rate adaptation with thompson sampling in 802.11 ac wlan,” IEEE Communications Letters, vol. 23, no. 10, pp. 1888–1892, 2019.

[10] V. Saxena, J. Jald´en, J. E. Gonzalez, I. Stoica, and H. Tullberg, “Con- strained thompson sampling for wireless link optimization,” arXiv preprint arXiv:1902.11102, 2019.

[11] K. Brueninghaus, D. Astely, T. Salzer, S. Visuri, A. Alexiou, S. Karger, and G.- A. Seraji, “Link performance models for system level simulations of broadband radio access systems,” in Personal, Indoor and Mobile Radio Communications, 2005. PIMRC 2005. IEEE 16th International Symposium on, vol. 4, pp. 2306– 2311, IEEE, 2005. 96 PAPER IV

[12] 3rd Generation Partnership Project, “Evolved Universal Terrestrial Radio Ac- cess (E-UTRA); Physical layer procedures,” Tech. Rep. 36.213 v12.3.0, Sept. 2016. [13] V. Saxena, “bayesla-link-adaptation.” https://github.com/vidits-kth/ bayesla-link-adaptation, 2020. Paper V

97 98 PAPER V 99

Thompson Sampling for Linearly Constrained Bandits

We address multi-armed bandits (MAB) where the objective is to maximize the cumulative reward under a probabilistic linear con- straint. For a few real-world instances of this problem, constrained extensions of the well-known Thompson Sampling (TS) heuristic have recently been proposed. However, finite-time analysis of con- strained TS is challenging; as a result, only O pT bounds on the cumulative reward loss (i.e., the regret )areavailable.Inthispa- per, we describe LinConTS, a TS-based algorithm for bandits that place a linear constraint on the probability of earning a reward in every round. We show that for LinConTS, the regret as well as the cumulative constraint violations are upper bounded by O(log T ) for the suboptimal arms. We develop a proof technique that re- lies on careful analysis of the dual problem and combine it with recent theoretical work on unconstrained TS. Through numerical experiments on two real-world datasets, we demonstrate that Lin- ConTS outperforms an asymptotically optimal upper confidence bound (UCB) scheme in terms of simultaneously minimizing the regret and the violation.

Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. PMLR: Volume 108. Copyright 2020 by the author(s). 100 PAPER V

1 Introduction

Multi-armed bandits (MAB) are a systematic way of modeling sequential decision problems. In MAB problems, an agent plays a sequence of arms aimed at opti- mizing some cumulative objective over T dicrete time intervals (“rounds”). The agent strives to achieve this goal by sequentially exploring the available arms and exploiting historical rewards from previously selected arms. MABs have been suc- cessfully applied to problems in dynamic pricing, online procurement, and digital advertising, where the goal is to minimize the regret (i.e., the cumulative reward loss) over a finite time horizon [1]. Despite the popularity of MABs, the unconstrained reward maximization goal does not extend to several commonly encountered sequential decision problems. In this paper, we consider a constrained MAB problem where the objective func- tion is subject to a specific type of linear constraint, namely, that the probability of receiving a reward in any round exceeds a fixed, pre-defined threshold. A few recently studied applications of this type of constraint are listed below: Weblink selection: Several online content providers earn revenue by adver- tising aliate weblinks on their page. In this context, the goal is to select a subset of links from a pool of available links, which collectively maximize the provider’s ad revenue. At the same time, the provider typically wants to avoid displaying some high-revenue links that have a low click-through rate (CTR), since such links may be perceived as clutter and drive down user satisfaction. In [2], this problem is formulated as a constrained MAB where the probability of clicking on one of the displayed weblinks must exceed a minimum CTR threshold. Wireless Rate selection: Wireless communication networks strive to opti- mize the data transmission rate for packetized information bits. Aggressive data transmission rates carry more per-packet information but also su↵er from more frequent packet failures compared to conservative rates. For latency-sensitive ap- plications such as video streaming and online gaming, the goal of wireless rate selection is to maximize the average data throughput while simultaneously main- taining a minimum packet success frequency. Recently, [3] proposed a constrained MAB approach for this rate selection problem that was shown to outperform state-of-the-art techniques. Thompson sampling (TS) is an ecient Bayesian heuristic for MAB optimiza- tion [1, 4]. TS operates by conditioning the probability of selecting an optimal arm on the historical rewards observed in previous rounds. The regret for TS has been shown to scale as O(log T ) in [5, 6], where the constant factors depend on the distribution of the underlying problem parameters. These bounds asymptoti- cally achieve the optimal regret established in [7]. Further, [8] proposed TS-based algorithm for a stochastic packing problem commonly referred to as Bandits with Knapsacks (BwK), for which the distribution-independent regret upper bounds were shown to scale as O(pT ). For BwK, O(log T ) distribution-dependent lower bounds were developed in [9]. In this paper, we propose LinConTS, a TS-based algorithm for the specific 2. PROBLEM FORMULATION 101

type of linear constraint described above. Our algorithm encodes a probabilistic arm selection policy, where the arm selection weights are obtained by solving a linear program (LP) subroutine in every round. We denote the arms supported by the stationary optimal policy as optimal arms and the rest as suboptimal arms. We show that for the rounds where the suboptimal arms are played, the LinConTS regret is upper bounded by O(log T ). Further, we show that for these arms, the violation metric, i.e., the cumulative number of constraint violations, is also upper bounded by O(log T ). For the rounds where the optimal arms are played, the regret and violation metrics scale only as fast as O(pT ). Proof technique: The recipe for MABs with linearly constrained objectives is to invoke a LP subroutine in each round. Subsequently, the central problem in analyzing constrained MABs arises from the fact that the arm parameters cannot be compared directly; instead, they interact with each other through the LP. To handle this issue, our proof technique relies on careful examination of the dual LP problem. In particular, to bound the probability of selecting a particular arm, we analyze stochastic perturbations of the convex polytope enclosing the LP feasible region. We use these perturbation results in conjunction with theoretical analysis of TS developed in [6].

2 Problem Formulation

Consider a MAB with N arms, where at each time step t =1, 2,...,T, one of the arms must be played. Playing the arm i(t) 1,...,N results in a Bernoulli- 2{ } distributed reward event with mean µi(t). Reward events are independent across the arms and and across successive plays of an arm. In case of a successful reward event, a reward value r (0, 1] is immediately collected by the player. i(t) 2 We assume that the µis are aprioriunknown while the ris are deterministic and known in advance. In the context of weblink selection, the arms (µi,ri) correspond to the pool of available weblinks. For a selected weblink that is displayed on a target webpage, the reward event denotes a Bernoulli-distributed user click with an unknown mean CTR µi. Subsequently, the reward value ri captures the revenue generated when a user clicks on the displayed link. If a user clicks on the displayed link, the reward event is deemed to be true and the corresponding reward value is collected. On the other hand if a user fails to click on the link, the reward event is false and no reward is collected in that round. The problem objective is to maximize the cumulative expected reward value, T T t=1 µi(t)ri(t) subject to the constraint (1/T ) t=1 µi(t) ⌘ where ⌘ [0, 1]. The constraint, which ensures that at least a fraction ⌘ of rounds result2 in a rewardP event, must be satisfied with a high probability.P This problem can be 102 PAPER V

formulated using the LP relaxation

LP (µ) : maximize xiµiri i X subject to x µ ⌘, i i i X xi =1, i X x 0,i=1,...,N. (2.1) i

where xi denotes probability of selecting arm i and therefore x =[x1,...,xN ]is the probabilistic arm selection vector. The arms assigned probability mass xi > 0 by the LP solution constitute the set of optimal arms that are played with a non-zero probability.

Stationary optimal policy: If the µis were known in advance, the stationary optimal policy is found by solving LP (µ). We denote the solution of this LP

with x⇤ =[x1⇤,...,xN⇤ ], which is the optimal stationary probabilistic arm selec- tion vector. The expected reward of the stationary optimal policy is therefore

r⇤ = i xi⇤µiri. The arms assigned a probability mass xi⇤ > 0 by the optimal policy constitute the set of stationary optimal arms and the rest of the arms are stationaryP suboptimal arms. Assumptions: We assume that the arm with the highest expected reward

value, imax = argmaxi µiri, does not also satisfy the constraint, i.e., µimax <⌘ (otherwise the problem reduces to an unconstrained MAB problem). Conse- quently, to simultaneously satisfy the reward maximization and the constraint satisfaction objectives, a mixture of high-reward-value and high-reward-event- probability arms is required. Further, we assume that a strictly feasible solution exists for LP (µ) and that there are no degenerate cases (i.e., the stationary op- timal policy is a unique mixture of the arms). Regret and violation: The performance of any MAB algorithm is typically measured in terms of its regret, (T ), which is the cumulative reward loss com- pared to selecting the reward-maximizingR arm in hindsight. However, in a con- strained MAB setting, the optimal arms balance the need for reward maximiza- tion with constraint fulfilment. As a consequence, it is possible for a sub-optimal policy to exceed the cumulative reward achieved by the optimal policy, e.g., by frequently selecting a high-reward arm that instead violates the constraint. There- fore, we additionally define a violation metric (T ) to measure the cumulative constraint violations until time T . For the estimatedV reward event probabilities th in the t round, µt =[µ1,t,...,µN,t], let the instantaneous, probabilistic arm selection vector is xt =[x1,t,...,xN,t]. Then the expected violation and regret e e e 3. ALGORITHM AND FINITE-TIME ANALYSIS 103

until time T are given by

T E (T ) = E Tr⇤ xi,tµiri (2.2) R  t=1 i + ⇥ ⇤ X X = iE ki(T + 1) , (2.3) i +  X ⇥T ⇤ E (T )] = E T⌘ xi,tµi (2.4) V  t=1 i + ⇥ X X = (⌘ µi)E ki(T + 1) , (2.5)  i + X ⇥ ⇤ where k (t) denotes the number of times that the arm i is played until time t 1, i i = r⇤ µiri is the expected loss in reward for arm i, and [x]+ = max x, 0 . Intuitively, the reward metric measures the amount of reward lost due to selecting{ } suboptimal arms, or due to selecting optimal arms with a frequency di↵erent from the stationary optimal policy. In order to minimize regret, a policy could simply choose to search for the reward-maximizing arm without optimizing for the constraint. To account for this behaviour, the violation metric keeps track of the constraint violations of the policy. Both regret and violation metrics must be simultaneously optimized by any useful policy. In the next sections, we develop finite-time upper bounds on regret and violation metrics for LinConTS.

3 Algorithm and Finite-Time Analysis

LinConTS TS assigns prior distributions over the unknown MAB parameters. Subse- quently for the played arm in evey round, the arm posterior is updated using the observed reward. For the problem studied in this paper, the unknown pa- rameters are the reward event means µ1,...,µN . Since these parameters are Bernoulli-distributed, a suitable choice of prior is the Beta distribution [1]. The LinConTS algorithm is described in Alg. 3.1. At the beginning of the experiment, the reward values r ,i 1,...,N and the constraint ⌘ are provided. For each i 2{ } arm, LinConTS assigns independent uniform priors distributed as Beta(↵i,0 = 1,i,0 = 1). At every time step t =1,...,T, LinConTS obtains Thompson samples µi,t Beta(↵i,t 1,i,t 1) and subsequently solves a LP parameterized by these samples⇠ to obtain the instantaneous, probabilistic arm selection vector xt =[x1,te,...,xN,t]. This vector is sampled to obtain the playing arm, i(t) xt,if the LP is feasible. Subsequently, the reward event c Bern(µ ) is observed⇠ i(t),t ⇠ i(t) and the reward ci(t)ri(t) is collected. The parameter distribution for the played arm is updated according to the rule Beta(↵i(t),t 1,i(t),t 1) (ci(t),t, 1 ci(t),t). 104 PAPER V

Algorithm 3.1 LinConTS

1: Input: Reward Values r 1,...,N , Constraint ⌘ { } 2: Initialize: ↵ 1,...,N ,0 =1, 1...N ,0 = 1. { } { } 3: for Time index t =1to T do 4: if txi,t 0e i 1,...,N P 8 2{ } > 9: if a (feasible) optimal solution: exists then 10: Sample i(t) [x ,...,x ] ⇠ 1,t N,t 11: else 12: Sample i(t) uniformly from 1,...,N . { } 13: end if 14: end if 15: Observe: Reward event c 0, 1 . i(t),t 2{ } 16: Update: ↵i(t),t = ↵i(t),t 1 + ci(t),t i(t),t = i(t),t 1 +(1 ci(t),t). 17: end for

Regret and Violation Upper Bounds The theoretical analysis of TS is challenging, and as such only an empirical evalu- ation of its performance was available until recently [10]. The chief reason for TS’ theoretical intractability is that, owing to its randomized nature, novel techniques are required to bound the number of draws for suboptimal arms, which were first introduced in [11] and extended in [5,6] . At a high level, these techniques compare the Thompson samples for each arm with carefully selected thresholds related to empirical estimates of the arm parameters and their true values. Subsequently, by using known bounds on the sums of Bernoulli random variables, these techniques bound the expected number of times that any suboptimal arm is played. In case of LinConTS, the selected arm in round t depends on the solution of LP(µt). Therefore, to bound the probability of selecting a suboptimal arm, we need to analyze the LP in (2.1) with perturbed reward event means. We address e 3. ALGORITHM AND FINITE-TIME ANALYSIS 105

this challenge by formulating the Lagrangian dual of the LP that can be solved to obtain specific thresholds for each suboptimal arm, which, when breached under certain well-specified conditions, lead to the suboptimal arm being selected with a nonzero probability. We use these thresholds in combination with the proof tech- nique of [6] to show that for the suboptimal arms, the expected number of plays is bounded by O(log T ). Under the assumption of no degenerate cases and that the reward-maximizing arm does not simultaneously satisfy the constraint, exactly two arms support (2.1). This well-known result for LPs satisfies the following in- tuition: the optimal solution to (2.1) frequently selects a high-reward-value arm, which comes at the cost of frequent failures (i.e., no-reward events). Hence, to satisfy the constraint, the optimal solution sometimes picks another, low-reward- value arm that has a high success probability. Without loss of generality, we denote the optimal arm indices with 1 and 2 respectively with parameters that satisfy 0 <µ1 <⌘<µ2 1 and r1µ1 >r2µ2. Subsequently, we obtain the following regret and violation bounds:

Theorem 4. The expected regret for LinConTS,

E (T ) R 2 ⇥ ⇤ (1 + ) + + N i log T +max 18 2T log 2 + O( 2 ),  d(µi,⇠i) ·  i= 1,2 6 X{ } p

(r1 r2)µ1µ2 µ 1 µ where ⇠ = >µ, (0, 1] d(µ ,⇠ )=µ log i +(1 µ ) log i , i (ri r2)µ2 (ri r1)µ1 i i i i ⇠i i 1 ⇠i + + 2 + i = max 0,r⇤ µiri ,andmax = maxi [N] i . { } 2

Theorem 5. The expected violation for LinConTS,

E (T ) V 2 ⇥ ⇤ (1 + ) + + N i log T + max 18 2T log 2 + O( 2 ),  d(µi,⇠i) ·  i= 1,2 6 X{ } p

+ + + where i = max 0,⌘ µi and max = maxi [N] i { } 2 106 PAPER V

LP Perturbation Analysis The Lagrangian dual (x,,⌫, ) of the primal LP in (2.1) can be formulated as L (x,,⌫, )= x µ r L i i i i X + ⌘ x µ i i i ✓ X ◆ + ⌫ x 1 i i ✓ X ◆ x , i i i X = ⌘ ⌫ (r µ + µ ⌫ + )x , (3.2) i i i i i i X where x RN is the optimization variable, and R, ⌫ R, and RN are the Langrange2 dual variables. The corresponding dual2 function2 2

g(,⌫, )= inf (x,,⌫, ) (3.3) x L ⌘ ⌫rµ + µ ⌫ + =0, i i i i 0, = 8 (3.4) > i 0. <> otherwise. 1 > We assume that a strictly:> feasible solution exists to the LP (µ), otherwise there can be no policy that achieves the objective. Then, strong duality holds from Slater’s condition and the primal optimal is equal to the dual optimal, i.e., there no duality gap. Hence, from the KKT optimality conditions,

⇤ ⌘ x⇤µ = 0 (3.5) i i i X ⌫⇤ x⇤ 1 =0, and (3.6) i i X i⇤xi⇤ =0, (3.7) i X where ⇤,⌫⇤, ⇤ are the optimal duals. Since ⇤ 0,x⇤ 0, each term of i i i i⇤xi⇤ is non-negative. Consequently for i i⇤xi⇤ = 0 to hold, xi⇤ > 0implies that ⇤ = 0 and x⇤ > 0 implies that ⇤ = 0. Under the assumption of no P i i i P degenerate cases, strict complementarity holds, i.e., either xi⇤ > 0 or i⇤ > 0 [12, Chap. 5]. Optimal arms: The arms for which i⇤ = 0 are assigned positive probabilities xi⇤ > 0 by the solution to LP (µ). We can then obtain the optimal dual variables 3. ALGORITHM AND FINITE-TIME ANALYSIS 107

by solving the system of linear equations

r µ + ⇤µ ⌫⇤ =0,i j : ⇤ =0,j 1,...,N , i i i 2{ j 2{ }}

r1µ1 r2µ2 (r1 r2)µ1µ2 which gives ⇤ = µ µ 0 and ⌫⇤ = µ µ . Note that ⇤ is the slope 2 1 2 1 of the hyperplane that optimizes LP (µ). Further, by solving i=1,2 pi⇤µi = ⌘, we get the optimal selection probabilities x⇤ =(µ2 ⌘)/(µ2 µ1) and x⇤ = 1 P 2 (⌘ µ1)/(µ2 µ1). Suboptimal arms and complementary slackness: In contrast to the optimal arms, the arms where xi⇤ =0, i⇤ > 0 constitute suboptimal arms that are as- signed zero probability mass by the optimal solution. For these arms, the con- straints are slack so that suciently small perturbations of their parameters does not alter the optimal solution to LP (µ). We now quantify the slack for each suboptimal arm i 1, 2 . For this, we observe that the arms stay suboptimal 62 { } as long as i⇤ > 0. Rearranging (3.4) to obtain riµi + ⇤µi ⌫⇤ = i⇤ < 0, we ⌫⇤ get µi < . Hence, we define ri+⇤

⌫⇤ (r r )µ µ ⇠ := = 1 2 1 2 >µ. (3.8) i r + (r r )µ (r r )µ i i ⇤ i 2 2 i 1 1 The value ⇠ µ > 0isthecomplementary slackness for arm i. As long as any i i (perturbed) mean value, µi0 , for arm i satisfies µi0 <⇠i the arm i stays suboptimal, i.e, the arm is assigned szero selection probability mass.

Arm Selection Bounds

To bound the regret and violation for LinConTS, we first bound E ki(T + 1) , the expected number of times any suboptimal arm i 1, 2 is played until ⇥ ⇤ time T + 1. Our proof technique is inspired by [6]: First,62 { we} show that the probability of playing a sub-optimal arm is a linear function of the probability of playing one of the optimal arms. Next, we show that the coecient of this linear function decreases exponentially with successive plays of the arm. For this, we use the concentration results from [6] to upper bound the number of plays for each suboptimal arm. We begin by defining the following: Definition 1: The number of successes observed for arm i =1,...,N until Si(t) time step t 1 is denoted by Si(t). The empirical mean at time t isµ ˆi(t)= , ki(t) withµ ˆi(t)=1whenki(t) = 0. Definition 2: The history of plays until time t 1 are denoted by the filtration t 1,i.e., F

t 1 = i(w),ci(w),w,w=1,...,t 1 . F { } Definition 3: For each suboptimal arm i 1, 2 , we choose two thresholds 62 { } yi and zi such that µi

Definition 4: We define the variable, 1,...,N such that

jrj ziri = ⇤,  z j i where i := zi. The hyperplane supported by any two points (zi,ziri) and (j,jrj),j= i runs parallel to the optimal hyperplane for LP (µ). Definition6 5: We define the variable  ⌘ ✏ = 2 >x > 0, 1,i   1,t 2 1 which lower bounds the selection probability mass assigned to arm 1, x1,t,under certain conditions described later in the proof. µ ✓ Definition 6: We define two events: Ei (t): µˆi(t) yi and Ei (t): ✓i,t {  } { µ  zi ,where✓i,t := µi,t denotes the Thompson sample at time t.Intuitively,Ei (t) } ✓ and Ei (t) denote the events that the empirical and sampled means for arm i at time t do not exceede the true arm mean by a large amount. Definition 7: We define the probability

pi,t =Pr ✓1,t >1 t 1 (3.9) { } F Next, we prove the following lemma that establishes the relationship between the number of plays of any suboptimal arm i 1, 2 and the optimal arm i = 1. 62 Lemma 1. For all t 1,...,T ,andi 1, 2 , 2{ } 62 { } µ ✓ Pr i(t)=i, Ei (t),Ei (t) t 1 |F 1 1 pi,t µ ✓ Pr i(t)=1,Ei (t),Ei (t) t 1 . (3.10)  ✏1,i · pi,t · |F µ Proof. Similar to [6, Lemma 1], we note that Ei (t)isdeterminedby t 1.Further, µ F when Ei (t) is false, the left hand side of (3.10) is zero and the inequality is trivially µ satisfied. Hence, we assume that t 1 is such that Ei (t)istrue. Subsequently, F to prove (3.10), we show that under a suitably chosen set of conditions Mi(t) that hold with a non-zero probability, the following relationships also hold:

✓ ✓ Pr i(t)=i Ei (t), t 1 (1 pi,t)Pr Mi(t) Ei (t), t 1 , | F  | F ✓ ✓ Pr i(t)=1Ei (t), t 1 (✏1,i pi,t)Pr Mi(t) Ei (t), t 1 , | F · | F which immediately gives the Lemma above. The proof details are provided in Appendix A. Subsequently, to bound the expected number of plays for any suboptimal arm i, we use Lemma 2, 3, and 4 obtained in [6],

T 1 Pr(i(t)=i, Eµ(t)) +1, (3.11) i  d(y ,µ ) t=1 i i X 3. ALGORITHM AND FINITE-TIME ANALYSIS 109

which relies on Cherno↵-Hoe↵ding bounds on the concentration of the empirical mean around the true mean,

T Pr(i(t)=i, Eµ(t), E✓(t)) L (t)+1, (3.12) i i  i t=1 X which is based on the fact that after the arm has played a sucient number of times, Li(t), the Thompson sample will be close to its mean value, and

1+ 3 j< 8 , i0 i0 2 0i j 1 81+O e 2 > , (3.13) E > 1 Dij 8 pi,⌧j +1  > + 2 e j>  > (j+1)0i 0i < 1 + 2j/4 , e 0i 1 > > > th respectively, where ⌧j denotes: the time step for the j trial of arm 1, 0i = µ1 1 z 1 z and D = z log i +(1 z ) log i . This last result uses novel algebraic analysis i i µ1 i 1 µ1 developed from the concentration of Binomial sums. Based on the above, we obtain the following bound on the expected number of plays of the suboptimal arm i until time T :

Lemma 2. The expected number of plays for any stationary suboptimal arm, i 1,...,K 1, 2 , is upper bounded by 2 { }\{ } 1 24 E ki(T ) 2+Li(T )+ + 2  d(yi,⇠i) ✏1,ii0 ⇥ ⇤ T 1 2j 1 0i 1 D j + O e 2 + e i ✏ 2 i,i j=0 (j + 1)0i X ✓ 1 + . (3.14) 2j/4 e 0i 1 ◆ Proof. This proof essentially sums over the expressions obtained in (3.11), (3.12), and (3.13) until time T . The details are provided in Appendix B.

Lemma 3. For the stationary optimal arms i 1, 2 , 2{ }

+ iE ki(T ) 18 2T log 2,  max ·  i=1,2 + X ⇥ ⇤ p + (⌘ µi)E ki(T ) 18 2T log 2, (3.15)  max ·  i=1,2 + X ⇥ ⇤ p + + + + where i = r⇤ µiri, max = maxi [N] i and max = maxi [N] i . 2 2 110 PAPER V

Proof. From Lemma 2, the rate of playing any suboptimal is a logarithmic function of T . Hence, the set of optimal arms is played at a linear rate. Viewing the play of optimal arms as a two-armed bandit, we can use existing results for constrained TS to bound the regret and violation contribution for the rounds when only the optimal arms are assigned nonzero arm selection probabilities by LP (µt). First, we decompose the regret and violation expressions separately using the decomposition in [8, EC.4] that relies on a frequentist upper and lower bounds thate holds with high probability for every t [T ]. Subsequently, we apply [8, Lemma EC.2], which provides an upper bound on2 the expected regret and violation with respect to the frequentist upper and lower bounds. This directly gives us the bounds in (3.15) where we set K = 2 since only the rounds where the optimal arms are played are taken into account.

Proof of Regret and Violation Bounds

For some (0, 1] we choose the thresholds yi (µi,⇠i) and zi (yi,⇠i) such that 2 2 2 2 d(yi,⇠i)=d(µi,⇠i)/(1 + ) and d(yi,zi)=d(yi,⇠i)/(1 + )=d(µi,⇠i)/(1 + ) . This leads to

log T 2 log T Li(T )= =(1+) . d(yi,zi) d(µi,⇠i)

⇠ (1 µ ) Following the ideas in [6], we obtain y µ d(µ ,⇠)/ log i i ,which i i 1+ i µi(1 ⇠i) 1 2 2 · gives d(y ,µ ) (y µ )2 = O(1/ ). i i  i i Proof of regret bound: From Lemma 2 and Lemma 3, we obtain

E (T ) = iE ki(T ) R  i + ⇥ ⇤ X ⇥ ⇤ iE ki(T ) + iE ki(T )   i= 1,2 +  i=1,2 + 6 X{ } ⇥ ⇤ X ⇥ ⇤ + + E ki(T ) + 18 2T log 2  i max · i= 1,2 6 X{ } ⇥ ⇤ p 2 log T + N (1 + ) i + O( 2 )  d(µi,⇠i) i= 1,2 6 X{ } ++ 18 2T log 2, max · p where the first and second inequalities use [a + b]+ [a]+ +[b]+. This completes the regret bound.  4. RELATED WORK 111

Proof of violation bound: From Lemma 2 and Lemma 3, we obtain

E (T )= (⌘ µi)E ki(T ) V  i + ⇥ X ⇥ ⇤ iE ki(T ) + iE ki(T )   i= 1,2 +  i= 1,2 + 6 X{ } ⇥ ⇤ X{ } ⇥ ⇤ 2 (1 + ) log T + N + i + O( 2 )+max 18 2T log 2  d(µi,⇠i) · i= 1,2 6 X{ } p 2 (1 + ) + N = i log T + O( 2 ) d(µi,⇠i)  i= 1,2 6 X{ } + + 18 2T log 2, max · p where the first and second inequalities use [a + b]+ [a]+ +[b]+. This completes the violation bound. 

4RelatedWork

Sequential learning under probabilistic constraints has been studied in [13], and Thompson sampling under general problem settings was studied in [14]. In [15], distribution-free regret bounds for convex optimization with bandit feedback were shown to scale as O(pT ). Bandits with concave rewards and convex knapsacks were studied within a very general framework in [16], which subsumed the pre- vious bandits with knapsacka (BwK) formulation in [17]. In [18], a Thompson sampling algorithm for budgeted MABs was proposed that outperforms the UCB BwK algorithm. Subsequently TS was applied to network revenue management in [8], were distribution-free bounds that scale as O(pT ) were shown to hold. In [2], a horizon-dependent UCB approach was proposed whose regret and vi- olation performance was shown to scale with O(pT ). Recently, applications of linearly constrained MABs have been proposed for advertiser portfolio optimiza- tion [19], wireless communications [3, 20], and real-time electricity pricing [21].

5 Numerical Experiments

We evaluate the regret and violation performance of LinConTS on two real-world datasets, Coupon-Purchase [22] and edX-Course [23], respectively. The Coupon- Purchse and edX-Course datasets have been explored previously for constrained bandit problems, albeit in a multi-play setting that allowed multiple arms to be played in each time step and the assumed that the reward values were stochas- tic [2,24]. The experiments are implemented in Python using Jupyter notebooks and have been made publicly available at [25]. 112 PAPER V

(a) Regret (b) Violation

(c) Cumulative Reward (d) Cum. reward / Violation

Figure 3.1: Experimental results for the Coupon-Purchase dataset for N = 142 coupons with ⌘ =0.25. The LinConTS approach ouperforms the competing LinCon-KL-UCB approach by achieving a lower regret and violation, higher cu- mulative rewards and a higher ratio of cumulative rewards to violation.

For the Coupon-Purchase dataset, which contains discount coupons applied to online purchases, we extract all the N = 142 coupons for products priced equal to or below below 200 price units that have been purchased by at least one cus- tomer. For these coupons, we obtain the purchase rate and the discounted selling price from the dataset. The edX-Course dataset contains enrolment information for N = 290 Harvard and MIT courses. We process this dataset according to previous experiments [2]: the course participation rates are estimated by max- min normalization of the number of participants in each course, and the course certification rates are obtained by dividing the number of certified participants in each course by the number of course participants. We model each coupon in the Coupon-Purchase dataset and each course in the edX-Course dataset using independent bandit arms. Further, for the Coupon- Purchase dataset, we generate independent Bernoulli-distributed reward events with mean values obtained from the purchase rates and a deterministic reward value defined as the final selling price normalized by 200. For the processed 5. NUMERICAL EXPERIMENTS 113

(a) Regret (b) Violation

(c) Cumulative Reward (d) Cum. reward / Violation

Figure 3.2: Experimental results for the edX-Course dataset for N = 290 courses with ⌘ =0.50. The LinConTS approach ouperforms the competing LinCon-KL- UCB approach by achieving a lower regret and violation, higher cumulative rewards and a higher ratio of cumulative rewards to violation.

Coupon-Purchase dataset, the reward event means and the rewards values are found to lie between [0, 0.30] and (0, 1] respectively. Analogously for the edX- Course dataset, the reward event means are obtained from the course participa- tion rates and the corresponding reward values are the course certification rates that are assumed to be known in advance, e.g. from historical course data. For this dataset, the reward event means and the rewards values are found to span [0, 1] and (0, 0.40] respectively. We implement two linearly constrained bandit algorithms, LinConTS and LinCon-KL-UCB respectively. The LinConTS algorithm is described in Sec. 3 and the pseudocode is available in Alg. 1. The LinCon-KL-UCB, described in Appendix C, is inspired by the ConTS algorithm proposed for multi-play lin- early constrained MABs in [2]. However, compared to ConTS that relies on an index-based UCB, LinCon-KL-UCB estimates the UCB for each arm using the 114 PAPER V

Kullback-Liebler (KL) divergence metric. For the Bernoulli-distributed rewards considered in here, KL-based UCB has been shown to achieve optimal regret and significantly outperform index-based UCB schemes [26]. The performance of LinCon-KL-UCB and LinConTS for the Coupon-Purchase dataset and for ⌘ =0.25 is shown in Fig. 3.1, where the results have been aver- aged over 16 test runs. LinConTS achieves a lower regret compared to LinCon- KL-UCB in Fig. 1.1(a). This demonstrates that compared to LinCon-KL-UCB, LinConTS is closer to the cumulative rewards achieves by the stationary opti- mal policy. We study the cumulative violations for each approach in Fig. 1.1(b), where LinConTS demonstrates fewer constraint violations than LinCon-KL-UCB. Interestingly, in Fig. 1.1(c), LinConTS achieves higher cumulative rewards as well compared to LinCon-KL-UCB. We combine the related e↵ects of regret and vio- lation minimization by calculating the ratio between the cumulative rewards and the cumulative violations at each time step. This quantity, depicted in Fig. 1.1(d), can be interpreted as the additional reward earned for every constraint violation. We observe that LinConTS achieves a higher ratio, which demonstrated that LinConTS is more ecient in exploiting the infrequent constraint violations. For the edX-Course dataset and ⌘ =0.50, the experimentalt results for Lin- ConTS and LinCon-KL-UCB schemes are averaged over 16 test runs and shown in Fig. 3.2. Similar to the results for the Coupon-Purchase dataset, here also LinConTS has lower regret and violation than LinCon-KL-UCB in Fig. 1.2(a) and Fig. 1.2(b) respectively. Also, from Fig. 1.2(c) and Fig. 1.2(d), LinConTS increases the cumulative reward, and the ratio of cumulative reward to violations, respectively.

6 Conclusions and Further Work

For constrained bandit problems, LP subroutines enable powerful sequential arm selection techniques. Combined with an underlying Thompson Sampling ap- proach, these techniques promise ecient and robust solutions that are optimal in terms of their learning rate. We have addressed the MAB problem of maximizing the cumulative reward when the reward event probability is constrained above a fixed threshold in every round. For this problem, we described LinConTS, which incorporates a LP subroutine in every step of the Thompson Sampling heuristic. We have provided the first instance-dependent asymptotic analysis our algorithm. Through numerical results for two real-world datasets, we have showed that Lin- ConTS outperforms an optimal UCB-based algorithm in terms of the regret and violation metrics. The proof technique developed in this paper can inspire so- lutions to other constrained MAB problems, for example to problems that deal with more general type of constraints compared to the one considered here. 6. CONCLUSIONS AND FURTHER WORK 115

Acknowledgements

This work was partially supported by the Wallenberg Artificial Intelligence, Au- tonomous Systems and Software Program (WASP) funded by Knut and Alice Wallenberg Foundation, and by the European ResearchCouncil project AGNOS- TIC (742648). We thank Simon Lindst˚ahl for noting a flaw in the original for- mulation of Lemma 3, which has been updated in this version of the paper. 116 PAPER V

References

[1] D. J. Russo, B. Van Roy, A. Kazerouni, I. Osband, and Z. Wen, “A tutorial on thompson sampling,” Found. Trends Mach. Learn., vol. 11, p. 1–96, July 2018.

[2] K. Chen, K. Cai, L. Huang, and J. C. Lui, “Beyond the click-through rate: Web link selection with multi-level feedback,” in Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pp. 3308– 3314, International Joint Conferences on Artificial Intelligence Organization, 7 2018.

[3] V. Saxena, J. Jald´en, J. E. Gonzalez, I. Stoica, and H. Tullberg, “Con- strained thompson sampling for wireless link optimization,” arXiv preprint arXiv:1902.11102, 2019.

[4] W. R. Thompson, “On the likelihood that one unknown probability exceeds another in view of the evidence of two samples,” Biometrika, vol. 25, no. 3/4, pp. 285–294, 1933.

[5] E. Kaufmann, N. Korda, and R. Munos, “Thompson sampling: An asymptoti- cally optimal finite-time analysis,” in International Conference on Algorithmic Learning Theory, pp. 199–213, Springer, 2012.

[6] S. Agrawal and N. Goyal, “Further optimal regret bounds for thompson sam- pling,” in Artificial Intelligence and Statistics, pp. 99–107, 2013.

[7] T. L. Lai and H. Robbins, “Asymptotically ecient adaptive allocation rules,” Advances in applied mathematics, vol. 6, no. 1, pp. 4–22, 1985.

[8] K. J. Ferreira, D. Simchi-Levi, and H. Wang, “Online network revenue manage- ment using thompson sampling,” Operations research, vol. 66, no. 6, pp. 1586– 1602, 2018.

[9] A. Flajolet and P. Jaillet, “Logarithmic regret bounds for bandits with knap- sacks,” arXiv preprint arXiv:1510.01800, 2015.

[10] O. Chapelle and L. Li, “An empirical evaluation of thompson sampling,” in Advances in neural information processing systems, pp. 2249–2257, 2011.

[11] S. Agrawal and N. Goyal, “Analysis of thompson sampling for the multi-armed bandit problem,” in Conference on Learning Theory, pp. 39–1, 2012.

[12] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge university press, 2004. 6. CONCLUSIONS AND FURTHER WORK 117

[13] A. Meisami, H. Lam, C. Dong, and A. Pani, “Sequential learning under proba- bilistic constraints,” in Proceedings of the Thirty-Fourth Conference on Uncer- tainty in Artificial Intelligence, UAI 2018, Monterey, California, USA, August 6-10, 2018 (A. Globerson and R. Silva, eds.), pp. 621–631, AUAI Press, 2018.

[14] A. Gopalan, S. Mannor, and Y. Mansour, “Thompson sampling for complex online problems,” in International Conference on Machine Learning, pp. 100– 108, 2014.

[15] A. Agarwal, D. P. Foster, D. J. Hsu, S. M. Kakade, and A. Rakhlin, “Stochastic convex optimization with bandit feedback,” in Advances in Neural Information Processing Systems, pp. 1035–1043, 2011.

[16] S. Agrawal and N. R. Devanur, “Bandits with concave rewards and convex knapsacks,” in Proceedings of the Fifteenth ACM Conference on Economics and Computation, EC ’14, (New York, NY, USA), pp. 989–1006, ACM, 2014.

[17] A. Badanidiyuru, R. Kleinberg, and A. Slivkins, “Bandits with knapsacks,” in Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Sympo- sium on, pp. 207–216, IEEE, 2013.

[18] Y. Xia, H. Li, T. Qin, N. Yu, and T.-Y. Liu, “Thompson sampling for budgeted multi-armed bandits,” in Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI’15, p. 3960–3966, AAAI Press, 2015.

[19] A. Pani, S. Raghavan, and M. Sahin, “Large-scale advertising portfolio opti- mization in online marketing,” tech. rep., Working Paper, 2017.

[20] V. Saxena and J. Jald´en, “Bayesian link adaptation under a bler target,” in IEEE 21st International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), 2020.

[21] N. Tucker, A. Moradipari, and M. Alizadeh, “Constrained thompson sampling for real-time electricity pricing with grid reliability constraints,” arXiv preprint arXiv:1908.07964, 2019.

[22] Kaggle, “Coupon Purchase Data.” https://www.kaggle.com/c/ coupon-purchase-prediction/data, 2016.

[23] I. Chuang and A. Ho, “HarvardX and MITx: Four years of open online courses– fall 2012-summer 2016,” Available at SSRN 2889436, 2016.

[24] K. Cai, K. Chen, L. Huang, and J. C. Lui, “Multi-level feedback web links selection problem: Learning and optimization,” in 2017 IEEE International Conference on Data Mining (ICDM), pp. 763–768, IEEE, 2017.

[25] V. Saxena, “LinConTS.” https://github.com/vidits-kth/LinConTS, 2019. 118 PAPER V

[26] A. Garivier and O. Capp´e, “The kl-ucb algorithm for bounded stochastic ban- dits and beyond,” in Proceedings of the 24th annual Conference On Learning Theory, pp. 359–376, 2011. Paper VI

119 120 PAPER VI 121

Reinforcement Learning for Ecient and Tuning-Free Link Adaptation

Wireless links adapt the data transmission parameters to the dynamic channel state – this is called link adaptation.Classical link adaptation relies on tuning parameters that are challenging to configure for optimal link performance. Recently, reinforcement learning has been proposed to automate link adaptation, where the transmission parameters are modeled as discrete arms of a multi- armed bandit. In this context, we propose a latent learning model for link adaptation that exploits the correlation between data trans- mission parameters. Further, motivated by the recent success of Thompson sampling for multi-armed bandit problems, we propose alatentThompsonsampling(LTS)algorithmthatquicklylearns the optimal parameters for a given channel state. We extend LTS to fading wireless channels through a tuning-free mechanism that au- tomatically tracks the channel dynamics. In numerical evaluations with fading wireless channels, LTS improves the link throughout by up to 100% compared to the state-of-the-art link adaptation al- gorithms.

Paper under review for the IEEE Transactions on Wireless Communications (IEEE). Please do not distribute. 122 PAPER VI

1 Introduction

Link adaptation is a key feature of the wireless physical layer. In a fading wireless channel, link adaptation adapts the data transmission parameters, for example the modulation and coding scheme (MCS), to optimize the link performance in real time. Ecient and robust link adaptation is central to achieving the extremely high data rates supported by the state-of-the-art wireless networks [1]. Wireless channels introduce stochastic impairments to the transmitted data symbols due to, for example, channel fading, phase rotation, and additive noise and interference. The wireless channel state is therefore a complex and time- varying property. Wireless systems estimate the instantaneous channel state from the outcome of previous transmissions as well as infrequent, measurement-based, channel reports. With link adaptation, these channel state estimates determine MCS used for encoding data packets in subsequent transmission intervals. In the modern wireless networks, the physical layer also employs hybrid au- tomatic repeat request (HARQ) for reliable communication. With HARQ, each received packet generates either an acknowledgement (HARQ ACK, or simply, “ACK”) or a negative ACK (NACK) indicating whether the receiver succeeded or failed in decoding the incoming packet. The chief purpose of the ACK/NACK feedback is to select the next set of data bits for transmission. Further, this feed- back also provides a low-resolution indication of the underlying, or latent, channel state. Several link adaptation schemes hence use the ACK/NACK feedback to refine their estimate of the channel state. In addition to the ACK/NACK feed- back, the wireless transmitter may configure measurement-based channel feed- back for a more fine-grained channel update. However, these reports can be expensive to generate and additionally consume precious wireless resources for signaling. Therefore, in this paper, we address link adaptation based only on the ACK/NACK feedback. We exemplify link adaptation in the context of cellular networks, although the general techniques are broadly applicable to other wireless protocols such as the IEEE 802.11 WiFi standard. Link adaptation based on the ACK/NACK feedback falls under two cat- egories: the classical outer loop link adaptation (OLLA) approach, and the recently-proposed reinforcement learning link adaptation (RLLA). With OLLA, the channel state is modeled in terms of an e↵ective signal-to-interference-and- noise-ratio (SINR). The e↵ective SINR is well-known to be a robust, low-dimensional, representation of the channel state [2]. OLLA hence maintains a dynamic, real- time, estimate of the SINR. For data transmission, OLLA maps the latest SINR estimate to the performance of each candidate MCS using a pre-generated of- fline link model (OLM)1 [3–6]. In contrast to the channel state model adopted by OLLA, RLLA operates by learning the statistical behaviour for each MCS from the ACK/NACK feedback. Based on these statistical estimates, RLLA

1Common OLLA implementations often short-circuit the OLM by pre-computing discrete switching thresholds [3–5] . However, in this paper, we show that exploiting the full OLM output improves the link throughput. 1. INTRODUCTION 123

predicts the optimal MCS for data transmission in every transmission instance. In [7], RLLA was rigorously modeled as a multi-armed bandit (MAB) optimiza- tion problem, which is the canonical framework for reinforcement learning2. The central shortcoming of OLLA is that it is challenging to tune OLLA pa- rameters for optimal performance. As an example, OLLA employs step-based SINR adjustments in response to the ACK/NACK feedback. The step sizes must be chosen carefully, since they a↵ect how quickly OLLA converges to the opti- mal transmission configuration, and also the stability of this configuration. The most common step size selection mechanism uses a proxy configuration parameter known as the target block error rate (BLER), which itself is dicult to tune for good link performance [8]. In practice, the step size and BLER target parame- ters are tuned based on empirical evidence, which often leads to suboptimal link performance [9, 10]. RLLA mitigates some of the drawbacks associated with OLLA. In particular, RLLA automatically learns the statistical behavior of the candidate MCSs and thus reduces the dependence on tuning parameters. RLLA is hence more robust to arbitrary channel fading profiles than the previous techniques [7,11,12]. How- ever, compared to the low-dimensional SINR model adopted by OLLA, RLLA uses a model that encodes each candidate MCS as a dicrete arm of a multi-armed bandit. The RLLA modeling complexity hence scales linearly with the number of candidate MCSs. This model is both resource-intensive and inecient to train in real time. As a result, current RLLA algorithms can be slow to react to fre- quent channel variations. Additionally, current RLLA algorithms address channel fading through a sliding window protocol that needs to be tuned empirically for optimal performance.

Contributions In this paper, we propose a new reinforcement learning algorithm for link adapta- tion. Our algorithm, latent Thompson sampling (LTS), extends the well-known Thompson sampling approach for online optimization. Compared to classical Thompson sampling, LTS encodes the environmnt through a low-dimensional la- tent state model. In the context of link adaptation, LTS adopts a probabilistic model of the channel SINR, whose parameters are learnt from the ACK/NACK feedback observed for previous transmissions. Since the SINR is a low-dimensional channel metric, LTS is able to quickly and reliably estimate the channel state from only a few ACK/NACKs. We make the following contributions in this paper:

2Reinforcement learning is a vast field with numerous optimization techniques such as Q- learning, actor-critic algorithms, policy gradients, and others. These techniques learn a trajectory of actions that optimize for a desired long-term objective. However, as shown in [7], link adaptation is adequately modeled through the the elegant MAB framework, where the action in any step only influences the immediate reward. MABs are optimized using statistical techniques, the most prominent among which are upper confidence bound (UCB) or Thompson sampling based algorithms. 124 PAPER VI

We model the latent channel state, measured in terms of the instantaneous • channel SINR, through a probabilistic model. LTS initializes this model based on the available channel information and subsequently refines it with the inferred channel response.

We propose an ecient, OLM-based Bayesian update scheme to refine the • SINR model with the decoding outcome obtained at the end of every trans- mission event.

We develop upper bound on the throughput perfomance of LTS, where we • show that the worst-case cumulative throughput loss scales with the square root of time.

We extend the SINR estimation scheme to fading channels through a tuning- • free mechanism. For this, we propose relaxing the SINR probability distri- bution in evry time step with an appropriately chosen smoothing function. The variance of this smoothing function is automatically configured using the channel Doppler estimates.

We numerically evaluate the performance of LTS in terms of the average link • throughput for frequency selective fading channels. Our results demonstrate that LTS significantly outperforms state-of-the-art OLLA and RLLA schemes for cellular networks.

Organization The rest of this paper is structured as follows: In Section 2, we review the existing link adaptation schemes. Next, in Section 3, we describe the cellular link model and the link performance objectives considered in this paper. In Section 4, we introduce Thompson sampling in the context of link adaptation. Subsequently, in Section 5, we introduce and describe our proposed LTS algorithm for link adaptation. We evaluate LTS numerically in Section 6 and compare it with the state-of-the-art link adaptation algorithms. Finally, we conclude the paper in Section 7 and discuss future research directions.

2RelatedWork

Link adaptation has been an area of extensive research since the past two decades. Link adaptation has been studied in the context of both cellular networks and WiFi networks, although the terminology varies slightly across the two domains. Further, while cellular link adaptation generally favors measurement-based tech- niques, link adaptation schemes for WiFi tend to use sampling-based approaches. In the rest of this section, we review cellular link adaptation techniques, which are commonly referred to as OLLA, and the recently proposed RLLA schemes. 2. RELATED WORK 125

Outer Loop Link Adaptation OLLA was first introduced in the context of third-generation cellular networks [3]. Since then, OLLA has been adopted by fourth-generation (4G) and fifth-generation (5G) networks as well [13], and has also been proposed for satellite communica- tion systems [14]. The e↵ect of the target BLER parameter for OLLA on link throughput was studied analytically in [8] and [15] under a number of simplifying assumptions. At a high level, it was found that the optimal target BLER is a decreasing function of the SINR. While [8] argued for a broadly-applicable tar- get BLER of 10%, [15] derived an analytical expression that maps a long-term average channel SINR to the optimal target BLER. However, obtaining optimal target BLERs with realistic channel impairments remains an open problem. In addition to the diculty of configuring the target BLER, OLLA also re- quires careful calibration of the step sizes. Large step sizes cause OLLA to os- cillate around its steady state. On the other hand, small step sizes have better stability but slow the OLLA convergence towards a steady state. To address these issues, [16] proposed jump-starting the SINR estimate using an step size statistics from previous data flows. In [10], an enhanced OLLA (eOLLA) scheme was pro- posed to dynamically adjust the step sizes based on the history of ACK/NACK feedback. The eOLLA scheme was shown to achieve better stability and modest throughput gains compared to classical OLLA. In [17], the step size was initial- ized with a large value that was then decayed exponentially with the time that a data flow was active. However, the channel variation profiles and data flow characteristics vary with time and across cellular deployments. As a result, it is challenging to configure OLLA step sizes that perform well in most scenarios. Recently, reinforcement learning has also been proposed to automate OLLA configuration. In [18], adaptive kernel regression was employed to automatically tune OLLA parameters for optimal link throughput. In [19], reinforcement learn- ing was employed to select from a set of optimal SINR adjustments modeled as independent arms of a multi-armed bandit. Further, [11, 20–24] have proposed neural network models that learn the optimal data transmission parameters in real time.

Reinforcement Learning based Link Adaptation In contrast to reinforcement leaning-based OLLA enhancements, reinforcement learning has also been proposed for the statistical learning of MCS performance. These RLLA techniques rigorously model the MCSs as discrete arms of a MAB [7]. MABs elegantly encode decision-making in stochastic environments, where an agent should eciently explore the available actions to maximize some environment- dependent reward function [25]. The key advantage with MAB-based link adap- tation is that it directly optimizes for the desired link performance objective, for example the link throughput. These RLLA schemes hence do not rely on proxy tuning parameters such as the target BLER, nor on step-based heuristics. 126 PAPER VI

The first RLLA scheme was proposed in [7], where an upper-confidence bound (UCB) scheme, graphical optimal rate sampling (G-ORS), was proposed. For e- cient learning, G-ORS exploited the fact that the expected throughput as a func- tion of the MCS exhibits a unimodal structure. Further, [7] provided theoretical performance bounds that are also asymptotically optimal in terms of convergence to the optimal MCS. Compared to UCB, Thompson sampling is a Bayesian MAB optimization technique that adopts probabilistic models of the unknown problem parameters. In empirical tests, Thompson sampling has been shown to out- perform UCB-based algorithms [26]. Hence, several RLLA schemes build upon Thompson sampling for good link performance. [27] proposed a Thompson sam- pling algorithm for link adaptation, where the correlation assumption between MCSs was dropped in favor of a simpler MAB model. However, neglecting the correlation between MCSs also hurts the learning rate of their proposed modified Thompson sampling (MTS) algorithm. Subsequently, [12] proposed a constrained Thompson sampling algorithm that preserves the unimodality across the modeled MCS. In [28], the G-ORS approach of [7] was translated to an analogous unimodal Thompson sampling (UTS) algorithm. While UTS has not been evaluated for link adaptation, it nevertheless outperforms G-ORS for several other scenarios. The problem of link optimization under constrained performance objectives was studied in [29,30], where suitable, analytically bounded, extensions of Thompson sampling were proposed. Further, [31] proposed a Thompson sampling-based ap- proach for link adaptation that compacts the search space for fast convergence. In this paper, we compare our proposed LTS approach with the MTS and UTS algorithms discusses above. The basic RLLA schemes optimize for a stationary environment, that is, where the channel response does not vary over time. However, in most practical deploy- ments, the channel response is known to be nonstationary. To address these channel variations, [7] proposed a sliding window heuristic, which conditions the MCS selection only on the ACK/NACK feedback for the preceding few transmis- sions. The size of the sliding window is an algorithm tuning parameter that is chosen empirically. In contrast, [29] uses relatively frequent channel reporting to respond to channel fading. In this paper, we propose a Doppler-based mechanism for LTS that obviates the need for algorithm parameter tuning.

3 Model and Objectives

We consider packetized data transmission over a wireless link. In every transmis- sion instance indexed by t =1, 2,...,T, the wireless transmitter selects an MCS m[t] 1,...,M .WithMCSindexm[t], Dm[t] bits are packed into a transport block2that{ is first} encoded with a forward error-correcting code and bit-interleaved using a pseudorandom sequence to protect against stochastic noise and channel fading. The encoded and interleaved bits are mapped onto modulation symbols from a complex-valued alphabet of size F prescribed by the MCS. The se- | m[t]| 3. MODEL AND OBJECTIVES 127

quence of modulated symbols is either truncated or zero-padded to completely fill the time-frequency resources allocated for transmission. The e↵ective channel code rate is then given by

1 D L = m[t] , (3.1) m[t] tf ⇥ log F 2 | m[t]| where t and f denote the scheduled transmission duration and bandwidth re- spectively. The transmission symbols are subsequently multiplexed with known reference symbols for receiver-side channel compensation and loaded onto a dis- crete time-frequency grid of time-domain orthogonal frequency division multi- plexing (OFDM) symbols and frequency-domain subcarriers [1]. The data rate for MCS m, i.e., the normalized amount of data carried, is given by 1 r D = L log F . (3.2) m[t] , tf m[t] m[t] ⇥ 2 | m[t]|

The receiver attempts to decode the incoming signal and feeds back either a single-bit ACK, cm[t][t] = 1, or a NACK, cm[t][t] = 0, that signals a successful or a failed transport block reception respectively (typically determined at the receiver using a cyclic redundancy check (CRC) appended to the transport block). The ACK probability for a received transport block depends on the MCS m[t] and the underlying channel state. Denoting the channel state for the tth transmission instance with ✓[t] ⇥, the ACK probability for MCS m[t] [M] is given by 2 2

µ m[t],✓[t] = P cm[t][t]=1 m[t],✓[t] . (3.3) ⇥ ⇤ Problem Objective For the transmission at time t,therealized data throughput for a link with MCS m[t]is

[t]=r c [t], (3.4) T m[t] ⇥ m[t] that is, the normalized number of data bits delivered successfully to the receiver. We consider the link adaptation goal of maximizing the expected link throughput,

maximize E rm[t] cm[t][t] (3.5) ⇥ To optimize the expected throughput,⇥ link adaptation⇤ predicts the optimal MCS at time t,

? m [t] = argmax rm µm[t] (3.6) m 1,...,M ⇥ 2{ } b where µm[t] is the predicted ACK probability of MCS m at the transmitter.

b 128 PAPER VI

E↵ective SINR and OLM The ACK probability is aprioriunknown at the transmitter. With link adapta- tion, the transmitter needs to estimate the ACK probability associated with each MCS to select the optimal MCS that maximizes the expected throughput in every transmission instance. State-of-the-art OLLA approaches use the channel SINR as a robust latent channel state metric to estimate ACK probabilities. In [2], an e↵ective SINR metric (ESM) was proposed to compress the vector of per- subcarrier SINRs to a scalar value. Further, the ESM was shown to accurately parameterize the ACK probability for convolutional channel codes. Since then, several ESMs have been proposed that model a broad range of MCSs, where the compression parameters are learnt from a training dataset [32]. Recently, [33] also proposed an artificial neural network model that improves the ACK probability prediction compared to ESM-based techniques. Cellular base stations and nodes maintain separate o✏ine models that map an input ESM and MCS to an output ACK probability,

(m, ✓):(m, ✓) µ (✓). (3.7) G 7! m Closed-form OLMs that takes the form of parameterized sigmoid-like functions have been proposed in [34] and [35]. Further, table-based OLM representations that discretize the feasible SINR range into a finite number of bins are commonly used in current cellular deployments [32]. Here, each bin corresponds to a corre- sponding ACK probability, which is typically stored in the form of a lookup table for fast access. The parameters for both the traditional table-based OLM and the analytical OLM are learnt by fitting the model to a training dataset either generated numerically or collected in the field.

MAB Link Adaptation Model MABs are a special class of reinforcement learning models. MAB models the classical problem of picking the optimal action from a set of candidate actions, where executing an action in the environment generates a random-valued reward with an unknown distribution. The choice of action in any round hence balances between exploring the actions with uncertain reward distribution, and exploit- ing the knowledge of rewards gained in the previous rounds. In contrast to the general reinforcement learning problem where the actions in any round influence future rewards, the MAB rewards are assumed to be independent and identically distributed across the rounds. The goal of the MAB agent is to maximize the cumulative reward over several rounds. The performance of any MAB optimiza- tion algorithm is then evaluated in terms of its regret, which is the cumulative di↵erence in the reward attained by the evaluated policy and the rewards for an oracle policy that picks the globally optimal action in every round. In the con- text of link adaptation, the transmitter serves as the MAB agent that operates 4. THOMPSON SAMPLING FOR LINK ADAPTATION 129

in the wireless channel environment. The space of available MCSs constitute the candidate actions. In every round, which corresponds to a transmission interval, transmission using a selected MCS results in an ACK or a NACK, which is ran- dom variable parameterized by the unknown channel state. The link adaptation goal in (3.5) is equivalent to maximizing the cumulative reward until time T .

4 Thompson Sampling for Link Adaptation

Thompson sampling is a Bayesian technique for MAB optimization, which was first proposed in 1933 in the context of clinical trials [36]. During the last decade, Thompson sampling has attracted renewed interest for optimizing MAB problems, for example in the context of recommender systems, online auctions, etc. [25]. This interest in Thompson sampling is also driven by the evidence of its state-of-the-art empirical performance [26], and analytical breakthroughs related to its finite-time performance [37,38]. Next, we introduce the general Thompson sampling principle, and summarize a couple of link adaptation algorithms based on this principle. In the context of link adaptation, Thompson sampling models the ACK prob- ability associated with each candidate MCS. These ACK probability estimates are used to predict the optimal, throughput-maximizing MCS. In every time step, Thompson sampling obtains sampled point estimates of the ACK probabili- ties. This sampling operation controls the exploration-exploitation tradeo↵ of the MCS selection policy, by sometimes picking MCSs have uncertain reward perfor- mance. Over multiple transmission instances, Thompson sampling sequentially learns from the ACK/NACK feedback to refine the ACK probability model and select the optimal MCS. MTS and UTS: The MTS algorithm was proposed in [27], which employed the classical Thompson sampling algorithm for link adaptation. MTS uses the Beta distribution to model the ACK probability for each candidate MCS. Further, MTS reduces the modeling complexity by assuming that the MCSs are uncorrelated in terms of the observed ACK/NACK events. In contrast, the UTS algorithm was proposed in [28] based on the prior link adaptation formulation in [7]. The UTS algorithm exploits the correlation across the candidates MCSs by assuming that the throughput is a unimodal function of the MCS. In this manner, the UTS algorithm substantially reduces the algorithm complexity for link adaptation problems. In Section 6, we numerically compare the proposed LTS algorithm with MTS and UTS algorithms for link adaptation.

5 Latent Thompson Sampling for Link Adaptation

We propose a RLLA scheme, LTS, which models the channel SINR in terms of its probability distribution over a range of SINR values. In every time interval, LTS 130 PAPER VI

assigns a probability distribution to a range of feasible SINR values. Next, LTS obtains an SINR point estimate by sampling the SINR distribution and maps the point estimate to the ACK probabilities for each MCS using the OLM. The ACK probabilities are then used to predict the optimal MCS for transmission. The receiver signals a ACK/NACK feedback to signal the transmission outcome, which LTS uses to update the SINR probability distribution for the next time step. The SINR distribution iteratively concentrates around the true channel SINR, i.e., assigns higher probability density to the SINRs close to the true channel SINR. Further, LTS addresses channel fading by smoothing the SINR distribution to account for channel variations from one time interval to the next. The complete LTS algorithm is listed in Algorithm 5.1, and described subsequently.

Probabilistic SINR Model The channel SINR is a real-valued quantity. Given that the channel is apriori unknown at the transmitter, it is denoted by the SINR probability density func- tion (PDF) P⇥[t][✓] at the transmission time interval indexed by t. The initial SINR PDF, P⇥[0], is uniformly distributed over a feasible SINR range if no chan- nel information is available. On the other hand, if some knowledge of the channel SINR is available, for example from a recent CQI report, the initial SINR PDF P⇥[0] also reflects this channel knowledge.

SINR Point Estimates LTS calculates an estimated SINR in every transmission instance for predicting the optimal MCS. At the tth transmission instance, LTS generates the SINR sample

✓[t] P . (5.1) ⇠ ⇥[t] In the initial few transmission intervals,e the SINR sample has a relatively high variance owing to the limited (or absent) prior knowledge of the channel. The initial SINR samples hence explore the space of feasible SINRs. Over time, the SINR PDF is refined through Bayesian updates that are described in the next section. Subsequent draws are then more likely to be close to the peak of the SINR PDF, which concentrates around the true channel SINR. Sampling from the SINR PDF is possible though several sampling techniques [39]. Here we describe one such scheme, inverse transform sampling (ITS), which has commonly available and computationally eciently implementations [40]. ITS first calculates the the cumulative distribution function (CDF),

F⇥[t] ✓ = P⇥[t] x dx. (5.2) x ✓ Z  ⇥ ⇤ ⇥ ⇤ 5. LATENT THOMPSON SAMPLING FOR LINK ADAPTATION 131

ITS then samples a uniformly distributed random variable, u[t] (0, 1). Fi- ⇠U nally, ITS maps u[t] to an SINR sample through the inverse SINR CDF, ✓[t]= 1 F⇥[t][u[t]]. Sampling a SINR sample ✓[t] that underestimates the true channel SINRe is likely to lower the instantaneous throughput. On the other hand, overestimating the channel SINR is even moree detrimental to link performance: sampling a too- high SINR leads to aggressive MCS selection. Since most error rate curves have a sharp drop-o↵(also called waterfall curves [41]), aggressive MCS selection sharply reduces the instantaneous throughput. To mitigate this possibility, LTS adopts a pessimistic sampling approach, where the SINR estimate is chosen to be the minimum of the mean and sampled SINR values, i.e.,

✓[t]=min ✓[t],E[⇥[t]] . (5.3) { } Similar one-sided samplingb approachese have earlier been studied in the context of optimistic Thompson sampling [42].

Optimal MCS Selection LTS uses the SINR point estimate ✓[t] to predict the optimal MCS m?[t] for the tth transmission interval. Similar to the OLLA MCS prediction step, LTS uses the Tx OLM to map ✓[t] to the MCS ACKb probabilities µm[t]= m, ✓[t] m [M]. Consequently, to maximize the expected link throughput,G the optimal8 MCS2 is obtained fromb (3.6). LTS can also easily addressb other link optimizationb goals, for example throughput maximization under a BLER constraint as considered in [29]. For these alternate performance goals, the appropriate suitable objective function is defined in place of (3.6). The transmitter processes the next set of data bits using m?[t] and send them over the air. Subsequently, the receiver attempts to recover the data bits from the received signal processing and control signaling that indicates that MCS m?[t] was used. The receiver then feeds back the binary ACK/NACK signal c[t] 0, 1 to the data transmitter. 2{ }

Posterior SINR Distribution LTS calculates the posterior SINR PDF using Bayesian updates to the prior SINR PDF. The posterior SINR PDF is P⇥ ✓[t + 1] := P⇥ ✓[t] c[t] ,i.e.,the prior SINR PDF conditioned on the observed ACK/NACK feedback. Further, ⇥ ⇤ ⇥ ⇤ the likelihood of observing c[t] is given by

P c[t]=1 ✓ ,c[t]=1 P c[t] ✓ = (1 ⇥ P c[t]=1 ⇤ ✓ ,c[t]=0 (5.4) ⇥ ⇤ = c[t] µ(⇥m?[t],✓)+ ⇤ 1 c[t] (1 µ(m?[t],✓)), ⇥ ⇥ 132 PAPER VI

Algorithm 5.1 LTS for Link Adaptation

1: Input: Data rates rm m [M], Variance parameter8 22. 2: Initialize: SINR PDF P⇥[1] 3: for Time index t =1to T do 4: Determine an SINR estimate, ✓[t], using (5.3). 5: Predict the optimal MCS, m?[t], for ✓[t] using (3.6). 6: Transmit data with MCS m?[tb]. 7: Observe ACK/NACK feedback c[t]. b 8: Calculate the posterior SINR PDF, P⇥[t+1]. 9: end for

where µ(m?[t],✓) is defined in (3.3) as the ACK probability for MCS m?[t] at SINR ✓. LTS estimates the ACK probability from the OLM defined in (3.7) Tx ? and available at the transmitter, i.e., µm?[t][t]= (m [t],✓). The likelihood function can then be written as G P c[t] ✓[t] c[t] Txb(m?[t],✓) ⇡ ⇥G Tx ? ⇥ ⇤ + 1 c[t] (1 (m [t],✓)) (5.5) ⇥ G The posterior SINR PDF is then obtained from the Bayes’ rule as

P c[t] ⇥[t]=✓ P⇥[t] ✓ P⇥[t+1] ✓ = ⇥ . (5.6) ⇥ P c[t]⇤ ⇥ ⇤ ⇥ ⇤ The ACK probability P c[t] , is easily computed⇥ ⇤ by marginalizing the corre- sponding ✓-conditional distribution, i.e., P c[t] = P c[t] ✓ d✓. The poste- ⇥ ⇤ ⇥[t] rior PDF can then be estimated through the self-normalized expression ⇥ ⇤ R ⇥ ⇤ P c[t] ⇥[t]=✓ P⇥[t] ✓ P⇥[t+1] ✓ = ⇥ . (5.7) ⇥ ⇥[t] P c[t] ✓⇤ P [✓]d✓⇥ ⇤ ⇥ ⇤ R ⇥ ⇤ Convergence Analysis In this section, we bound the expected link adaptation performance of the LTS algorithm. We characterize the finite-time regret of LTS, which measures the cumulative loss in throughput compared to an oracle policy that always selects the optimal MCS. We denote the true, unknown, channel SINR with ✓⇤,which corresponds to the optimal MCS index m⇤ = argmax 1,...,K rmµ(m, ✓⇤). Then, the finite-time regret for LTS is given by { }

T (T ; ✓⇤)=E rm µ(m⇤,✓⇤) rm[t]µ(m[t],✓⇤) . (5.8) R ⇤ t=1 ✓ X ◆ 5. LATENT THOMPSON SAMPLING FOR LINK ADAPTATION 133

For Thompson sampling based optimization, a common regret analysis approach is to bound the Bayesian regret, which is the expected regret over all parameter configurations, given by [25]

(T )=E✓ ⇥ (T ; ✓⇤) (5.9) BR ⇤2 R In the context of link adaptation, the Bayesian regret metric amounts to the expected LTS regret over the range of feasible channel SINRs. Hence, a bound on the Bayesian regret quantifies the maximum throughput loss experienced by an average wireless link. Theorem 4. The Bayesian regret for LTS until time T is upper bounded by

(T ) r 3M + 6M T log T , (5.10) BR  max · where r = max r ,...,r . p max { 1 K } Theorem 4 shows that the Bayesian regret for LTS increases at most as fast as O(pT log T ), where O( ) is the big-Oh notation. Equivalently, the relative · T loss in throughput until time T satisfies O log T/T !1 0, and hence the algorithm is asymptotically optimal. In Section 6, we empirically ! evaluate LTS, p where we demonstrate that this algorithm rapidly converges to the true channel SINR in only a few transmission time intervals. Next, we develop the proof for Theorem 4. The convergence analysis for latent bandits is a topic of active research. The recent work in [43] provides a timely update on this topic. The analysis in [43] assumes that the reward distribution is sub-Gaussian, that is, the reward R P ( a, s) for action a in latent state s satisfies log E exp (R E(R)) ⇠ ·| 2S  22/2, for all R and where is the variance proxy. Under this sub- Gaussianity assumption,2 the Bayesian regret for the general latent Thompson sampling algorithm is shown to be upper bounded by Lemma 1.

(T ) 3 +2 6 T log T, (5.11) BR  |S| |S| · where denotes the cardinality of . p |S| S The bound in Lemma 1 holds for discrete state spaces. Since the SINR is a real-valued quantity, this result is not directly applicable for link adaptation. Further, the LTS rewards correspond to the ACK/NACK feedback, which are Bernoulli-distributed in 0, 1 . To bound the Bayesian regret for LTS in this con- text, we first show that the{ SINRs} can be discretized without a loss in throghput performance. Next, we establish the sub-Gaussianity of ACK/NACK rewards and obtain its variance proxy. We use these two results to derive the stated bound. 134 PAPER VI

Proposition 1. There exists discretization for the range of feasible SINRs such that M that uniquely encodes the the throughput-maximizing MCS for an arbitrary| | SINR.

Proof. Consider the range of feasible SINRs ⇥= [✓min,✓max]. We construct the discrete set of SINRs, = ,..., through the following procedure: = { 1 M } m ✓ : m⇤(✓) = maxm 1,...,M rmµ(m, ✓) , that is, the set of SINRs for which MCS m maximizes{ the link throughput.2 Ties are} broken arbitrarily. Clearly, each element in corresponds to a unique, throughput-maximizing, MCS. Further, the cardinality of this discretization M, since there can be MCSs that do not maximize the link throughput for any| | SINR value.

As an example, the discretization can be constructed by querying the OLM with the range of feasible SINRs to obtain the corresponding expected through- puts for each MCS and constructing the discrete encoding above.

Lemma 2. The Bernoulli distributed random variable X (µ) is sub-Gaussian with variance proxy 2 =1/4. ⇠B

Proof. From the definition of the Bernoulli distribution, we have that

(X µ) (1 µ) ( µ) log E(e ) = log µe +(1 µ)e = µ + log(µe +(1 µ)). (5.12) The expression in (5.12) can be maximized with respect to µ to obtain

2 (X µ) max log E(e )= 1+ log , µ e 1 e 1  8 where the last step uses Pinkser’s inequality [44]. Comparing with the definition, X is sub-Gaussian with 2 =1/4.

Next, we define the following set of consistent SINRs at time t:

1 C :µ µ (t) 6N () log t , (5.13) t 2  2 t ⇢ p where is an arbitrary discretizationb of the SINR space, µ is the expected reward for SINR obtained using the OLM, µ (t) is the mean reward for all 2 previous time steps where SINR was selected, and Nt() is the number of times that was selected until time t.ThesetCt containsb the SINRs that contain the true channel SINR with a high probability. Further, we define an upper bound on the expected reward for each MCS, Ut(m) = argmax C µ(m, ),m=1,...,M. 2 t 5. LATENT THOMPSON SAMPLING FOR LINK ADAPTATION 135

Following the approach in [43,45], we decompose the regret at time step t in the following manner,

r µ(m⇤; ✓⇤) r µ(m[t]; ✓⇤) m⇤ m[t] r µ(m⇤; ✓⇤) µ(m[t]; ✓⇤)  max = r µ(m⇤; ✓⇤) U (m[t]) + U (m[t]) µ(m[t]; ✓⇤) max t t r µ(m⇤; ✓⇤) U (m ) + U (m[t]) µ(m[t]; ✓⇤) , (5.14)  max t ⇤ t where rmax = maxm 1,...,M rm is the MCS corresponding to the largest through- put. This decomposition2{ is} of the same form as obtained in [43, Sec. 4.1], and directly leads to the bound in Theorem 4. We omit a full description owing to space constraints, and refer the interested reader to [43].

Extension to Fading Channels The posterior PDF encodes the knowledge of the channel SINRs at the end of the tth transmission round. For a stationary channel, the SINR does not vary over time. Successive ACK/NACK feedbacks for an arbitrary MCS are hence generated from a stationary ACK probability distribution induced by the chan- nel SINR. However, for fading channels typically observed in practical wireless deployments, the channel SINR is non-stationary owing to the dynamics of prop- agation environment. When the propagation environment features suciently many propagation paths, these dynamics are closely approximated through inde- pendent Normal probability distributions on the real and imaginary signal com- ponents [41]. Further, the variance of the Normal distribution is proportional to the relative speed between the transmitter and the receiver [46]. Here, we use this insight to allow probabilistic SINR tracking in fading channels. LTS addresses SINR nonstationarity by relaxing the posterior SINR PDF through convolution with a Normal distributionin every time step. The updated SINR PDF is then given by

P (up) = P (0,2), (5.15) ⇥[t+1] ⇥[t+1] ⇤N where denotes the convolution operator and (0,2) is the Normal distribu- tion with⇤ zero mean and variance 2. The operationN in (5.15) corresponds to a priori modeling the channel fading as an autoregressive process with Gaussian innovation. The magnitude of SINR variations is a function of the relative speed between the transmitter and the receiver. Cellular deployments measure this rel- ative speed in terms of the Doppler shift experienced by each link [46,47]. Hence, LTS automatically configures the variance parameter proportionally to the nor- malized Doppler estimate, where the proportionality factor needs to be chosen empirically. 136 PAPER VI

Algorithm Complexity As discussed earlier, OLLA has a low implemenation complexity. In terms of memory, OLLA maintains a running estimate of the scalar SINR for each link. In every time interval, OLLA computes a new SINR estimate by adding (or subtracting) the step size from the previous SINR estimate based on the ACK (or NACK) feedback. Subsequently, in most OLLA implementations, the updated SINR estimate is mapped to the optimal MCS through a simple thresholding of the SINR space [3]. In contrast to OLLA, the Thompson sampling based link adaptation schemes maintain a prior distribution over each individual MCS. This entails two distri- bution parameters for each MCS, that is, a total of 2K distribution parameters need to be stored in memory. In every time interval, K Monte Carlo sampling steps are executed, one for each of the K MCSs. The computational complexity of these sampling operations depends on several factors that have been analyzed in [48]. Subsequent to the sampling, the optimal MCS is computed by searching over the predicted throughputs for the candidate MCSs. The complexity of these algorithms is hence significantly higher than OLLA. Our proposed LTS algorithm needs to only maintain a single prior distri- bution, that is, over the scalar SINR metric. Further, in every time interval, LTS executes only a single Monte Carlo sampling step to obtain an SINR point estimate. This SINR is mapped to the ACK probabilities through a fast OLM- based table lookup. Compared to the existing RLLA schemes, LTS hence has a lower memory footprint (to store the SINR model) and a lower computational complexity for computing the optimal MCS.

Discussion LTS is an ecient link adaptation scheme that automatically adapts to the chan- nel fading profile. Further, cellular LA with LTS fulfils the desired link adaptation characteristics of being fast, stable, and robust. The intuition behind fast con- vergence is easily developed from the waterfall nature of error rate curves [41]: for a given MCS, the ACK probability equals one for suciently high SINRs and drops rapidly to zero as the SINR decreases. Hence, whenever an ACK is observed, LTS assigns zero probability densities to the subset of SINRs that correspond to a zero ACK probability for the selected MCS. Conversely, with a NACK, LTS assigns a zero probability density to the SINRs that always succeed for that MCS. Owing to the abrupt error rates predicted by the waterfall curves, a single ACK/NACK feedback can substantially narrow down the range of likely SINRs for fast convergence. Further, LTS has a stable steady-state performance. After convergence to the most likely SINR, subsequent ACK/NACK feedback signals only serve to concentrate the SINR PDF further. Hence, unlike OLLA, the SINR estimate with LTS does not fluctuate after convergence. Finally, LTS is robust to deployment- and hardware-specific impairments. Since LTS learns 6. NUMERICAL RESULTS 137

from the observed ACK/NACK feedback, it estimates the OLM-supported SINR that most closely matches the true channel state. Even in the presence of impair- ments, LTS accurately learns the true channel SINR as long as it is within the range of SINRs encoded by the OLM.

6 Numerical Results

In this section, we numerically evaluate the link adaptation scheme described in this paper. We simulate four link adaptation algorithms: (i) OLLA, which is state-of-the-art in cellular deployments, (ii) UTS, which exploits the unimodality of the throughput function [28], ii) MTS, which employs Thompson sampling with uncorrelated arms [27], and (iv) LTS, our proposed link adaptation algorithm that exploits the latent channel SINR state metric. We evaluate each link adaptation algorithm in terms of their expected throughput, N (t), for t =1, 2,...,where the expectation is taken over N = 1000 independentT runs of each experiment. The simulation code is written in Python and relies on the PY-ITPP li- brary [49] for the communication and signal processing functionality. The PY- ITPP library also provides access to standardized open-source implementations of several small-scale fading stochastic channel models, two of which are considered here. We use Jupyter notebooks running on a remote server to execute the exper- iments and collect the results. Further, we use the Ray library [50] to eciently parallelize the experiments across several compute nodes running separate Linux kernels.

Experimental Setup We evaluate each link adaptation algorithm in the following radio environments. First, we simulate a channel that is stationary in time and has a flat frequency response, also known as an additive white Gaussian noise (AWGN) channel. For this channel, we also compare the SINR evolution for the two SINR-tracking schemes, OLLA and LTS, respectively. Next, we simulate two frequency selective fading channels that respectively model pedestrian and vehicular radio environ- ments. Finally, for the pedestrian environment, we also evaluate the e↵ect of CQI reporting on the performance of the evaluated algorithms. We consider downlink communication modeled on the LTE standard [51]. The transport block sizes and MCS values are obtained from [51] and use Turbo codes for encoding the data bits. For the frequency selective and time-varying SINR experiments, we simulate two di↵erent channel models: a pedestrian environment where the relative speed between the transmitter and the receiver is 3 km/h, using the ITU PEDESTRIAN A channel model [52], and a vehicular environment with a relative speed of 30 km/h modeled using the ITU VEHICULAR B channel model [52]. The pedestrian and vehicular environments respectively emulate low frequency selectivity with gradual fading, and high frequency selectivity with rapid fading. The complete simulation parameters are listed in Table 6.1. 138 PAPER VI

We make the following assumptions throughout the numerical evaluations. We assume perfect synchronization between the transmitter and the receiver. Fur- ther, we assume full-bu↵er trac, that is, the transmitter has sucient data bits in the bu↵er to completely fill the choice of transport block in every transmission instance. We assume that the ACK/NACK feedback is available at the transmit- ter without any signaling delay. We assume that the channel is known perfectly at the receiver, both for decoding the data bits and for estimating the CQI. We assume perfect control channel feedback so that the CQI and the ACK/NACK feedback is always signaled successfully to the transmitter. For CQI generation, we ignore any systematic bias between the OLM employed by the transmitter and the receiver. Further, we only consider the first ACK/NACK transmission and ignore any retransmissions. For fading channels, we assumes a block fading profile, i.e., the channel response stays constant within a transmission interval but varies across the intervals. Further, we assume that the transmitter has per- fc v fect knowledge of the normalized channel Doppler, = ·c Ts,wherefc is the carrier frequency, v is the relative speed of the receiver with· respect to the transmitter, and Ts is the sampling interval.

Algorithms The OLLA parameters are configured based on the previous literature. In par- ticular, The target BLER for OLLA is chosen to the commonly used value of 0.1 [9,10,16]. For step size configuration, di↵erent studies have used values rang- ing from 0.01 dB uup to 1.0 dB. As discussed in previous sections, the OLLA step size strikes a balance between the convergence rate and the stability of the algorithm. We experimented with several di↵erent values for OLLA, and found that a value of = 0.5 to be suitable for link throughput optimization. Finally, the OLLA SINR estimate is constrained within maximum and minimum SINR values obtained from the range of SINRs supported by the OLM. These values e↵ectively constitutes the feasible SINR region for the experiments. MTS and UTS do not require any additional configuration for time-stationary channels. However, with fading channels, these algorithms adapts to the channel dynamics by employing a moving window over the transmissions and their cor- responding ACK/NACK outcomes [28]. A moving window retains the historical MCSs and observed ACK/NACK outcomes for the previous few time steps and discards the rest. This historical data is used to predict the optimal MCS for the current time step. The size of the moving window, L, varies inversely with the channel coherence. However, an exact expression for L is not available and heuristics are applied to choose an appropriate value. We select L by experiment- ing with several possible values with each fading channel, and pick those values that provide the best link performance. LTS uses a probabilistic SINR model. We implement LTS through a dis- cretization of the SINR probability distribution. This implementation approach has several advantages: first, a discretization helps overcome the challenge of ex- 6. NUMERICAL RESULTS 139

pressing the SINR PDF in a closed form. Second, the cellular standards limit the available MCSs to a finite set of discrete MCS values. A suciently fine SINR discretization hence does not su↵er any performance loss compared to the SINR PDF. Finally, a discrete SINR probability mass function (PMF) can be used with computationally ecient sampling schemes. We discretize the feasible SINR range into P SINR bins. The SINR PMF at time interval t is then denoted by

P [✓]:=P [✓ ⇥ ] p [P ], (6.1) ⇥k ⇥k 2 p 8 2 th where ⇥k is the set of SINRs modeled by the p bin and p [K]⇥p is the range of feasible SINRs. [ 2 As discussed in Section 5, we use ITS to obtain a point SINR estimate from the SINR PMF [39]. To calculate the posterior SINR PMF, the integration in (5.6) is replaced by a summation, and the Normal distribution employed by (5.15) is suitably discretized for convolving with the posterior PMF. We configure the variance parameter for LTS by linearly scaling the normal- ized Doppler with a constant scaling factor of 104. This scaling factor was ob- tained empirically in the context of the pedestrian channel. For this, we simulated various LTS variance configurations and picked the variance that maximized the link throughput in the pedestrian scenario. Subsequently, while configuring LTS for the vehicular channel, we computed a new LTS variance by simply scaling the vehicular Doppler with the same factor of 104, that is, without any additional simulations. We observed that with this configuration approach, LTS retains good performance for the vehicular channel as well. An intuitive reason for this is that the channel dynamics are indeed a factor of the relative Dopplers for the respective links. With this approach, the scaling factor needs to be obtained only once, and can then be used to automatically tune LTS configuration for several operating environments that have di↵erent Dopplers.

AWGN Channel We first consider an AWGN wireless channel, where the SINR does not vary with time and where the channel response is flat across the frequency. The true channel SINR is configured to be 10 dB. We assume that the true channel SINR is unknown at the start of the experiment and that no CQI report is available. The goal of link adaptation is therefore to maximize the link throughput by quickly converging to the true SINR. The initial state of algorithms is configured in the following manner: for each experiment, the initial OLLA SINR is selected randomly from the feasible SINR range. For the UTS, MTS, and LTS algorithms, the initial probability distributions for the modeled parameters are uniform over the respective parameter range. The realized throughput for each link adaptation scheme is illustrated in Fig. 1.1(a), where the throughput in each step is averaged over N independent 140 PAPER VI

Table 6.1: Simulation Parameters Scenario Parameter Value

Carrier Frequency, fc 2 GHz Number of Subcarriers 72 FFT Size, M 128 Subcarrier Spacing, f 15 kHz Subframe Duration, t 1ms Feasible SINR Range [ 10 dB, 20 dB] Channel Models AWGN, ITU PEDESTRIAN A [52], ITU VEHICULAR B [52] Relative Speeds v 3 km/h (pedestrian), 30 km/h (vehicular) Configuration Parameters Value OLLA Target BLER, ⌘ 0.1 OLLA Step Size Parame- 0.5dB ter, OLLA Minimum SINR 8.5dB OLLA Maximum SINR 18 .0dB MTS/UTS Window Size, 30, 100 L { } LTS SINR bin size, P 1.0dB

experiments. The average throughput for OLLA (solid blue curve) increases from the initial value and settles around a stable-state value after a few tens of trans- missions. Compared to OLLA, UTS (solid orange curve) has poor performance for approximately the first 20 time intervals, where the UTS link throughput is close to zero. The reason for this behaviour is that UTS initially assigns identical ACK probability distributions to each MCS. UTS hence predicts high expected throughput for the more aggressive MCSs. These MCSs inevitably fail owing to the given channel state, which results in a zero throughput. Subsequently, UTS probes each MCS sequentially until it obtains the optimal throughput-maximizing SINR after 20 time intervals. UTS subsequently maintains the optimal link throughput for rest of the time intervals. The MTS algorithm (solid green curve) achieves an overall poorer performance than UTS, but outperforms OLLA in the latter half of the experiment. The proposed LTS (solid red curve) outperforms each of the competing techniques in terms of the link throughput. LTS con- verges quickly to the optimal throughput and maintains a stable performance subsquently. For the later time intervals, LTS is observed to achieve a slightly smaller throughput than UTS. This is because LTS performance is fundamentally constrained by the discretization of the OLM lookup table, which quantizes the MCS ACK probabilities. The OLM model can be improved to overcome this 6. NUMERICAL RESULTS 141

(a) Link throughput (t). RN

(b) Estimated channel SINR, mean values (solid curve) and one standard deviation (shaded region).

Figure 6.1: Link throughput (top figure) and estimated SINR (bottom figure) av- eraged over N independent runs of an experiment with the AWGN channel. LTS converges quickly to the optimal throughput and has a stable steady state per- formance. In terms of SINR evolution, the SINR estimate for LTS concentrates around the true channel SINR and has a smaller steady-state variance than OLLA.

performance gap, for example by using data-driven OLMs [33]. The OLLA and LTS algorithms explicitly estimate the channel SINR. In Fig. 1.1(b), we plot the SINR evolution for these algorithms. Solid lines de- note the mean SINR across the experimental runs and the dotted lines denote a single standard deviation distance from the mean. Here, OLLA is observed to gradually converge towards the true channel SINR and simultaneously reduce the SINR estimation variance. In the case of LTS (red curves), the SINR estimate is highly uncertain at the beginning of the experiment. However, the SINR estimate quickly concentrates around the true channel SINR. 142 PAPER VI

Figure 6.2: Link throughput averaged for pedestrian channel with a relative speed of 3 km/h. Compared to OLLA, UTS and MTS, LTS adapts faster to the channel variations and is stable across the time intervals.

Frequency Selective Fading Channels

Next, we discuss the numerical results for the pedestrian and vehicular channel models. For the OLLA algorithm, the step size parameter is kept at = 0.5 dB. The window sizes for UTS and MTS are chosen to be L = 100 and L = 30 respectively for the pedestrian and vehicular channels. The LTS variance parameter is configured to be =0.3 and =3.0 for these two fading channels based on the discussion earlier.

In Fig. 6.2, the average realized throughput for OLLA, UTS, MTS, and LTS algorithms are illustrated for the pedestrian channel with blue, orange, green, and red curves respectively. We observe that OLLA achieves good throughput per- formance for the initial few time intervals, but exhibits large fluctuations in later intervals. In contrast, UTS has a relatively more stable throughput performance. However, the peak throughput achieved by UTS is up to 30% lower than OLLA. Compared to both OLLA and UTS, MTS achieves a lower link throughput. In contrast, LTS demonstrates the best performance among the evaluted link adap- tation algorithms. The throughput with OLLA is both higher then competing algorithms, and stable across the time intervals.

We illustrate the link adaptation performance for the vehicular channel in Fig. 6.3. Here, the average throughput realized by LTS is up to 100% higher than that achieved by the competing algorithms. Further, MTS is observed to have a zero throughput for a large number of time intervals since it does not adapt fast enough to the channel. As a result, MTS underperforms OLLA for most of the transmission duration. 6. NUMERICAL RESULTS 143

Figure 6.3: Link throughput averaged for vehicular channel with a relative speed of 30 km/h. Compared to OLLA, UTS, and MTS, LTS increases the average link throughput by up to 100%.

Explicit Channel Reporting

We consider the e↵ect of CQI reports on the link adaptation performance for the pedestrian channel. We assume that the CQI reports are available at the transmitter every 80 ms, with a CQI signaling delay of 4 ms after the channel measurement. The CQI reporting instances are indicated with dashed gray lines in Fig. 6.4. Here, we choose a smaller step size = 0.1 dB to OLLA. This is because with periodic CQI reports, OLLA can quickly adapt to larger channel variations. Further, smaller step size allows stable throughput performance. The UTS and MTS algorithms do not have a mechanism to incorporate CQI reports, and are illustrated in the figure for completeness. The performance results for OLLA, UTS, MTS, and LTS are illustrated in Fig. 6.4 with blue, orange, green, and red curves respectively. We observe that OLLA achieves a good throughput performance by relying on CQI. Compared to Fig. 6.2, OLLA quickly ramps up the SINR estimate when the CQI reports are available, for example at t = 80, 240, 640 ms. OLLA thus overcomes its performance gap compared to LTS with the additional channel knowledge provided by the CQI reports. In contrast, the impact of CQI reports on LTS is rather limited. For a few time intervals (for example, t = 480, 800 ms, the LTS throughput is slightly helped by the availability of CQI reports. However, these gains are insignificant compared to the signaling overhead required to carry the CQI reports. LTS can thus mitigate the need for CQI reports for this scenario and achieve near-optimal throughput based only on the ACK/NACK feedback. 144 PAPER VI

Figure 6.4: Link throughput averaged for pedestrian channel when CQI reports are signaled every 80 ms with a signaling delay of 4 ms (indicated by dashed gray lines). At the cost of higher signaling overhead, CQI reports can improve OLLA performance to match the LTS performance.

7 Conclusions and Future Work

We have proposed LTS, a new approach for link adaptation in cellular systems. LTS models an SINR probability distribution to estimate the latent channel SINR based on the ACK/NACK feedback. By optimally exploiting an OLM at the transmitter, the SINR distribution concentrates quickly around the true e↵ec- tive channel state. Further, LTS can be made to adapt to a fading channel by relaxing the SINR PDF proportionally to the channel Doppler. Compared to state-of-the-art OLLA and RLLA techniques, LTS (i) does not require any con- figuration tuning, (ii) converges faster to the optimal SINR estimate, and (iii) demonstrates a stable steady-state performance. Through numerical evaluations, we demonstrate that LTS significantly outperforms the existing link adaptation algorithms in terms of the average link throughput. Further work on LTS can involve the selection of an appropriate SINR PDF update function for diverse channel fading profiles. For example, with a domi- nant line-of-sight component, the assumption of Gaussian innovation for channel fading might not be the optimal choice. In that case, another suitable function could be applied for the posterior SINR PDF update. Similarly, mork work is re- quired to address abrupt channel variations owing to large-scale obstructions and interference signals. Finally, the numerical evaluation for LTS could be carried out in the presence of several other impairments observed in practice, such as signaling delays, OLM biases, channel estimation errors, interference, and others. The latent modeling approach adopted by LTS can be applied to other do- mains, and to other problems in wireless communications. For example, with beam-based transmission commonly used in fifth generation (5G) cellular sys- tems, the beam space may described by a compact angular subspace. Then, 7. CONCLUSIONS AND FUTURE WORK 145

the relative performance of a beam may be learnt eciently through a simi- lar Bayesian scheme. Similar approaches could also be proposed in the context of frequency sub-band selection, rank adaptation, and spatial multiplexing for multi-user MIMO. 146 PAPER VI

References

[1] E. Dahlman, S. Parkvall, and J. Skold, 5G NR: The next generation wireless access technology. Academic Press, 2018. [2] S. Nanda and K. M. Rege, “Frame error rates for convolutional codes on fad- ing channels and the concept of e↵ective E b/N 0,” IEEE Transactions on Vehicular Technology, vol. 47, no. 4, pp. 1245–1250, 1998. [3] M. Nakamura, Y. Awad, and S. Vadgama, “Adaptive control of link adaptation for high speed downlink packet access (HSDPA) in W-CDMA,” in The 5th International Symposium on Wireless Personal Multimedia Communications, vol. 2. IEEE, 2002, pp. 382–386. [4] D. Morales, J. J. Sanchez, G. Gomez, M. Aguayo-Torres, and J. Entram- basaguas, “Imperfect adaptation in next generation OFDMA cellular systems,” 2009. [5] J. Saha, S. S. Das, and S. Mukherjee, “Link adaptation using dynamically allocated thresholds and power control,” Wireless Personal Communications, vol. 103, no. 3, pp. 2259–2283, 2018. [6] S. T. Chung and A. J. Goldsmith, “Degrees of freedom in adaptive modulation: aunifiedview,”IEEE Transactions on Communications, vol. 49, no. 9, pp. 1561–1571, 2001. [7] R. Combes, J. Ok, A. Proutiere, D. Yun, and Y. Yi, “Optimal rate sampling in 802.11 systems: Theory, design, and implementation,” IEEE Transactions on Mobile Computing, vol. 18, no. 5, pp. 1145–1158, 2018. [8] P. Wu and N. Jindal, “Coding versus ARQ in fading channels: How reliable should the PHY be?” in GLOBECOM 2009 - 2009 IEEE Global Telecommu- nications Conference, Nov 2009, pp. 1–6. [9] V. Buenestado, J. M. Ruiz-Avil´es, M. Toril, S. Luna-Ram´ırez, and A. Mendo, “Analysis of throughput performance statistics for benchmarking LTE net- works,” IEEE Communications Letters, vol. 18, no. 9, pp. 1607–1610, Sept 2014. [10] F. Blanquez-Casado, G. Gomez, M. del Carmen Aguayo-Torres, and J. T. Entrambasaguas, “eOLLA: an enhanced outer loop link adaptation for cellular networks,” EURASIP Journal on Wireless Communications and Networking, vol. 2016, no. 1, p. 20, 2016. [11] V. Saxena, J. Jald´en, J. E. Gonzalez, M. Bengtsson, H. Tullberg, and I. Stoica, “Contextual multi-armed bandits for link adaptation in cellular networks,” in Proceedings of the 2019 Workshop on Network Meets AI & ML, 2019, pp. 44–49. 7. CONCLUSIONS AND FUTURE WORK 147

[12] H. Gupta, A. Eryilmaz, and R. Srikant, “Link rate selection using constrained Thompson sampling,” in IEEE INFOCOM 2019-IEEE Conference on Com- puter Communications. IEEE, 2019, pp. 739–747.

[13] G. Pocovi, A. A. Esswie, and K. I. Pedersen, “Channel quality feedback en- hancements for accurate URLLC link adaptation in 5G systems,” in 2020 IEEE 91st Vehicular Technology Conference (VTC2020-Spring). IEEE, 2020, pp. 1–6.

[14] A. Tato, S. Andrenacci, E. Lagunas, S. Chatzinotas, and C. Mosquera, “Link adaptation and SINR errors in practical multicast multibeam satellite systems with linear precoding,” International Journal of Satellite Communications and Networking, 2020.

[15] S. Park, R. C. Daniels, and R. W. Heath, “Optimizing the target error rate for link adaptation,” in 2015 IEEE Global Communications Conference (GLOBE- COM). IEEE, 2015, pp. 1–6.

[16] A. Duran, M. Toril, F. Ruiz, and A. Mendo, “Self-optimization algorithm for outer loop link adaptation in LTE,” IEEE Communications letters, vol. 19, no. 11, pp. 2005–2008, 2015.

[17] T. Ohseki and Y. Suegara, “Fast outer-loop link adaptation scheme realizing low-latency transmission in LTE-advanced and future wireless networks,” in 2016 IEEE Radio and Wireless Symposium (RWS). IEEE, 2016, pp. 1–3.

[18] S. Wahls and H. V. Poor, “An outer loop link adaptation for BICM-OFDM that learns,” in 2013 IEEE 14th Workshop on Signal Processing Advances in Wireless Communications (SPAWC). IEEE, 2013, pp. 719–723.

[19] S. K. Pulliyakode and S. Kalyani, “Reinforcement learning techniques for outer loop link adaptation in 4G/5G systems,” arXiv preprint arXiv:1708.00994, 2017.

[20] Z. Dong, J. Shi, W. Wang, and X. Gao, “Machine learning based link adap- tation method for mimo system,” in 2018 IEEE 29th Annual International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC). IEEE, 2018, pp. 1226–1231.

[21] S. Khastoo, T. Brecht, and A. Abedi, “Neura: Using neural networks to im- prove wifi rate adaptation,” in Proceedings of the 23rd International ACM Conference on Modeling, Analysis and Simulation of Wireless and Mobile Sys- tems, 2020, pp. 161–170.

[22] M. Elwekeil, S. Jiang, T. Wang, and S. Zhang, “Deep convolutional neural networks for link adaptations in mimo-ofdm wireless systems,” IEEE Wireless Communications Letters, vol. 8, no. 3, pp. 665–668, 2018. 148 PAPER VI

[23] N. Baldo and M. Zorzi, “Learning and adaptation in cognitive radios using neu- ral networks,” in 2008 5th IEEE Consumer Communications and Networking Conference. IEEE, 2008, pp. 998–1003.

[24] C. Wang, J. Hsu, K. Liang, and T. Tai, “Application of neural networks on rate adaptation in ieee 802.11 wlan with multiples nodes,” in 2010 3rd International Conference on Computer Science and Information Technology, vol. 4. IEEE, 2010, pp. 425–430.

[25] A. Slivkins, “Introduction to multi-armed bandits,” arXiv preprint arXiv:1904.07272, 2019.

[26] O. Chapelle and L. Li, “An empirical evaluation of Thompson sampling,” in Advances in neural information processing systems, 2011, pp. 2249–2257.

[27] H. Gupta, A. Eryilmaz, and R. Srikant, “Low-complexity, low-regret link rate selection in rapidly-varying wireless channels,” in IEEE INFOCOM 2018-IEEE Conference on Computer Communications. IEEE, 2018, pp. 540–548.

[28] S. Paladino, F. Trovo, M. Restelli, and N. Gatti, “Unimodal Thompson sam- pling for graph-structured arms.” in AAAI, 2017, pp. 2457–2463.

[29] V. Saxena and J. Jald´en, “Bayesian link adaptation under a BLER target,” in 21st IEEE International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), 2020.

[30] V. Saxena, J. Jald´en, J. E. Gonzalez, I. Stoica, and H. Tullberg, “Con- strained Thompson sampling for wireless link optimization,” arXiv preprint arXiv:1902.11102, 2019.

[31] H. Qi, Z. Hu, X. Wen, and Z. Lu, “Rate adaptation with Thompson sampling in 802.11 ac WLAN,” IEEE Communications Letters, vol. 23, no. 10, pp. 1888– 1892, 2019.

[32] K. Brueninghaus, D. Astely, T. Salzer, S. Visuri, A. Alexiou, S. Karger, and G.- A. Seraji, “Link performance models for system level simulations of broadband radio access systems,” in Personal, Indoor and Mobile Radio Communications, 2005. PIMRC 2005. IEEE 16th International Symposium on, vol. 4. IEEE, 2005, pp. 2306–2311.

[33] V. Saxena, J. Jald´en, M. Bengtsson, and H. Tullberg, “Deep learning for frame error probability prediction in BICM-OFDM systems,” in 2018 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 6658–6662.

[34] S. D. Lembo et al., “Modeling BLER performance of punctured turbo codes,” 2011. 7. CONCLUSIONS AND FUTURE WORK 149

[35] A. Carreras Mesa, M. C. Aguayo-Torres, F. J. Martin-Vega, G. G´omez, F. Blanquez-Casado, I. M. Delgado-Luque, and J. Entrambasaguas, “Link ab- straction models for multicarrier systems: A logistic regression approach,” International Journal of Communication Systems, vol. 31, no. 1, p. e3436, 2018. [36] W. R. Thompson, “On the likelihood that one unknown probability exceeds another in view of the evidence of two samples,” Biometrika, vol. 25, no. 3/4, pp. 285–294, 1933. [37] E. Kaufmann, N. Korda, and R. Munos, “Thompson sampling: An asymptoti- cally optimal finite-time analysis,” in International Conference on Algorithmic Learning Theory. Springer, 2012, pp. 199–213. [38] S. Agrawal and N. Goyal, “Further optimal regret bounds for Thompson sam- pling,” in Artificial Intelligence and Statistics, 2013, pp. 99–107. [39] W. G. Cochran, Sampling techniques. John Wiley & Sons, 2007. [40] S. Olver and A. Townsend, “Fast inverse transform sampling in one and two dimensions,” arXiv preprint arXiv:1307.1223, 2013. [41] A. Molisch, Wireless Communications. Wiley, 2010. [42] B. C. May, N. Korda, A. Lee, and D. S. Leslie, “Optimistic bayesian sampling in contextual-bandit problems,” The Journal of Machine Learning Research, vol. 13, no. 1, pp. 2069–2106, 2012. [43] J. Hong, B. Kveton, M. Zaheer, Y. Chow, A. Ahmed, and C. Boutilier, “La- tent bandits revisited,” Advances in Neural Information Processing Systems, vol. 33, 2020. [44] N. Cesa-Bianchi and G. Lugosi, Prediction, learning, and games. Cambridge university press, 2006. [45] D. Russo and B. Van Roy, “Learning to optimize via posterior sampling,” Mathematics of Operations Research, vol. 39, no. 4, pp. 1221–1243, 2014. [46] T. Yucek, R. M. Tannious, and H. Arslan, “Doppler spread estimation for wireless OFDM systems,” in IEEE/Sarno↵Symposium on Advances in Wired and Wireless Communication, 2005. IEEE, 2005, pp. 233–236. [47] Z. Hou, Y. Zhou, L. Tian, J. Shi, Y. Li, and B. Vucetic, “Radio environment map-aided doppler shift estimation in LTE railway,” IEEE Transactions on Vehicular Technology, vol. 66, no. 5, pp. 4462–4467, 2016. [48] D. F. Anderson, D. J. Higham, and Y. Sun, “Computational complexity anal- ysis for monte carlo approximations of classically scaled population processes,” Multiscale Modeling & Simulation, vol. 16, no. 3, pp. 1206–1226, 2018. 150 PAPER VI

[49] V. Saxena, “py-itpp,” https://github.com/vidits-kth/py-itpp, 2020. [50] E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, J. Gonzalez, K. Goldberg, and I. Stoica, “Ray RLLIB: A composable and scalable reinforcement learning library,” arXiv preprint arXiv:1712.09381, p. 85, 2017. [51] 3rd Generation Partnership Project, “Evolved Universal Terrestrial Radio Ac- cess (E-UTRA); Physical layer procedures,” Tech. Rep. 36.213 v12.3.0, Sep. 2016. [52] ITU-R M. 1225, “Guidelines for evaluation of radio transmission technology for IMT-2000,” Tech. Rep., 1997.