A Comparison of Artificial Intelligence for Dynamic Power Allocation in Flexible High Throughput Satellites by Juan Jose Garau Luis Submitted to the Department of Aeronautics and Astronautics in partial fulfillment of the requirements for the degree of Master of Science in Aeronautics and Astronautics at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY May 2020 ○c Massachusetts Institute of Technology 2020. All rights reserved.

Author...... Department of Aeronautics and Astronautics May 13, 2020

Certified by...... Prof. Edward F. Crawley Professor of Aeronautics and Astronautics Thesis Supervisor

Accepted by ...... Sertac Karaman Associate Professor of Aeronautics and Astronautics Chair, Graduate Program Committee 2 A Comparison of Artificial Intelligence Algorithms for Dynamic Power Allocation in Flexible High Throughput Satellites by Juan Jose Garau Luis

Submitted to the Department of Aeronautics and Astronautics on May 13, 2020, in partial fulfillment of the requirements for the degree of Master of Science in Aeronautics and Astronautics

Abstract The Dynamic Resource Management (DRM) problem in the context of multibeam satellite communications is becoming more relevant than ever. The future landscape of the industry will be defined by a substantial increase in demand alongside the introduction of digital and highly flexible payloads able to operate and reconfigure hundreds or even thousands of beams in real time. This increase in complexity and dimensionality puts the spotlight on new resource allocation strategies that use autonomous algorithms at the core of their decision-making systems. These algorithms must be able to find optimal resource allocations in real or near-real time. Traditional optimization approaches no longer meet all these DRM requirements and the research community is studying the application of Artificial Intelligence (AI) algorithms to the problem as a potential alternative that satisfies the operational constraints. Although multiple AI approaches have been proposed in the recent years, most of the analyses have been conducted under assumptions that do not entirely reflect the new operation scenarios’ requirements, such as near-real time performance or high- dimensionality. Furthermore, little work has been done in thoroughly comparing the performance of different algorithms and characterizing them. This Thesis considers the Dynamic Power Allocation problem, a DRM subproblem, as a use case and compares nine different AI algorithms under the same near-real time operational assumptions, using the same satellite and link budget models, and four different demand datasets. The study focuses on Genetic Algorithms (GA), (SA), Particle Swarm Optimization (PSO), Deep (DRL), and hybrid approaches, including a novel DRL-GA hybrid. The comparison considers the following characteristics: time convergence, con- tinuous operability, scalability, and robustness. After evaluating the algorithms’ performance on the different test scenarios, three algorithms are identified as potential candidates to be used during real satellite operations. The novel DRL-GA implemen- tation shows the best overall performance, being also the most robust. When the

3 update frequency is in the order of seconds, DRL is identified as the best , since it is the fastest. Finally, when the online data substantially diverges from the training dataset of the DRL algorithm, both DRL and DRL-GA hybrid might not perform adequately and an individual GA might be the best option instead.

Thesis Supervisor: Prof. Edward F. Crawley Title: Professor of Aeronautics and Astronautics

4 Acknowledgments

This Thesis has not been written under normal circumstances. Right now the world is suffering the social, economic, and structural effects of the COVID-19 global pandemic. My first thoughts of gratitude go to all healthcare workers on the front line andpeople carrying out essential tasks that keep our society moving. I also appreciate the efforts that my advisor, the department of Aeronautics and Astronautics, and MIT have made to facilitate my work during these months.

I would like to sincerely thank my advisor, Prof. Edward Crawley, and Dr. Bruce Cameron for their support, advice, and guidance throughout these past two years. Ed, Bruce, I really appreciate your honest input and implication to make this research project succeed. It is truly an honor to work side by side with you.

I would also like to thank everyone else that has been part of this research project. First, I really appreciate the support received from SES, especially the feedback and motivation from Joel Grotz and Valvanera Moreno. Next, I would like to express my gratitude to my labmate Markus Guerster, with whom I have shared many moments of joy during the project. Markus, working together on this project has been a rewarding learning experience, I sincerely wish you all the best in your future endeavors. I also appreciate the valuable inputs and help received from Dr. Kalyan Veeramachaneni. I would like to acknowledge the rest of my labmates who also have been part of this project at some point during these two years. Nils, Damon, Rubén, and Skylar, you have been a source of inspiration and stimulating discussions, best of luck on your next steps.

The rest of my labmates at the Engineering Systems Lab have been a key factor to succeed in this first part of my graduate studies. Sydney, Matt, Alex, Anne, Eric, Beldon, George, Tommy, Katie, Michael, thank you for your constant encouragement and for all the fun times together. I would also like to show my appreciation to my former labmate, Íñigo del Portillo, for his valuable advice and honest guidance during my first year at the lab. I do not want to miss the chance to say thank youforthe invaluable administrative support I always get from Amy Jarvis, Beth Marois, and

5 Ping Lee, and the counseling received from Suraiya Baluch. Going through these two years would not have been possible without the support of my roomate Marc de Cea. Marc, I feel very fortunate to share the MIT adventure with you. I am also deeply grateful to my friends for their continous doses of joy and fun throughout these two years: María, Inés, Helena, Ximo, Alex, Dani, Álvaro, Reus, Íñigo, Ondrej, Lukas, Regina, Faisal. Also, being part of Spain@MIT has been so much fun! I also want to thank the tremendous support received from my family and friends in Spain. Big thanks to my parents, Ana and Simón, for encouraging me to work hard and aim high. Also, thanks to my grandparents, godparents, aunts, uncles, cousins and the rest of my family for motivating me and constantly checking in. I am also deeply grateful to my closest friends – you know who you are – who always show me that distance is nothing when it comes to our friendship. Finally, I would like to dedicate this Thesis to my late grandmother Antònia, who passed away shortly before it was completed. Thank you for everything you have done for me, pradina.

6 Contents

1 Introduction 19 1.1 Motivation ...... 19 1.2 General Objectives ...... 22 1.3 Literature Review ...... 23 1.4 Specific Objectives ...... 26 1.5 Thesis Overview ...... 28

2 Dynamic Power Allocation in Multibeam Satellites 31 2.1 Introduction ...... 31 2.2 Dynamic Resource Management ...... 32 2.3 Multibeam Satellite Communications Systems ...... 33 2.3.1 Overview ...... 33 2.3.2 Dynamic Resource Management in Multibeam Satellites . . . 35 2.3.3 Artificial Intelligence for the DRM problem in satellite commu- nications ...... 38 2.4 Dynamic Power Allocation Problem ...... 40 2.4.1 Power Allocation and Transmission ...... 41 2.4.2 Problem Statement ...... 43 2.4.3 Objective Metrics ...... 45

3 Algorithm Implementations 47 3.1 Introduction ...... 47 3.2 ...... 47

7 3.2.1 ...... 48 3.2.2 Simulated Annealing ...... 49 3.2.3 Particle Swarm Optimization ...... 52 3.3 Deep Reinforcement Learning ...... 53 3.4 Hybrid Algorithms ...... 58 3.4.1 SA-GA Hybrid ...... 58 3.4.2 PSO-GA Hybrid ...... 59 3.4.3 DRL-GA Hybrid ...... 60

4 Simulation Models 61 4.1 Introduction ...... 61 4.2 Satellite Model ...... 61 4.3 Demand Models ...... 62 4.4 Link Budget Model ...... 64

5 Results 67 5.1 Introduction ...... 67 5.2 Convergence Analyses ...... 68 5.3 Continuous Operation Performance ...... 71 5.4 Scalability Analyses ...... 77 5.5 Robustness Analyses ...... 81 5.5.1 Sequential Activation ...... 82 5.5.2 Spurious Events ...... 85 5.5.3 Non-stationarity ...... 88 5.5.4 Conclusions on robustness ...... 92

6 Conclusions 95 6.1 Thesis Summary ...... 95 6.2 Main Findings ...... 97 6.3 Future Work ...... 99

8 A Additional Figures 101 A.1 Convergence Analyses ...... 101 A.2 Continuous Operation ...... 102 A.3 Scalability Analyses ...... 104 A.4 Robustness Analyses ...... 109

B Metric details 113 B.1 Satisfaction-Gap Measure ...... 113 B.2 Power calculation heuristic ...... 114

9 10 List of Figures

1-1 Data rate provided by a slightly flexible system (blue) and a highly flexible system (green) with respect to the requested data rate (red). The amount of resource savings corresponds to the area between the green and the blue curves...... 20

2-1 Multibeam satellite with 7 beams...... 34

3-1 DRL Architecture...... 55

4-1 Normalized aggregated demand plot for the four scenarios considered...... 64

5-1 Average aggregated power and 95% confidence interval against computing time available. Power is normalized with respect to the optimal aggregated power. Reference scenario used. . 70

5-2 Average aggregated UD and 95% confidence interval against computing time available. UD is normalized with respect to the aggregated demand. Reference scenario used...... 70

5-3 Aggregated power delivered by every algorithm during the continuous execution simulations. Power is normalized with respect to the optimal aggregated power (optimal power is 1). Reference scenario used...... 72

11 5-4 Aggregated UD achieved by every algorithm during the con- tinuous execution simulations. UD is normalized with respect to aggregated demand (optimal UD is 0). Reference scenario used...... 73

5-5 Average power and UD performance per algorithm on the Ref- erence dataset. Standard deviation for each metric is shown as one of the semiaxes of the ellipses...... 74

5-6 Average aggregated power against number of beams. Power is normalized with respect to the optimal aggregated power. Reference scenario used...... 79

5-7 Average aggregated UD against number of beams. UD is nor- malized with respect to the aggregated demand. Reference scenario used...... 79

5-8 Aggregated power delivered in a continuous execution using the Sequential activation dataset. Power is normalized with respect to the optimal aggregated power...... 83

5-9 Aggregated UD achieved in a continuous execution using the Sequential activation dataset. UD is normalized with respect to the aggregated demand...... 84

5-10 Average power and UD performance per algorithm on the Sequential activation dataset. Standard deviation for each metric is shown as one of the semiaxes of the ellipses. .... 85

5-11 Aggregated power delivered in a continuous execution using the Spurious dataset. Power is normalized with respect to the optimal aggregated power...... 86

5-12 Aggregated UD achieved in a continuous execution using the Spurious dataset. UD is normalized with respect to the ag- gregated demand...... 87

12 5-13 Average power and UD performance per algorithm on the Spurious dataset. Standard deviation for each metric is shown as one of the semiaxes of the ellipses...... 88

5-14 Aggregated power delivered in a continuous execution using the Non-stationary dataset. Power is normalized with respect to the optimal aggregated power...... 89

5-15 Aggregated UD achieved in a continuous execution using the Non-stationary dataset. UD is normalized with respect to the aggregated demand...... 90

5-16 Average power and UD performance per algorithm on the Non-stationary dataset. Standard deviation for each metric is shown as one of the semiaxes of the ellipses...... 91

A-1 Average aggregated SGM and 95% confidence interval against computing time available for the SA algorithm. Reference scenario used...... 101

A-2 Aggregated power delivered by every algorithm during the continuous execution simulations. Power is normalized with respect to the optimal aggregated power (optimal power is 1). Reference scenario used...... 102

A-3 Aggregated UD achieved by every algorithm during the con- tinuous execution simulations. UD is normalized with respect to aggregated demand (optimal UD is 0). Reference scenario used...... 103

A-4 Average power and UD performance per algorithm on the Ref- erence dataset. Standard deviation for each metric is shown as one of the semiaxes of the ellipses...... 103

13 A-5 Aggregated power delivered by each algorithm for the scalability test using 200 beams. Power is normalized with respect to the optimal aggregated power (optimal power is 1). Reference scenario used...... 104

A-6 Aggregated UD achieved by each metaheuristic algorithm for the scalability test using 200 beams. UD is normalized with respect to aggregated demand (optimal UD is 0). Reference scenario used...... 105

A-7 Aggregated power delivered by each metaheuristic algorithm for the scalability test using 400 beams. Power is normalized with respect to the optimal aggregated power (optimal power is 1). Reference scenario used...... 105

A-8 Aggregated UD achieved by each metaheuristic algorithm for the scalability test using 400 beams. UD is normalized with respect to aggregated demand (optimal UD is 0). Reference scenario used...... 106

A-9 Aggregated power delivered by each metaheuristic algorithm for the scalability test using 1000 beams. Power is normalized with respect to the optimal aggregated power (optimal power is 1). Reference scenario used...... 106

A-10 Aggregated UD achieved by each metaheuristic algorithm for the scalability test using 1000 beams. UD is normalized with respect to aggregated demand (optimal UD is 0). Reference scenario used...... 107

A-11 Aggregated power delivered by each metaheuristic algorithm for the scalability test using 2000 beams. Power is normalized with respect to the optimal aggregated power (optimal power is 1). Reference scenario used...... 107

14 A-12 Aggregated UD achieved by each metaheuristic algorithm for the scalability test using 2000 beams. UD is normalized with respect to aggregated demand (optimal UD is 0). Reference scenario used...... 108 A-13 Aggregated power delivered in a continuous execution using the Sequential activation dataset. Power is normalized with respect to the optimal aggregated power...... 109 A-14 Aggregated UD achieved in a continuous execution using the Sequential activation dataset. UD is normalized with respect to the aggregated demand...... 110 A-15 Aggregated power delivered in a continuous execution using the Spurious dataset. Power is normalized with respect to the optimal aggregated power...... 110 A-16 Aggregated UD achieved in a continuous execution using the Spurious dataset. UD is normalized with respect to the ag- gregated demand...... 111 A-17 Aggregated power delivered in a continuous execution using the Non-stationary dataset. Power is normalized with respect to the optimal aggregated power...... 111 A-18 Aggregated UD achieved in a continuous execution using the Non-stationary dataset. UD is normalized with respect to the aggregated demand...... 112

15 16 List of Tables

2.1 List of algorithms and their respective optimization metrics. 46

3.1 GA Parameters...... 50 3.2 SA Parameters...... 51 3.3 PSO Parameters...... 53 3.4 DRL Parameters...... 57

4.1 Link Budget Parameters...... 66

5.1 Aggregated Power and UD results for each algorithm. Power and UD are normalized with respect to the optimal aggre- gated power and aggregated demand, respectively...... 74 5.2 Aggregated Power and UD results for each algorithm, using the Sequential Activation dataset. Power and UD are nor- malized with respect to the optimal aggregated power and aggregated demand, respectively...... 83 5.3 Aggregated Power and UD results for each algorithm, using the Spurious dataset. Power and UD are normalized with respect to the optimal aggregated power and aggregated de- mand, respectively...... 87 5.4 Aggregated Power and UD results for each algorithm, using the Non-stationary dataset. Power and UD are normalized with respect to the optimal aggregated power and aggregated demand, respectively...... 90

17 18 Chapter 1

Introduction

1.1 Motivation

In the coming years, the competitiveness in the satellite communications market will be largely driven by the operators’ ability to automate part of their systems’ key processes, such as capacity management or telemetry analysis. Companies will rely on autonomous engines to make decisions over their operation policies in order to adapt to faster and larger changes in their customer pools, and to have a better management of their systems’ efficiency [10]. Two tendencies set up this new scenario: a shift from static communications payloads to highly-flexible payloads2 [ ] and an increasing demand for data [48]. The former responds to the recent improvements in multibeam satellite technology, where the number of beams and individually-configurable parameters in orbit is growing exponentially – the power, frequency, routing, pointing direction, and sizing of each beam will be individually tunable in real time. The latter trend includes the growing necessity for data transmission through satellite links, since services such as guaranteeing connectivity in isolated regions or providing streaming capabilities in planes and ships are becoming more frequent – aeronautical connectivity grew by $400M and passenger aircraft retail revenues reached $1B in 2018; in-flight connectivity is expected to reach $36B in cumulative revenue by 2028 [47]. In this new scenario, being able to exploit these new flexibilities is important in

19 order to operate on tighter margins, as seen in Figure 1-1. When shifting from a nonflexible or slightly flexible system (blue in the figure) to a highly flexible resource allocation strategy (green in the figure), the same amount of data rate demand (red in the figure) can be served using less resources. These resource savings constitute an additional capacity that can be used to accommodate new users into the system. This shift will be key in order to be competitive in the new markets.

Figure 1-1: Data rate provided by a slightly flexible system (blue) and a highly flexible system (green) with respect to the requested data rate (red). The amount of resource savings corresponds to the area between the green and the blue curves.

On top of the flexibility and demand increase trends, an increase in hardware scalability also contributes to the complexity of future satellite systems. Multibeam technology is expected to improve, and constellations are assumed to be able to sustain hundreds to thousands of active spot beams simultaneously. Examples include SpaceX’s 4,425-satellite LEO constellation with up to 32 beams per satellite and SES’s O3b mPower MEO constellation, consisting of 7 satellites able to power thousands of beams each [32]. In both cases, Terabit-level capacity is expected to be offered [14][59]. This scenario entails that, when it comes to efficiently defining operation policies that exploit the flexibility of communications satellites, a larger number of decisions will be necessary – new flexible parameters and more beams – and these decisions willbe reevaluated often – dynamically – due to increasing fluctuations in demand. This is

20 known as the Dynamic Resource Management (DRM) problem.

The DRM problem is a common trend across many different industries. Multiple studies emphasize the importance of finding efficient resource allocation policies in different domains, such as supply chain [72], vehicle-to-vehicle communications [68], car parking management [71], cloud computing resource assignation [67], or air transportation optimization [5]. The system scalability and the introduction of new hardware technologies that allow a better resource control has motivated the study and design of resource optimization algorithms that constitute the of new decision- making pipelines in these industries. The performance of these algorithms is therefore evaluated on two different levels: optimality and runtime. The former relates tothe ability of allocating resources as efficiently as possible given a certain context. The latter directly affects the ability to repeat this resource allocation process at ahigher frequency, and therefore adapt quicker to changes in the environment. If a company or institution uses an algorithm that decides a poor allocation strategy given a maximum computing time, this might have a substantial impact on the competitiveness of such company with respect to other companies using faster and/or better algorithms.

In the specific case of the DRM problem in multibeam satellites, while traditional approaches were based on static, human-controlled policies that relied on conservative operational margins, new systems will make use of dynamic algorithms capable of handling the increasing dimensionality and the rapidly-changing nature of the user demand [31]. Consequently, these systems must be prepared to make quick decisions on multiple parameters per beam across thousands of active spot beams. These decisions will be taken simultaneously and repeatedly, responding to every change in the constellation’s or satellite’s environment. This is especially relevant in the case of High Throughput Satellites (HTS), which are able to provide around 20 times the throughput of classic FSS satellites [45], and therefore usually serve more customers – a higher variation in demand.

Multiple algorithm families have already been proposed as possible alternatives to current human decision-making processes in the satellite communications industry. Specifically, in the recent years, the spotlight has been placed on Artificial Intelligence

21 (AI) algorithms given its success in other areas [1]. These algorithms range from evolutionary approaches to -based methods. While one can find DRM studies in the communications domain that emphasize the potential of each algorithm, there has been little effort to characterize and compare these methods under the same operationally-realistic premises. It is important to largely characterize the different algorithm alternatives to pick the one that shows the potential competitive advantage. In order to study their suitability for the upcoming resource allocation challenges, this Thesis to close this research-gap by comparing the most recent and popular AI algorithms in solving an instance of the DRM problem in the context of multibeam communications satellites.

1.2 General Objectives

The most important objective of this Thesis is to offer the reader insight on how different AI algorithms perform on a specific but generalizable DRM problem, based on a multibeam communications satellite context, that is high-dimensional and when runtime is short. This work aims to provide a complete comparison such that the key conclusions can be extrapolated to other problems in the same or different domains that share similar features such as high-dimensionality or near-real time control. In order to compare a set of methods, first it is necessary to find and rigorously characterize the problem under which those are compared. Consequently, another objective of this Thesis is to identify and formulate a suitable DRM problem to be used as a comparative benchmark and provide a clear problem statement. As part of this process, important test scenarios that will help to quantify performance need to be identified. Finally, this Thesis also provides details on how different AI algorithms are im- plemented and how parameters are selected in order to best adapt to the problem in question. Although parameter tuning is a process intimately linked to each specific problem, this work aims to contrast algorithms with respect to implementations found in literature and motivate the use of specific parameters or subroutines.

22 1.3 Literature Review

The problem of allocating resources for multibeam communication satellites, especially in an offline fashion, is a well-studied NP-hard [4] and non-convex [8] problem. Math- ematical Programming (MP), which includes well-known subdomains such , Integer Optimization, or Convex Optimization, has been a popular method for years to deal with highly-constrained instances of these resource allocation problems. Examples include the use of MP-based algorithms for beam layout optimiza- tion [6], for power allocation [66], for spectrum allocation [35], and even joint power and bandwidth allocation [39]. To deal with the complexities of the problem, these studies need to rely on relaxations such as piecewise linearization or coordinate descent optimization that guarantee convexity or solve the problem in an often suboptimal iterative fashion.

The increasing dimensionality present in the industry brings up new problems with more optimization variables. Consequently, in addition to the nature of the problem, scalability adds a new layer of complexity. The availability of computing resources is already critical to MP algorithms, but if the dimensionality of the problems increases, having MP algorithms as the core of these online and fast optimization tools is inefficient and impractical. As a reference, in[6] it is shown that, with a timeout of 600 seconds, a MP-based algorithm is able to find optimal beam layouts for 200 user terminals, but starts failing to do so when the number of user terminals reaches 370. The best layout shows a gap of 150% with respect to the best theoretical outcome for 800 user terminals. It is clear that it is difficult to envision this kind of algorithms working in near-real time.

AI is a potential solution to overcome issues like exponential computing cost or complex non-linear solution spaces. Metaheuristic algorithms [64] are a popular group of AI algorithms that have been thoroughly studied in the context of DRM subproblems for satellite communications. In [4] a metaheuristic algorithm known as Genetic Algorithm (GA) is used to dynamically allocate power in a 37-beam satellite and is compared to other metaheuristic approaches. A similar GA formulation is

23 extended to include bandwidth optimization in [50], which demonstrates the advantages of dynamic allocation based on two 37-beam and 61-beam (Viasat-1) scenarios. Joint power and carrier allocation is also proposed in [42], where a two-step heuristic optimization is carried out iteratively for a 84-beam use case. In time-dependant problems, beam task has been addressed using the GA, with positive results for a 20-beam system [33]; and on beam hopping problems, optimizing the illumination schedule for different time slots in a 100-beam satellite [3].

Other authors have opted for approaches based on the Simulated Annealing (SA) algorithm. In [8] a single-objective and discrete power and bandwidth allocation problem was formulated using SA and applied to a 200-beam case, with a focus on fairness. Fairness is a relevant topic that has also been considered in [66], where an iterative dual algorithm is proposed to optimally and fairly allocate power in a 4-beam setting. SA is very present on hybrid approaches, such as in [65], where it is combined with a GA to optimize channel assignment in a 64-cell use case, or in [56], where it is used in combination with a simple neural network. A prior hybrid optimization stage combining the GA and SA is also considered in [4]. SA is also proposed as one of the components of a constrained beam layout greedy optimization approach [7], being responsible for the reflector allocation decision-making process and validated with a 150-beam use case.

Besides population-based and annealing approaches, swarm-based methods have also been applied to the DRM problem in the context of satellite communications. In [18] the Particle Swarm Optimization (PSO) algorithm is applied to a 16-beam satellite in order to optimally allocate power. Then, in [49] a PSO-based power allocation algorithm is implemented and tested on 200-beam use cases. This last work also studies the improvements of adding a subsequent GA stage and forming a PSO-GA hybrid algorithm.

While the performance of the presented methods is compared to equal-allocation, heuristic-based, or stochastic approaches, authors only consider the offline performance when presenting their results and do not account for runtime thresholds that specific scenarios might impose. In other words, none of these studies show the results

24 of continuously applying the respective methods in a dynamic and time-changing environment where there is limited computing time and resources available, nor do they provide the performance metrics as aggregations over complete operation cycles. Therefore, the adequacy of these algorithms to online scenarios has yet to be fully tested and contrasted.

As a potential alternative for repeated uses of the optimization tool, some studies have recently focused on Machine Learning algorithms, specifically Deep Reinforcement Learning (DRL) architectures, which address the needs of fast online performances. In [22], authors use a DRL architecture to control a communication channel in real time considering multiple objectives. Alternatively, DRL has also been exposed to multibeam scenarios, as in [37], where a Deep Q-network was used to carry out channel allocation in a 37-beam scenario. Then, in [25] a continuous DRL architecture is used to allocate power in a 30-beam HTS, showing a 1,300-times speed increase with respect to a comparable GA approach. Finally, in [38] DRL is used to carry out beam-hopping tasks in a 10-beam and 37-cell scenario, showing a stable performance throughout a 24-hour test case.

Prior to the increase of research interest in the DRL field, authors had relied on neural networks as a way of creating hybrid algorithm architectures in combination with heuristic or metaheuristic algorithms. Two examples of this kind of unions include the use of multiple in combination with convergence heuristics to carry out spectrum allocation in multisatellite systems [24], and the adoption of a plus a GA to improve the same task [55]. Compared to these basic structures, DRL achieves an additional level of abstraction as it is able to improve the resource allocation performance with respect to its counterparts. It raises little doubt why it has become an active field of research in the communications community [43].

The cited works show that DRL architectures have been shown to reach good solutions and be capable of exploiting the time and spatial dependencies of the DRM problem. However, most of the test cases focus on optimality and leave other important features such as robustness out of the studies. A lack of robustness in a

25 DRL architecture might lead to non-desirable allocations in cases in which the input does not match the average behaviour of the environment or the system has not been well-trained. The fact that these algorithms must go through a prior training stage raises the question of how a DRL training process should be designed in order to result in robust DRL architectures ready to be operable.

1.4 Specific Objectives

As seen in the previous section, there are multiple papers that address one or more dimensions of the DRM problem for multibeam satellite communications. In all cases, the studies focus on specific resources to optimize and allocate such as power, bandwidth, or beam placement. Designing a method or algorithm able to handle multiple of these optimization variables simultaneously and repeatedly is still a work in progress in the community. The results obtained so far show that, apart from the lack of comparative analyses in most of the studies presented, two important modeling decisions that reflect the future of satellite communications are not adequately addressed: first, while the majority of papers highlight the adaptive nature of the algorithms, there are no results on their performance on a continuous execution in which a repeated use of the algorithms is necessary and computing time is a limiting factor. Second, a few of the studies explore the scalability of their approach. Those that do, fail to do so for use cases with more than 200 beams, and none explore systems in the range of thousands of beams. While the expected dimensionality of the problem lies in the range of hundreds to thousands of beams, the performance of the algorithms has mostly been assessed, on average, for scenarios with less than 100 beams. This Thesis considers the four AI algorithms introduced in the literature review – GA, SA, PSO, and DRL – as a focus of the comparison. In addition, since hybrid algorithms have proved their usefulness for instances of resource allocation problems, this work also considers the two hybrids introduced in the Literature Review – SA- GA and PSO-GA – alongside a novel implementation of a DRL-GA hybrid. This

26 Thesis presents an implementation for each of these algorithms and analyzes their performance on the same use case and scenarios under a realistic operational context. The power allocation problem is used as the use case of the comparison, given that it is one of the most studied resource allocation problems for satellite communications. To fully make it an instance of the DRM problem, this Thesis precisely considers the Dynamic Power Allocation (DPA) problem, with a control frequency of 3 minutes. In the context of this problem, the main specific objectives addressed by this Thesis are:

1. To provide a formulation of several of the most well-studied AI algorithms in literature for solving the DPA problem.

2. To implement a novel DRL-GA hybrid approach as a baseline for the DPA problem.

3. To compare and contrast the performance of the considered algorithms under the same satellite models and on a set of different test scenarios to account for robustness analyses.

4. To characterize the online performance of the algorithms given computing time restrictions.

5. To provide scalability and robustness performance results.

This Thesis focuses on the DPA problem for simplicity and to reduce the amount of uncertainty around the results. Given the current state of the research in the field, the complexity of the full-scale DRM problem would add a noise layer when determining which algorithms work better, since finding a good representation for the complete problem is still a work in progress in the research community. Using the DPA problem as a baseline, the intended objective is to offer the reader a comprehensive and fair comparison of these methods in order to guide the algorithm downselection process for future instances, possibly more complex, of the DRM problem in the context of communication satellites and other fields. If the reader is interested in finding a robust implementation for another DRM problem, the nature or structure of such problem should be carefully analyzed first.

27 Other DRM problems in the satellite communications domain, such as dynamic bandwidth allocation, or in other domains, such as CPU allocation in data centers, have a similar nature to that of the DPA problem and therefore might share several features, such as the constraints-to-variables ratio. Then, the structure of the algorithms could be mapped from one problem to the other without large modifications. Then, generally, the algorithm that showed a better performance on the DPA problem would potentially perform equally good with respect to the other algorithms. This would potentially reduce the cost of implementing multiple algorithms for the second problem, tuning, and characterizing them. In cases in which the nature of the second DRM problem is not aligned with the DPA problem, the downselection process is not that immediate. For these situations, this Thesis aims to explain and understand why the different algorithms have a certain relative performance and to understand why each algorithm performs better or worse given the context and the input data. Those depend on the optimization procedure of every algorithm. It is then the intention of the author that these insights allow the reader to rank the algorithms in terms of the suitability to the second problem, although modifying some parts of the implementations or testing more than one algorithm might be necessary.

1.5 Thesis Overview

The remainder of this document is structured as follows: Chapter 2 explains the generic DRM problem concept, puts it in the context of satellite communications and describes the problem statement and the formulation used throughout the rest of this work; Chapter 3 presents the implementations for each of the nine algorithms studied in this Thesis alongside the main algorithm parameters to allow for reproducibility; Chapter 4 covers the satellite, data rate demand, and link budget models considered in the simulations and used to compute the performance metrics; Chapter 5 discusses the performance of the algorithms in terms of their convergence behavior, online operation results, scalability, and robustness against multiple scenarios; and finally Chapter 6

28 summarizes the work and outlines the conclusions and future research directions as a result of this Thesis.

29 30 Chapter 2

Dynamic Power Allocation in Multibeam Satellites

2.1 Introduction

This chapter addresses the Dynamic Resource Management (DRM) problem in the context of multibeam communications satellites. Specifically, it outlines and focuses on the Dynamic Power Allocation (DPA) problem as a use case for the purpose of this Thesis.

To that end, Section 2.2 starts with a general overview of DRM problems and highlights their main challenges. Then, Section 2.3 puts the spotlight on satellite communications systems and introduces the peculiarities of the DRM problem in this context. To close the chapter, the minutiae of the DPA problem are presented in Section 2.4. These include a synopsis of the power allocation mechanisms in multibeam communications satellites, a detailed problem formulation with the assumptions considered, and an overview of the different objective functions that will be used throughout the rest of the Thesis.

31 2.2 Dynamic Resource Management

Resource Management problems consist in the division and assignment of a limited supply known as the resource to a set of users or systems that make use of it under certain restrictions. A unit or part of this resource can be exclusive, i.e. only one user can utilize it, or non-exclusive, when a finite or infinite number of users are ableto exploit it. Exclusive examples include stock management in storage facilities or the cargo-vehicle assignment operations of a transportation company. On the other hand, spectrum allocation in frequency reuse systems and vehicle routing are examples of non-exclusive cases.

Generally, for every resource allocation decision, one incurs a cost and obtains a benefit from it. Then, the goal of the problem is to jointly maximize the total benefit and minimize the total cost, which can be defined in multiple ways. All decisions are taken according to this goal. For instance, a transportation company might want to maximize the revenues per unit of distance travelled and therefore vehicles and cargo would be paired according to this objective.

In all the examples presented, the users’ resource demands might actually vary over time due to temporal effects, such as seasonality or contingency scenarios. Re- source management problems are usually time-dependant and, consequently, multiple instances of the problem should be considered to account for these time variations. One alternative to be robust against time is to make the resource allocation decisions at one time instant but simultaneously fulfilling the resource demands at all subse- quent moments, assuming the future demand or an estimate of the maximum possible demand is perfectly known.

However, when the future demand is uncertain, there is a need to adapt to the time variations and constantly reconsider the resource allocation decisions made in past instances of the problem. This is known as the Dynamic Resource Management (DRM) problem, where time-dependency plays a relevant role. Usually, following these adapting strategies also leads to a better fulfillment of the goal compared to the worst-case estimations, by increasing the benefits and/or reducing the cost (e.g.

32 there is no need to rent the same storage space every week for a product that is highly seasonal on a yearly basis).

2.3 Multibeam Satellite Communications Systems

This Thesis focuses on the DRM problem in the context of satellite communications, where the goal is to successfully provide communication service to multiple users or customers on Earth by means of satellite coverage. At any given moment, a user has a certain throughput demand, which then changes over time. To satisfy that varying demand, a satellite has a pool of resources such as power and bandwidth. This section addresses that scenario and presents, first, the main features of communication satellites; and second, the specific nuances of the DRM problem in this background.

2.3.1 Overview

A communications satellite is a complex system that is primarily designed to provide communications service to user terminals through space-borne links. Services such as monitoring cargo ships in the middle of the ocean, providing streaming capabilities in airplanes, or internet connectivity in remote areas, rely on satellite communications. Generally, the users serviced by these systems have limited or no access at all to other communications infrastructure, such as terrestrial links. A satellite system is composed by three different segments: the space segment, the ground segment, and the control segment. The first segment comprises all spacecraft and inter-satellite links involved in the communication process. These spacecraft are located in a determined orbit from a wide range of LEO, MEO, and GEO available orbits. The ground segment includes all user terminals and ground stations that participate in the information transmission process. Finally, the control segment consists of all stations that manage and monitor the satellites. The space and ground segments are connected through two types of links: the uplinks, which support the communication from Earth stations to spacecraft; and the downlinks, which do the same from spacecraft to Earth stations. A satellite can establish multiple uplinks and

33 downlinks simultaneously, each associated with a radio frequency-modulated carrier. A carrier is the signal transmitted to (downlink) or from (uplink) a specific user terminal. A satellite is composed of the payload and the platform. The former comprises all the hardware and antennas that sustain the carrier transmission, while the latter consists of all the subsystems that allow the payload to operate (e.g. electric power supply, Telemetry, Tracking and Control (TT&C)). The payload incorporates one or multiple antennas, which supports the creation of one or multiple beams, each with the capacity to back multiple carriers. A beam represents a coverage area on Earth’s surface called footprint. A multibeam satellite comprises several – tens to thousands – of these beams to provide coverage to multiple different regions, as seen in Figure 2-1. The narrower a beam is, the better it can serve the terminals above its footprint, since it can better concentrate the transmitting power on a specific area. The communication effectiveness of a terminal also depends on the relative position of the user inside the beam’s footprint, since the closer to the center of the beam the user is, the less power the satellite needs to serve such user.

Figure 2-1: Multibeam satellite with 7 beams.

In order to ensure an efficient, fair, and economical use of the radio-frequency

34 spectrum, the International Telecommunication Union (ITU) is responsible for creating regulations that states must adopt in their efforts to exploit space communications technology. Among these regulations, the ITU establishes the allowed frequency bands for the uplinks and downlinks of different radiocommunications services.

2.3.2 Dynamic Resource Management in Multibeam Satellites

A satellite user requires a service from the satellite in the form of a specific data rate or throughput. To successfully serve all of its users, i.e. provide at least the required throughput to each user, a multibeam communications satellite has multiple available resources, which are mostly defined by the hardware architecture of its payload and the regulatory restrictions. In addition to the beams, which constitute individual resources per se, a satellite also possesses spectrum and power resources which have to be shared among all active beams. Before a satellite can transmit information to a user, satellite operators need to allocate enough of these power and spectrum resources to the beam that covers the user. Given the location of multiple users on Earth, a multibeam satellite generates a certain number of spot beams such that the union of all of their footprints covers all the locations of interest. Operators must carefully decide how many beams are required in the process. In the case that all beams are equal in size and shape, operators deal with a minimum coverage problem, which imposes a lower bound on the number of required beams. An appropriate maximum number of beams can also be estimated this way, since placing too many beams can lead to interference problems during operations. If the payload allows flexibility in the size and shape of the beams, those can be exploited to create better coverage configurations with a wider range inthe possible number of beams (e.g. one architecture might have a large number of narrow beams while a different one might completely cover users with a few wide beams). After choosing an appropriate coverage configuration, operators need to address the frequency plan decisions: how much bandwidth and which part of the spectrum each beam should use. A satellite has an available spectrum pool, limited by ITU restrictions, to split among its beams. While ideally one would want to maximize

35 bandwidth usage, an excessive allocation of bandwidth might lead to interference and information loss problems. Additionally, some systems incorporate frequency reuse mechanisms in their payloads, thus increasing the ability to exploit and make full use of the available spectrum. Once a specific beam has been placed at a location with one or more usersanda sufficient amount of bandwidth has been assigned to it, the payload needs toprovide enough power such that the received data rate at the user’s terminal is equal or greater than the one requested. At the end of this process, one can realize four different satellite resources are involved in the user-beam communication process: the beam’s position relative to the user, the shape of the beam, the frequency assignment to that beam, and finally the power allocated to initiate the transmission. Despite the user demand being time dependant (e.g. users might not require the same amount of data rate during the morning and the evening) in the past most of these decisions were made only once, generally before launching the satellite, since payloads were fixed-by-design or allowed very little flexibility and therefore the resource management problem was only addressed a single time (the payload was then configured to accommodate the decisions made). User demand was known in advance and operators could account for the worst case demand requirements (peak demand) to efficiently make the resource decisions. However, in the recent years, with the introduction of modern digital payloads, satellites have seen a significant increase in their flexibility. It is expected that inthe next generation of multibeam satellites, in addition to an increase in the number of active spot beams being sustained simultaneously (from tens to thousands), the beam placement, frequency, and power resources will allow for reallocation in real-time. In addition, the satellite communications market is seeing an increase in the number of users, especially those in the data segment, which is expected to double by 2025 [48]. Thanks to the new payload flexibility, operators no longer need to account forthe worst-case scenario, which limits their Service Level Agreements (SLA) when designing a payload and its resource management policies. Instead, they are able to reconfigure the payload in orbit and adapt to the changing user pool and user needs, resulting in a

36 more efficient use of their systems. However, this adaptation comes at the expense ofa continuous recomputing of the payload’s resource allocation decisions. Operators now face the problem of making and reconsidering their allocation decisions at a chosen time frequency in order to maximize the efficiency of their systems. This is known as the DRM problem for multibeam satellite systems [31]. For instance, one might want to reduce the transmitted power for beams covering users in nighttime longitudes, since the throughput demand is reduced. By minimizing the use of resources to satisfy the demand needs, operators can successfully accommodate new demand requests in their user pools and thus the increase in the number of users can be well-handled without the need to launch extra satellites. In the past, making resource allocation decisions was straightforward, since the numbers of variables was small due to the limited flexibility and the number of active spot beams. With the addition of a large set of tunable variables per beam, the optimization problem involved no longer has an immediate solution. The complexity of the problem has three dimensions:

∙ NP-hardness: Since it is not guaranteed that the DRM problem can be solved in time [4], exhaustive search methods that explore the whole solution space require massive amounts of computing resources, especially when dimensionality is high.

∙ Non-convexity: Given the relationship between the optimization variables, the problem is not convex [8] and therefore classic convex optimization methods and solvers can not be used in this context without involving relaxations that lead to suboptimalities. Nevertheless, some specific subproblems of the DRM problem in the context of satellite communications do have convex formulations, although this Thesis focuses on the need to address the non-convex nature of the whole problem.

∙ Multiobjectiveness: The problem not only consists on serving all users satis- factorily, but also minimizing the amount of resources used when doing so. At the same time, this minimization task can be framed in many ways: minimiz-

37 ing the number of beams used, minimizing spectrum needs, or reducing power consumption.

On top of these challenges, high-dimensionality and real-time operation constraints add an extra computational burden. Being able to solve the problem in near real-time ensures the maximization of efficiency, but involves an additional use of computing resources due to the large number of optimization variables, mainly as a consequence of the large number of beams modern payloads are able to sustain.

2.3.3 Artificial Intelligence for the DRM problem in satellite communications

Given the nature of the DRM problem for satellite communications, it is necessary that the resource allocation decisions are transferred to autonomous or semi-autonomous systems that are capable of handling the task in near-real time using powerful DRM algorithms, as opposed to manual or simple rule-based approaches that need to rely on additional resource usage to guarantee user service and/or are hard to deploy in near-real time systems. Still, finding an appropriate algorithm is a challenging task itself, since most of the well-established optimization techniques can be hard to implement due to the reasons exposed in the previous section. As a consequence, a relevant amount of research efforts have been targeting Artificial Intelligence (AI) methods as a potential solution to overcome these issues. AI is a large field that involves ideas from philosophy, mathematics, neuroscience, decision theory, , and probability. It focuses on the design and implementation of systems that can “think” and act rationally, without the need of a constant human supervision [54]. AI comprises a wide range of disciplines including logic-based decision-making, Probabilistic Reasoning, and Machine Learning (ML). Throughout the years, AI has proved its usefulness in numerous real-world domains as well as in advanced computer simulations. This supports the potential of AI-based algorithms to solve the DRM problem for satellite communications. AI algorithms could provide a solution to the optimization problem which is then translated into a

38 specific resource allocation setting. As discussed in Section 1.3, most of the recent work on AI-based solutions for the DRM problem focuses on two algorithm categories:

∙ Metaheuristics: Optimization algorithms that consist of the iterative im- provement of a non-optimal solution or set of solutions by means of heuristics. Examples of these include Genetic Algorithms, Ant Colony Optimization, or Simulated Annealing. Given an instance of a DRM problem, these algorithms produce an initial “bad” solution or set of solutions and improve them iteratively, obtaining a close-to-optimal solution after a certain number of iterations.

∙ Machine Learning: Inference-based algorithms that, after a training stage involving the use of known or “seen” data, perform a specific task without the need of human supervision. Neural Networks or Support Vector Machines are examples of algorithms that belong in this category. These algorithms would process several instances of resource allocation optimization problems, alongside their solutions, and “learn” an underlying optimization function. Then, when faced with an unseen problem instance, the algorithms would be able to provide a solution following their learned function.

On one hand, metaheuristic algorithms have been widely applied to DRM problems mostly involving the optimization of power and/or frequency variables. Examples include the use of Genetic Algorithms [4][50], Particle Swarm Optimization [18][49], and Simulated Annealing [8]. These algorithms have been shown to reach close-to- optimal solutions, but the majority of test cases have not accounted for dimensionality – almost all the studies use satellite simulations with less than a hundred beams. The need to achieve near real-time performance has not been addressed by these studies either. The contribution of these algorithms in a real satellite operation scenario, or in a DRM problem from another domain, remains unclear. On the other hand, since most of the algorithms involve the use of linear ap- proximators known as neural networks, ML algorithms are able to operate in near real-time. However, there are little dimensionality studies that open the doors to their

39 applicability in hundred or thousand-beams cases. In addition, these algorithms are greatly affected by the characteristics of the training data, and therefore have therisk of performing poorly when the input data diverges from the nature of the training dataset. This uncertainty in their robustness capabilities needs to be further addressed in order to characterize and approve them for a real DRM problem in the satellite communications context. Although AI algorithms are a promising solution to be part of DRM autonomous engines that operate in near real-time, there is still little understanding of the implica- tions each of the methods studied in the literature would have in a real deployment. Furthermore, the results evidence that operators might need to consider different algorithms and make decisions over which algorithm or group of algorithms to use in each context. Without a comprehensive and in-depth comparison of the capabilities of each of these methods, the selection and implementation of an algorithm involves excessive amount of trial and error, slowing down the progress in the field. This problem is shared across multiple domains in which DRM is a central issue. While the complete DRM problem for multibeam satellite communications has not been solved – there is currently no algorithm implementation that can handle all resource flexibilities in near real-time – this Thesis tries to offer a complete comparison procedure that helps downselecting to the set of most promising algorithms. To that end, the Dynamic Power Allocation problem (DPA) is chosen as a supporting example throughout the rest of this Thesis. Since the optimality criteria is clear for this specific problem, i.e. there is a good understanding of the tradeoffs between the different optimization variables, the analyses can focus on the specific benchmarking tasks across the different algorithms considered. In the following section, the specifics ofthe DPA problem formulation are presented.

2.4 Dynamic Power Allocation Problem

The power allocation problem in multibeam communications satellites consists of, given a data rate demand for every beam and a fixed setting to the rest of resources,

40 choosing how much power should be allocated to each beam. The goal of the operator is to carry out this task using a policy that allows serving every user while minimizing the use of the power of the system. This way, the power efficiency is maximized and the operator can serve a larger number of new users when following this policy. The particularity of the Dynamic Power Allocation problem is the need to carry out the power allocation task in a continued manner, constantly updating the time- dependant demand requirements of the users and accordingly adapting the power allocation mechanisms of the system. This section first describes the equations that govern the power transmission subsystems and then focuses on the specific problem statement that will be followed for the rest of this Thesis.

2.4.1 Power Allocation and Transmission

Given an active spot beam with a fixed position and shape, and with a predefined frequency band to transmit in, the allocation of a certain power level to that beam entails that a certain data rate will be provided to a user located on its footprint. There are multiple elements that are involved in and affect this process, all of them defined in what is known asthe link budget. This term comprises the set of equations that govern a communication link between a transmitter and a receiver. For the specific case of a downlink, the beam’s antenna is the transmitter and the user terminal the receiver. The link budget equations allow to understand which is the data rate at

the receiver 푅푏 given a certain amount of power 푃푇 푋 allocated at the transmitter. While this section covers the basics, an in-depth explanation on the details of the

link budget elements can be found in [44]. The link’s carrier-to-noise ratio, 퐶/푁0, expresses the ratio of the carrier power at the receiver over the noise power spectral density at the receiver and is defined as

퐶 = 푃푇 푋 − OBO + 퐺푇 푋 + 퐺푅푋 − 퐿 − 10 · log10(푘 · 푇푠푦푠) [dB] (2.1) 푁0 where OBO is the power-amplifier output back-off, 퐺푇 푋 and 퐺푅푋 are the transmitting

and receiving antenna gains, respectively, 푘 is the Boltzmann constant, and 푇푠푦푠 is

41 the system’s temperature. 퐿 represents the sum of all the losses involved in the communication process, being

퐿 = FSPL + 퐿푎푡푚 + 퐿푅퐹푇 푋 + 퐿푅퐹푅푋 [dB] (2.2)

where FSPL indicate the free-space path losses, 퐿푎푡푚 the atmospheric losses, and 퐿푅퐹푇 푋

and 퐿푅퐹푅푋 the transmitting and receiving radiofrequency chain losses, respectively.

The system temperature 푇푠푦푠 (K) from Eq. (2.1) is computed using the Friis formula

−퐿푅퐹 /10 (퐿푅퐹 +퐿푎푡푚)/10 −퐿푅퐹 /10 푇푠푦푠 = 푇푎푛푡 · 10 푅푋 + 푇푎푡푚 · 10 푅푋 + 푇푤 · (1 − 10 푅푋 ) (2.3)

where 푇푎푛푡 is the receiving antenna temperature, 푇푎푡푚 is the atmospheric temperature,

and 푇푤 is the waveguide temperature.

Next, considering interference sources that add to the transmitted signal, the

carrier-to-noise-plus-interference ratio 퐶/(푁0 + 퐼) is computed as

퐶 (︂ 1 1 1 1 1 )︂−1 = + + + + (2.4) 푁0 + 퐼 퐶퐴퐵퐼 퐶퐴푆퐼 퐶푋푃 퐼 퐶3퐼푀 퐶/푁0 where CABI is the Carrier to Adjacent Beam Interference, CASI is the Carrier to Adjacent Satellites Interference, CXPI is the Carrier to cross Polarization Interference, and C3IM is the Carrier to third order Inter-Modulation products interference. Since the use case and scenarios considered in the following chapters assume interference minimization (퐼 ≃ 0) no further details on interference sources are provided in this section.

With the carrier-to-noise-plus-interference ratio, the bit-energy-to-noise-plus- in-

terference ratio, 퐸푏/(푁 + 퐼), is defined as

퐸 퐶 퐵푊 푏 = · (2.5) 푁 + 퐼 푁0 + 퐼 푅푏 where 푅푏 is the resulting data rate at the receiver and 퐵푊 is the bandwidth allocated

42 to the link. At the same time, the data rate can be computed as

(︂ )︂ 퐵푊 퐸푏 푅푏 = · Γ (2.6) 1 + 훼푟 푁 + 퐼 where 훼푟 is the roll-off factor and Γ is a parametric function that represents the spectral efficiency of the modulation and coding scheme (MODCOD) (bps/Hz), given

the value of 퐸푏/(푁 + 퐼) itself.

2.4.2 Problem Statement

This Thesis considers a multibeam High Throughput GEO Satellite with 푁푏 non- steerable beams. These beams are already pointed to a location and have a defined shape. In addition, each beam is allocated a certain amount of spectrum beforehand such that interference among beams is minimized and can be ignored. A of timesteps {1, ..., 푇 } represents all instants in which the satellite is requested a certain throughput demand per beam, i.e. there needs to be a power allocation decision per beam. The goal of the problem is to, at every timestep, allocate a sufficient amount of power to each beam in order to satisfy the demand and constraints imposed by the system while minimizing resource consumption.

At a given timestep 푡, the demand requested at beam 푏 is represented by 퐷푏,푡. Likewise, the power allocated to beam 푏 at timestep 푡 and the data rate attained when

doing so are denoted as 푃푏,푡 and 푅푏,푡, respectively (the notation 푃푏,푡 is used, instead of

the notation from Section 2.4.1, to refer to the transmitting power 푃푇 푋 of beam 푏 at timestep 푡). As seen in the previous section, there is an explicit dependency between the data rate achieved and the power allocated to a particular beam. It is assumed that the satellite, at any moment, has a total available power of

푚푎푥 푃푡표푡. Similarly, every beam 푏 has a maximum power constraint, denoted by 푃푏 . Apart from maximum total and individual beam power constraints, some payloads are further limited by some of their subsystems, such as power amplifiers. This work

assumes the satellite is equipped with 푁푎 power amplifiers, with 푁푎 ≤ 푁푏. Every amplifier is connected to a certain number of beams, but no beam is connectedto

43 more than one amplifier. These connections are given and can not be changed during operation. The amplifiers also impose a maximum power constraint: the sum ofthe power allocated to a group of beams connected to amplifier 푎 can not exceed a certain

푚푎푥 amount 푃푎 . Taking all the constraints into account, the problem is formulated as follows

푇 ∑︁ min 푓(풫푡, 풟푡) (2.7) 푃푏,푡 푡=1 푚푎푥 s.t. 푃푏,푡 ≤ 푃푏 , ∀푏 ∈ ℬ, ∀푡 ∈ {1, ..., 푇 } (2.8) 푁 ∑︁푏 푃푏,푡 ≤ 푃푡표푡, ∀푡 ∈ {1, ..., 푇 } (2.9) 푏=1 ∑︁ 푚푎푥 푃푏,푡 ≤ 푃푎 , ∀푎 ∈ 풜, ∀푡 ∈ {1, ..., 푇 } (2.10) 푏∈푎

훾푚(푅푏,푡) ≥ 훾푀 , ∀푏 ∈ ℬ, ∀푡 ∈ {1, ..., 푇 } (2.11)

푃푏,푡 ≥ 0, ∀푏 ∈ ℬ, ∀푡 ∈ {1, ..., 푇 } (2.12)

where ℬ and 풜 represent the set of beams and amplifiers of the satellite, respectively;

and 풟푡, 풫푡, and ℛ푡 denote the set of throughput demand, power allocated, and data rate attained per beam at timestep 푡, respectively. This formulation includes the constraints presented so far: on one hand, constraints (2.8) and (2.12) represent the upper and lower bounds of the power for each beam in ℬ at any given timestep, respectively. On the other hand, constraints (2.9) and (2.10) express the limitations imposed by the satellite’s and amplifiers’ maximum power, respectively. Then, constraint (2.11) refers to the need of achieving a certain

link-margin per beam greater than 훾푀 , which depends on the data rate attained, as will be described in Section 4.4. Finally, the objective (2.7) is a function of the requested data rate and the power allocated that reflects the goal of the problem: successfully serving all users while minimizing the power resource consumption. Multiple specific objective formulations can be considered to represent this goal, this Thesis focuses on three different metrics extracted from literature, which constitute the objective functions of the algorithms compared.

44 2.4.3 Objective Metrics

Regarding the objective (2.7), 푓(풫푡, 풟푡), different metrics have been used in different works on power allocation algorithms for multibeam satellites. Three different objective metrics are chosen in this Thesis, all of them based on the power allocation at a given timestep 푡. First, the Total Power (TP) is introduced as a measure of resource consumption. This metric simply sums the power allocated to every beam, specifically,

푁 ∑︁푏 푇 푃푡 = 푃푏,푡 (2.13) 푏=1

The second metric is, the Unmet Demand (UD), defined as the fraction of the demand that is not served given the power allocation. This metric has already been used in numerous DRM studies [4][25][50][53] and is formulated as

푁 ∑︁푏 푈퐷푡 = max[퐷푏,푡 − 푅푏,푡(푃푏,푡), 0] (2.14) 푏=1

Finally, the third metric is the Satisfaction-Gap Measure (SGM), which is used as the objective function in [8]. The SGM is based on the demand and data rate per beam

(퐷푏,푡 and 푅푏,푡, respectively) and, following a series of transformations, measures the mismatch with respect to an ideal allocation case, ensuring fairness among beams. The SGM takes values in the interval [0, 1], one indicating the best possible performance. A detailed description of how this metric is computed can be found in Appendix B.1. Following the formulation from Eq. (2.7), the minimization of the negative SGM is required. For each algorithm, the metric or metrics – as objective functions – that show a better convergence towards an optimal solution are used. Table 2.1 summarizes the relationship between the algorithms and the metrics used. While GA and PSO allow for multi-objective implementations, SA and DRL are generally single-objective approaches and therefore only one metric can be used. Specifically, in the case of DRL a linear combination of the TP and UD is considered, as explained in Section 3.3.

45 Table 2.1: List of algorithms and their respective optimization metrics.

Algorithm Optimization type Metric(s) GA Multi-objective TP and UD SA Single objective SGM PSO Multi-objective TP and UD DRL Single objective 푓(TP, UD)

46 Chapter 3

Algorithm Implementations

3.1 Introduction

After presenting the details of the DRM problem for satellite communications and the precise DPA problem formulation that constitutes the test case of this Thesis, this Chapter covers the implementation minutiae of the algorithms compared. First, Section 3.2 presents the three metaheuristic algorithms (GA, SA, and PSO) individually. Then, Section 3.3 introduces the main features of the DRL architecture used and the considerations involving the use of two different types of neural networks, which constitute two separate algorithms for comparison purposes. Finally, Section 3.4 explains the concepts behind each of the hybrid algorithms considered, being SA-GA, PSO-GA, and the two hybrid versions of DRL-GA (one with each neural network).

3.2 Metaheuristics

Metaheuristic algorithms are a class of optimization algorithms that prove to be useful for optimization problems with non-linear, complex search spaces, especially when time plays a relevant role [29]. While achieving optimality is generally hard for these algorithms, they provide “good enough” solutions in admissible amounts of time. There is a wide range of metaheuristic algorithms (e.g., nature-inspired vs. non-nature inspired, population-based versus single-point search [64]).

47 This Thesis considers three of the most popular metaheuristic algorithms, which have already been applied to the power allocation problem. These are the Genetic Algorithm (GA), the Simulated Annealing (SA) method, and the Particle Swarm Optimization (PSO) approach. The GA and PSO are population-based algorithms while SA is a single-point search method. All three algorithms are nature-inspired.

3.2.1 Genetic Algorithm

A GA [36] is a population-based metaheuristic optimization algorithm inspired by the natural evolution of the species. The algorithm operates with a set of solutions to the optimization problem known as a population. Each single solution is called an individual. Iteratively, the procedure combines different solutions (crossing) to generate new individuals and then selects the fittest ones (in terms of the goodness of the solution with respect to the objective function of the problem) to keep a constant population size. On top of that, some individuals might undergo mutations, thus changing the actual solution associated with it. This is done to avoid focusing only on one area of the search space and keep exploring potentially better zones. After a certain number of iterations, when the algorithm execution ends, the fittest individual is chosen as the best solution to the problem.

In the implementation chosen for the comparison, in the context of a GA exe-

cution at timestep 푡, an individual 푥푘,푡 is defined as an array of power allocations {푃 푘 , 푃 푘 , ..., 푃 푘 }, for 푘 ≥ 1. To fulfill constraints (2.8) and (2.12), these values are 1,푡 2,푡 푁푏,푡 푚푎푥 always kept between 0 and 푃푏 for each beam 푏, respectively. The population is then

composed by 푁푝 individuals. When generating new individuals by the procedures

described above, these are denoted as 푥푁푝+1,푡, 푥푁푝+2,푡, and so on. Depending on the case, the population is initialized randomly or the final population from the previous timestep execution is used as initial population for the subsequent timestep execution, i.e.

{푥1,푡, ..., 푥푁푝,푡} = {푥푓1,푡−1, ..., 푥푓푁푝 ,푡−1} (3.1)

48 where 푓1, ..., 푓푁푝 are the indexes of the fittest 푁푝 individuals from the GA execution at timestep 푡 − 1. Then, the genetic operations are implemented as follows:

∙ Crossing: Individuals from the original population are randomly paired and

have a probability 푝푐푥 of being crossed. If so, for each beam, the power allocations 푖 of both individuals are compared and with probability 푝푐푥 the BLX-훼 [20] operator is used.

∙ Selection: Since a multiobjective implementation with TP and UD as objective metrics is considered, the well-known NSGA-II selection method [13] is used in this Thesis, which prioritizes non-dominated individuals given both objectives.

∙ Mutation: After the offspring are generated, new individuals might undergo

a mutation with probability 푝푚푢푡. If so, for each beam, the power allocation is 푖 randomly changed with probability 푝푚푢푡.

To accelerate the convergence to a global optimum, in both the crossing and mutation operations, whether each beam is underserving or overserving its users is taken into account. If a certain beam is underserving, the algorithm does not allow allocation changes that reduce the power allocated to that beam and vice versa. Finally, constraints (2.9) and (2.10) are taken care after each iteration of the algorithm. If a specific individual does not meet any of the two constraints, the power allocation levels are proportionally reduced until fulfilment. Table 3.1 shows the value of parameters of the algorithm used in this work. The implementation is supported by the DEAP library [23].

3.2.2 Simulated Annealing

SA [41] is a metaheuristic algorithm inspired by the cooling processes in metallurgy, where a metal object is more malleable the higher its temperature is. This method, which focuses on a single solution to the optimization problem, iteratively perturbs the optimization variables one by one, observing the changes in the objective function during the course. It is a single-point . The process is governed by a

49 Table 3.1: GA Parameters.

Parameter Symbol Value

Population size 푁푝 200 Individual crossing prob. 푝푐푥 0.75 Individual mutation prob. 푝푚푢푡 0.15 푖 Gene crossing prob. 푝푐푥 0.6 푖 Gene mutation prob. 푝푚푢푡 0.02 BLX-훼 훼 0.2

parameter known as temperature. After changing one of the optimization variables, the algorithm evaluates the objective function and, if the solution improves, the change is kept. However, if the solution is not improved, the algorithm might keep it with a certain probability that is larger the higher the temperature is. The temperature is progressively decreased throughout the iterations – the annealing process – thus preventing changes that degrade the solution in later stages of the algorithm. This generally helps the algorithm not getting stuck in local optima during early iterations.

The implementation used in this Thesis is an adaptation of the parallel SA version in [52] and whose the core procedure is based on [11]. Similar to the GA case, at timestep

푚푎푥 푡 the single solution 푥푡 is defined as {푃1,푡, 푃2,푡, ..., 푃푁푏,푡}, where 0 ≤ 푃푏,푡 ≤ 푃푏 is

always fulfilled for all 푏 ∈ {1, ..., 푁푏}. At the beginning of the procedure this solution is randomly initialized or defined as

푓 푥푡 = 푥푡−1, (3.2)

푓 where 푥푡−1 is the final solution after the SA execution at timestep 푡 − 1. Once

initialized, 푁푛 SA nodes or processes are launched in parallel, all starting from the

initial solution. When all these processes finish, the best 푁푏푒푠푡 solutions are extracted

and the 푁푛 SA nodes start again using the best solutions as initial conditions instead. This process is carried out a fixed number of times or until convergence. After each iteration, the initial and final temperatures are decreased according to the following

50 rule

푇푠푡푎푟푡 ← 푇푠푡푎푟푡 − 푇푟푒푑 · 푘 (3.3)

푇푒푛푑 ← 푇푒푛푑 − 푇푟푒푑 · 푘, (3.4)

where 푇푟푒푑 is an input parameter and 푘 is the iteration number.

Regarding each of the parallel SA processes, these follow the six steps described in [11], using the SGM as objective metric, as in [8]. The cooling rate used in this work is linear, as opposed to the original exponential rate, which for the DPA use case unnecessarily increases runtime. Specifically, the cooling process follows

푇 ← 푇 − 푇푠푡푒푝, (3.5)

where 푇 is the temperature at a given moment of the execution and 푇푠푡푒푝 is an input parameter. All the parameters relevant to the algorithm and the corresponding values used are listed in Table 3.2. For the varying and step vectors, all elements are initialized

identically with values 푣푖 and 푐푖, respectively.

Table 3.2: SA Parameters.

Parameter Symbol Value

Number of nodes [52] 푁푛 20 Initial Start temperature 푇푠푡푎푟푡 1 −5 Initial Final temperature 푇푒푛푑 10 Temperature step 푇푠푡푒푝 0.25 Temperature reduction 푇푟푒푑 100·푇푠푡푒푝 Iterations per variation [11] 푁푠 4 Iterations per temp. reduction [11] 푁푇 2 Step vector C [11] 푐푖 2 Varying vector V [11] 푣푖 1 Number of best individuals [52] 푁푏푒푠푡 5 Number of parallel processes [52] 푁푝푎푟 20

51 3.2.3 Particle Swarm Optimization

PSO [19] is a metaheuristic iterative algorithm inspired by the search processes insect swarms and bird flocks carry out in the nature. The algorithm focuses onasetof particles known as the swarm. At every iteration, each of these particles has a position and a velocity, them being a specific solution to the optimization problem and the direction where the particle is moving to, respectively. Each particle moves between positions – evaluates different solutions – in order to find the best solution givenan objective metric. To that end, every particle takes into account its best position found so far (local best) as well as the best position found by any member of the swarm (global best). At every iteration, the velocity of each particle is determined as a function of its local best and the global best. When the iterations end, the global best is selected as the solution to the optimization problem.

In the context of a PSO execution at timestep 푡, the position of a particle 푘 is denoted by 푥 and defined as an array of power allocations 푥 = {푃 푘 , 푃 푘 , ..., 푃 푘 }. 푘,푡 푘,푡 1,푡 2,푡 푁푏,푡 푚푎푥 Each element in the array is kept between 0 and 푃푏 . Similarly, the velocity is denoted as 푣 and defined as the array 푣 = {푣푘 , 푣푘 , ..., 푣푘 }. Each value fulfills 푘,푡 푘,푡 1,푡 2,푡 푁푏,푡 the following condition

푚푎푥 푘 푚푎푥 −푣푏 ≤ 푣푏,푡 ≤ 푣푏 , ∀푏, 푘 (3.6)

푚푎푥 where 푣푏 is a positive value and the maximum component-wise speed for beam 푏.

푘 Then, the local and global best vectors are denoted by 푥퐿,푡 and 푥퐺,푡, respectively. Note the local best depends on the particle 푘 considered whereas the global best is the same across the whole swarm. Following these definitions, at every iteration the velocity of every particle 푘 is updated as

푘 푣푘,푡 ← 휌퐼 푣푘,푡 + 휌퐺휖퐺(푥퐺,푡 − 푥푘,푡) + 휌퐿휖퐿(푥퐿,푡 − 푥푘,푡) (3.7)

where 휌퐼 is the inertia factor influence [61], 휌퐺 is the global factor influence, 휌퐿 is

the local factor influence, 휖퐺 is the global factor multiplier, and 휖퐿 is the local factor multiplier. The factor multipliers, which are randomly sampled at every iteration,

52 increase the exploration capabilities of the algorithm, avoiding the whole swarm to be easily stuck in a local optima. Once the velocity is defined, the position is updated as follows

푥푘,푡 ← 푥푘,푡 + 푣푘,푡 (3.8)

Then, the new position of each particle is evaluated and the local and global bests updated if necessary. This implementation is the same used in [49]. Further considerations of the algorithm include common elements with the GA implementation, such as the use of underserving and overserving heuristics and additional mutation operations to increase exploration. Then, constraints (2.9) and (2.10) are also taken care of after every iteration and the multi-objective implementation is based on the work in [9]. Finally, the DEAP library [23] is also used as a support. The full list of parameters considered for this algorithm is shown in Table 3.31.

Table 3.3: PSO Parameters.

Parameter Symbol Value

Number of particles 푁푝 500 Inertia factor influence [61] 휌퐼 0.729844 Local factor influence 휌퐿 2 Local factor multiplier 휖퐿 푈(0, 1) Global factor influence 휌퐺 2 Global factor multiplier 휖퐺 푈(0, 1) 푚푎푥 푚푎푥 Max. component-wise speed 푣푏 0.025푃푏 Particle mutation prob. 푝푚푢푡 0.1 푖 Component mutation prob. 푝푚푢푡 0.01

3.3 Deep Reinforcement Learning

DRL is defined by the set of techniques and algorithms that enhance classic Rein- forcement Learning (RL) framework using Deep Learning models [30], mostly neural

1The inertia factor influence is set to the optimal value found in[61]

53 networks. The typical RL setting consists of an agent interacting with an environment

[62]. At a certain timestep 푡, the agent observes the environment’s state 푠푡 and takes

the action 푎푡 that will maximize the discounted cumulative reward 퐺푡, defined as

푇 ∑︁ 푘−푡 퐺푡 = 훾 푟푘 (3.9) 푘=푡 where 푟푘 is the reward obtained at timestep 푘, 훾 is the discount factor, and 푇 is the

length of the RL episode. An episode is a sequence of states {푠0, 푠1, ..., 푠푇 } in which

the final state 푠푇 is terminal, i.e., no further action can be taken. The goal of the agent is to find a policy that optimally maps states to actions. In the case ofDRL,

the policy is parametrized and defined by 휋휃(푎푡|푠푡), where 휃 represents the different weights and biases of the neural network considered.

This Thesis considers the same DRL architecture used in [25], which is depicted in Figure 3-1. The environment is composed by the satellite model and the throughput demand, whereas the agent is responsible for allocating the power at every timestep and therefore comprises the neural network (NN) policy and the policy optimization algorithm. Then, the nature of the problem guides the selection of the policy opti- mization algorithm. While most of the communication and network studies focused on DRL use variants of the Deep Q-learning algorithm [46] due to the discrete nature of the problems [43], the non-discrete state and action spaces considered for the DPA problem suggest the use of a Policy Gradient [63] algorithm instead. To better control the evolution of the policy, a Trust Region method [57] is selected for this Thesis. Among the Trust Region methods, in order to avoid large gradients that substantially change the policy parameters between updates, the Proximal Policy Optimization (PPO) algorithm [58] is chosen as the policy optimization algorithm.

Exploiting the temporal correlation of the DRM problem – user demand is a time series – the state of the environment at timestep 푡 is defined as

* * 푠푡 = {풟푡, 풟푡−1, 풟푡−2, 풫푡−1, 풫푡−2} (3.10)

54 Figure 3-1: DRL Architecture.

where 풟푡 풟푡−1, and 풟푡−2 correspond to the set of throughput demand at timesteps * * 푡, 푡 − 1, and 푡 − 2, respectively; and 풫푡−1 and 풫푡−2 are the heuristic optimal power allocations for the two previous timesteps, described in the following paragraph. While only by including the two previous timesteps there is already a good performance, as will be seen in Chapter 5, the consideration of additional timesteps in the state definition could increase the predictive power at the expense of additional computing resources during training. Given the available resources for this study, looking two timesteps back results in a good compromise between performance and training cost.

The initial state is set to 푠0 = {풟0, 0, 0, 0, 0}.

In this work it is assumed that the agent has access to an heuristic optimal power allocation procedure for all the previous timesteps prior to the current allocation decision, which is easy to compute using the inverse equations to equations (4.1) - (4.5) from the simulation model, which will be presented in Chapter 4. This way, for every

* beam 푏, the minimum power 푃푏,푡 that satisfies 푅푏,푡 ≥ 퐷푏,푡 can be easily determined. If there is a certain beam 푏 in which the demand can not be met, the optimal power

푚푎푥 equals its maximum allowed power 푃푏 . This procedure is described in more detail in the appendix B.2.

The action of the agent is simply allocating the power for each beam, therefore

푚푎푥 푎푡 = {푃푏,푡|푏 ∈ {1, ..., 푁푏}, 0 ≤ 푃푏,푡 ≤ 푃푏 }. (3.11)

55 As shown in Table 2.1, the DRL architecture optimizes the policy using a function of the TP and UD metrics as objective. This is reflected in the reward of the architecture, defined as ∑︀푁푏 ∑︀푁푏 * 2 훽 푏=1 max(퐷푏,푡 − 푅푏,푡, 0) 푏=1(푃푏,푡 − 푃푏,푡) 푟푡 = − − (3.12) ∑︀푁푏 ∑︀푁푏 * 푏=1 퐷푏,푡 푏=1 푃푏,푡

* where 훽 is a weighting factor, 푃푏,푡 is the power set by the agent, 푃푏,푡 is the optimal power (the minimum power needed to satisfy 100% of the demand, computed with the

suboptimal heuristic), 푅푏,푡 is the data rate achieved after taking the action, and 퐷푏,푡 is the demand of beam 푏 at timestep 푡. The data rate is computed using equations (4.1) - (4.5) from the simulation model and the optimal power using the heuristic-based procedure previously introduced. The reward function plays a relevant role in the training or offline stage of the DRL algorithm; the better it represents the goals of the problem the better the performance in the test or online stage will be. Therefore, providing information of the solution in the training stage can substantially increase the performance during the online or test stage, when that information is not available. For more complex DRM problems, where the optimal solution is hard to find, adding the result from approximate heuristic procedures can substantially boost the learning process of the network, as they bias the search towards “good” areas of the solution space. In this study, the performance of the DRL approach is exclusively evaluated on the test data. The reward definition takes into account both objectives of the problem. Thefirst element focuses on demand satisfaction while the second element responds to the need of reducing power consumption. Both elements are normalized by the overall demand and the total optimal power, respectively. The constant 훽 is used to define a priority hierarchy between the objectives. Since it is better to prioritize smaller values of UD over TP, the parameter 훽 is chosen in the range (1, ∞). According to the reward

definition in equation (3.12), 푟푡 ≤ 0, ∀푡. An important element of the DRL architecture is the neural network used as the policy linear approximator function [62]. Two different neural networks are considered:

∙ Multilayer (MLP): A MLP network consists of two or more

56 hidden layers with a parametrized mapping function R푚 −→ R푛 for each pair of consecutive layers, where 푚 and 푛 are the number of hidden units of the first and second layers, respectively [30]. This mapping usually includes a linear function and an activation function in sequence. In this Thesis, four fully-connected layers

with 15푁푏 hidden units each and Rectified Linear Units (ReLU) activations are considered. By expanding the number of dimensions in the hidden layers, the network might learn dependencies and relationships among the different beams that produce better power allocation decisions. This network also uses normalization layers after each hidden layer to reduce training time.

∙ Long Short-Term Memory (LSTM) [27]: A LSTM network is a type of recurrent neural network that exploits the dependencies between sequential data inputs, such as time series, by means of a hidden state – the memory – and three operation gates: input, forget, and output gate. In this case, the parametrized mapping is established between the input to the network, the hidden state and

the three gates. In this Thesis, a 5푁푏-dimension array is used to model the hidden state. Normalization layers are also considered.

Finally, the DRL implementation is supported by OpenAI’s baselines [15] and the PPO parameters listed in Table 3.4 are used.

Table 3.4: DRL Parameters.

Parameter Symbol Value Discount factor 훾 0.01

Learning rate 휆푟 0.03 Steps per update 푁푠푡푒푝푠 64 Training minibatches per update 푁푏푎푡푐ℎ 8 Training epochs per update 푁푒푝표푐ℎ 4 Advantage est. disc. factor [58] 휆 0.8

Clip coef. [58] 휖1 0.2 Gradient norm clip coef. [58] 휖2 0.5 Weight factor 훽 (Eq. (3.12)) 100

57 3.4 Hybrid Algorithms

The final class of algorithms considered in this Thesis are hybrid algorithms. These algorithms consist of the sequential union of two or more algorithms such that the output of the first algorithm constitutes the input or partial input to the second algorithm. It is not uncommon to see these algorithms in literature, especially when it comes to metaheuristic hybridizations, in which two metaheuristic algorithms of different nature are combined. The rationale behind hybrid implementations isto complement and mitigate the weaknesses of the individual algorithms.

Three different hybrids are considered in this work: SA-GA, PSO-GA, andDRL- GA. The idea behind these hybrids is the same: SA, PSO, and DRL are generally known for being fast algorithms but, even though search heuristics are implemented, they have a high risk of getting stuck in local optima or not being robust enough. In contrast, the GA is an algorithm that, despite being slower, if correctly implemented guarantees asymptotic optimality. The objective of the hybrid methods considered is to first, obtain a “fast and good” solution, and then slowly reiterate on that solution to approach to the global optimum of the problem.

3.4.1 SA-GA Hybrid

The SA-GA hybrid combines a single-point and a population-based metaheuristic algorithm and is one of the most popular hybridizations in literature [28][40][69]. For the specific case of power allocation in communications satellites, this algorithm was explored in [4], where it showed an improvement over individual GA or PSO in an offline scenario. The implementation first starts with a parallel SA stage as explained in Section 3.2.2 for a predefined amount of iterations or until convergence. When it ends, instead of selecting the best solution from the parallel processes, the set of solutions is transformed into the initial population of a GA execution. Then the GA carries out another batch of iterations following the procedures described in Section 3.2.1 and the final solution is extracted from the final GA population.

58 Specifically, after 푘 iterations of the SA stage, the GA population is initialized as

{푥푓 , 푥푓 , ..., 푥푓 } (3.13) 푘1 푘2 푘푁푛

being the best solutions for each of the 푁푛 parallel SA processes or nodes after the 푘-th SA iteration. Since the number of parallel SA processes is smaller than the

maximum GA population size, 푁푝, the initial population is left to grow until it reaches this maximum size. Both the SA and GA stages use the parameters already presented in Table 3.2 and Table 3.1, respectively. For all the simulations that follow in this work, the SA-GA hybrid combines three initial iterations of the SA algorithm followed by the GA execution until time limit or convergence.

3.4.2 PSO-GA Hybrid

PSO and GA are two population-based metaheuristics that have also been combined as a hybrid [26][60], since PSO’s population updates produce “quicker” approaches to better areas of the search space but, as opposed to GA, asymptotic optimality is difficult to implement. In literature we see this specific hybrid has been appliedto the joint power and bandwidth allocation problem [49], which constitutes another subinstance of the DRM problem for satellite communications. In this case, since both individual algorithms are based on multi-solution approaches, after an initial PSO execution, the set of final solutions is transitioned to a GA execution and is usedas initial conditions. Then, the final solution is extracted from the final GA population. Specifically, after 푘 iterations of the PSO algorithm, the initial population of the GA is defined as

{푥1,1, ..., 푥푁푝,1}퐺퐴 = {푥푓1,푘, ..., 푥푓푁푝 ,푘}푃 푆푂 (3.14)

where {푥푓1,푘, ..., 푥푓푁푝 ,푘}푃 푆푂 constitute the 푁푝 best particles after the PSO execution ends. The parameters shown in Table 3.3 and Table 3.1 are also used in the individual PSO and GA stages of the PSO-GA hybrid, respectively. In all the simulations that follow in this work, 50 initial iterations of the PSO algorithm are first executed and

59 then the GA stage continues the optimization until time limit or convergence.

3.4.3 DRL-GA Hybrid

The last of the hybrids combines two algorithms from different classes as opposed to two algorithms from the same class, as is the case of SA-GA and PSO-GA. The DPA-oriented implementation presented in this Thesis is a novel work. In the past, researchers have combined neural networks with GA but it is still not common to see a specific DRL model enhanced by a posterior metaheuristic algorithm. In termsof resource management for satellite communications, there is only one study in which a recurrent neural network is combined with GA [55], which was designed for offline frequency allocation. The hybrid DRL-GA integrates the optimized policy that results from the training process described in Section 3.3 with the GA implementation from Section 3.2.1. Being in part a Machine Learning-based method, this hybrid needs a training stage that must happen before the use of the model in an online scenario. Only the DRL architecture is involved in that learning phase, thus resulting in the same models than the ones obtained in the solo DRL implementations. Once the policy is trained, the DRL agent works as a fast warm-start generator for the subsequent GA stage, providing a “good enough” solution extremely fast thanks to the linear structure of the neural network. Then, the GA population is initialized with clones of a single individual, which is the solution found by the DRL model. Finally, as happens in the other hybrid implementations, the GA is run until time limit or convergence. The same two neural networks implemented for Section 3.3 will be considered in the hybrid case as well. The Multi Layer Perceptron (MLP) is used as a general network that can exploit the nonlinearities of the problem while the Long Short Term Memory (LSTM) has proven to work well with time series input data. Consequently, two hybrid implementations using each of these two networks will be considered as individual algorithms to be compared, highlighting the role of the network in the whole optimization process.

60 Chapter 4

Simulation Models

4.1 Introduction

All the analyses conducted in this Thesis are carried out via realistic simulation of communications satellite models. Apart from the algorithm being tested, all simulations rely on three different submodels: a satellite model, a link budget model, and a demand model. This chapter focuses on these three models and presents the details of each one in the following order: first, Section 4.2 introduces the satellite system that is common to all simulations; then Section 4.3 addresses the different demand models or datasets that conform the test scenarios of this study; and finally Section 4.4 covers the specific link budget equations that are used to compute the performance metrics.

4.2 Satellite Model

The satellite system considered in this Thesis is a single 200-beam (푁푏 = 200) GEO satellite located over America and part of the Atlantic Ocean. As part of this model, in order to fully focus on the power optimization variables, the beam placement is predefined and fixed throughout the simulation. Likewise, the spectrum allocation is also preoptimized and frequency reuse mechanisms are used to mitigate interference between the placed beams. This satellite is set to operate in the Ka band and it is

61 assumed that interference with other satellites can be safely ignored. Although a specific satellite system is modeled for this Thesis, the algorithms and the optimization framework could work for other single-satellite or multi-satellite systems optimizing for multiple beams. Actually, most of the DRM problems in the satellite communications context can be mapped to multiple satellite architectures if the communication environment is known and the resource allocation decisions are made by a centralized engine. With respect to the ground segment, although multiple users or terminals might be served by a single beam, all the individual data rate demands are aggregated and a single “super user”, located precisely at the center of the beam, is considered instead. The data rate demand of this unique user is equal to the sum of the individual ones. Having separate users inside a single beam only affects the number of optimization variables, it does not affect any of the other models or algorithms. To address the specific impact of dimensionality, Section 5.4 focuses on the scenarios with an increased number of optimization variables. Finally, it is assumed the control segment requires a near real-time control and optimization of the satellite’s power allocation decisions. Although Section 5.2 gives an overview of how algorithm performance changes with runtime, a 3 minute update rule is considered as control frequency of the communications payload. This value represents a realistic control frequency, given the uplink latency of TT&C and the typical demand variations.

4.3 Demand Models

This work uses traffic demand models that represent realistic operation scenarios. Four different datasets are used in the simulations; these provide demand data points forall beams and span through a time window longer than one day, which usually marks the period of a repeating demand pattern in the case of GEO satellites. These datasets are provided by SES and are composed by 200 time series (one per beam) containing 1,440 data points each, which correspond to demand samples throughout a 48-hour activity

62 period (a sample every 2 minutes). This means 288,000 data rate values from different regions and times are available. With these datasets, the aim is to account for, first, the demand behaviour during a typical daily operations cycle; and second, unusual cases that might lead to important service failures if the algorithms do not perform adequately. The aggregated demand plot for each dataset (sum across all beams for a given time instant) is depicted in Figure 4-1 and their respective descriptions are the following:

∙ Reference dataset: represents the throughput demand during a typical opera- tion scenario, in which more data rate is requested during specific time periods in the day. This represents the user demand behaviour that holds for the majority of the operation life of the satellite.

∙ Sequential activation dataset: considers the scenario in which beams activate progressively throughout a 48-hour time window. As appreciated in Figure 4-1, the algorithms have to deal with sparse demand at the beginning of the simulation and adapt to the sequential activation of new beams. During operations, a sequential activation could occur when there is a need to recover from system failures or new users are added into the system. To create this dataset, the Reference dataset is taken and the activation pattern is created on top of that.

∙ Spurious dataset: accounts for the case in which events that require a relatively high amount of throughput take place in a short amount of time, such as crowded sport events or some natural disasters. This dataset is constructed taking a low demand baseline value for all beams and adding random “demand spikes” with a throughput demand between 2 and 5 times the baseline. The frequency of the spikes increases from 2% to 15% over time throughout a 48-hour interval.

∙ Non-stationary dataset: represents a non-stationary scenario in which the tendency of each beam’s demand constantly changes. To create this dataset, the Reference dataset is again taken and, in periods of 2 hours, time series among pairs of beams are randomly interchanged. For instance, two beams 푏 and 푏′

63 swap their demand time series every 2 hours. This way, the demand correlation between beams is not constant through time, which specifically impacts the Machine Learning-based algorithms. Since this dataset does not change the demand magnitude with respect to the Reference one, the aggregated demand curve looks identical to the Reference dataset.

Figure 4-1: Normalized aggregated demand plot for the four scenarios con- sidered.

4.4 Link Budget Model

The last of the three models deals with the equations needed to quantify the per- formance of the different algorithms. According to the problem statement and their implementation, the output of the algorithms is a power value to be set for each of the beams. This is carried out for a given number of timesteps. Once the power for every beam is decided, there is an additional step of computing the TP and UD metrics considered in this work. This is done by the link budget model presented in this section, which is the same model used in [50] with additional preassumptions. In order to compute the metrics, the power, the demand, and the provided data rate per beam are necessary, for every timestep considered. Focusing on a single beam

푏 at timestep 푡, and denoting its allocated power with 푃푏,푡, the goal is to find the data

rate 푅푏,푡; the other parameters are already known. Following the power transmission

64 mechanisms described in Section 2.4.1, the link’s carrier-to-noise-spectral-density ratio

of beam 푏 at timestep 푡, 퐶/푁0|푏,푡, is computed as

⃒ 퐶 ⃒ ⃒ = 푃푏,푡 − OBO + 퐺푇푥 + 퐺푅푥 − FSPL − 10 log10(푘푇푠푦푠)[dB] (4.1) 푁0 ⃒푏,푡

where OBO is the power-amplifier output back-off (dB), 퐺푇푥 and 퐺푅푥 are the trans- mitting and receiving antenna gains (dB), respectively, FSPL is the free-space path

loss (dB), 푘 is the Boltzmann constant, and 푇푠푦푠 is the system temperature (K). It is assumed that the antenna and amplifier architectures do not change over time and therefore the antenna gains and OBO are constant throughout the entire simulation. Regarding the losses, equations (2.1) and (2.2) considered multiple types of losses 퐿,

including FSPL, the atmospheric losses (푇푎푡푚), and the transmitting and receiving

radiofrequency chain losses (퐿푅퐹푇 푋 and 퐿푅퐹푅푋 , respectively). In the model used it

is assumed that FSPL ≫ 푇푎푡푚, FSPL ≫ 퐿푅퐹푇 푋 , and FSPL ≫ 퐿푅퐹푅푋 , and therefore 퐿 ≃ FSPL.

Since the satellite model considers a beam placement and spectrum allocation that mitigates interference, for the purposes of the objective the interference is neglected (퐼 ≃ 0) and therefore ⃒ ⃒ 퐶 ⃒ 퐶 ⃒ ⃒ = ⃒ (4.2) 푁0 + 퐼 ⃒푏,푡 푁0 ⃒푏,푡 Then, equation (2.5) is given by

⃒ ⃒ 퐸푏 ⃒ 퐶 ⃒ 퐵푊푏 ⃒ = ⃒ · (4.3) 푁 ⃒푏,푡 푁0 ⃒푏,푡 푅푏,푡 where 퐵푊푏 is the bandwidth allocated to beam 푏 (Hz) and 푅푏,푡 is its data rate (bps) achieved at timestep 푡, which is in turn computed as

(︃ ⃒ )︃ 퐵푊푏 퐸푏 ⃒ 푅푏,푡 = · Γ ⃒ [bps] (4.4) 1 + 훼푟,푏 푁 ⃒푏,푡

where 훼푟,푏 is beam 푏’s roll-off factor and Γ is the spectral efficiency of the modulation

65 and coding scheme (MODCOD) (bps/Hz). The spectral efficiency is a function of

퐸푏/푁. In this model, and Modulation (ACM) mechanisms are considered, and therefore the MODCOD used on each link is the one that provides the maximum spectral efficiency while satisfying the following condition

⃒ ⃒ 퐸푏 ⃒ 퐸푏 ⃒ ⃒ + 훾푀 ≤ ⃒ [dB] (4.5) 푁 ⃒th 푁 ⃒푏,푡

퐸푏 ⃒ where 푁 ⃒푡ℎ is the MODCOD threshold (dB), 훾푀 is the desired link margin (dB), and

퐸푏 ⃒ 푁 ⃒푏,푡 is the actual link energy per bit to noise ratio (dB) computed using equation (4.3). Equation (4.5) represents constraint (2.11) from the problem statement such that ⃒ ⃒ 퐸푏 ⃒ 퐸푏 ⃒ 훾푚(푅푏,푡) = ⃒ − ⃒ ≥ 훾푀 (4.6) 푁 ⃒푏,푡 푁 ⃒푡ℎ Equations (4.5) and (4.6) validate if a certain power allocation is feasible (i.e., there needs to be at least one MODCOD scheme such that these inequalities are satisfied). Finally, this study uses the standards DVB-S2 and DVB-S2X for the MODCOD

퐸푏 ⃒ schemes and therefore the values for Γ and 푁 ⃒푡ℎ are those tabulated in the DVB-S2X standard definition [21]. The rest of the parameters of the model can be found in Table 4.1; these were provided and/or validated by SES. Some of these parameters have the same value for all beams; others do not and therefore the range for each of them is shown.

Table 4.1: Link Budget Parameters.

Parameter Symbol Value

TX antenna gain 퐺푇푥 50.2 - 50.9 dB

RX antenna gain 퐺푅푥 39.3 - 40.0 dB Free-space path losses FSPL 209.0 - 210.1 dB Boltzmann constant 푘 1.38 · 10−23 J/K

Roll-off factor 훼푟,푏 0.1 Link margin 훾푀 0.5 dB

66 Chapter 5

Results

5.1 Introduction

This chapter focuses on the different analyses that account for the performance of each algorithm discussed in Chapter 3 using the models and scenarios presented in Chapter 4. Four specific performance dimensions are addressed: time convergence, continuous execution, scalability, and robustness. First, Section 5.2 covers the relationship between the algorithm performance with algorithm runtime for the case of the metaheuristic approaches, where computing time plays a critical role. Then, Section 5.3 does a comparison across all 9 algorithms in the context of the Reference demand model and analyzes the average quality of the solutions attained by each method under time restrictions. Next, Section 5.4 focuses on scalability and presents the results on how the quality of the solution degrades when more beams are added into the optimization framework with the same system and time restrictions being present. Finally, Section 5.5 compares the performance of the algorithms on the additional demand models that represent the contingency or non-frequent scenarios. All the simulations included in this Thesis run on a server with 20 cores of an Intel 8160 processor, 192 GB of RAM, and one NVIDIA GTX 1080 Ti. The GPU is exclusively used during the DRL training stage; the rest of the execution runs on the CPUs. Parallelization has been implemented for all algorithms that allow it

67 (e.g., population evaluation in the case of the GA). For the specific case of DRL, 10 environments are used in parallel during the training stage in order to acquire more experience and increase training speed. All the results that follow are compared in terms of both the power consumed (TP metric) and UD achieved, assigning a higher weight to the latter metric when obtaining a Pareto Front of non-dominated solutions from the population and swarm-based algorithms (GA and PSO stages, respectively). To better compare the algorithms, the performance across both metrics is normal- ized when reporting all results part of this Thesis. First, as discussed in Section 3.3, in this work it is assumed that the optimal power (the minimum power needed to satisfy 100% of the demand, computed with the heuristic described in Appendix B.2) can be computed a posteriori. Consequently, the power metric is normalized with respect to this optimal power. On the other hand, the UD metric is normalized with respect to the total demand for the timestep considered. According to this normalization decisions, the best performance that can be attained is 푇 푃 = 1 and 푈퐷 = 0, marking the utopia point [12] of this study.

5.2 Convergence Analyses

The key of the DRM problem is the need to continuously optimize the resource allocation decisions, since the resource demand varies with time. Consequently, the time necessary to make the new allocation decisions is relevant, since the faster this process is, the higher the update frequency can be. An appropriate update frequency is key to adapt to demand changes and be efficient in terms of resource usage. Inthe case of satellite communications, the ability to obtain good solutions in small intervals of time allows satellite operators to better adapt to their users’ data rate demands. As pointed out in Section 4.2, in this Thesis it is assumed that the control frequency of the satellite system is 3 minutes, which accounts for the time needed to send commands to the satellite and it is a reasonable threshold compared to the variations in demand. This means the decision-making process, i.e. the algorithm runtime, can’t take more than 3 minutes. This is especially relevant for the metaheuristic algorithms.

68 In [25] it is proven that DRL is capable of providing a solution in the range of milliseconds, although it requires a prior offline training phase. By contrast, the three metaheuristic algorithms considered in this work – GA, SA, and PSO – are iteration-based procedures that progressively improve the quality of their solutions [64]. Consequently, the quality of an allocation provided by any of these algorithms is clearly constrained by the computation time available.

The goal of this section is to understand how, in the case of metaheuristic algorithms, the additional dimension of runtime compares between methods. In the context of power allocation, the three algorithms have been assessed in terms of optimality in different works (GA4 [ ], SA [8], and PSO [18]). However, their time-relative performance still needs to be analyzed. To that end, this study starts by evaluating the convergence rate of the metaheuristic algorithms when applied to the use case of the DPA problem.

Figure 5-1 shows the average normalized power (with respect to optimal minimum power) and 95% confidence interval (CI) each of the metaheuristic algorithms allocates depending on the amount of computing time given. The computing time of the analysis ranges from 1 to 10 minutes and is discretized in intervals of 1 minute. Likewise, Figure 5-2 shows the UD each algorithm is able to reach (relative to the total demand) under the same premises. Both figures have been obtained after executing the algorithms at 12 different data points (timesteps) uniformly spaced within the first 24hofthe Reference dataset. Notice in both figures the optimal value is denoted for each metric. As explained in the introduction, these values hold for all simulations but will not be shown again to alleviate visual content.

Two different behaviors can be observed: on one hand, both SA and PSOare capable of finding fairly good solutions in less than one minute of computing time but are not able to reach further significant improvements – they both reach local optima. SA obtains a solution better in power whereas PSO reaches a lower level of UD. On the other hand, GA shows a continuous improvement that leads it to a dominant position both in terms of power and UD. These behaviours reinforce the motivation behind the use of hybrid approaches, as introduced in Chapter 3. Finally,

69 Optimal 1.6 Avg. GA Avg. PSO 1.5 Avg. SA 95% CI GA 1.4 95% CI PSO 95% CI SA

1.3

Normalized power 1.2

1.1

1.0 1 2 3 4 5 6 7 8 9 10 Computing time (min)

Figure 5-1: Average aggregated power and 95% confidence interval against computing time available. Power is normalized with respect to the optimal aggregated power. Reference scenario used.

0.20

Optimal Avg. GA 0.15 Avg. PSO Avg. SA 95% CI GA 0.10 95% CI PSO Normalized UD 95% CI SA

0.05

0.00 1 2 3 4 5 6 7 8 9 10 Computing time (min)

Figure 5-2: Average aggregated UD and 95% confidence interval against computing time available. UD is normalized with respect to the aggregated demand. Reference scenario used.

it is interesting to point out that the use of a single-objective optimization metric for the SA, with SGM as the objective function, leads it to progressively reduce UD and

70 increase power. The SGM evolution over time for the SA algorithm can be found in Figure A-1 of Appendix A. These results have a relevant conclusion to bear in mind throughout the remainder of this Thesis: the performance of the algorithms involving a GA execution is relative to the amount of computing time given. The reader should take into account this limit when extracting conclusions, as allowing for additional computing time could improve the results and vice versa. For the rest of the analyses the maximum computing time of 3 minutes per algorithm is considered, which is enough for the GA to achieve less than 3% and 20% in UD to demand and allocated to optimal power, respectively.

5.3 Continuous Operation Performance

In the context of satellite communications, one of the main gaps found in the literature is the evaluation and comparison of the algorithms’ performance in a dynamic scenario that represents a continuous operation context, where there is limited time to make resource allocation decisions. Prior studies mostly focus on environments in which time is not a restriction and on demand models that do not account for continuous operation. This is especially the case of metaheuristic-based papers, where time might actually be a critical factor, as seen in the previous section. In the case of DRL, works [25], [37], and [38] frame the use of their respective method in a more realistic context and prove the usefulness of DRL throughout 24-hour operation scenarios. The desired execution frequency alongside the algorithm convergence time might impose hard constraints on the use of specific candidates, therefore it is important to quantify the algorithm performance in dynamic environments where computing time is a restriction and use these as comparison baselines. This is the objective of this section. Each of the 9 algorithms considered in this Thesis is run on the Reference dataset at 97 different timesteps evenly spaced between 0 and 24 hours (15 minutes between timesteps). A 3-minute computing time constraint is imposed to all algorithms, to be shared between the two stages in the case of the hybrids. In the case of DRL simulations, the models are trained offline for 6 million timesteps – approximately 7

71 hours and 11 hours of training time for the MLP and LSTM networks, respectively – using the Reference dataset and the same training and testing procedures described in [25]. The DRL solo and DRL-GA hybrids share the same DRL models when the DRL architectures use the same neural network (MLP or LSTM). Figure 5-3 shows the aggregated achieved power for every algorithm at each of the timesteps, applying a normalization factor with respect to the optimal aggregated power. Likewise, the UD results for this analysis are shown in figure 5-4, considering the aggregated demand per timestep as the normalization factor. These two figures focus on a specific range of the , which prevents the full performance curve from being shown in the case of some algorithms. In Appendix A, Figure A-2 and Figure A-3 show the results on the full metric space for both the TP and UD metrics, respectively.

GA SA DRL-LSTM PSO SAGA DRL-GA-MLP 2.0 PSOGA DRL-MLP DRL-GA-LSTM

1.8

1.6

1.4

1.2

Normalized power 1.0

0.8

0.6

0 3 6 9 12 15 18 21 24 Time (h)

Figure 5-3: Aggregated power delivered by every algorithm during the con- tinuous execution simulations. Power is normalized with respect to the optimal aggregated power (optimal power is 1). Reference scenario used.

72 GA SA DRL-LSTM PSO SAGA DRL-GA-MLP PSOGA DRL-MLP DRL-GA-LSTM

0.14

0.12

0.10

0.08

0.06 Normalized UD

0.04

0.02

0.00 0 3 6 9 12 15 18 21 24 Time (h)

Figure 5-4: Aggregated UD achieved by every algorithm during the contin- uous execution simulations. UD is normalized with respect to aggregated demand (optimal UD is 0). Reference scenario used.

While it is interesting to show the temporal evolution of the performance, it is easier to extract conclusions from further aggregated data. To that end, the aggregated performance over time for each algorithm, in terms of both average and standard deviation per metric, can be found in Table 5.1. Both metrics can be integrated in the form of a 2-D space, as shown in Figure 5-5. This figure shows the power metric on the horizontal axis and the UD metric on the vertical axis. Then, the average performance for both metrics is given by a bold dot per algorithm while the standard deviation of each metric is represented as one of the semiaxes of each ellipse – the horizontal semiaxis corresponds to the power metric standard deviation and the vertical semiaxis corresponds to the UD standard deviation. This figure has also been limited toa specific region of the metric space, leaving the SA performance out of the visual range. In Appendix A, Figure A-4 shows the full range of the metric space.

73 Table 5.1: Aggregated Power and UD results for each algorithm. Power and UD are normalized with respect to the optimal aggregated power and aggregated demand, respectively.

Power UD Algorithm Avg. Std. dev. Avg. Std. dev. GA 1.06 0.09 0.03 0.02 PSO 0.93 0.16 0.05 0.04 PSO-GA 1.20 0.28 0.01 0.01 SA 2.40 5.77 0.19 0.15 SA-GA 1.25 0.30 0.01 0.01 DRL-MLP 1.32 0.13 0.03 0.01 DRL-LSTM 1.30 0.12 0.01 0.01 DRL-GA-MLP 1.07 0.09 0.01 0.01 DRL-GA-LSTM 1.05 0.03 0.01 0.01

GA SA DRL-LSTM PSO SAGA DRL-GA-MLP 0.10 PSOGA DRL-MLP DRL-GA-LSTM

0.08

0.06

0.04 Normalized UD

0.02

0.00 0.9 1.0 1.1 1.2 1.3 1.4 1.5 Normalized power

Figure 5-5: Average power and UD performance per algorithm on the Ref- erence dataset. Standard deviation for each metric is shown as one of the semiaxes of the ellipses.

Based on the results in all these figures, the behavior of each algorithm canbe described as follows:

74 ∙ GA: depicted in red, GA shows one of the most stable performances in terms of power, as has one of the smallest power averages (6% extra power with respect to the optimal value) and the second smallest standard deviation. Regarding UD, despite not being one of the best candidates in terms of this metric, it always serves, on average, at least 97% of the demand.

∙ SA: depicted in green, it is clearly dominated by other algorithms both in terms of power and UD, achieving the worst results in both metrics. Furthermore, at some timesteps it is not able to find the optimal region and produces allocations that consume more than the double of the power required. This behaviour would not be desirable in a real operation scenario. The main advantage of the SA is its speed to find an initial solution, which clearly has an impact when itis combined with the GA in the SA-GA hybrid (average UD reduction from 19% to 1%).

∙ PSO: depicted in blue, it shows a similar behaviour to SA in terms of local- optimality and speed but clearly dominates the annealing method, since, es- pecially during the first 12 hours, achieves a lower UD consuming less power with respect to SA. Still, the variability of this algorithm produces non-desirable outcomes at several intervals of the simulation that incur in more than 10% of the demand not met (e.g. time interval from 9 to 15 hours). The PSO algorithm is not as robust as the GA in terms of locating the minimum UD solutions in the power-UD tradeoff curve, since its multi-objective adaptation does not allow to preserve as many points from the best Pareto front, as the GA does.

∙ SA-GA: depicted in orange, the first of the two hybrids shows the positive impact in the UD metric with respect to any of the individual algorithms alone, as shows an average service rate of 99%. However, in terms of power, it preserves part of the variability induced by the SA stage. This originates a “spiky” behaviour throughout the 24 hours. This amount of variability might not be desirable.

75 ∙ PSO-GA: depicted in purple, this algorithm achieves similar low UD results to other algorithms and, as opposed to SA, does not show an excess power use. The main disadvantage is, although in smaller quantity with respect to the SA-GA hybrid, the presence of power spikes. The spikes of the PSO-GA and SA-GA hybrids can be validated with the fact that they present the largest horizontal semiaxes in the ellipses (not taking the solo SA into account).

∙ DRL: depicted in olive (MLP) and brown (LSTM), this approach shows good UD results, serving, on average, 97% (MLP) and 99% (LSTM) of the demand for all timesteps. However, both networks are too conservative in terms of power; although the behaviour is stable and does not show any spike. This conservative power setting is a result of the networks learning a high-power bias. The greatest advantage of DRL is its capacity to achieve these results in milliseconds, an important feature for rapidly-changing scenarios. Finally, it is worth pointing out that, although it is popular among time series prediction applications, the LSTM does not necessarily eclipse the MLP if the MLP is given a good representation for the environment state that captures important information.

∙ DRL-GA: depicted in pink (MLP) and grey (LSTM), both DRL-GA hybrids are clearly the best algorithms in terms of the power-UD tradeoff. As also seen in the other hybrids, the GA stage improves the solo DRL versions in both average performance and standard deviation, making this hybrid a clear choice among the rest of the algorithms. The comparison conclusions between networks repeats: although the LSTM shows better numbers, the MLP performance is close although it shows 3 times more power variability with respect to the DRL-GA-LSTM. The advantage of using the MLP network over the LSTM one is a reduced training time, although with the assumption that this happens offline it is not an important advantage.

Based on the presented results, we can observe some algorithms are dominated and should not be considered for online real settings. First, both metaheuristic hybrids dominate their individual counterparts in terms of UD (SA-GA dominates SA and

76 PSO-GA dominates PSO, respectively). Between these two, PSO-GA shows less variability in its power allocations. Then, GA is one of the most stable algorithms and its asymptotic optimal behaviour makes it outperform the PSO-GA in terms of power. Finally, on the Machine Learning side, both DRL-GA hybrids have shown an outstanding performance, with the version using the LSTM showing the smallest UD and smallest feasible power (the PSO shows a power performance of 0.93, since it is smaller than 1 means that there is no way of achieving zero UD). Finally, both solo DRL approaches can not be matched in terms of speed, making them the dominant algorithms in that dimension.

5.4 Scalability Analyses

As pointed out in the introductory chapter, there is at least an order of magnitude of difference in the number of beams used in previous power allocation studies and the satellite systems that are expected to be launched in the coming years. While almost every system will be able to operate hundreds of beams, some constellations will scale up to thousands of configurable beams. As a consequence, an autonomous DRM engine must be able to handle this level of scalability, which translates into more variables to optimize and more constraints to take into account under similar time constraints. In that direction, this part of the analyses focuses on the scaling capabilities of the algorithms, given their computing time constraints. However, since scalability has a different impact on different algorithms, it is necessary to make a distinction between the metaheuristic-based algorithms and the Machine Learning-based algorithms. The former work in iterative processes that improve solutions over time. These iterations involve changing some or all of the variables in the case of GA and PSO and each variable one-by-one in the case of SA. In these cases, scalability involves longer convergence processes, as the iteration runtime is proportional to the number of optimization variables. In the latter case, scalability has a direct impact on the training stage of Machine Learning-based algorithms. Having more variables in the

77 optimization problem translates into larger state and action spaces, and therefore optimizing the policy via policy gradient takes more time.

Since the DRL training matters are left out of the analyses of this Thesis, this section only addresses the scaling capabilities of the complete metaheuristic-based algorithms, i.e. GA, PSO, SA, PSO-GA, and SA-GA. In the case of DRL, although the training time increases, the online operation procedure continues being relying on the forward pass of a neural network, which given its linear structure still would be in the range of milliseconds to seconds. Nevertheless, although DRL has proven its capacity to deal with high-dimensional inputs in other domains [16][70], scalability in the order of tens of thousands of optimization variables and more is definitely one of the challenges to bear in mind for future applications of DRL [17].

To assess how well each algorithm scales, in this section the implementations are tested on satellites with an increasing number of beams. Given the lack of availability of satellite data corresponding to systems with more than 200 beams, the 200-beam model described in Section 4.2 is replicated instead, including beams, amplifiers and users. When replicating the 200-beam model once, a new 400-beam satellite is created and it can be controlled with any of the algorithms considered in this study. Following this idea, the replicating action is taken several times and three new models with 400, 1000, and 2000 beams, respectively, are used in this part of the analyses.

After replication, the Reference dataset – which is also replicated – is used and each algorithm is run at 25 different timesteps evenly spaced between the 10th and the 16th hour in this dataset (15 minutes between timesteps). This time window is chosen since it corresponds to the interval where the demand peak is, as seen in Figure 4-1 from Section 4.3. As a result, Figure 5-6 shows the average aggregated power obtained per algorithm in each of the scaled models. This metric is obtained after first, aggregating the power across all beams for each timestep, then normalizing it with respect to the optimal power in that timestep, and finally aggregating the power ratio across all timesteps. Likewise, Figure 5-7 shows the average aggregated UD under the same premises, using, as usual, the total demand as normalization factor. The time restriction of 3 minutes holds in each individual run.

78 GA SA PSO SAGA PSOGA 5

4

3

2 Normalized power

1

0 200 400 1000 2000 Number of beams

Figure 5-6: Average aggregated power against number of beams. Power is normalized with respect to the optimal aggregated power. Reference scenario used.

GA SA PSO SAGA PSOGA

0.30

0.25

0.20

0.15 Normalized UD 0.10

0.05

0.00 200 400 1000 2000 Number of beams

Figure 5-7: Average aggregated UD against number of beams. UD is nor- malized with respect to the aggregated demand. Reference scenario used.

79 The results show that not all the algorithms can be scaled given the time restrictions imposed. That is the case of SA and the hybrid SA-GA. Since, as opposed to PSO and GA, SA is a sequential algorithm – it only alters one variable between consecutive evaluations of the objective function – its performance is clearly limited by the number of variables in the model. Consequently, for the 1000-beam and 2000-beam cases the algorithm is not able to finish the first iteration on time.

On the contrary, GA, PSO, and their hybrid do obtain solutions for the scaled models. It can be observed that, while maintaining a UD result below 5%, these algorithms incur in greater power excesses when increasing the number of controlled beams. Having more variables to tune, and therefore bigger search and exploration spaces, makes these algorithms carry out less efficient iterations, which has an impact on the power required to reach minimum UD solutions. This is especially the case of GA; as seen in Figure 5-1 and Figure 5-2, given the priority of UD over power, this algorithm looks for high-power solutions that guarantee low UD first and then tries to reduce the amount of power needed while maintaining the UD performance. When the number of beams increases, since each iteration takes more time, the final solution after 3 minutes shows a high power level.

It is interesting to point out that while all algorithms have traded a higher power consumption for a lower UD, SA has done the opposite. This is due to the single- objective fair metric it uses, the SGM. As opposed to the UD as optimization metric, the SGM does distinguish between different overserving levels (by contrast, UD is zero for any overserving situation), and therefore tries to fix that as well. This is why the algorithm might sacrifice UD if there is a substantial power decrease, asin the 400-beam case. To further understand each algorithm behavior for each of the simulations, figures A-5 to A-12 in Appendix A show the performance over timein both in terms of power and UD for each of the scenarios (200, 400, 1000, and 2000 beams).

At this point there is a good overview of the strengths and weaknesses of each algorithm when applied to the DPA problem, as a representative of the DRM problem in the context of satellite communications. Nine different algorithms, which correspond

80 to seven different methods and the network considerations in the DRL cases, have been assessed in terms of runtime, online performance, and scalability. The final tests remaining in this Thesis correspond to the robustness analyses previously described. Since some algorithms have already showed dominant behavior in the analyses, only these will be considered from now on. This is the case of the DRL-GA hybrid, which has shown the best online performance for both networks. Second, the solo GA algorithm has one of the most stable behaviors, which leads it to obtain a good power- UD ratio in most cases. Similarly, due to its scaling properties, the PSO-GA hybrid is also interesting to further characterize. Finally, the speed of the solo DRL approach can not be matched by any of the other algorithms, both networks prove their ability to obtain an admissible UD performance on the order of seconds. Consequently, in the following section only these four approaches are evaluated, taking into account both networks in the case of algorithms involving DRL.

5.5 Robustness Analyses

One of the most important features a DRM algorithm must account for is robustness. While it is crucial that the algorithms perform at their best during an average operation cycle, being able to tackle the not-so-frequent events is also relevant. For instance, transportation companies must reorganize their fleet in case some vehicles need sudden maintenance, or data centers must be able to react when their service requests patterns change due to abnormal behaviour from the users. Satellite communications systems are not an exception to these contingency scenarios, since events such as gateway failure or sudden demand changes can happen. If not properly addressed, these events can lead to service failures. When an autonomous algorithm plays such an important role in the DRM decision-making process, it is a good practice to characterize how this algorithm behaves in terms of robustness. This section intends to extend the algorithm comparison to the robustness di- mension. As opposed to the previous sections in this chapter, the Reference dataset does not constitute the comparison scenario and the other three datasets are used

81 instead. These are the Sequential activation dataset, the Spurious dataset, and the Non-stationary dataset, all introduced in Section 4.3. As concluded in the previous section, given their dominance in the rest of the analyses, only GA, DRL, and the hybrids PSO-GA and DRL-GA are considered in this part of the study. Regarding the implementations involving a DRL stage, the policies used are the same that result when training the architectures on the Reference dataset and not on the additional datasets, since the goal of the section is to understand the behavior of the algorithms during unprecedented circumstances. Both MLP and LSTM neural networks are considered. All the results presented in this section are based on simulations involving 61 different timesteps evenly spaced in the 48-hour time window of each dataset. The original 200-beam model will be exclusively used and the computing time is also limited to 3 minutes. In the following subsections each simulation is discussed in detail, and the results are reported following the same procedure from Section 5.3. All power and UD-related metrics are normalized according to the optimal power and the total demand, respectively.

5.5.1 Sequential Activation

The first analysis involves the Sequential activation dataset, which tries to model the situation in which demand grows progressively from zero. This might happen after a gateway outage or after new sets of users start being served by the satellite and introduced into the system in this progressive manner. Figure 5-8 and Figure 5-9 show the normalized power and UD results for this scenario, respectively. Again, these two figures focus on a specific range of the metric space, which prevents thefull performance line to be shown in the case of some algorithms. In Appendix A, Figure A-13 and Figure A-14 show the results on the full metric space for both the TP and UD metrics when considering the Sequential activation dataset, respectively. To ease the visual comparison, the same time-wise aggregation procedure from Section 5.3 is followed and therefore the average value and its standard deviation for each metric can be found in Table 5.2. This information can be also visualized using

82 GA DRL-MLP DRL-GA-MLP PSOGA DRL-LSTM DRL-GA-LSTM 2.0

1.8

1.6

1.4

1.2

Normalized power 1.0

0.8

0.6

0 4 9 14 19 24 28 33 38 43 48 Time (h)

Figure 5-8: Aggregated power delivered in a continuous execution using the Sequential activation dataset. Power is normalized with respect to the optimal aggregated power. the “ellipse plot” showed in Figure 5-10.

Table 5.2: Aggregated Power and UD results for each algorithm, using the Sequential Activation dataset. Power and UD are normalized with respect to the optimal aggregated power and aggregated demand, respectively.

Power UD Algorithm Avg. Std. dev. Avg. Std. dev. Sequential Activation GA 1.13 0.14 0.03 0.02 PSO-GA 1.20 0.44 0.01 0.01 DRL-MLP 2.51 1.23 0.04 0.04 DRL-LSTM 2.41 1.09 0.03 0.04 DRL-GA-MLP 1.04 0.15 0.00 0.01 DRL-GA-LSTM 1.03 0.02 0.00 0.01

For each of the metrics, algorithms are clearly clustered around two areas and therefore show a low power or high power performance and a low UD or high UD performance. In the case of the hybrid PSO-GA, it is located in the low power-low

83 GA DRL-MLP DRL-GA-MLP PSOGA DRL-LSTM DRL-GA-LSTM

0.14

0.12

0.10

0.08

0.06 Normalized UD

0.04

0.02

0.00 0 4 9 14 19 24 28 33 38 43 48 Time (h)

Figure 5-9: Aggregated UD achieved in a continuous execution using the Sequential activation dataset. UD is normalized with respect to the aggre- gated demand.

UD cluster and preserves the average performance with respect to the Reference scenario but clearly shows a higher power variance. This is not the case of the GA, which repeats the variance results of the previous analyses, showing a robust behavior due to the low power variance. However, the GA, in order not to decrease the UD performance, incurs in an increased average consumed power, which can be seen by the spikes present in Figure 5-8. Since the GA uses the previous timestep solution as the initial solution of the subsequent timestep, and in this next step more beams are active, the algorithm suddenly carries out a large power increase to preserve UD performance. Then, the algorithm will show these spikes if there is not enough time to reduce this high amount of power.

However, the “robustifying” behavior of the GA is apparent, as it clearly marks a difference in performance between DRL and DRL-GA. Since its training process involved the Reference dataset – which has a higher average demand per beam –, it can be noted that DRL is biased towards allocating higher amounts of power,

84 GA DRL-MLP DRL-GA-MLP PSOGA DRL-LSTM DRL-GA-LSTM

0.040

0.035

0.030

0.025

0.020

Normalized UD 0.015

0.010

0.005

0.000 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 Normalized power

Figure 5-10: Average power and UD performance per algorithm on the Sequential activation dataset. Standard deviation for each metric is shown as one of the semiaxes of the ellipses. especially during the first 24 hours of the test scenario, when less than half ofthe beams are active. This is why its power performance is clearly poor, regardless of the neural network considered. Nevertheless, adding the GA stage using the DRL solution as warm start is clearly beneficial for the DRL-GA method, as it achieves thebest power and UD performance both in terms of average and standard deviation. Finally, although similar, the LSTM results are better than the MLP results in this case.

5.5.2 Spurious Events

Next, the Spurious dataset is addressed. As explained in Section 4.3, the idea behind this dataset is to account for events that will require a substantially larger amount of data rate compared to the average demand over time. While some of the algorithms take advantage of the final solutions at previous timesteps, others rely on the average behavior to learn biases. In both cases, these kind of spurious events can negatively impact the performance of the algorithm. The power and UD performance of the

85 algorithms in this scenario is given by Figure 5-11 and Figure 5-12, respectively. These figures do not show the entire metric space, however the complete result range canbe found in Figure A-15 and Figure A-16 in Appendix A.

GA DRL-MLP DRL-GA-MLP PSOGA DRL-LSTM DRL-GA-LSTM 2.0

1.8

1.6

1.4

1.2

Normalized power 1.0

0.8

0.6

0 4 9 14 19 24 28 33 38 43 48 Time (h)

Figure 5-11: Aggregated power delivered in a continuous execution using the Spurious dataset. Power is normalized with respect to the optimal aggregated power.

Likewise, Table 5.3 and Figure 5-13 show the table with the mean and standard deviation results per metric and the representation of those using ellipses, respectively. The results in this case show again two clusters in terms of UD but more spread out performance when it comes to power. The spikes in demand have a clear negative effect on the individual algorithms. On one hand, GA is on the low UDclusterbut it always fails to provide enough power, since its TP value stays below 1 for almost all timesteps. Another effect is its increased standard deviation. On the other hand, the solo DRL algorithms do not allocate high amounts of power this time, as opposed to the Sequential activation scenario, although they attain a worse UD performance. Since this is a mostly “flat" input that strongly deviates from the training data,the neural networks perform poorly. This result highlights the importance of a varied

86 GA DRL-MLP DRL-GA-MLP PSOGA DRL-LSTM DRL-GA-LSTM

0.14

0.12

0.10

0.08

0.06 Normalized UD

0.04

0.02

0.00 0 4 9 14 19 24 28 33 38 43 48 Time (h)

Figure 5-12: Aggregated UD achieved in a continuous execution using the Spurious dataset. UD is normalized with respect to the aggregated de- mand.

Table 5.3: Aggregated Power and UD results for each algorithm, using the Spurious dataset. Power and UD are normalized with respect to the optimal aggregated power and aggregated demand, respectively.

Power UD Algorithm Avg. Std. dev. Avg. Std. dev. Spurious GA 0.95 0.14 0.02 0.01 PSO-GA 1.99 0.54 0.01 0.00 DRL-MLP 0.99 0.12 0.12 0.02 DRL-LSTM 0.86 0.10 0.12 0.03 DRL-GA-MLP 1.23 0.38 0.00 0.00 DRL-GA-LSTM 1.40 0.34 0.00 0.00

offline training stage for the DRL approaches. In contrast, the hybrid algorithms manage to find robust solutions in this scenario and achieve the best UD results. This reinforces the usefulness of the hybrids and the role of the GA as a way to increase robustness a posteriori. However, they incur in larger power consumption performances, both in terms of average and standard

87 GA DRL-MLP DRL-GA-MLP PSOGA DRL-LSTM DRL-GA-LSTM

0.12

0.10

0.08

0.06 Normalized UD

0.04

0.02

0.00 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Normalized power

Figure 5-13: Average power and UD performance per algorithm on the Spurious dataset. Standard deviation for each metric is shown as one of the semiaxes of the ellipses. deviation. Among the hybrids, the DRL-GA approaches perform better than PSO-GA, the GA is able to correct the low UD performance of the solo versions in exchange of a larger power variance. It is interesting to point out that the LSTM works better than the MLP in the solo DRL case while it is the other way around in the hybrids. This might be due to the fact that time does not actually play a relevant role in predicting power needs in this case, as all input can be regarded as a constant value plus sparse spikes. When it comes to the metaheuristics, these spikes create a much more complex solution space, with very steep gradients, and, as a result, the GA and the PSO-GA fail to find feasible power regions (푇 푃 ≥ 1) and need to allocate twice the amount of power to remain in the low UD region, respectively.

5.5.3 Non-stationarity

Finally, this last subsection focuses on the Non-stationary dataset. While this is a dataset that represents a highly unlikely situation, it is important to quantify

88 the performance of the algorithms in cases in which the intrinsic behaviour of the demand changes. This is especially relevant for the Machine Learning-based algorithms; although non-stationarity is a hard challenge to overcome when it comes to real-world systems [17], it is important to at least quantify its effects. Figure 5-14 and Figure 5-15 show the power and UD performance for the six algorithms considered, respectively. In Appendix A, Figure A-17 and Figure A-18 show the same data with a wider axis for the UD metric.

GA DRL-MLP DRL-GA-MLP PSOGA DRL-LSTM DRL-GA-LSTM 2.0

1.8

1.6

1.4

1.2

Normalized power 1.0

0.8

0.6

0 4 9 14 19 24 28 33 38 43 48 Time (h)

Figure 5-14: Aggregated power delivered in a continuous execution using the Non-stationary dataset. Power is normalized with respect to the optimal aggregated power.

Again, the aggregated results across the time axis can be found in Table 5.4, which shows the average power and UD achieved by each algorithm as well as their respective standard deviations. Then, Figure 5-16 translates this information into the “ellipse plot”. In this case, in terms of UD, it can be clearly appreciated that non-stationarity affects and degrades the performance of the six algorithms. The standard deviation of the power metric also increases with respect to the online operation analyses.

89 GA DRL-MLP DRL-GA-MLP PSOGA DRL-LSTM DRL-GA-LSTM

0.14

0.12

0.10

0.08

0.06 Normalized UD

0.04

0.02

0.00 0 4 9 14 19 24 28 33 38 43 48 Time (h)

Figure 5-15: Aggregated UD achieved in a continuous execution using the Non-stationary dataset. UD is normalized with respect to the aggregated demand.

Table 5.4: Aggregated Power and UD results for each algorithm, using the Non-stationary dataset. Power and UD are normalized with respect to the optimal aggregated power and aggregated demand, respectively.

Power UD Algorithm Avg. Std. dev. Avg. Std. dev. Non-stationary GA 1.01 0.17 0.06 0.03 PSO-GA 1.37 0.57 0.02 0.01 DRL-MLP 1.16 0.37 0.12 0.10 DRL-LSTM 1.14 0.39 0.11 0.11 DRL-GA-MLP 1.04 0.06 0.02 0.01 DRL-GA-LSTM 1.05 0.07 0.02 0.01

First, GA achieves the third smallest standard deviation when it comes to power consumption but incurs in higher UD penalties compared to its performance in the Reference dataset (6% vs. 3%). The PSO-GA trades off these same two metrics, as it achieves a better UD performance compared to the GA, at the cost of a higher average power consumption. This algorithm also shows the worst standard deviation

90 GA DRL-MLP DRL-GA-MLP PSOGA DRL-LSTM DRL-GA-LSTM

0.12

0.10

0.08

0.06 Normalized UD 0.04

0.02

0.00 0.8 0.9 1.0 1.1 1.2 1.3 Normalized power

Figure 5-16: Average power and UD performance per algorithm on the Non- stationary dataset. Standard deviation for each metric is shown as one of the semiaxes of the ellipses.

for the TP metric, showing a considerable amount of spikes in power, which is not a sign of robustness in this specific scenario.

In the case of the Machine Learning-based algorithms, given the demand magnitude similarity between the Reference and Non-stationary datasets, DRL does not show any power bias but its performance is negatively affected with respect to the previous results, especially when it comes to UD (12% vs. 3%). DRL is only able to obtain good UD results at the timesteps where the assignation of each demand time series to a beam matches the one used in the training dataset. However, this behavior is corrected in the case of the DRL-GA hybrid, which shows the best results for both metrics in terms of average and standard deviation. In the case of the DRL-GA hybrids, there is not much difference between the two neural networks considered.

91 5.5.4 Conclusions on robustness

This section has focused on how robust are the six algorithms that have showed the best performance so far. Specifically, they have been tested on three different scenarios: a sequential activation scenario that models a progressive demand increase, which could occur when new users come into the system or when some failure happens; a scenario modelling the effects of sudden, unpredictable demand spikes; and a scenario where the correlations between beams and timesteps change due to non-stationarity. While each scenario shows its specific considerations, all results presented in this section emphasize the importance of evaluating the algorithm performance from different perspectives and trying to quantify robustness.

On the algorithm level, there is a significant variety in how different methods react to these contingency scenarios. Approaches that show a strong UD performance in the baseline case, such as solo DRL and the PSO-GA hybrid, turn out to be less robust when the demand behaviour differs from the training data in the case of DRL orshows a bigger relative change than PSO-GA can afford. The response of both algorithms to these “new” scenarios is either to allocate excessive amounts of power or to incur in higher UD penalties. In the case of the DRL algorithms, regardless of the neural network used, the agents learn specific power biases that have a negative impact when the demand is substantially lower (as in the Sequential activation dataset) or higher (as in the Spurious dataset) than seen in the training stage.

On the contrary, the GA shows one of the most robust performances and also turns out to have a “robustifying” behavior. When adding a second GA stage to the result given by DRL, its performance greatly improves in both power and UD, since the average and the standard deviation decrease. This result holds in all scenarios, placing the hybrid DRL-GA in the best position in terms of this downselection process. Finally, in the case of the solo GA algorithm, albeit not being the best option when tested under the Reference dataset premises, it shows an admissibly robust behaviour, as it always attains a low power variability and thus reducing the uncertainty regarding its performance. This result is important, since when a prior DRL solution is not

92 available, the GA might be the best option as an alternative.

93 94 Chapter 6

Conclusions

6.1 Thesis Summary

This Thesis addresses the study and comparison of multiple Artificial Intelligence (AI) algorithms for the Dynamic Resource Management (DRM) problem in the context of Flexible High Throughput Satellites. The first chapter starts with an overview of the current and future satellite communications market scenario and explains why the DRM problem is relevant for the field. After discussing the importance of autonomous engines based on AI algorithms, the relevant literature is introduced, emphasizing the main contributions and gaps of related studies. Then, the research gap addressed by this Thesis is identified and an in-depth comparison of existing and novel algorithms is motivated. Finally, the chapter ends with a description of the specific objectives of this work. Chapter 2 then provides the essential theoretical background of the DRM problem and communications satellites. It starts defining the DRM problem in a general context, emphasizing its relation to time. Next, the chapter explains the main features and elements of a communications satellite system, and then discusses how the DRM problem is framed in the context of multibeam satellite communications. After discussing the application of AI in this Thesis’s context, the Dynamic Power Allocation (DPA) problem is formally introduced. This problem serves as a specific use case throughout the rest of the Thesis. As part of the final section, the power

95 allocation mechanisms are reviewed and a precise problem formulation is provided.

Next, Chapter 3 focuses on the specific implementations of each of the algorithms considered. This Thesis studies three algorithm families: metaheuristics, Machine Learning, and hybrid algorithms. The first group includes Genetic Algorithms (GA), Simulated Annealing (SA), and Particle Swarm Optimization (PSO); then Deep Reinforcement Learning (DRL) is the only algorithm considered from the Machine Learning-based family; and finally, three hybridizations are studied: SA-GA, PSO-GA, and DRL-GA. In the case of algorithms involving a DRL stage, two neural networks are separately considered and constitute different algorithms: Multi Layer Perceptron (MLP) and Long Short Term Memory (LSTM) networks. This chapter presents the implementation details for each of them as well as the hyperparameters used.

The fourth chapter is centered around the simulation models used as a support in this Thesis. First, the GEO satellite model is introduced and the details on the user terminals are given. Then, the different demand models considered are presented. This study uses 4 different demand models in the form of 4 different datasets: a Reference dataset accounting for an average operation cycle, a Sequential activation dataset modelling the recovery after failures or the addition of new users into the system, a Spurious dataset representing random demand spikes, and finally a Non-stationary dataset focusing on the cases in which demand volume is preserved but correlations between beams change over time. Chapter 4 ends covering the main details of the link budget model used in this Thesis to compute the performance of the algorithms.

Lastly, Chapter 5 presents the results for the four different analyses carried out as part of the in-depth comparison, in terms of the total power consumed (TP) and Unmet Demand (UD). Relevant to the metaheuristic algorithms, the first section studies the convergence rate of these algorithms and shows how the performance is closely related to time in some cases. Then, the second section focuses on the performance during the average online operation cycles, comparing the performances of all nine algorithms. Next, the third section analyzes the scalability behavior of the metaheuristic and hybrid metaheuristic algorithms, which plays an important role for these algorithms when larger systems are considered. Finally, this chapter closes with

96 the robustness analyses, which account for the performance of the algorithms when not-so-frequent events occur but it is equally important that the algorithms react to them. This section includes three different robustness analyses, one for each of the additional datasets.

6.2 Main Findings

The main objective of this Thesis has been to compare and contrast the performance of nine different algorithms or implementations under a set of realistic models and simulations that account for the future trends in the satellite communications industry, including near-real time optimization, scalability, and robustness. The results of this comparison have showed dominant algorithms among the studied set. First, while PSO and SA are fast algorithms in terms of obtaining a good solution, both fail to find the global optimum region and get stuck in local optima. Furthermore, given the sequential behavior of the latter, its use has been shown to be clearly constrained by the dimensionality of the model used. The analyses have shown that SA is able to find solutions, under the 3 minute runtime timeout, for modelsup to 400 beams but is unsuccessful for the 1000-beam model onward. It is not clear where the performance threshold would be between the 400-beam and the 1000-beam models. This issue also affects the performance of the SA-GA hybrid, which although having demonstrated great UD behaviors in the 200-beam online operation analyses, its poor scaling properties make it unfit to be used in near-real time systems with large amounts of configurable variables. On the contrary, the PSO-GA hybrid has also showed one of the best performances in terms of UD and has been able to scale. However, its “bursty” behavior in terms of power needs to be further addressed, which is especially relevant when it comes to robustness. Then, the last of the individual metaheuristic approaches, the GA, has shown an interesting behavior that is clearly affected by the quality of a warm start and the computing time available. Given its asymptotically optimal behavior, this algorithm’s performance is improved when more computing time can be afforded. This Thesis has

97 assumed a 3-minute stop rule, other settings in which the change in demand is not that frequent and therefore the parameter update does not need to happen that often might be able to obtain even better solutions from the GA, both as an individual approach and in the form of hybrids. In this latter category is where the GA has especially excelled, as it has been shown that it possesses a “robustifying” behavior. A subsequent GA stage has greatly improved the average performance and variance of both metrics for each of the three hybrid algorithms, with respect to their individual counterparts (e.g. SA-GA has obtained better results than SA thanks to the GA stage).

Next, the DRL-based methods have proved that their study is relevant and they are definitely an interesting research direction when it comes to DRM problems. The individual DRL approach, regardless of the neural network used as policy, has proved to be the fastest algorithm and, when the training process is carried out appropriately, its performance in terms of UD matches the SA-GA’s and PSO-GA’s. However, robustness is a critical issue for this algorithm, which motivates the further study of the relationship between the training and operation phases in DRL models. The robustness problems are solved by adding a subsequent GA stage after the DRL policy provides a solution; the DRL-GA hybrid has been one of the best outcomes of this Thesis. On one hand, its performance during the online operation analyses has been clearly dominant both in terms of power and UD; on the other hand, thanks to the GA stage, it has been the most robust algorithm, regardless of the neural network with which it is paired, achieving less variance in its results with respect to the individual GA approach.

In light of these results, three algorithms are identified as potential candidates to be used during real satellite operations. In cases in which time is especially critical and there is a strong need to obtain a solution in seconds, implementing a DRL model is definitely the best option, especially as it has shown a great service (withlow UD) performance on the Reference dataset, which accounts for the average demand scenario. If time is not such an important consideration, the best option is undoubtedly the DRL-GA hybrid, which, on top of the best UD performance across all datasets

98 compared, it has shown better (lower) power results, in addition to be robust. Finally, there might be cases in which the demand behavior of the users changes drastically or the algorithm is deployed on a new system with new users. In these cases the operator can not guarantee a low uncertainty threshold from the start and the DRL architecture might not be ready to address these changes – it needs a batch of training iterations first –. Then, the robustness of the GA makes it the strongest candidate, as it has shown one of the lowest standard deviation values across all scenarios.

6.3 Future Work

Based on the results and the modelling decisions considered in this Thesis, different directions of future research are identified:

∙ Explore ways to improve the performance of the DRL-GA hybrid. The analyses have shown that this algorithm clearly has a superior advantage in terms of performance and robustness. The implementation considered for this Thesis merges two independent implementations that have been optimized to their best individual performance. It would be interesting to understand how this performance could be further improved if the implementations were jointly optimized instead. Although there are multiple ideas that follow this direction, a clear path would be to understand how to use the performance metrics from the GA stage to learn better policies in the DRL stage. This idea is similar to the joint optimization procedures followed in the latest drug design studies that use DRL [51].

∙ Understand the impact of the DRL training considerations in the whole perfor- mance. The details of the training stage in the DRL architectures have been mostly left out of this Thesis, but it is well-known that the training stage of the algorithm plays a relevant role in the performance during the online or testing stage [34]. Figuring out how scalability or a more diversified training dataset that accounts for robustness affects the requirements of the training stage inorder

99 to maintain the same performance and vice versa would be another interesting research direction.

∙ Identify better architectures and representations for the neural network policy. Although the specific neural network used as policy in the DRL architecture has not significantly impacted the performance of the algorithms, both neural networks have been regarded as the same submodular block of the DRL archi- tecture whereas its benefit might be increased with a better tailoring oftheir structure to the problem. For instance, the state representation has been the same for both neural networks but it does not necessarily mean that this is the best representation for one or both networks. These kinds of relationships between the different submodules of the DRL architecture need to be further addressed.

∙ Consider more complex DRM subproblems in the satellite communication context and DRM problems in other domains. This Thesis has considered the DPA problem as the comparison use case and it has been representative enough to be able to observe performance differences and downselect to a set of suitable algorithms. A work that follows is the study of more complex DRM subproblems (e.g. joint power and frequency optimization) and adapt these algorithms to work under more challenging premises. It would be interesting to examine the performance commonalities across multiple domains in which DRM problems are present and across different subproblems in the same domain.

100 Appendix A

Additional Figures

A.1 Convergence Analyses

This section presents additional plots of the simulations conducted in Section 5.2.

0.800 Avg. 95% CI 0.795

0.790

0.785

0.780 SGM 0.775

0.770

0.765

0.760 1 2 3 4 5 6 7 8 9 10 Computing time (min)

Figure A-1: Average aggregated SGM and 95% confidence interval against computing time available for the SA algorithm. Reference scenario used.

101 A.2 Continuous Operation

This section presents additional plots of the simulations conducted in Section 5.3.

GA SA DRL-LSTM PSO SAGA DRL-GA-MLP 3.0 PSOGA DRL-MLP DRL-GA-LSTM

2.5

2.0

1.5

Normalized power 1.0

0.5

0.0 0 3 6 9 12 15 18 21 24 Time (h)

Figure A-2: Aggregated power delivered by every algorithm during the continuous execution simulations. Power is normalized with respect to the optimal aggregated power (optimal power is 1). Reference scenario used.

102 GA SA DRL-LSTM PSO SAGA DRL-GA-MLP 0.40 PSOGA DRL-MLP DRL-GA-LSTM

0.35

0.30

0.25

0.20

Normalized UD 0.15

0.10

0.05

0.00 0 3 6 9 12 15 18 21 24 Time (h)

Figure A-3: Aggregated UD achieved by every algorithm during the contin- uous execution simulations. UD is normalized with respect to aggregated demand (optimal UD is 0). Reference scenario used.

GA SA DRL-LSTM PSO SAGA DRL-GA-MLP PSOGA DRL-MLP DRL-GA-LSTM

0.175

0.150

0.125

0.100

Normalized UD 0.075

0.050

0.025

0.000 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 Normalized power

Figure A-4: Average power and UD performance per algorithm on the Ref- erence dataset. Standard deviation for each metric is shown as one of the semiaxes of the ellipses.

103 A.3 Scalability Analyses

This section presents additional plots of the simulations conducted in Section 5.4.

GA PSOGA SAGA PSO SA 3.0

2.5

2.0

1.5

Normalized power 1.0

0.5

0.0 10 11 12 13 14 15 16 Time (h)

Figure A-5: Aggregated power delivered by each metaheuristic algorithm for the scalability test using 200 beams. Power is normalized with respect to the optimal aggregated power (optimal power is 1). Reference scenario used.

104 GA PSOGA SAGA PSO SA 0.40

0.35

0.30

0.25

0.20

Normalized UD 0.15

0.10

0.05

0.00 10 11 12 13 14 15 16 Time (h)

Figure A-6: Aggregated UD achieved by each metaheuristic algorithm for the scalability test using 200 beams. UD is normalized with respect to aggregated demand (optimal UD is 0). Reference scenario used.

GA PSOGA SAGA PSO SA 3.0

2.5

2.0

1.5

Normalized power 1.0

0.5

0.0 10 11 12 13 14 15 16 Time (h)

Figure A-7: Aggregated power delivered by each metaheuristic algorithm for the scalability test using 400 beams. Power is normalized with respect to the optimal aggregated power (optimal power is 1). Reference scenario used.

105 GA PSOGA SAGA PSO SA 0.40

0.35

0.30

0.25

0.20

Normalized UD 0.15

0.10

0.05

0.00 10 11 12 13 14 15 16 Time (h)

Figure A-8: Aggregated UD achieved by each metaheuristic algorithm for the scalability test using 400 beams. UD is normalized with respect to aggregated demand (optimal UD is 0). Reference scenario used.

GA PSO PSOGA

3.0

2.5

2.0

1.5

Normalized power 1.0

0.5

0.0 10 11 12 13 14 15 16 Time (h)

Figure A-9: Aggregated power delivered by each metaheuristic algorithm for the scalability test using 1000 beams. Power is normalized with respect to the optimal aggregated power (optimal power is 1). Reference scenario used.

106 GA PSO PSOGA

0.40

0.35

0.30

0.25

0.20

Normalized UD 0.15

0.10

0.05

0.00 10 11 12 13 14 15 16 Time (h)

Figure A-10: Aggregated UD achieved by each metaheuristic algorithm for the scalability test using 1000 beams. UD is normalized with respect to aggregated demand (optimal UD is 0). Reference scenario used.

GA PSO PSOGA

3.0

2.5

2.0

1.5

Normalized power 1.0

0.5

0.0 10 11 12 13 14 15 16 Time (h)

Figure A-11: Aggregated power delivered by each metaheuristic algorithm for the scalability test using 2000 beams. Power is normalized with respect to the optimal aggregated power (optimal power is 1). Reference scenario used.

107 GA PSO PSOGA

0.40

0.35

0.30

0.25

0.20

Normalized UD 0.15

0.10

0.05

0.00 10 11 12 13 14 15 16 Time (h)

Figure A-12: Aggregated UD achieved by each metaheuristic algorithm for the scalability test using 2000 beams. UD is normalized with respect to aggregated demand (optimal UD is 0). Reference scenario used.

108 A.4 Robustness Analyses

This section presents additional plots of the simulations conducted in Section 5.5.

GA DRL-MLP DRL-GA-MLP PSOGA DRL-LSTM DRL-GA-LSTM 3.0

2.5

2.0

1.5

Normalized power 1.0

0.5

0.0 0 4 9 14 19 24 28 33 38 43 48 Time (h)

Figure A-13: Aggregated power delivered in a continuous execution using the Sequential activation dataset. Power is normalized with respect to the optimal aggregated power.

109 GA DRL-MLP DRL-GA-MLP PSOGA DRL-LSTM DRL-GA-LSTM 0.40

0.35

0.30

0.25

0.20

Normalized UD 0.15

0.10

0.05

0.00 0 4 9 14 19 24 28 33 38 43 48 Time (h)

Figure A-14: Aggregated UD achieved in a continuous execution using the Sequential activation dataset. UD is normalized with respect to the aggre- gated demand.

GA DRL-MLP DRL-GA-MLP PSOGA DRL-LSTM DRL-GA-LSTM 3.0

2.5

2.0

1.5

Normalized power 1.0

0.5

0.0 0 4 9 14 19 24 28 33 38 43 48 Time (h)

Figure A-15: Aggregated power delivered in a continuous execution using the Spurious dataset. Power is normalized with respect to the optimal aggregated power.

110 GA DRL-MLP DRL-GA-MLP PSOGA DRL-LSTM DRL-GA-LSTM 0.40

0.35

0.30

0.25

0.20

Normalized UD 0.15

0.10

0.05

0.00 0 4 9 14 19 24 28 33 38 43 48 Time (h)

Figure A-16: Aggregated UD achieved in a continuous execution using the Spurious dataset. UD is normalized with respect to the aggregated de- mand.

GA DRL-MLP DRL-GA-MLP PSOGA DRL-LSTM DRL-GA-LSTM 3.0

2.5

2.0

1.5

Normalized power 1.0

0.5

0.0 0 4 9 14 19 24 28 33 38 43 48 Time (h)

Figure A-17: Aggregated power delivered in a continuous execution using the Non-stationary dataset. Power is normalized with respect to the opti- mal aggregated power.

111 GA DRL-MLP DRL-GA-MLP PSOGA DRL-LSTM DRL-GA-LSTM 0.40

0.35

0.30

0.25

0.20

Normalized UD 0.15

0.10

0.05

0.00 0 4 9 14 19 24 28 33 38 43 48 Time (h)

Figure A-18: Aggregated UD achieved in a continuous execution using the Non-stationary dataset. UD is normalized with respect to the aggregated demand.

112 Appendix B

Metric details

B.1 Satisfaction-Gap Measure

The Satisfaction-Gap Measure (SGM) is an objective metric that provides a measure of the mismatch between an certain allocation and the ideal allocation case, with a focus on fairness. This section covers the basic details of this metric. A complete description of the reasoning behind this metric can be found in [8].

Following the notation used in Section 2.4, each of the 푁푏 beams of a satellite is

requested a certain data rate demand 퐷푏 and, after allocating a power level 푃푏, it can

serve a total data rate of 푅푏, this applies for every 푏 in the set {1, . . . , 푁푏}. Based on the requested demand and the provided data rate, the satisfaction index of beam 푏 is defined as 푅푏 푆퐼푏 = . (B.1) 퐷푏 Similarly, the data rate difference for each beam is defined as

∆푏 = 푅푏 − 퐷푏. (B.2)

With these two values, 푁푏 different data points can be located inthe satisfaction/gap

(SG) complex plane, with the form 푐푏 = 푆퐼푏 + 푗 · ∆푏. Then, for all beams with 푆퐼푏 < 1

113 (i.e., underserving beams) the following transformation is applied

1 Re{푐푏} ←− 1 − , ∀푐푏 : Re{푐푏} < 1 (B.3) Re{푐푏}

If, in contrast, Re{푐푏} ≥ 1, this other transformation is applied

푐푏 ←− 푐푏 − 1, ∀푐푏 : Re{푐푏} ≥ 1 (B.4)

Then, to compensate for the difference in order of magnitude between 푆퐼푏 and ∆푏, the following transformation is applied to all beams

Im{푐 } Im{푐 } ←− 푏 (B.5) 푏 훽 where 훽, a positive real number, is a design parameter. For this Thesis, 훽 has been chosen as the average demand between all beams. Next, all points in the SG plane undergo the following transformation

−|푐푏| |푐푏| ←− 1 − 푒 (B.6)

Finally, the SGM metric is computed as

푁푏 1 ∑︁ 3 푆퐺푀 = 1 − |푐푏| (B.7) 푁푏 푏=1

B.2 Power calculation heuristic

For a specific beam 푏, the heuristic by means of which the necessary power 푃푏 to

satisfy a certain demand 퐷푏 is computed, follows the following steps:

1. A MODCOD is selected among all considered MODCODs, it serves as an optimistic initial guess.

2. The margin 훾푏 is computed for this MODCOD.

114 3. Following constraint (2.11) from Section 2.4.2, if 훾푏 ≤ 훾푀 , the next MODCOD from the list is selected and the margin is recomputed.

4. When the MODCOD is selected, this MODCOD is used to compute the necessary

power to achieve a data rate of at least 퐷푏, using equations (4.1) - (4.6).

After computing the necessary power for every beam, the amplifier constraint (2.10) is checked for each amplifier. If the sum of powers for an amplifier is above the threshold, the powers of these beams are reduced proportionally.

115 116 Bibliography

[1] Nadine Abbas, Youssef Nasser, and Karim El Ahmad. Recent advances on artificial intelligence and learning techniques in cognitive radio networks. Eurasip Journal on Wireless Communications and Networking, 2015(1), 2015.

[2] Piero Angeletti, Riccardo De Gaudenzi, and Marco Lisi. From" Bent pipes" to" software defined payloads": evolution and trends of satellite communications systems. In 26th International Communications Satellite Systems Conference (ICSSC), 2008.

[3] Piero Angeletti, David Fernandez Prim, and Rita Rinaldo. Beam Hopping in Multi-Beam Broadband Satellite Systems: System Performance and Payload Architecture Analysis. In 24th AIAA International Communications Satellite Systems Conference, Reston, Virigina, jun 2006. American Institute of Aeronautics and Astronautics.

[4] Alexis I. Aravanis, Bhavani Shankar M. R., Pantelis-Daniel Arapoglou, Gregoire Danoy, Panayotis G. Cottis, and Bjorn Ottersten. Power Allocation in Multibeam Satellite Systems: A Two-Stage Multi-Objective Optimization. IEEE Transactions on Wireless Communications, 14(6):3171–3182, jun 2015.

[5] Hamsa Balakrishnan. Control and optimization algorithms for air transportation systems. Annual Reviews in Control, 41:39–46, 2016.

[6] Jean-Thomas Camino, Christian Artigues, Laurent Houssin, and Stephane Mour- gues. Mixed-integer linear programming for multibeam satellite systems design: Application to the beam layout optimization. In 2016 Annual IEEE Systems Conference (SysCon), pages 1–6. IEEE, apr 2016.

[7] Jean-Thomas Camino, Stephane Mourgues, Christian Artigues, and Laurent Houssin. A greedy approach combined with for non-uniform beam layouts under antenna constraints in multibeam satellite systems. In 2014 7th Advanced Satellite Multimedia Systems Conference and the 13th for Space Communications Workshop (ASMS/SPSC), pages 374–381. IEEE, sep 2014.

[8] Giuseppe Cocco, Tomaso De Cola, Martina Angelone, Zoltan Katona, and Stefan Erl. Radio Resource Management Optimization of Flexible Satellite Payloads for DVB-S2 Systems. IEEE Transactions on Broadcasting, 64(2):266–280, 2018.

117 [9] C A Coello Coello and Maximino Salazar Lechuga. MOPSO: A proposal for multiple objective particle swarm optimization. In Proceedings of the 2002 Congress on . CEC’02 (Cat. No. 02TH8600), volume 2, pages 1051–1056. IEEE, 2002. [10] Martin Coleman. Is AI key to the survival of satcoms?, 2019. [11] A. Corana, , and M. C. Martini and S. Ridella. Minimizing Multimodal Functions of Continuous Variables with the "Simulated Annealing" Algorithm. ACM Transactions on Mathematical Software, 15(3):287, 1989. [12] Edward Crawley, Bruce Cameron, and Daniel Selva. System architecture: Strategy and product development for complex systems. Prentice Hall Press, 2015. [13] Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and T. Meyarivan. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation, 6(2):182–197, 2002. [14] Inigo del Portillo, Bruce G. Cameron, and Edward F. Crawley. A technical comparison of three low earth orbit satellite constellation systems to provide global broadband. Acta Astronautica, 2019. [15] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plap- pert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. OpenAI Baselines. https://github.com/openai/baselines, 2017. [16] Gabriel Dulac-Arnold, Richard Evans, Hado van Hasselt, Peter Sunehag, Timothy Lillicrap, Jonathan Hunt, Timothy Mann, Theophane Weber, Thomas Degris, and Ben Coppin. Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679, 2015. [17] Gabriel Dulac-Arnold, Daniel Mankowitz, and Todd Hester. Challenges of Real- World Reinforcement Learning. arXiv preprint arXiv:1904.12901, 2019. [18] Fabio Renan Durand and Taufik Abrão. Power allocation in multibeam satellites based on particle swarm optimization. AEU - International Journal of Electronics and Communications, 78:124–133, 2017. [19] Russell Eberhart and James Kennedy. A new optimizer using particle swarm theory. In MHS’95. Proceedings of the Sixth International Symposium on Micro Machine and Human Science, pages 39–43. Ieee, 1995. [20] Larry J Eshelman and J David Schaffer. Real-coded genetic algorithms and interval-schemata. In Foundations of genetic algorithms, volume 2, pages 187–202. Elsevier, 1993. [21] ETSI EN 302 307-2. Digital Video Broadcasting (DVB); Second generation framing structure, channel coding and modulation systems for Broadcasting, Interactive Services, News Gathering and other broadband satellite applications; Part 2: DVB-S2 Extensions (DVB-S2X). Technical report, 2015.

118 [22] Paulo Victor Rodrigues Ferreira, Randy Paffenroth, Alexander M. Wyglinski, Timothy M. Hackett, Sven G. Bilen, Richard C. Reinhart, and Dale J. Mortensen. Multiobjective Reinforcement Learning for Cognitive Satellite Communications Using Deep Neural Network Ensembles. IEEE Journal on Selected Areas in Communications, 36(5):1030–1041, 2018.

[23] F. A. Fortin, D. Rainville, M. A. G. Gardner, M. Parizeau, and C. Gagné. DEAP: Evolutionary Algorithms Made Easy. Journal of Machine Learning Research, 13:2171–2175, 2012.

[24] N Funabiki and S Nishikawa. A gradual neural-network approach for frequency assignment in satellite communication systems. IEEE Transactions on Neural Networks, 8(6):1359–1370, 1997.

[25] Juan Jose Garau Luis, Markus Guerster, Inigo del Portillo, Edward Crawley, and Bruce Cameron. Deep Reinforcement Learning Architecture for Continuous Power Allocation in High Throughput Satellites. arXiv preprint arXiv:1906.00571, 2019.

[26] Harish Garg. A hybrid PSO-GA algorithm for constrained optimization problems. Applied Mathematics and Computation, 2016.

[27] Felix A. Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction with LSTM. Neural Computation, 2000.

[28] S P Ghoshal. Application of GA/GA-SA based fuzzy automatic generation control of a multi-area thermal generating system. Electric Power Systems Research, 70(2):115–127, 2004.

[29] Anupriya Gogna and Akash Tayal. Metaheuristics: review and application. Journal of Experimental & Theoretical Artificial Intelligence, 25(4):503–526, dec 2013.

[30] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.

[31] Markus Guerster, Juan Jose Garau Luis, Edward F Crawley, and Bruce G Cameron. Problem representation of dynamic resource allocation for flexible high throughput satellites. In 2019 IEEE Aerospace Conference, 2019.

[32] Gunter. O3b 21, ..., 27 (O3b mPower), https://space.skyrocket.de/doc_sdat/o3b- 21.htm, april 2020.

[33] YuanZhi He, YiZhen Jia, and XuDong Zhong. A traffic-awareness dynamic resource allocation scheme based on multi-objective optimization in multi-beam mobile satellite communication systems. International Journal of Distributed Sensor Networks, 13(8):155014771772355, aug 2017.

119 [34] Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. In 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, 2018.

[35] Heng Wang, Aijun Liu, Xiaofei Pan, and Luliang Jia. Optimal bandwidth allocation for multi-spot-beam satellite communication systems. In Proceedings 2013 International Conference on Mechatronic Sciences, Electric Engineering and Computer (MEC), pages 2794–2798. IEEE, dec 2013.

[36] John Henry Holland and Others. Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT press, 1992.

[37] Xin Hu, Shuaijun Liu, Rong Chen, Weidong Wang, and Chunting Wang. A Deep Reinforcement Learning-Based Framework for Dynamic Resource Allocation in Multibeam Satellite Systems. IEEE Communications Letters, 22(8):1612–1615, 2018.

[38] Xin Hu, Shuaijun Liu, Yipeng Wang, Lexi Xu, Yuchen Zhang, Cheng Wang, and Weidong Wang. Deep reinforcement learning-based beam Hopping algorithm in multibeam satellite systems. IET Communications, 2019.

[39] Zhe Ji, Youzheng Wang, Wei Feng, and Jianhua Lu. Delay-Aware Power and Bandwidth Allocation for Multiuser Satellite Downlinks. IEEE Communications Letters, 18(11):1951–1954, nov 2014.

[40] Hyunchul Kim, Yasuhiro Hayashi, and Koichi Nara. An algorithm for thermal unit maintenance scheduling through combined use of GA SA and TS. IEEE Power Engineering Review, 1997.

[41] Scott Kirkpatrick, C Daniel Gelatt, and Mario P Vecchi. Optimization by simulated annealing. science, 220(4598):671–680, 1983.

[42] Jiang Lei and Maria Angeles Vazquez-Castro. Joint power and carrier allocation for the multibeam satellite downlink with individual SINR constraints. In 2010 IEEE International Conference on Communications, pages 1–5. IEEE, 2010.

[43] Nguyen Cong Luong, Dinh Thai Hoang, Shimin Gong, Dusit Niyato, Ping Wang, Ying-Chang Liang, and Dong In Kim. Applications of Deep Reinforcement Learning in Communications and Networking: A Survey. IEEE Communications Surveys & Tutorials, 21(4):3133–3174, 2019.

[44] Gérard Maral and Michel Bousquet. Satellite communications systems: systems, techniques and technology. John Wiley & Sons, 2011.

[45] Rajesh Mehrotra. Regulation of Global Broadband Satellite Communications. Technical report, ITU, 2011.

120 [46] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, and Others. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.

[47] Northern Sky Research. IFC Market project to reach $36 billion in cumulative revenue by 2028. Technical report, 2019.

[48] Northern Sky Research. VSAT and Broadband Satellite Markets. Technical report, 2019.

[49] Nils Pachler, Juan Jose Garau Luis, Markus Guerster, Edward F Crawley, and Bruce G Cameron. Allocating Power and Bandwidth in Multibeam Satellite Systems using Particle Swarm Optimization. In 2020 IEEE Aerospace Conference, 2020.

[50] Aleix Paris, Inigo del Portillo, Bruce G Cameron, and Edward F Crawley. A Genetic Algorithm for Joint Power and Bandwidth Allocation in Multibeam Satellite Systems. In 2019 IEEE Aerospace Conference. IEEE, 2019.

[51] Mariya Popova, Olexandr Isayev, and Alexander Tropsha. Deep reinforcement learning for de novo drug design. Science Advances, 4(7):eaap7885, jul 2018.

[52] D Janaki Ram, T H Sreenivas, and K Ganapathy Subramaniam. Parallel simulated annealing algorithms. Journal of parallel and , 37(2):207–212, 1996.

[53] R Rinaldo, X Maufroid, and R Casaleiz Garcia. Non-uniform bandwidth and power allocation in multi-beam broadband satellite systems. Proceedings of the AIAA, 2005.

[54] Stuart J Russell and Peter Norvig. Artificial intelligence: a modern approach. 2002.

[55] S Salcedo-Sanz and C Bousoño-Calzón. A Hybrid Neural-Genetic Algorithm for the Frequency in Satellite Communications. Applied Intelligence, 22(3):207–217, may 2005.

[56] S Salcedo-Sanz, R Santiago-Mozos, and C Bousono-Calzon. A Hybrid -Simulated Annealing Approach for Frequency Assignment in Satellite Communications Systems. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics), 34(2):1108–1116, apr 2004.

[57] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.

[58] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

121 [59] SES. O3b mPower, https://www.ses.com/networks/networks-and-platforms/o3b- mpower, april 2020.

[60] X. H. Shi, Y. C. Liang, H. P. Lee, C. Lu, and L. M. Wang. An improved GA and a novel PSO-GA-based hybrid algorithm. Information Processing Letters, 2005.

[61] Yuhui Shi and Russell Eberhart. A modified particle swarm optimizer. In 1998 IEEE international conference on evolutionary computation proceedings. IEEE world congress on computational intelligence (Cat. No. 98TH8360), pages 69–73. IEEE, 1998.

[62] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.

[63] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.

[64] El-Ghazali Talbi. Metaheuristics: from design to implementation, volume 74. John Wiley & Sons, 2009.

[65] Aizaz Tirmizi, Ravi Shankar Mishra, and A S Zadgaonkar. An efficient channel assignment strategy in cellular mobile network using hybrid genetic algorithm. In- ternational Journal of Electrical, Electronics and Computer Engineering, 4(2):127, 2015.

[66] Heng Wang, Aijun Liu, Xiaofei Pan, and Jianfei Yang. Optimization of power allocation for multiusers in multi-spot-beam satellite communication systems. Mathematical Problems in engineering, 2014, 2014.

[67] Zhen Xiao, Weijia Song, and Qi Chen. Dynamic Resource Allocation Using Virtual Machines for Cloud Computing Environment. IEEE Transactions on Parallel and Distributed Systems, 24(6):1107–1117, jun 2013.

[68] Hao Ye and Geoffrey Ye Li. Deep Reinforcement Learning for Resource Allocation in V2V Communications. IEEE International Conference on Communications, 2018-May, 2018.

[69] Hongmei Yu, Haipeng Fang, Pingjing Yao, and Yi Yuan. A combined genetic algorithm/simulated annealing algorithm for large scale system energy integration. Computers & Chemical Engineering, 24(8):2023–2035, 2000.

[70] Tom Zahavy, Matan Haroush, Nadav Merlis, Daniel J Mankowitz, and Shie Mannor. Learn what not to learn: Action elimination with deep reinforcement learning. In Advances in Neural Information Processing Systems, pages 3562–3573, 2018.

122 [71] Mahdi Zargayouna, Flavien Balbo, and Khadim Ndiaye. Generic model for resource allocation in transportation. Application to urban parking management. Transportation Research Part C: Emerging Technologies, 71:538–554, oct 2016.

[72] Wenyu Zhang, Shuai Zhang, Shanshan Guo, Yushu Yang, and Yong Chen. Con- current optimal allocation of distributed manufacturing resources using extended Teaching-Learning-Based Optimization. International Journal of Production Research, 55(3):718–735, feb 2017.

123