A New Solution for Markov Decision Processes and Its Aerospace Applications
Total Page:16
File Type:pdf, Size:1020Kb
Iowa State University Capstones, Theses and Graduate Theses and Dissertations Dissertations 2020 A new solution for Markov Decision Processes and its aerospace applications Joshua R. Bertram Iowa State University Follow this and additional works at: https://lib.dr.iastate.edu/etd Recommended Citation Bertram, Joshua R., "A new solution for Markov Decision Processes and its aerospace applications" (2020). Graduate Theses and Dissertations. 17832. https://lib.dr.iastate.edu/etd/17832 This Thesis is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. A new solution for Markov Decision Processes and its aerospace applications by Joshua Bertram A thesis submitted to the graduate faculty in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Major: Computer Engineering Program of Study Committee: Peng Wei, Co-major Professor Joseph Zambreno, Co-major Professor Phillip Jones The student author, whose presentation of the scholarship herein was approved by the program of study committee, is solely responsible for the content of this thesis. The Graduate College will ensure this thesis is globally accessible and will not permit alterations after a degree is conferred. Iowa State University Ames, Iowa 2020 Copyright c Joshua Bertram, 2020. All rights reserved. ii DEDICATION I would like to dedicate this thesis to my wife Gretchen, son Kaiden, and daughter Reyna without whose support I would not have been able to complete this work. iii TABLE OF CONTENTS Page LIST OF TABLES . .v LIST OF FIGURES . vi ACKNOWLEDGMENTS . .x ABSTRACT . xi CHAPTER 1. INTRODUCTION . .1 1.1 Markov Decision Process Background . .1 1.2 Value Iteration . .2 1.3 Markov Decision Process Related Work . .3 1.4 Nature and Structure of Value Function . .5 CHAPTER 2. POSITIVE REWARDS: EXACT ALGORITHM . .9 2.1 Introduction . .9 2.2 Methodology . .9 2.2.1 MDP Transition Graphs.............................. 10 2.2.2 Exact Solutions for a Single Reward Source . 12 2.2.3 Exact Solution for Multiple Reward Sources . 19 2.2.4 Algorithm . 20 2.2.5 Proof of Algorithm Correctness . 25 2.2.6 Part 1: Bellman optimality and maximum value . 25 2.2.7 Part 2: Algorithm calculation of maximum value . 30 2.3 Experiments . 40 2.4 Conclusion . 42 CHAPTER 3. POSITIVE REWARDS: MEMORYLESS ALGORITHM . 43 3.1 Introduction . 43 3.2 Methodology . 43 3.2.1 Extracting Optimal Trajectory . 50 3.3 Experiments . 52 3.4 Conclusion . 54 CHAPTER 4. EXPLAINABILITY AND PRINCIPLE OF OPPORTUNITY . 56 4.1 Introduction . 56 4.2 Methodology . 57 4.2.1 Dominance . 57 4.2.2 Identifying Collected Rewards . 62 4.2.3 Relative Contribution . 64 4.3 Stability and Sensitivity . 65 iv 4.4 Principle of Opportunity . 65 4.5 Conclusion . 66 CHAPTER 5. NEGATIVE REWARDS: FastMDP ALGORITHM . 67 5.1 Introduction . 67 5.2 Methodology . 67 5.2.1 Negative Reward . 67 5.2.2 Standard Positive Form . 71 5.2.3 Reordering of Operations to Improve Efficiency . 72 5.2.4 Algorithm . 74 5.3 Conclusion . 74 CHAPTER 6. APPLICATION: COLLISION AVOIDANCE . 76 6.1 Introduction . 76 6.2 Collision Avoidance Related Work . 77 6.3 Methodology . 80 6.3.1 State Space . 80 6.3.2 Action Space . 81 6.3.3 Dynamic Model . 81 6.3.4 Reward Function . 81 6.4 Results . 83 6.5 Conclusion . 85 CHAPTER 7. APPLICATION: PURSUIT EVASION . 86 7.1 Introduction . 86 7.2 Pursuit Evasion Related Work . 88 7.3 Methodology . 91 7.3.1 Dynamic Model . 91 7.3.2 Forward Projection . 93 7.3.3 State Space . 94 7.3.4 Action Space . 95 7.3.5 Reward Function . 96 7.3.6 Algorithm . 97 7.4 Experimental Setup . 100 7.5 Results . 101 7.6 Conclusion . 107 CHAPTER 8. FUTURE WORK SUMMARY AND DISCUSSION . 108 8.1 Algorithm Implementation Improvements . 108 8.2 Stochastic MDPs . 109 8.3 Incorporating Actions . 109 8.4 New Applications . 111 BIBLIOGRAPHY . 112 v LIST OF TABLES Page Table 7.1 Limits on aircraft performance for each team . 94 Table 7.2 Action choices for each team . 96 Table 7.3 Rewards created for each ownship . 97 Table 7.4 Probability of win Pwin and Probability of survivability Ps of blue team as team size increases . 106 Table 7.5 Processing time required for each agent on red or blue team as team size increases . 106 Table 7.6 Links to videos . 107 vi LIST OF FIGURES Page Figure 1.2 Illustration of value function over multiple iterations of value iteration algo- rithm for a MDP with non-terminal rewards demonstrating that value grows outward from states which contain reward. This can be considered a kind of diffusion process that eventually reaches a steady state equilibrium which is known as convergence. .6 Figure 2.1 MDP states graph for a small, deterministic MDP . 10 Figure 2.2 Illustration of baseline (blue), combined baseline (red), and delta baseline (green) . 20 Figure 2.3 Disjoint subsets of SZ and S+ that form S................... 30 Figure 2.4 Optimal sequence of actions through SZ until a point in S+ is reached. 32 Figure 2.5 Depiction of the relationship between policy, value function, and optimal solution for V M .................................. 37 Figure 2.7 Varying number of reward sources . 40 Figure 2.8 Varying number of states . 40 Figure 2.9 Varying discount factor . 40 Figure 2.10 Experimental results showing the improved performance of the algorithm as compared to value iteration when varying the number of reward sources, varying the number of states, and varying the discount factor. 40 Figure 3.2 Exact algorithm . ..