Quasi-Random Action Selection in Markov Decision Processes

Georgia Southern University Digital Commons@Georgia Southern Electronic Theses and Dissertations Graduate Studies, Jack N. Averitt College of Fall 2017 Quasi-Random Action Selection In Markov Decision Processes Samuel D. Walker Follow this and additional works at: https://digitalcommons.georgiasouthern.edu/etd Part of the Applied Statistics Commons, and the Statistical Models Commons Recommended Citation Walker, Samuel. "Quasi-Random Action Selection In Markov Decision Processes". November, 2017. This thesis (open access) is brought to you for free and open access by the Graduate Studies, Jack N. Averitt College of at Digital Commons@Georgia Southern. It has been accepted for inclusion in Electronic Theses and Dissertations by an authorized administrator of Digital Commons@Georgia Southern. For more information, please contact [email protected]. QUASI-RANDOM ACTION SELECTION IN MARKOV DECISION PROCESSES by SAMUEL DALTON WALKER (Under the Direction of Stephen Carden) ABSTRACT In Markov decision processes an operator exploits known data regarding the environment it inhabits. The information exploited is learned from random exploration of the state-action space. This paper proposes to optimize exploration through the implementation of quasi- random sequences in both discrete and continuous state-action spaces. For the discrete case a permutation is applied to the indices of the action space to avoid repetitive behavior. In the continuous case sequences of low discrepancy, such as Halton sequences, are utilized to disperse the actions more uniformly. INDEX WORDS: Statistics, Stochastic Theory, Markov Decision Processes 2009 Mathematics Subject Classification: 90C40 , 60J05 QUASI-RANDOM ACTION SELECTION IN MARKOV DECISION PROCESSES by SAMUEL DALTON WALKER B.S., Middle Georgia State University, 2015 A Thesis Submitted to the Graduate Faculty of Georgia Southern University in Partial Fulfillment of the Requirements for the Degree MASTER OF SCIENCE STATESBORO, GEORGIA c 2017 SAMUEL DALTON WALKER All Rights Reserved 1 QUASI-RANDOM ACTION SELECTION IN MARKOV DECISION PROCESSES by SAMUEL DALTON WALKER Major Professor: Stephen Carden Committee: Arpita Chatterjee Emil Iacob Scott Kersey Electronic Version Approved: December 2017 2 DEDICATION I dedicate this to my fiancee,´ mother, and friends whose aid made all this possible. 3 ACKNOWLEDGMENTS I wish to acknowledge Dr. Carden for being a good friend and mentor as well as the wonderful faculty at Georgia Southern’s Mathematics Department. 4 TABLE OF CONTENTS Page ACKNOWLEDGMENTS :::::::::::::::::::::::::::::: 3 LIST OF TABLES ::::::::::::::::::::::::::::::::: 5 LIST OF FIGURES ::::::::::::::::::::::::::::::::: 6 LIST OF SYMBOLS :::::::::::::::::::::::::::::::: 8 CHAPTER 1 Introduction ::::::::::::::::::::::::::::::::: 9 2 Discrete MDP :::::::::::::::::::::::::::::::: 16 2.1 Two State System :::::::::::::::::::::::::: 18 2.2 Discrete System Dynamics ::::::::::::::::::::: 26 2.3 Hitting Time Distribution :::::::::::::::::::::: 28 2.4 Empirical Data :::::::::::::::::::::::::::: 34 3 Continuous MDP ::::::::::::::::::::::::::::::: 42 3.1 Halton Sequence ::::::::::::::::::::::::::: 43 3.2 Continuous System Dynamics :::::::::::::::::::: 48 3.3 Empirical Data :::::::::::::::::::::::::::: 53 4 Conclusion :::::::::::::::::::::::::::::::::: 64 REFERENCES ::::::::::::::::::::::::::::::::::: 65 A Probability Matrices ::::::::::::::::::::::::::::: 67 B Hypercube :::::::::::::::::::::::::::::::::: 70 C Collision Mechanics ::::::::::::::::::::::::::::: 73 5 LIST OF TABLES Table Page 2.1 Statistics For Goal Stop Criterion (Discrete) ::::::::::::::::: 36 2.2 Statistics For Explore Stop Criterion (Discrete) ::::::::::::::: 39 3.1 Statistics For Goal Stop Criterion (Continuous) ::::::::::::::: 58 3.2 Statistics For Explore Stop Criterion (Continuous) :::::::::::::: 63 A.1 Pr(down) ::::::::::::::::::::::::::::::::::: 67 A.2 Pr(left) :::::::::::::::::::::::::::::::::::: 68 A.3 Pr(right) ::::::::::::::::::::::::::::::::::: 68 A.4 Pr(up) :::::::::::::::::::::::::::::::::::: 69 6 LIST OF FIGURES Figure Page 2.1 Basic discrete problem for comparing hitting times under different action selection protocols. :::::::::::::::::::::::::::: 19 2.2 Grid World is a 5x5 grid maze which is simple enough that we can test our theory quickly, but complicated enough that we avoid trivial solutions. A circle denotes the starting location of the operator and a star denotes the goal state. ::::::::::::::::::::::::::::::::: 26 2.3 Hitting time distribution for state 3 under random action selection. ::::: 31 2.4 Cumulative hitting time probabilities for state 3 from state 23. ::::::: 32 2.5 Cumulative hitting time probabilities for state 3. Each line represents the cummulative distribution for different initial states. :::::::::::: 33 2.6 Histogram of required epochs until state 3 is reached from state 23 under RAS. 35 2.7 Histogram of required epochs until state 3 is reached from state 23 under LDAS. ::::::::::::::::::::::::::::::::::: 36 2.8 Histogram of required epochs until the state space has been fully explored under RAS. :::::::::::::::::::::::::::::::: 38 2.9 Histogram of required epochs until the state space has been fully explored under LDAS. :::::::::::::::::::::::::::::::: 39 2.10 A heatmap of the difference of hitting times for each state (LDAS - RAS). The operator initializes in state 23, hence both action selection protocols have identical hitting times for that particular state. ::::::::::::::: 40 3.1 The modified Halton sequence given in polar coordinates for primes 2 and 7907. :::::::::::::::::::::::::::::::::::: 42 3.2 The Halton sequence with coprimes 2 and 3. https://en.wikipedia. org/wiki/Halton_sequence :::::::::::::::::::: 46 3.3 Continuous analog of Grid World. :::::::::::::::::::::: 48 7 3.4 The partition Ω := fS1;S2;S3;:::;S25g of the continuous Grid World state space S. :::::::::::::::::::::::::::::::::: 50 3.5 First 100 elements of the Halton Sequence with co-prime base 2 and 3. Operator i is in red, the velocity he chooses is element hi of the Halton sequence. ::::::::::::::::::::::::::::::::: 51 3.6 Histogram of required epochs until a state within partition 3 is reached from the initial state under RAS. :::::::::::::::::::::::: 54 3.7 Histogram of required epochs until a state within partition 3 is reached from the initial state under LDAS with primes 2 and 3. ::::::::::::: 55 3.8 Histogram of required epochs until a state within partition 3 is reached from the initial state under LDAS with primes 3 and 5. ::::::::::::: 56 3.9 Histogram of required epochs until a state within partition 3 is reached from the initial state under LDAS with primes 5 and 13. :::::::::::: 57 3.10 Histogram of required epochs until sufficient exploration reached under random action selection. :::::::::::::::::::::::::: 59 3.11 Histogram of required epochs until sufficient exploration reached under low discrepancy action selection with primes 2 and 3. ::::::::::::: 60 3.12 Histogram of required epochs until sufficient exploration reached under low discrepancy action selection with primes 3 and 5. ::::::::::::: 61 3.13 Histogram of required epochs until sufficient exploration reached under low discrepancy action selection with primes 5 and 13. :::::::::::: 62 B.1 Deconstructing the hypercube :::::::::::::::::::::::: 71 C.1 Operator’s collision with the environment :::::::::::::::::: 76 8 LIST OF SYMBOLS St state of a stochastic process at index t 2 I 10 Pr(a j b) probability of a given b 10 pij transition probability from state i to state j 10 P = (pij) transition probability matrix 10 S state space of a Markov decision process 11 As set of permissible actions from state s 2 S 11 0 ps;s0;a transition probability from state s to state s under action a 11 fr(s; a) j s 2 S; a 2 Asg set of reward for each state and action 11 π function which specifies the actions the operator will perform 11 γ(k) discount factor used to weight future rewards 12 Vπ(s) the expected value of policy π over the horizon 12 E[X] expected value of the random variable X 12 π∗ optimal policy 12 V ∗ value under the optimal policy 12 Q∗(s; a) state-action value function 13 # cardinality operator 16 h(s; t) history random vector 17 Gd state space of Grid World 27 A(E; N; X) counts the number of the first N indices of X belonging to E 46 1E characteristic function of E 46 ∆ discrepancy function 47 λ Lebesgue measure 47 ∗ DN (X) star discrepancy of X 47 Ψ radical inverse function 48 Φn n - dimensional Halton sequence 50 Ω partition of the state space S 52 th Pi i generating prime of a Halton sequence 53 9 CHAPTER 1 INTRODUCTION A stochastic process is a collection of random variables, usually denoted by fSt j t 2 Ig, where t is from some indexing set I and St is the state of the process at index t 2 I. It is permissible for the indexing set to be either continuous or discrete. A typical example of a discrete index is when I represents the number of iterations or steps in some process. Let fSt; t 2 N [ 0g be a stochastic process, where St assumes a finite or countable number of possible values. The process is said to be in state i at time t if St = i. Given that the process is in state i at time t, we may describe all possible fixed transition probabilities from said state. If the transitions satisfy the equation Pr(St+1 = j j St = i) = Pr(St+1 = j j St = i; St−1;:::;S1;S0) = pij: then the process is called a Markov chain. The property above is known as the Markovian property. We may interpret the Markovian property as saying that the conditional distribu-

Load more