Many Objective Sequential Decision Making

Many Objective Sequential Decision Making by Bentz P. Tozer III B.S. in Computer Engineering, December 2003, University of Pittsburgh M.S. in Electrical and Computer Engineering, May 2007, Johns Hopkins University A Dissertation submitted to The Faculty of The School of Engineering and Applied Science of The George Washington University in partial satisfaction of the requirements for the degree of Doctor of Philosophy May 21, 2017 Dissertation directed by Thomas A. Mazzuchi Professor of Engineering Management and Systems Engineering Shahram Sarkani Professor of Engineering Management and Systems Engineering The School of Engineering and Applied Science of The George Washington University certifies that Bentz P. Tozer III has passed the Final Examination for the degree of Doctor of Philosophy as of November 30, 2016. This is the final and approved form of the dissertation. Many Objective Sequential Decision Making Bentz P. Tozer III Dissertation Research Committee: Thomas A. Mazzuchi, Professor of Engineering Management and Systems Engineering & of Decision Sciences, Dissertation Co-Director Shahram Sarkani, Professor of Engineering Management and Systems Engineering, Dissertation Co-Director Chris Willy, Professorial Lecturer in Engineering Management and Systems Engineering, Committee Member Royce Francis, Assistant Professor of Engineering Management and Systems Engineering, Committee Member E. Lile Murphree, Professor Emeritus of Engineering Management and Systems Engineering, Committee Member ii c Copyright 2017 by Bentz P. Tozer III All rights reserved iii Dedication Dedicated to Audrey, who provided much of the motivation and inspiration necessary to complete this dissertation. iv Acknowledgments This dissertation would not be possible without the support of many people throughout my life, starting with my family. I'd like to acknowledge my parents, Bentz and Cathy, who encouraged me to pursue my interests in life, and instilled the discipline and passion for learning that were necessary to complete this dissertation, and my brother, Colin, for the tolerance and support he has provided throughout our child- hood and beyond. I'd also like to acknowledge my grandparents and the rest of my family for the encouragement and positive influence they've provided over the years. Next, I would like to acknowledge my co-advisors at George Washington Uni- versity, Dr. Thomas Mazzuchi and Dr. Shahram Sarkani. I greatly appreciate the opportunity to pursue a Ph.D. under their tutelage, as well the structure and guidance they provided while working towards this dissertation. I'd also like to acknowledge my classmates at GW, especially Don, Sean, and Shapna. Working together towards our mutual goal has been a pleasure, and I feel fortunate to have shared this experi- ence with all of you. I'd also like to acknowledge the people who initially guided me towards the field of engineering and kindled my interest in academic research. Lawrence Anderson and Joseph Ulrich Sr. acted as role models during my youth, showing me potential career options for an individual with an engineering degree. This led me to major in computer engineering at the University of Pittsburgh as an undergraduate. During my time there, I was fortunate to perform research under the direction of Dr. Ray- mond Hoare, causing me to consider the pursuit of a Ph.D. for the first time. Finally, I have been fortunate to have employers during my graduate studies that encouraged my interest in pursuing advanced degrees. I would like to acknowledge Raytheon for providing financial support for my M.S. degree, and Digital Oper- atives, specifically Nathan Landon, for providing financial and moral support for my Ph.D., as well as moral support. v Abstract Many Objective Sequential Decision Making Many routine tasks require an agent to perform a series of sequential actions, either to move to a desired end state or to perform a task of indefinite duration as efficiently as possible, and almost every task requires the consideration of multiple objectives, which are frequently in conflict with each other. However, methods for determining that series of actions, or policy, when considering multiple objectives have a number of issues. Some are unable to find many elements in the set of optimal policies, some are dependent on existing domain knowledge provided by an expert, and others have difficultly selecting actions as the number of objectives increases. All of these issues are limiting the use of autonomous agents to successfully complete tasks in complex, uncertain environments that are not well understood at the start of the task. This dissertation proposes the use of voting methods developed in the field of social choice theory to determine optimal policies for sequential decision making problems with many objectives, addressing limitations in methods that rely on scalarization functions and Pareto dominance to create policies. Voting methods are evaluated for action selection and policy evaluation within a model-free reinforcement learning algorithm for episodic problems ranging from two to six objectives in deterministic and stochastic environments, and compared to state of the art methods which use Pareto dominance for these tasks. The results of this analysis show that certain voting methods avoid the shortcomings of existing methods, allowing an agent to find multiple optimal policies in an initially unknown environment without any guidance from an external assistant. vi Table of Contents Dedication .................................... iv Acknowledgments ............................... v Abstract ..................................... vi List of Figures .................................. ix List of Tables .................................. xi Chapter 1. Introduction ........................... 1 1.1 Thesis Statement . 3 1.2 Contributions . 3 1.3 Organization . 4 Chapter 2. Related Work .......................... 5 2.1 Multi-objective Optimization . 5 2.1.1 Scalarization Algorithms . 9 2.1.2 Evolutionary Algorithms . 12 2.1.3 Many Objective Optimization . 14 2.2 Reinforcement Learning . 17 2.2.1 Markov Decision Processes . 19 2.2.2 Exploration versus Exploitation . 23 2.2.3 Model-based Learning . 25 2.2.4 Model-free Learning . 27 2.3 Multi-Objective Reinforcement Learning . 34 2.3.1 Single Policy Multi-Objective Reinforcement Learning Algorithms . 35 2.3.2 Multi-policy Multi-objective Reinforcement Learning Algorithms . 40 2.4 Social Choice Theory . 50 2.4.1 Social Choice Theory and Multi Criteria Decision Making . 54 Chapter 3. Many Objective Q-Learning ................. 57 3.1 Structuring Problems as Markov Decision Processes . 57 3.2 Solving Multi-objective Markov Decision Processes with Social Choice Functions . 59 3.3 Voting Based Q-Learning Algorithm . 64 3.4 Example Problem . 66 Chapter 4. Experiments ........................... 69 4.1 Metrics Used for Algorithm Evaluation . 69 4.2 Deep Sea Treasure . 71 4.3 Many Objective Path Finding . 78 4.4 Deterministic Five Objective Problem . 79 vii 4.5 Stochastic Five Objective Problem . 85 4.6 Stochastic Six Objective Problems . 90 4.7 Summary of Results . 95 Chapter 5. Conclusions and Future Work ................ 96 5.1 Summary of Contributions . 96 5.2 Future Research Directions . 96 5.2.1 Partially Observable Environments . 97 5.2.2 Function Approximation . 97 5.2.3 Non-Markovian Problems . 98 5.2.4 Alternative Social Choice Functions . 99 5.2.5 Model-based Learning Algorithms . 99 5.3 Conclusion . 100 References .................................... 101 viii List of Figures 1 The Pareto front for the two objective example problem, where black circles indicate optimal solutions that are part of the Pareto front. ............................... 9 2 A Pareto front with a point at (5, 6) which results in the existence of a non-convex region of the Pareto front. ...... 10 3 The Pareto front for a two objective and three objective problem represented by hypercubes. ................... 17 4 The interactions between components of the reinforcement learning paradigm, which are the agent and the environment. ... 18 5 The multi-objective reinforcement learning paradigm, where the reward received from the environment is a vector value instead of a scalar. ........................... 35 6 A convex hull of a Pareto front, where the convex hull is indi- cated by the black line. ......................... 43 7 An example of an election with a Condorcet cycle ........ 51 8 Example transformation of Q-values associated with each state action pair to a ballot associated with each objective. ..... 59 9 Example multi-objective MDP .................... 67 10 The Deep Sea Treasure environment. ................ 72 11 The Pareto front for the Deep Sea Treasure problem, where the Pareto optimal value for each of the 10 terminal states is represented by a black circle. ..................... 72 12 Hypervolume per episode for the Deep Sea Treasure problem. 74 13 Total reward per episode for the Deep Sea Treasure problem. 76 14 Elapsed time in seconds per episode for the Deep Sea Treasure problem. .................................. 78 15 Rewards for deterministic five objective path finding problem. 80 16 Hypervolume per episode for the five objective deterministic problem. .................................. 82 ix 17 Total reward per episode for the five objective deterministic problem. .................................. 83 18 Elapsed time in seconds per episode for the five objective deterministic problem. .......................... 84 19 Probability of

Many Objective Sequential Decision Making

Mohanoor Ou 0169D 10067.Pdf (1.725Mb)

Acyclic Graphs, Trees and Spanning Trees

Open THESIS PPPP Revjul27.Pdf

Graph Algorithms

On Stability of Widest Path in Network Routing Abstract

Artistic Recoloring of Image Oversegmentations

Smart Resource Allocation in Internet-Of-Things

Fundamental Data Structures Zuyd Hogeschool, ICT Contents

Fahmy Layout

Arxiv:2105.05371V1 [Cs.DM] 11 May 2021 Total Investment Cost, the Return on the Investment)

Fine-Grained Complexity and Algorithms for the Schulze Voting Method*

Comparitive Analysis on Shortest Path Computation in Gps Systems