<<

Essays on with Decision Trees and Accelerated Boosting of Partially Linear Additive Models

A dissertation submitted to the Graduate School of the University of Cincinnati in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the department of Operations, Business Analytics and Information Systems of the Carl H. Lindner College of Business by Steven Dinger M.S. Clemson University 2014 B.S. Clemson University 2005 Committee Uday Rao, PhD (Chair) Yan Yu, PhD Dungang Liu, PhD Raj Bhatnagar, PhD June 2019

Abstract

Reinforcement learning has become a popular research topic due to the recent successes in combining value function estimation and reinforcement learning. Because of the popularity of these methods, deep learning has become the de facto standard for function ap- proximation in reinforcement learning. However, other function approximation methods offer advantages in speed, ease of use, interpretability and stability. In our first essay, we examine several existing reinforcement learning methods that use decision trees for function approxima- tion. Results of testing on a benchmark reinforcement learning problem show promising results for decision tree based methods. In addition, we propose the use of online random forests for reinforcement learning which show competitive results. In the second essay, we discuss accelerated boosting of partially linear models. Partially lin- ear additive models are a powerful and flexible technique for modeling complex data. However, automatic variable selection to linear, nonlinear and uninformative terms can be computation- ally expensive. We propose using accelerated twin boosting to automatically select these terms and fit a partially linear additive model. Acceleration reduces the computational effort versus non-accelerated methods while maintaining accuracy and ease of use. Twin boosting is adopted to improve variable selection of accelerated boosting. We demonstrate the results of our pro- posed method on simulated and real data sets. We show that accelerated twin boosting results in accurate, parsimonious models with substantially less computation than non-accelerated twin boosting.

iii

Acknowledgements

I would like to thank my advisor, Uday Rao, for all his help and support. Without your guid- ance and encouragement I would not be where I am today. I would also like to thank Yan Yu for her advice and feedback on my research. You helped me turn an idea into paper in a short amount of time and I am eternally grateful for it. To my committee members, Dungang Liu and Raj Bhatnagar, thank you for feedback and willingness to help me through this process. To my family, thank you for your support and understanding. To my friends, thanks for mak- ing me laugh, letting me complain and taking my mind off work. To Jason Thatcher, thank you for always being available to give me advice about academia. Finally, to my fiancée, Kate, thank you for being there for me. Thank you for helping me through tough times and taking me on amazing adventures. I am excited to spend the rest of my life with you.

v

Contents

Abstract iii

Acknowledgements v

List of Figures xi

List of Tables xiii

List of Algorithms xv

List of Abbreviations xvii

1 Introduction 1

2 Reinforcement Learning with Decision Trees - Review, Implementation and Computation Comparison 5

1 Introduction ...... 5 2 Literature Review ...... 6 3 Methods ...... 10 3.1 Background ...... 10

vii Markov Decision Process ...... 11 Bellman Optimality Equation ...... 13 Q-Learning ...... 15 Function Approximation ...... 17 3.2 Batch Methods ...... 18 Fitted Q Iteration ...... 18 Boosted FQI ...... 21 Neural FQI ...... 23 3.3 Online Methods ...... 25 G-Learning ...... 26 Abel’s Method ...... 28 Deep Q-Networks ...... 29 3.4 Online Methods ...... 30 4 Testing ...... 33 4.1 Environment ...... 33 4.2 Batch Methods ...... 35 4.3 Online Methods ...... 36 4.4 Replications ...... 37 4.5 Parameter Tuning ...... 39 5 Results ...... 39 5.1 Batch Methods ...... 40 Single Tree FQI ...... 40 Random Forest FQI ...... 40 Boosted FQI ...... 41

viii Neural FQI ...... 45 5.2 Online ...... 45 G-Learning ...... 46 Abel’s Method ...... 47 Deep Q-Network ...... 49 Online Random Forest ...... 51 5.3 Comparison ...... 55 6 Limitations ...... 58 7 Future Work ...... 59 8 Discussion ...... 60

3 Accelerated Boosting of Partially Linear Additive Models 61

1 Introduction ...... 61 2 Partially Linear Additive Models and Boosting ...... 65 2.1 Partially Linear Additive Models ...... 65 2.2 Boosting of Partially Linear Additive Models ...... 67 2.3 Twin Boosting of Partially Linear Additive Models ...... 71 3 Accelerated Boosting ...... 72 3.1 Accelerated Boosting of Partially Linear Additive Models ...... 72 3.2 Accelerated Twin Boosting of Partially Linear Additive Models ...... 74 4 Numerical Study ...... 74 4.1 Variable Selection Results ...... 78 4.2 Estimation Results ...... 81

4 Conclusion 89

ix Bibliography 97

x List of Figures

1 RL Agent Interacting with the Environment ...... 11 2 RL Example Problem Environment ...... 12 3 Value Iteration Example ...... 14 4 Q-Learning Example ...... 16 5 Example Neural Network Structure ...... 24 6 CartPole Testing Environment ...... 34 7 Comparison of Raw and Filtered Reward Outputs ...... 37 8 Comparison of All Replications and Averaged Rewards ...... 38 9 Single Tree FQI Results ...... 41 10 Random Forest FQI Results ...... 42 11 Boosted FQI: Uncorrelated and Correlated Comparison ...... 43 12 Boosted FQI: Final Results ...... 44 13 Neural FQI Results ...... 46 14 G-Learning Results ...... 47 15 Abel’s Method Results ...... 48 16 Deep Q-Network Results ...... 50 17 DQN Results with Limited Memory ...... 51

xi 18 Online Random Forest FQI with Stopping Results ...... 52 19 Moving Window Random Forest Results ...... 53 20 Moving Window Random Forest Results with Limited Memory ...... 54 21 Comparison of RL Methods ...... 56

1 Twin versus Single Accelerated Boosting Results for Example 1 ...... 79 2 Variable Selection Results Example 1 ...... 79 3 Twin versus Single Accelerated Boosting Results for Example 2 ...... 80 4 Variable Selection Results Example 2 ...... 81 5 Results for Example 1 ...... 83 6 Boxplots for Example 1 vs ...... 84 7 Results for Example 2 ...... 85 8 Boxplots for Example 2 vs Learning Rate ...... 85 9 Boxplots for Taiwan Housing Data vs Learning Rate ...... 86

xii List of Tables

1 Classification of RL Papers ...... 9 2 Classification of RL Methods ...... 10 3 Execution Time of RL Methods...... 58

1 Summary of Variable Selection Results ...... 82 2 Summary Results of Non-Accelerated vs Accelerated Twin Boosting ...... 88

xiii

List of Algorithms

1 Single Tree FQI ...... 20 2 Random Forest FQI ...... 21 3 Boosted FQI ...... 23 4 Neural FQI (NFQ) ...... 25 5 G-Learning ...... 27 6 Abel’s Method ...... 29 7 Deep Q-Networks ...... 30 8 Online Random Forest FQI with Stopping ...... 31 9 Moving Window Random Forest ...... 32 1 Boosting Partially Linear Additive Models ...... 69 2 Twin Boosting of Partially Linear Additive Models ...... 72 3 Accelerated Boosting of Partially Linear Additive Models ...... 74 4 Accelerated Twin Boosting of Partially Linear Additive Models ...... 75

xv

List of Abbreviations

Acc. Accelerated DQN Depp Q-Network FQI Fitted Q-Iteration GAM Generalized Additive Model MDP Markov Decision Process NFQ Neural Fitted Q-Iteration PLAM Partially Linear Additive Model RL Reinforcement Learning RMSE Root Mean Square Error SD Standard Deviation

xvii

Dedicated to my brother, Mike. Thanks for always encouraging me.

xix

Introduction

Reinforcement learning (RL) is an artificial intelligence (AI) method to optimize decision pro- cesses by rewarding good actions and penalizing bad actions. Although commonly used for games and robotics, RL is not limited to those areas, it can be used anywhere a decision needs to be made that weighs the long-term consequences against short-term gain. RL can be used in pricing problems (Tesauro and Kephart 2002), clinical trials (Ernst et al. 2006; Martín-Guerrero et al. 2008), schedule optimization (Wang and Dietterich 1999) and customer marketing (Martín- Guerrero et al. 2008) among many other examples. For RL to see more use in a variety of fields, the techniques should be more broadly publicized, easy to use and computationally cheap. While the recent successes of winning against champion Go players and competitively playing modern video games has helped to publicize RL, the techniques they use to achieve state of the art results can be challenging to replicate. One of the reasons for this is that most modern RL methods use deep neural networks, also known as deep learning. Deep neural networks can be quite challenging to use as they require massive data sets, specialized hardware, lots of training time and are difficult to troubleshoot (Marcus 2018). Replacing the neural network part of RL with a different technique like random forests should result in an easier, more accessible method that still offers competitive results.

1 Introduction

While there are many candidate methods for replacing deep learning in RL, decision trees and tree ensembles show much promise (Caruana et al. 2008). Random forests are a popular method because of they are simple to use and need little tuning to achieve good results. trees are another popular technique due to their flexibility, speed and performance. There is some existing research in using decision trees for RL, but most of it is several decades old and very little of it uses modern decision tree ensembles. Essay one explores these existing methods, proposes a classification for them and computationally compares them to state-of-the-art deep learning RL for a specific environment. Additionally, we explore options for online decision trees and ensemble methods. While examining decision trees and tree ensembles for Essay one, a method of improving the performance of gradient boosting trees was developed. It uses an optimization technique common in training neural networks known as Nesterov’s accelerated gradient. It reduces the number of iterations needed to minimize a some error by taking larger gradient steps. As gradient boosting is a gradient-based technique, we can use acceleration to reduce the number of iterations gradient boosting needs. In an effort to make accelerated boosting more interpretable and handle variable selection, we adopt partially linear additive models. In contrast to other partially linear methods, our method does not need to pre-specify which variables are potentially non-linear and potentially linear. Accelerated boosting of partially linear additive models can automatically select variables as linear, non-linear or uninformative. Essay two develops this method in more details and shows potential using simulated and real data examples. In the future, we would like to combine these two essays and use accelerated boosting of partially linear additive models for reinforcement learning. As mentioned in Essay one, current methods of using gradient boosting trees for RL need a new tree for each iteration. Old trees are never removed or combined, so there is a potential memory problem if the RL process runs for

2 Introduction many thousands of iterations. The same is not true of using partially linear additive models. As their components are linearly additive in nature, any update to the model can be directly applied to the model, without needing to create an ensemble model. This means the model size is fixed no matter how many iterations pass. This makes partially linear additive models a good fit for online RL, where the number of iterations could grow without bound. The other advantage that partially linear additive models provide to RL is in explainability. Currently, this is not a priority for most RL research as most methods use black-box neural networks. However, as more methods are used to make decisions and more regulations are instituted about needing to explain decisions to consumers, the use of interpretable methods become more important. Research into RL with partially linear additive models or other interpretable methods could be very valuable. However, that is left for future work.

3

Reinforcement Learning with Decision Trees

- Review, Implementation and Computation Comparison

1 Introduction

Recently, several high-profile achievements in AI have been the result of using RL with neural networks that have many hidden layers, also known as deep learning. Deep learning RL methods can beat the world champion at the game of Go, play Atari games at human-level performance and play competitively in modern video games. Because of these successes, the use of deep neural networks has become nearly universal in RL. However, these results are attributed to large teams of researchers who are experienced in using neural networks and have enormous computational resources at their disposal. For those without the experience and resources, neural networks can be challenging to use. Neural networks are often slow to converge, need large amounts of data and can be hard to interpret (Pyeatt and Howe 2001; Marcus 2018). For this reason, we investigate the use of decision trees and decision tree ensembles in RL. Replacing neural networks with a different function approximation method could make RL

5 Reinforcement Learning with Decision Trees - Review, Implementation and Computation Comparison easier and faster to use. Many different methods have been adapted for use in RL including linear approximators (Sutton and Barto 1998), lasso (Kolter and Ng 2009) and support vector machines (Lagoudakis and Parr 2003) but because of their speed, flexibility and ease of use, this essay fo- cuses on using decision trees for reinforcement learning. It is our hope that researching easier, faster methods for RL will lead to more widespread adoption of the technique for a variety of problems. The research question we want to address is what are the advantages of replacing neural net- work function approximation with tree-based methods. In what situations are tree-based meth- ods preferable and in what situations are neural network methods preferred? To answer these questions, we compare several tree-based methods with two neural network based methods on a classic RL benchmark problem. In addition, we propose several methods of using random forests for online RL. The rest of the essay is as follows: Section 2 is a literature review of the tree-based RL methods. Section 3 details how each method is implemented. Section 4 explains our testing environment. Results are presented in Section 5 for individual methods and a comparison between methods. Section 6 looks at the limitations of our work. Section 7 explores future work and Section 8 is the final discussion.

2 Literature Review

Sutton and Barto (1998), a seminal book on reinforcement learning, is a good resource for learning the fundamentals of RL. Kaelbling et al. (1996) covers the central topics of RL in a shorter, paper format. Watkins (1989); Watkins and Dayan (1992) developed the Q-learning algorithm, which is the basis of all of the methods discussed here.

6 2. Literature Review

One of the earliest examples of tree-based RL is Chapman and Kaelbling (1991). They develop G-learning as an online method that takes binary input data. They start by assuming all input state bits are irrelevant and use Student’s t-test to determine which bits are relevant. They then create a binary decision tree with only the relevant bits of state information. In their tests, G- learning outperforms neural networks. G-learning and the other approaches that we implement are discussed further in methods, Section 3. Moore and Atkeson (1995) proposes Parti-Game, another online tree-based method that uses game theory to recursively partition the input space. McCallum (1996) develops the U Tree algorithm while Uther and Veloso (1998) extends U Tree to take continuous state input. Pyeatt and Howe (2001) updates Chapman and Kaelbing’s G-learning method to use non-binary inputs and, again, compares its performance to neural networks in a high-dimensional setting. They find their tree-based method converges more reliably and does not suffer from learn/forget cycles that are experienced when using neural networks. These early methods all use a single tree that can be updated in online operation. Later tree-based methods update only in batches. Wang and Dietterich (1999) runs on batch data and they use a unique tree method with linear models at the leaf nodes. Ernst et al. (2005) proposes Fitted Q-Iteration (FQI) which is a batch learning framework that can use many differ- ent machine learning methods for function approximation. Their paper demonstrates FQI with several different single regression tree and random forest methods. In their tests, random forests outperform single trees and after this point, most tree-based methods focus on tree ensembles1. Abel et al. (2016) extends FQI using gradient boosting tree ensembles. Tosatto et al. (2017) shows boosted-FQI performing better than standard FQI using both tree ensembles and neural networks but they never directly compare trees vs neural networks.

1Ensembles are groups of individual models, either averaged or added together into a single model. The most well-known tree ensembles are random forests and gradient boosting trees.

7 Reinforcement Learning with Decision Trees - Review, Implementation and Computation Comparison

Neural networks are used frequently in RL, but we focus on two neural network based meth- ods in particular because of their frequent use in RL and their similarity to the tree-based methods that we test. First, Riedmiller (2005) introduces Neural Fitted Q-Iteration (NFQ) which integrates a neural network into the FQI framework. Second, Mnih et al. (2015) demonstrates deep Q-networks (DQN) an online method that uses deep learning neural networks for function approximation. To help organize the many methods referenced above, we propose a classification scheme, seen in Table 1. The methods are classified by mode of operation, batch or online, and the type of function approximation used, a single decision tree, random forest, gradient boosting trees or neural network. The single tree methods have several proposed methods for both batch and online modes. The tree ensembles only have one method apiece, and all of them are related to FQI. There are many more neural network based methods, but for simplicity we limit our discussion to one batch and one online method. Given the lack of an existing online random forest RL method, we propose a few different methods of using random forests in an online setting. We examine using a random forest on simple moving window of the most recent data. As a baseline comparison, we train a batch random forest FQI while gathering data and then stop updating the random forest once enough data has been gathered. We examine the results of these two methods and discuss other potential methods of using decision trees and ensembles for online reinforcement learning. There are few comparisons between neural networks and tree-based methods in the literature. Out of the papers listed above, only two directly compare tree-based RL and neural network based RL, Chapman and Kaelbling (1991) and Pyeatt and Howe (2001). Both find their tree-based methods to be competitive with neural networks and Pyeatt and Howe find their method less susceptible to learn/forget cycles than neural networks. They attribute this problem to neural networks not having a separation between local and global updates. If one small (local) part of

8 2. Literature Review

TABLE 1: Classification of RL Papers. The papers are separated by mode of data input, whether batch or online, and type of function approximation method. We found no papers using random forests for online RL, so we have left it blank. There are many neural network methods, we have chosen these two as representative examples.

Function Batch Online Approximator

Single Tree Wang and Dietterich (1999) Chapman and Kaelbling (1991) Ernst et al. (2005) Moore and Atkeson (1995) McCallum (1996) Uther and Veloso (1998) Pyeatt and Howe (2001) Random Forest Ernst et al. (2005) Gradient Boosting Tosatto et al. (2017) Abel et al. (2016) Trees Neural Network Riedmiller (2005) Mnih et al. (2015)

9 Reinforcement Learning with Decision Trees - Review, Implementation and Computation Comparison

TABLE 2: Classification of RL Methods. This table contains the RL methods which we will be examining in more depth.

Function Approximator Batch Online

Single Tree Single Tree FQI G-Learning (Ernst et al. 2005) (Pyeatt and Howe 2001) Random Forest Random Forest FQI Windowed Random Forest (Ernst et al. 2005) (This Essay) Gradient Boosting Trees Boosted FQI Abel’s Method (Tosatto et al. 2017) (Abel et al. 2016) Neural Network Neural FQI Deep Q-Networks (Riedmiller 2005) (Mnih et al. 2015)

the network needs to be updated, the effects will be felt globally throughout the rest of the network because of its fully-connected nature. Pyeatt and Howe’s G-learning is constructed so that local changes only happen to a single leaf node, leaving the rest of the tree alone.

3 Methods

In this section, we discuss details of how we implemented each of these methods for this re- search. Table 2 shows which methods are implemented for testing. Since there are many single tree methods, we choose to test the most recently proposed batch and online method. As the remaining categories only have one or no methods, we test the only method available.

3.1 Background

In order to understand the methods, some background information on RL is presented below. This covers the setup of a Markov decision process, how to optimally solve that process with the

10 3. Methods

FIG. 1: RL Agent Interacting with the Environment in Discrete Time. The RL agent is given a state, st, and chooses an action, at, based on that state. The action results in some reward, rt, from the environment and the state changes to a new state, st+1. The RL agent uses the new state to pick a new action and the continues until a stopping criterion is reached.

Bellman optimality equation and how Q-learning can approximate this optimal solution.

Markov Decision Process

The process of an RL agent interacting in a problem environment is modeled in Figure 1. At any point in time t, a decision maker (the agent in RL) is in state st. From there, action at is executed, the state transitions to the next state st+1 and reward rt is given to the agent. The test environment is modeled by a Markov decision process (MDP). MDPs are used to model decisions under uncertainty. It has a number of states, s, actions, a, and rewards, . The probability that a state st will transition to the next state st+1 is dependent on the current state and action P(st+1|st, at). A policy π(s) is a function that maps states to the actions that an agent will take. An optimal policy π∗(s) are the actions that leads to the highest total reward. Finally, the reward r depends on the current state, action and next state, rt = R(st, at, st+1). Note that the reward and state transitions are dependent only on the current and next state and not any other previous states. This is what makes the process “Markovian” or “memoryless.”

11 Reinforcement Learning with Decision Trees - Review, Implementation and Computation Comparison

FIG. 2: RL Example Problem Environment. Peach wants to move from state 1 to state 7, but must avoid state 4. There is a stochastic element that results in Peach ending up one square left of her desired destination 20% of the time.

Figure 2 shows a simple problem environment based on the canonical GridWorld example. Peach starts in state 1, she wants to get to Mario at state 7, but there is an enemy blocking the way and a strong wind blowing to the left. After any action, there is a 20% chance that Peach will be blown to the left one state. If Peach moves right, there is an 80% chance that she will succeed and a 20% chance that she will stay in the current state. If she moves down, she will end up below her current state 80% of the time and she will end below and left 20% of the time. Similarly an up action results in Peach moving up 80% of the time and up and left 20% of the time. A left action will move Peach one space left 80% of the time and two spaces left 20% of the time. At no point can an action or the wind move Peach outside of the problem environment. Assume there is a wall surrounding the entire perimeter of the world. States 4 and 7 are terminal states. Rewards for entering states 4, 5 and 7 are -10, 1 and 10, respectively. Given this setup, what actions will lead to the highest reward? Is going to state 5 and collecting the coin worth the chance of ending up in state 4? The next section will help us calculate the optimal policy.

12 3. Methods

Bellman Optimality Equation

The Bellman equation is a commonly used method for solving MDPs. The Bellman equation determines the value of being in a state by separating that value into two terms, the immediate reward and the discounted future value of being in the next state. The value of a state depends on the current policy and is determined by the Bellman equation:

Vπ(st) = Eπ [rt + γVπ(st+1)] (1) where γ ∈ [0, 1] is the time discount factor. It is a weighting term that lets us solve MDPs that have loops or an infinite time horizon. It is analogous to the future value of money in that rewards now are considered better than future rewards. A high γ represents more weighting on future rewards, where a low γ is more “greedy” and takes the immediate reward more. The Bellman optimality equation is similar:

∗ ∗ V (st) = max [E [rt + γV (st+1)]] (2) a

It states that the optimal value of a state is found by choosing the best action to balance the immediate reward with the discounted value of next optimal state. Equations 1 and 2 are the simplified forms of the actual equations to aid comprehension. To get the full form, remember that r depends on the current state and the action taken. The next state depends on the current state, the action and some uncertainty. That is why the terms must be averaged over all the possible next states. Equation 3 shows the full Bellman optimality equation.

" # ∗ ∗ V (st) = max P(st+1|st, a) [R(st, a, st+1) + γV (st+1)] (3) a ∑ st+1

13 Reinforcement Learning with Decision Trees - Review, Implementation and Computation Comparison

FIG. 3: Value Iteration Example. The value of Peach being in a particular state is listed. The reward for reaching state 4 is -10, reaching state 5 is +1 and reaching state 7 is +10. γ = 0.9 for this example. Note that the value for being in a particular state is the reward next action plus the discounted value of the expected next state.

Special techniques must be used to solve this equation because the optimality equation is recur- sive in V∗. Here, we discuss value iteration because of its simplicity and relevance to Q-learning. Value iteration is a dynamic programming technique to find the optimal value of every state, V∗. Once the values are known, it is easy to find the actions corresponding to the optimal policy. We start by assuming at iteration m = 0, the estimated value V0(s) = 0 for all s. Then we iterate over the Bellman optimality equation until the values converge and the difference between iterations is less than some convergence criteria. This iterative version is shown in equation 4.

" #

Vm+1(st) = max P(st+1|st, a)[R(st, a, st+1) + γVm(st+1)] (4) a ∑ st+1 Figure 3 shows the optimal values for each of the states. Note that states 4 and 7 are terminal states, so they are not assigned a value. The value of state 1 is 6.1, but it is omitted from the diagram for clarity. While these values appear to show the optimal policy passes through state 5, these are only the values of being in a state. It should not be assumed that moving to the highest value state is always the optimal action. This will be demonstrated in the next section.

14 3. Methods

The optimal policy, π∗, is actually found by using equation 3 with an argmax term instead of the max. Value iteration requires that all the state transition probabilities and rewards are known. But if the transition probabilities and rewards are not known, then reinforcement learning is necessary.

Q-Learning

Q-learning (Watkins 1989; Watkins and Dayan 1992) is one of the most popular RL methods. In our research, all of the existing tree based RL methods use Q-learning, so we limit our discussion to this form of RL. Reinforcement learning differs from dynamic programming in that the full MDP does not need to be known. RL can still determine the best actions by taking sample actions and learning over time. Model-based RL methods learn the states, transition probabilities, rewards and then solves them using dynamic programming. Model-free RL methods like Q-learning skip this process. It directly learns the value of a particular action at a given state, Q(s, a), also known as a state-action value or Q-value. Like the Bellman equations, Q-learning balances immediate reward with long-term value. Its most basic form is shown here:

Q(st, at) = rt + γ max Q(st+1, at+1) (5) at+1 for all state-action pairs. However, it is a recursive equation and just like the state-value Bellman equation, it must be solved iteratively. The iterative update equation is :

  Qm+1(st, at) = rt + γ max Qm(st+1, at+1) (6) at+1

Returning to our example with Peach and Mario, Q-learning no longer places the values on the states, the values are placed on the actions at every state. Figure 4 shows the most relevant

15 Reinforcement Learning with Decision Trees - Review, Implementation and Computation Comparison

FIG. 4: Q-Learning Example. In this example, the values are calculated for specific actions not for being in a specific state. Notice that moving into state 5 has a low value because their is a 20% chance that Peach will end up in state 4 and receive -10 reward. of these state-action values, not all possible ones. Notice that the value of moving into state 5 from either state 2 or 6 is 5.4. Both state 2 and 6 have actions with a value higher than 5.4, so moving into state 5 cannot be an optimal action. Setting up the state-action values this way makes new policies much easier to calculate. For Q-learning the best policy is to take the action with the highest Q-value:

π(st) = argmax Q(st, at+1) (7) at+1 One final advantage of Q-learning is that it is an off-policy method, its does not need to follow a specific policy when training. Note that Q-learning does not depend on the action taken at the next state at+1, it will always use the best action for the next state. This is an important feature of off-policy RL methods. On-policy methods can only learn from data that was generated by their own policy, since they need the at+1 term. As an example, consider the equation for SARSA (Rummery and Niranjan 1994), another well-known RL method:

Q(st, at) = rt + γ Q(st+1, at+1) (8)

16 3. Methods

This is similar to the Q-learning equation, but as there is no max term, the action at+1 must be the at+1 action taken by SARSA’s current policy. Because Q-learning has the max term, it picks the best possible action for at+1 which is not necessarily an on-policy action. This is what lets Q-learning and other off-policy methods learn from real-world data, even when the policy is unknown.

Function Approximation

Given the small size of our example problem, it is not difficult to calculate all the values for every state in the problem. However, real-world problems commonly have much bigger state spaces. They may have many dimensions which are continuous instead of discrete in nature. This very quickly becomes an intractable problem if all state-action values are stored in a table. This is why most RL problems use a functional approximation of the state-action values, to compress the state space. The other advantage of function approximation is generalizability. When using lookup tables, an RL agent that encounters a new state for the first time has no information about it. It does not know the transition probabilities or what possible rewards it might gain. It can only make uninformed decisions about what action to take, because there is no information about this state in the lookup table. When using function approximation, an RL agent that encounters a new state, can get an approximation of the state-action values from the function approximator because it is a model that tries to generalize information over a broad range of inputs. The RL methods discussed in this essay all use Q-learning to calculate their state-action values, but their main differences are the mode of operation and the function approximator used. The two modes of operation are batch operation on a static data set and online operation on a continuously changing data set. The function approximators used are single trees, random forests, gradient

17 Reinforcement Learning with Decision Trees - Review, Implementation and Computation Comparison boosting trees and neural networks. We discuss each method in more detail below, starting with the batch methods.

3.2 Batch Methods

Batch RL methods train over a pre-existing set of data. Since the batch methods iterate many times over the same set of data, they are generally more data efficient than online methods. Batch methods do not directly interact with their problem environment, but it is possible to generate a policy from a trained batch method and use this policy on the environment to create new data. This new data could then be used for training a new batch process. This multi-stage batch training has potential but to limit the scope of this essay, we leave its examination for future work. The following batch methods are all variants of fitted Q iteration (FQI). It is a versatile tech- nique that allows any technique to be used for RL. A supervised learning method is a machine learning method that predicts an outcome based on a given input data. Mul- tivariate regression, decision trees, neural networks, scalar vector machines and naive Bayes can all be classified as supervised learning methods. Since most supervised learning methods are batch processes, FQI is, by necessity, a batch process. There are other batch methods in RL, but they are not as well-known or flexible as FQI.

Fitted Q Iteration

Fitted Q iteration (Ernst et al. 2005) uses Q-learning on batches of data. It finds the Q-value of a given state and action by iterating over a batch of data. All the data is organized into four-tuples of (state, action, reward, next state) or (st, at, rt, st+1) as this is the only information needed for

Q-learning. FQI splits all the four-tuples into input tuples (st, at) and output tuples (rt, st+1). The

18 3. Methods iteration number is m. F is the functional approximation of Q and at m = 0, all F = 0. At every iteration, the output tuples are used to approximate the Q-values:

Qt = rt + γ max Fm−1(st+1, at+1) (9) at+1

Then a supervised learning method is used to fit all the Qt values to their corresponding input tuples:

Fm(st, at) ← Qt (10)

This process, like value iteration, is an iterative process where the Q-values are initialized at 0 and get closer to their true value every iteration. The iterations continue until m hits a certain number or the Q-values converge within some tolerance. The effectiveness of FQI depends on what type of supervised learning is used. Models that can approximate the Q-values well, will give better results than a poorly fitting model. The fitting process also takes much of the processing time, so a neural network will take more time to process than a simple . The differences between supervised learning methods allow us a great deal of flexibility when using FQI. In fact, the only differences between the single tree FQI, random forest FQI and neural FQI is the function approximator used. Boosted FQI is similar but has a slight modification to it, see the next section for more details.

Algorithm 1 shows the single tree FQI process. The input data set of size T is (st, at, rt, st+1), for t = 1..T where st is the current state vector, at is the current action, rt is the reward for that state-action and st+1 is the state vector for the following time-step. Fm is the decision tree model that approximates the Q-values for a given state and action. The data set for all training is created by running the test environment for T steps, picking a random action at each step. Record the

19 Reinforcement Learning with Decision Trees - Review, Implementation and Computation Comparison

Algorithm 1 Single Tree FQI

1: F0(s, a) = 0 for all s and a 2: Initialize data set by running T time-steps in the test environment, choosing random actions 3: For every time-step record (st, at, rt, st+1) 4: for m = 0 to M do 5: Calculate Q-values Qm,t = r + γ max Fm(st+1, at+1) for t = 1 . . . T at+1 6: Fit single tree to Q-values (Breiman et al. 1984): Fm+1(st, at) ← Qm,t 7: end for state vector, action, reward and next state vector for each time step. If the environment reaches a terminal state, restart the environment at a new random state. Once the data set is created, iterate over it M times. For every iteration, estimate the Q-values of the data set and fit a single regression tree model to the Q-values. The regression tree is a series of nodes that are either decision nodes or leaf nodes. The decision nodes partition the input state space based on one of the input dimensions. A tree recursively partitions the state space by using multiple decision nodes until a leaf node is reached. The leaf node contains estimated Q-values, one for each action. Some tuning parameters for regression trees include the maximum number of decision nodes before reaching a leaf node, known as the maximum tree depth, and the minimum number of data points that must be in a leaf node. The larger the tree depth, the more precise its predictions can be, but a very high tree depth can lead to overfitting. One way to counter the overfitting is to use a larger number of sample points in each leaf node. More details on decision trees can be found in Breiman et al. (1984). Once the decision tree is fit to the estimated Q-values, the process can repeat for the next iteration. Like equation (6), the process needs many iterations to converge to the true Q-values of the environment. The accuracy of the function approximation model used for this process can affect the convergence of the Q-values. An inaccurate model could converge to the wrong values or not converge at all, leading to poor performance of the RL agent. More recently, random

20 3. Methods forest models have shown better accuracy and more robustness to tuning parameters than single decision trees, so we also examined the use of random forests in FQI.

Algorithm 2 Random Forest FQI

1: F0(s, a) = 0 for all s and a 2: Initialize data set by running T time-steps in the test environment, choosing random actions 3: For every time-step record (st, at, rt, st+1) 4: for m = 0 to M do 5: Calculate Q-values Qm,t = r + γ max Fm(st+1, at+1) for t = 1 . . . T at+1 6: Fit random forest to Q-values (Breiman 2001): Fm+1(st, at) ← Qm,t 7: end for

Random forests use an ensemble of randomly sampled trees as a function approximator. Sin- gle trees tend to overfit the data they are applied to. Random forests (Breiman 2001) help negate this by subsampling the entire data set and fitting a tree to the sample. This is repeated many times to create a “forest” of trees that are averaged together to get a final result. The averaging of many trees reduces overfitting. The process of averaging subsampled models together is known as bagging in statistics. Random forests go one step further than tree bagging by also random- izing the dimensions that each split can be made on. This reduces correlation between trees and improves results. Figure 2 shows the random forest FQI process. The process is similar to single tree FQI except that a random forest is used to predict the Q-values at each state and action. The tuning parameters for a random forest are similar to single trees, but also include the total number of trees in the ensemble.

Boosted FQI

Boosted FQI (Tosatto et al. 2017) varies slightly from the standard FQI (Section 3.2) in how it uses gradient boosting. Boosting is similar to bagging in that it creates an ensemble of models, but

21 Reinforcement Learning with Decision Trees - Review, Implementation and Computation Comparison while bagging reduces overfitting by averaging those models together, boosting reduces under- fitting by adding the models together. To make trees underfit, the maximum depth of the tree is limited to between 4 and 8 (Hastie et al. 2009). In addition, a learning rate limits the amount that any single tree can contribute. Generally, learning rates are less than 0.3 to prevent overfitting. Lower learning rates will be less likely to overfit, but need more iterations to achieve an accurate model. This value is must be set for the particular data set and is usually chosen through repeated testing and evaluation. Typically, boosting is done on the original data set with no subsampling. Gradient boosting trees (Friedman 2001) is an iterative process of fitting trees to the gradient of the . Because gradient boosting and FQI are both iterative processes they can be combined so that boosting iterations happen at the same time as Q iterations. The boosted FQI update is:

Qm,t = rt + γ max Fm(st+1, at+1) (11) at+1

fm(st, at) ← Qm,t − Fm(st, at) (12)

m Fm+1(s, a) = ∑ η fi(s, a) (13) i=0

The individual trees, fi, in the ensemble, Fm+1, are fit to the residual of the calculated Q and the previous estimate of that Q. Here, the learning rate η ∈ [0, 1] is the learning rate of the boosting ensemble. This combined updating process limits the number of trees in the boosted ensemble to the number of iterations of FQI. However, by tying the boosting iterations to the Q iterations, we are able to retain all the learning in the boosting ensemble up to this point. This makes it faster because only one tree needs to be fit at each iteration and potentially more stable because it does not need to rebuild the model entirely from scratch every iteration like the other FQI methods do. Algorithm 3 shows this process. In this case, the number of trees in the ensemble is not a

22 3. Methods changeable parameter, as it is equal to the number of iterations, but boosted FQI does need to tune the learning rate in addition to the maximum tree depth and minimum samples per leaf of the component trees.

Algorithm 3 Boosted FQI

1: F0(s, a) = 0 for all s and a 2: Initialize data set by running T time-steps in the test environment, choosing random actions 3: For every time-step record (st, at, rt, st+1) 4: for m = 0 to M do 5: Calculate Q-values Qm,t = rt + γ max Fm(st+1, at+1) for t = 1 . . . T at+1 6: Fit single tree to residual: fm(st, at) ← Qm,t − Fm(st, at) 7: Update ensemble (Friedman 2001): Fm+1(st, at) = Fm(st, at) + η fm(st, at) 8: end for

Neural FQI

Neural FQI (NFQ) (Riedmiller 2005) is FQI (Section 3.2) with a neural network as the function approximator. Neural networks originated in the 50s based on the idea of recreating the neurons in our brain. Each neuron in a neural network has some number of inputs that are weighted by a scalar term. The weighted inputs are added together and put through a non-linear “activation function.” The output of the activation function is then passed to more neurons. The typical fully- connected neural network is organized into layers of neurons where each neuron feeds forward into all of the neurons in the next layer. Figure 5 shows an example of this. When Riedmiller intro- duced NFQ, most networks only had one or two hidden layers with 5-20 neurons each. Recently, neural networks have started adding more layers and more complicated structures. This is the origin of the term “deep learning,” it is a “deep” network that has many layers. Updating a network with many weighting terms is a complex process. This process is called ((Rumelhart et al. 1985)) and it uses to incrementally change

23 Reinforcement Learning with Decision Trees - Review, Implementation and Computation Comparison

FIG. 5: Example Neural Network Structure. The ovals represent the neurons and the lines are the weights. The number of hidden layers and number of neurons per hidden layer are input parameters. The number of neurons in the input layer is equal to the number of input dimensions and the number of output neurons is equal to the number of output dimensions. all of the weights. Unlike trees, the network structure is not created using data as input, the network structure is fixed with random weights assigned to all the connections. After creating a randomly-weighted network, backpropagation is used to incrementally change the weights until the data fits the network well. Because of the numerous weight updates needed, much research has gone into optimizing the gradient descent process. Modern gradient descent optimizers, such as Adam, operate stochastically on mini-batches of data. However, Riedmiller recommends the use of a batch-based gradient descent optimizer, like gradient descent with acceleration (Nesterov 1983). Algorithm 4 shows the NFQ process. Tuning the parameters of a neural network can be challenging as there are many different parameters and some of them are very sensitive to changes. These parameters are the number of hidden layers, the number of neurons per layer, the type of activation function, the gradient optimization method and its subsequent parameters. More layers and neurons can result in a

24 3. Methods

Algorithm 4 Neural FQI (NFQ) 1: Initialize network with random weights 2: Initialize data set by running T time-steps in the test environment, choosing random actions 3: For every time-step record (st, at, rt, st+1)8 4: for m = 0 to M do 5: Calculate Q-values for entire data set: Qm,t = rt + γ max Fm(st+1, at+1) at+1 6: Update network weights using backpropogation and batch gradient descent to match Q (Rumelhart et al. 1985): Fm+1(st, at) ← Qm,t 7: end for more accurate model but may also become more unstable or need more training time. Good values for these parameters are often found through testing, extensive searches and hand-tuning.

3.3 Online Methods

Online methods update continuously as data is gathered. Online methods tend to be preferred in RL because they can explore the environment better, can be used with non-stationary processes and also can run batch data (Sutton and Barto 1998). Typically, online agents will also directly interact with the testing environment. See Section 4.3 for more information on online testing. As the online agent is in charge of exploring the environment, a tradeoff is encountered. In order to test out all available options, exploration is a must. But if an agent is always exploring, it never gets a chance to exploit the best option. Conversely, if the agent always takes the best option it knows of, how does it know it is actually the best option. As an example, restaurant A has a steak you like. Should you continue to eat at restaurant A or try out restaurant B? What if the steak at B is better? Was it a one-time occurrence, or is the food consistently better at B? Should you continue to eat that or try the veal? What sauce should you have? What about sides? Drinks? Each decision affects the next decision and in order to most efficiently find the best meal, you cannot choose randomly.

25 Reinforcement Learning with Decision Trees - Review, Implementation and Computation Comparison

This is called the explore-exploit tradeoff and is important in online RL agents. There are a few common methods of solving this such as upper confidence bounds (Lai and Robbins 1985), and Thompson (Thompson 1933), but we use epsilon-greedy (Sutton and Barto 1998) because it is simple and often returns results competitive to more complicated methods. Epsilon- greedy algorithms will always choose the best action, except for e percent of the time, when it takes a completely random action. Epsilon is usually small, around 15%, and can decay over time to improve convergence. In order to understand some of the distinctions between the following methods, it is necessary to understand a few terms that are discussed in more detail later. These terms are “time-step” and “episode”. They are discussed in detail in Section 4, but a brief explanation is that a time-step is the smallest increment of time in the environment’s simulation. An episode runs for many time- steps until a final state is reached. Then, the environment is reset and a new episode can begin. These two terms are important because the following methods update at different intervals, some at every time-step and others at every episode.

G-Learning

G-learning (Pyeatt and Howe 2001) is an online RL method that grows a single tree as data is input to it. The tree starts as a single leaf node and is split into two leaf nodes when there is enough data to support the partitioning. As new data comes in, it is added to the information at that node. If there are enough values at a leaf node, G-learning will run a test to see if the leaf should be split into two nodes, otherwise, it will just update the current Q-values2 at that node. The algorithm is described in Algorithm 5. The main loop repeats every time step, when a new (state, action, reward, new state) four-tuple is generated. The ∆Q(s, a) update equation in

2One Q-value per action at each leaf node.

26 3. Methods step 11 is the is an incremental form of Q-learning. It uses a learning rate, α, to incrementally move the Q-values at the leaf nodes towards their true values because the tree is not rebuilt after every iteration like in FQI. There are existing values in the leaf nodes and the values should be incrementally changed over time. This learning rate can also be used to filter noisy inputs. The learning rate, α, exploration parameter e and history minimum length are the tuning parameters for G-learning. The |mean(∆Q)| is the absolute value of the mean of all ∆Q in the history list of a leaf node while stddev(∆Q) is the standard deviation of the same.

Algorithm 5 G-Learning 1: Initialize tree with single leaf node, Q(s, a) = 0, t = 0 2: for m = 0 to M do 3: Start new episode of test environment 4: while episode not ended do 5: t = t + 1 6: Use input to find a leaf node representing state st 7: Generate uniform random variable u = U(0, 1) 8: if u is less than e then 9: Choose random action 10: else 11: Choose argmax Q(st, at) at 12: end if 13: Execute action at and observe the reward rt and new state st+1 14: Calculate ∆Q(st, at) = α[rt + γmax Q(st+1, at+1) − Q(st, at)] at+1 and update Q(st, at) 15: Add ∆Q(st, at) to the history list of leaf node st 16: if length of history list at st > history list min length and |mean(∆Q)| < stddev(∆Q) for all ∆Q at that node then 17: Split history list into two nodes 18: end if 19: end while 20: end for

G-learning was created to limit the amount that any local change would affect the entire, global, tree. Consider a hill-climbing algorithm. If a batch RL process is trained on random actions,

27 Reinforcement Learning with Decision Trees - Review, Implementation and Computation Comparison it may learn how to conquer the first few obstacles, but what if conditions change as the agent climbs? Would a batch RL agent be able to overcome those new conditions? An online agent would be able to learn as it climbs, so it should learn to overcome those new conditions when it encounters them. However, Pyeatt and Howe (2001) saw that neural network based models had problems with online learning in that they would learn to climb over the initial obstacles, hit new obstacles, adapt to them and forget how to get over the first obstacles. They believe that this learn/forget cycle can be countered by using methods that do no change the entire (global) function approximator when trying to make small (local) changes. This local/global updating is what lead them to develop G-learning.

Abel’s Method

Abel’s method (Abel et al. 2016) is similar to boosted FQI (Section 3.2) with two exceptions. The first is that Abel’s method does not iterate over a static data set. As the agent is choosing actions, the state, action, reward and next state are saved into memory. The memory can hold N number of four-tuples (st, at, rt, st+1), so it acts like a moving window of the previous N tuples. At the end of an episode, Abel’s method does a batch update using the previous N points. This update adds another tree to the function approximation ensemble to better estimate Q(s, a). The tuning parameters for Abel’s method are the same as boosted FQI, plus amount of data to store N and the exploration rate, e. The second change that makes Abel’s method different from boosted FQI is an enhanced explo- ration algorithm. In their testing, the exploration algorithm substantially improved their results. However, our testing environment (Section 4) is relatively simple compared to Abel et al. (2016) 3D game environment, so we use the same epsilon-greedy exploration that G-learning and DQN methods use. The algorithm for Abel’s method is shown in Algorithm 6.

28 3. Methods

Algorithm 6 Abel’s Method

1: F0(s, a) = 0 for all s and a, t = 0 2: for m = 0 to M do 3: while episode not ended do 4: t = t + 1 5: Generate uniform random variable u = U(0, 1) 6: if u is less than e then 7: Choose random action 8: else 9: Choose argmax Fm(st, at) at 10: end if 11: Execute action at and observe the reward rt and new state st+1 12: Store (st, at, rt, st+1) in memory, remove old data if over N tuples stored 13: end while 14: Calculate Q-values Qm,t = rt + γ max Fm−1(st+1, at+1) for all t in memory at+1 15: Fit single tree to residual fm(st, at) ← Qm,t − Fm−1(st, at) 16: Update ensemble (Friedman 2001) Fm+1(st, at) = Fm(st, at) − η fm(st, at) 17: end for

Deep Q-Networks

Deep Q-networks (Mnih et al. 2015) are a combination of deep learning and Q-learning. DQNs use a technique called “experience replay” (Lin 1992) to fit the deep neural networks to the Q- values. This is similar to the moving window of data in Abel’s method (Section 3.3) except that the DQN trains on mini-batches that are sampled from the previous N data points. These training updates occur at every time-step, not every episode like Abel’s method. Since every episode can have up to 200 time-steps, this is computationally more demanding, but can potentially learn faster on a per-episode basis. The update algorithm is described in Algorithm 7. This is a newer method, so it uses the popular Adam (Kingma and Ba 2014) optimization to update the weights of the network. Also, more hidden layers tend to be than in Neural FQI. The DQN has the same

29 Reinforcement Learning with Decision Trees - Review, Implementation and Computation Comparison tuning parameters as NFQ, plus the amount of data stored in the experience replay, N, the mini- batch training size, J and the exploration rate, e.

Algorithm 7 Deep Q-Networks 1: Initialize network with random weights, t = 0 2: for m = 0 to M do 3: Start new episode of test environment 4: while episode not ended do 5: t = t + 1 6: Generate uniform random variable u = U(0, 1) 7: if u is less than e then 8: Choose random action 9: else 10: Choose argmax F(st, at) at 11: end if 12: Execute action at and observe the reward rt and new state st+1 13: Store (st, at, rt, st+1) in memory, remove old data if over N tuples stored 14: Subsample mini-batch of four-tuples (sj, aj, rj, sj+1) for j = 1 . . . J from memory 15: Calculate Qj = rj + γmaxF(sj+1, aj+1) for all mini-batch, j = 1 . . . J aj+1 16: Train neural network on mini-batch of Q-values using backpropogation and Adam (Kingma and Ba 2014) F(sj, aj) ← Qj 17: end while 18: end for

3.4 Online Random Forest Methods

Drawing inspiration from several of the methods discussed above, we propose several novel applications of random forests to online RL. The first approach uses random forest FQI until enough data has been gathered and then stop gathering data or updating the random forest. The random forest is still used to update Q-values for specific state-actions, but is not updated with new data. This approach allows us to have a baseline to compare other online random forest methods to. This approach is shown in Algorithm 8. The tuning parameters will be the same as

30 3. Methods

random forest FQI, plus the iteration at which data gathering stops, mstop. For simplicity, our stop- ping criteria is the number of iterations executed, but other stopping criteria such as the number of data points or the average reward could be used.

Algorithm 8 Online Random Forest FQI with Stopping

1: F0(s, a) = 0 for all s and a, t = 0 2: for m = 0 to M do 3: while episode not ended do 4: t = t + 1 5: Generate uniform random variable u = U(0, 1) 6: if u is less than e then 7: Choose random action 8: else 9: Choose argmax Fm(st, at) at 10: end if 11: Execute action at and observe the reward rt and new state st+1 12: Store (st, at, rt, st+1) in memory 13: end while 14: if m < mstop then 15: Calculate Q-values Qm,t = r + γ max Fm(st+1, at+1) for all t in memory at+1 16: Fit random forest to Q-values (Breiman 2001): Fm+1(st, at) ← Qm,t 17: else 18: Fm+1(s, a) = Fm(s, a) 19: end if 20: end for

The second approach uses a moving window of previous data points, similar to Abel’s method and DQNs, which rebuilds the entire random forest at every iteration, like the random forest FQI. This allows the training data for the random forest to be more current, but it could lose some important information learned from early data points. This method is described in Algorithm 9. The tuning parameters are the same as random forest FQI plus the number of data points to store in the moving window, N.

31 Reinforcement Learning with Decision Trees - Review, Implementation and Computation Comparison

Algorithm 9 Moving Window Random Forest

1: F0(s, a) = 0 for all s and a, t = 0 2: for m = 0 to M do 3: while episode not ended do 4: t = t + 1 5: Generate uniform random variable u = U(0, 1) 6: if u is less than e then 7: Choose random action 8: else 9: Choose argmax Fm(st, at) at 10: end if 11: Execute action at and observe the reward rt and new state st+1 12: Store (st, at, rt, st+1) in memory, remove old data points if over N tuples stored 13: end while 14: Calculate Q-values Qm,t = r + γ max Fm(st+1, at+1) for all t in memory at+1 15: Fit random forest to Q-values (Breiman 2001): Fm+1(st, at) ← Qm,t 16: end for

A third approach would be to use an online random forest such as Saffari et al. (2009) or Lak- shminarayanan et al. (2014) proposed. The advantage of this approach is the these methods and other online tree ensembles are designed to be used in an online fashion. Where our previous two methods are forcing a naturally batch method into a online setting. There are some challenges with this approach though. First, many of these methods are designed for classification tasks and regression is not possible or it is not optimized for regression. Second, online forests tend to use online trees and many online trees, such as Hoeffding trees (Domingos and Hulten 2000), main- tain multiple potential splits at each leaf node. Maintaining multiple splits at every leaf node for multiple trees can be very memory-intensive and, like G-learning, performance could suffer be- cause of it. The third challenge with this approach is that no mature implementations of them exist. While demo implementations do exist, they do not have the widespread use or support of major packages like scikit-learn or Tensorflow. This makes it more likely that issues will appear

32 4. Testing when trying to use these implementations and troubleshooting those issues will be more difficult. For these reasons, we live exploration of online random forests to future work.

4 Testing

This section describes the setup of the testing environment and some of the details of how the testing was implemented for online and batch methods. We also discuss the use of multiple replications and how we decided to represent the results in the plots.

4.1 Environment

Testing is accomplished using OpenAI’s reinforcement learning gym (Brockman et al. 2016). Each RL method is tested on the “CartPole-v0” environment, which is a classic control design problem (Michie and Chambers 1968) where the agent must learn to balance a pole, Figure 6. At each time-step, the learning agent receives the state of the system (cart position and velocity, pole angle and angular velocity) from the environment. The agent must then choose whether to push the cart to the left or to the right. After deciding which action to take, the environment advances one time-step, moves the cart in the desired direction and updates the state information. In addition to the updated state, the learning agent is now also provided with a reward, just as shown in Figure 1. At the end of every time-step, the agent will get a reward of 1 if the pole has not fallen over. If the pole falls below a certain angle or the cart moves beyond its boundaries, then a zero reward is given. The objective is to maximize the cumulative reward per episode. Each episode lasts for a maximum of 200 time-steps but will be stopped early if a zero reward is received (i.e. the pole falls over or the cart moves out of bounds). An agent is considered to have successfully learned how to balance a pole if the average total reward over the previous

33 Reinforcement Learning with Decision Trees - Review, Implementation and Computation Comparison

FIG. 6: CartPole Testing Environment. The objective is to balance the pole for as long as possible. The pole is attached to the cart at a pivot point. The actions for the environment are to push the cart left or push the cart right. The state is the cart’s position and velocity and pole’s angle and angular velocity.

100 episodes is 195 or better. The 200 time-step maximum and 195 average over 100 episodes correspond to OpenAI’s recommended numbers. In RL, it is common to “shape” reward functions so that more granular information is provided to the agent. An example of a shaped reward would be one that provides 10 reward if the pole is vertical, 5 if it is slightly off vertical and -10 if it falls over. A better reward function would be a linear combination of angle, position and velocities to provide a large reward to a centered, vertical pole and smaller rewards the more the state strays from this ideal. This makes it much easier for an agent to learn than the sparse 0/1 rewards, but we believe the more challenging reward function makes for a more interesting problem. In addition, both OpenAI’s formulation and the original problem formulation by Michie and Chambers (1968) correspond to a sparse reward. However, because most older papers use a shaped reward function, our results are not comparable to the results from those papers.

34 4. Testing

More recent research uses sparse rewards because it is easier to program and less susceptible to errors being introduced by the shaped reward function. Shaped reward functions must be carefully created or the agent will learn to optimize the wrong goal. For instance, if we were programming a team of RL agents to play soccer, the true objective is to win the game. However, each game lasts 90 minutes and it will take a long time before the RL agents see any reward from winning or losing the game. One way to shape the reward function to force the desired behavior is to reward goals made by your team. However, the agents may optimize this shaped reward by leaving no players to defend their own goal and use all the players to shoot for goals against the opposing team. This might lead to a high scoring game, but if the opposing team scores more than your team because your team has left their goal undefended, then you could still miss out on the true objective of winning the game. It is for this reason that recent research has focused on using sparse rewards and why we choose to do so, as well.

4.2 Batch Methods

The batch methods were tested with data generated from random actions. To generate this data, the environment is initialized to a random state close to center. At each time-step, ran- dom actions are applied and the (st, at, rt, st+1) four-tuples are recorded. This continues until a terminal state is reached, usually the pole falling over. Using random actions, the pole will bal- ance for an average of 20 time-steps before falling. Once a final state is reached, the environment is reinitialized, more random actions applied and the data is recorded again. This repeats until 1000 time-steps of information is recorded. These 1000 data points are then used to train a batch method. The batch methods rely on iterating over a static data set. As the iterations increase, the Q- values better approximate the true value of a state-action pair. In order to better compare the

35 Reinforcement Learning with Decision Trees - Review, Implementation and Computation Comparison batch and online methods, we need a way to plot how a batch method learns as a function of the iteration number. Generally, batch methods iterate until convergence or for some fixed number of iterations before generating a final policy. Because we want to see how it learns as a function of iterations, we need to generate a test episode after every learning iteration. So, each batch method starts with a randomly generated set of 1000 four-tuples. It then executes one iteration of Q-value updates. A new policy is generated and a single test episode is run. This new test episode does not provide any new information back to the batch process, it is just to see how much it learned in 1 iteration. The process of iterating over the data, updating Q-values, updating the policy and running a test episode is completed 1000 times. The initial batch of data is never updated and the batch process only works with that same batch of data. This testing setup allows us to compare results of the two modes of operation.

4.3 Online Methods

Online methods interact directly with the environment. They start with no information, but as they run they store each (st, at, rt, st+1) four-tuples into a memory buffer. At every time-step or every episode, depending on the method, the agent updates its state-action Q-values based on the current data in its memory. The agent then uses the updated Q-values to generate a new policy for what actions it should take. This continuous process of executing actions, storing data and updating the policy happens for 1000 episodes. Whenever an episode finishes, a new episode is started at a random initial state and the learning process continues from where it left off. When an online method is tested, they start with no data and no information about the state- action values. Initially, an online method has no experience to draw on, so it can only give random actions. As these actions are played out, the agent learns which state-actions are better than others and can start giving better actions. As the online agent starts exploiting these better actions, it may

36 4. Testing

(A) (B)

FIG. 7: Comparison of Raw and Filtered Reward Outputs. The left plot is the raw re- ward per iteration with no filtering. The right plot uses a 100 episode running average to filter the results. reach a new area of the state space. This is the idea behind why an online agent can explore better than a batch process. Although online learning does have its own potential problems. 3

4.4 Replications

A single pass through 1000 iterations is a single replication of the learning process. Each repli- cation records the cumulative reward of every episode. The raw output of the reward per episode for a single replication is shown in Figure 7a. Because the raw rewards are noisy, a 100-episode simple moving average filter is applied to all replications, as in Figure 7b. This corresponds to the goal of a 100-episode average reward above 195. If the filtered plot achieves 195 or above, then the policy at that point will have successfully learned how to balance the pole according to our objectives.

3See Sections 3.3 and 5.3 for more information about learn/forget cycles.

37 Reinforcement Learning with Decision Trees - Review, Implementation and Computation Comparison

(A) (B)

FIG. 8: Comparison of All Replications and Averaged Rewards. The left plot shows all 20 replications of the single tree FQI. As this plot is hard to interpret, we instead use an average plot with 95% confidence interval of the average, shown on the right.

Because of the randomness inherent in both the batch and online modes of operation, multi- ple replications of the learning process must be simulated for each method. The online methods start with no data and learns for 1000 iterations per replication. Batch methods start with 1000 randomly generated data points and iterate over it 1000 times. Each replication of a batch method starts with a new 1000 randomly generated four-tuples to learn from. Each replication of an online method starts with zero data and no information in its Q-values. Because of the long processing time of some of the methods, we only ran 20 replications of each method. Presenting all 20 replications on a plot, as in Figure 8a, is not very informative, so all plots are presented as an average of all 20 replications, with a 95% confidence interval of the average, Figure 8b. Notice in Figure 8a that some of the replications were able to achieve the goal of 195 reward but that it was not able to do so consistently. The width of the confidence interval (CI) shows how consistent the learning method is over the replications. The small CI for the first

38 5. Results

200 iterations demonstrates fairly consistent learning early on, but as the CI widens, more of the replications start diverging from the average performance.

4.5 Parameter Tuning

An important part of using supervised learning is having the correct values for the parameters. Some parameters, like number of trees and tree depth, have good default values that work well in a variety of situations. Others, like the number of layers in a neural network have no good heuristic for setting them. The results for each method are highly dependent on have good values for the input parameters, so we needed a systematic way of tuning the parameters that could be applied to each method. It is common to use combination of grid-search and hand tuning to find the best values for the parameters, but research has shown that random search can provide equal or better results (Bergstra and Bengio 2012). The intuition behind this is that many input parameters are not important for many situations, so spending computational resources changing only one parameter is not worth the effort. Given this, we ran 20 replications of every method with 100 different randomly chosen settings for the input parameters. For the NFQ and DQN results, this process had to be repeated several times choosing different ranges for the input parameters before satisfactory results were found. The results presented below are for the parameter settings that achieved the highest cumulative reward over all 1000 iterations. The final parameter values for each method will be given in the corresponding section.

5 Results

In this section, we discuss the results for each method. We start with the batch methods, Section 5.1, and then present the results of the online methods, Section 5.2. Finally, Section 5.3 presents

39 Reinforcement Learning with Decision Trees - Review, Implementation and Computation Comparison a comparison of all the methods and make some observations about the performance of trees, neural networks and batch vs online. The comparison section also compares the execution times of each of the methods.

5.1 Batch Methods

This section contains the results for the batch RL methods. All the methods are variations of FQI. Each method iterates over a fixed batch of 1000 randomly generated four-tuples. The time discount factor γ is 1 for all the batch methods. Because we are testing on a fixed time interval, a discount factor less than 1 is not necessary.

Single Tree FQI

Single tree FQI is relatively easy to set up and runs quickly, but it is somewhat sensitive to the input parameters. Too small of a tree depth will lead to underfitting and too large a tree depth will promote overfitting. The number of samples per leaf node can be increased to counter the overfitting, so the best results are found with a combination of a larger tree depth and samples per leaf greater than one. The best setting for the input parameters is for a tree depth of 21 and the samples per leaf of 17. The results for the single tree FQI can be see in Figure 9. This method learns quickly over the first few hundred iterations, increasing its total reward per test episode and then stabilizes below the 195 total reward goal.

Random Forest FQI

Random forest FQI is also easy to use, though it takes more time than single tree FQI due to creating multiple trees per iteration. In contrast to the single tree FQI, the random forest FQI is fairly robust to parameter changes, as long as the tree depth is greater than 15 and the samples

40 5. Results

FIG. 9: Single Tree FQI Results. Results are for using a single decision tree with maxi- mum depth 21 and minimum samples per leaf of 17. per leaf are between 5 and 10, the random forest FQI performs well. The best results in our tests are at a tree depth of 24 and samples per leaf of 6. Results are similar for 50, 100 and 200 trees, so 50 trees is used to reduce computation time. The results for this setting are shown in Figure 10. Again, this method learns quickly, but now is able to reach the goal of 195 total reward per test episode. In addition, the 95% confidence interval is smaller than the single tree FQI confidence interval, indicating that the random forest FQI is more consistent in achieving its results.

Boosted FQI

Getting good results from boosted FQI is more difficult than single tree or random forest FQI. There are two main reasons for this. First, the package that we implemented all of our decision

41 Reinforcement Learning with Decision Trees - Review, Implementation and Computation Comparison

FIG. 10: Random Forest FQI Results. Results are for using 50 trees with a maximum depth of 24 and samples per leaf of 6. trees in, scikit-learn (Pedregosa et al. 2011), does not natively handle multiple outputs for gradi- ent boosting regressors. Scikit-learn’s single decision tree and random forest regressors do. Sec- ond, gradient boosting needs to be balanced between overfitting and underfitting where random forests are naturally resistant to overfitting. The first reason is a more difficult and detrimental issue than the second, so we will address it first. Scikit-learn’s decision tree regressor and random forest regressor provide an easy implementation to receive two different output values for a given input. The same is true for neural networks and G-learning. This makes it easier to program these function approximators to get two different actions (left and right) from one input. There is only one model and all Q(s, a) state-action values can be output from that one model. It is not difficult to use two separate models, one for each action, but that causes some unforeseen problems. We used two independent gradient boosting ensembles for Figure 11a. This shows some learning, but its peak performance is substantially less than random forest FQI.

42 5. Results

(A) (B)

FIG. 11: Boosted FQI: Uncorrelated and Correlated Comparison. The left plot shows the results of using boosted FQI with two independent ensemble models. The right plot shows the results of using a single ensemble and passing the action as an input. Both models use a maximum depth of 7, a learning rate of 0.3 and minimum samples per leaf of 1.

After some investigation, an explanation for this decreased performance arose. All the meth- ods that do native multi-output allow their outputs to correlate. Maintaining two independent boosting trees does not allow any correlation between the output action values. As an example of the importance of this is, imagine a cart-pole system that is perfectly balanced at the exact center of the environment. In this state, the value of moving left or moving right is the same. Now if the cart is in a state where the pole is about to fall over to the left. An action to push left and correct the falling pole is going to have a much higher value than pushing to the right, which will likely cause the pole to fall over. This illustration demonstrates the possibility that action values correlate across certain states and may be important to the learning process. To justify our theory, we created a single gradient boosting tree and converted the action into a variable that was input alongside the state. This allows us to calculate a Q-value based on both state and action, but lets

43 Reinforcement Learning with Decision Trees - Review, Implementation and Computation Comparison

FIG. 12: Boosted FQI: Final Results. The results presented are for boosted FQI after final parameter tuning and using correlated actions. The parameters are maximum depth of 12, learning rate or 0.26 and minimum samples per leaf of 83. the actions correlate across states. Figure 11 compares the uncorrelated and correlated gradient boosting methods for the same tuning parameter inputs. Using the correlated gradient boosting trees, we then find the optimal parameter settings using random search. The higher learning rates around 0.2 − 0.3, perform better than small learning rates, however we limit the learning from to be 0.3 or lower to prevent overfitting. The best settings for gradient boosting came in two groups, one used a low tree depth, less than 10, with few samples per leaf, also less than 10, while the other group found good results using a large tree depth and many samples per leaf. The high samples per leaf greatly reduces the overfitting of the large tree depth, allowing them to perform well in a gradient boosting context. The best results for boosted FQI are shown in Figure 12. The parameters for this example are a learning rate of 0.26, tree depth of 12 and samples per leaf of 83.

44 5. Results

Neural FQI

Parameter tuning for Neural FQI (NFQ) is more challenging than the previous methods as there are more parameters to change and there are few guidelines to follow when choosing what values to use. Because the optimization method recommended by Riedmiller (2005) for NFQ is no longer in common use or easily available, we use gradient descent with Nesterov acceleration (Nesterov 1983). We also used a common technique of dropping some percentage of the neurons from when training. This reduces correlation among the neurons, reduces overfitting and im- proved our results. Given this, the parameters that need to be set are the number of layers, the number of neurons per layer, the learning rate, the momentum factor and the dropout rate. The entire batch of 1000 data points is used to train the network on every iteration. The best results for NFQ were with a fully connected network that has two hidden layers with 13 neurons each. A learning rate of 0.017 and momentum factor 0.87 are used with accelerated gradient descent. A dropout rate of 0.2 is used. The results for the given parameters can be seen in Figure 13.

5.2 Online

This section discusses the results for the online methods: G-learning, Abel’s method, DQNs and the two proposed online random forest methods. All the online methods here used an epsilon- greedy method for exploration4. The starting value for the exploration rate e is 1 and it decays by 0.5% every training iteration. The smallest value that e will ever be is 0.01. As with the batch processes, the time discount factor γ is fixed at 1. Because G-learning does not enforce a limit on the total number of data points it stores and DQNs typically use an "experience replay" window of many thousands of data points, we decided not to enforce the same 1000 data point limit that is used in the batch processes.

4See section 3.3 for more information.

45 Reinforcement Learning with Decision Trees - Review, Implementation and Computation Comparison

FIG. 13: Neural FQI Results. The configuration of the network is as follows: two hidden layers with 13 neurons each, dropout rate of 0.2. A learning rate of 0.017 and momentum factor 0.87 are used with accelerated gradient descent.

G-Learning

There is no easily available implementation of G-learning, so we implemented it in Python based on Pyeatt and Howe (2001). The method seems to work well initially, but it starts running so slowly that it needed to be limited to 200 iterations in order to complete 20 replications in a reasonable time. This slow performance is likely due to two reasons. First, in examining the written description of the method, there is no clear way for old data points to be removed, which causes the data in each leaf’s history list to grow without bound. Second, all the other RL methods have some part of their implementation based on compiled code (Scikit-learn’s regression trees or Tensorflow’s neural networks), where our G-learning implementation is written entirely in interpreted Python. As a workaround, we limit the maximum tree depth to 15 and also limit the maximum length

46 5. Results

FIG. 14: G-Learning Results. In order to complete the replications in a reasonable amount of time, the maximum depth was limited to 15 and the leaf history list limited to 100. of a leaf’s history list to 100. This improved the execution speed somewhat, but the learning performance was not as good. Figure 14 shows these results. The “growing tree” method of G-learning shows potential, but needs some work to be competitive with the other RL methods.

Abel’s Method

Because Abel’s method is an online version of boosted FQI, the same problems encountered with boosted FQI testing are present in Abel’s method. We use the same correlated gradient boost- ing trees with Abel’s method and use boosted FQI’s best parameters as the starting point for our random parameter search of Abel’s method. The same parameters that are used in boosted FQI are used in Abel’s method, namely tree depth, learning rate and samples per leaf but we also vary the number of data points that the moving window stores. The best parameters we found for Abel’s method are tree depth of 12, samples per leaf of 9, a learning rate of 0.3 and 7500 of

47 Reinforcement Learning with Decision Trees - Review, Implementation and Computation Comparison

(A) (B)

FIG. 15: Abel’s Method Results. The left figure shows the best results from the random parameter search using a moving window of 7500 data points, learning rate 0.3, max tree depth of 12 and min samples per leaf of 9. The right figure shows the second best results using a moving window of 500 data points, learning rate 0.24, max tree depth of 7 and min samples per leaf of 27. the most recent data points are used for training. Figure 15a shows the results. The second best results from the random search yielded a cumulative total reward that is 98.6% of the cumulative reward for the best result. It uses a maximum tree depth of 7, a learning rate of 0.24 and minimum samples per leaf of 27 while only using the 500 most recent data points. The results for this are shown in Figure 15b. The peak learning is not as high as the best result, but it runs dramatically faster. Abel’s method with 7500 data points ran in 25 minutes, while Abel’s method with 500 data points ran in 2.5 minutes. More details on the timing of each method can be found in Section 5.3.

48 5. Results

Deep Q-Network

Parameter tuning of the Deep Q-Networks is the most difficult of the methods, even starting with the best parameters from the neural FQI. Over several random searches and manual ma- nipulation of the search area, we found the best network structure to be 3 hidden layers with 62 neurons each and a dropout rate of 0.33. For the gradient optimization, we used Adam with a learning rate of 0.00018 and default parameters of β1 = 0.9 and β2 = 0.999. We trained mini- batches of 32 points randomly sampled from a window of the last 100,000 data points at every time-step. This is 8 parameters to search over: number of layers, number of neurons per layer, dropout rate, optimization method, learning rate, momentum factor, batch size and window size. Even if there are good heuristics for a few of these, it is still a much larger search space than the 3 dimensions for random forest or boosted FQI. The results for the DQN are shown in Figure 16. It can be seen that there is a sharp drop in the reward after 800 iterations. We believe this is related to the learn/forget cycles that Pyeatt and Howe (2001) wrote about. Initial training on a mix of good and bad states will cause the network to learn which states to avoid. After avoiding bad states for many iterations, only data from good states will be stored in the window of data points. This causes the network to only train on good data points and it forgets how to avoid the bad states. To further test this theory of why the performance of the DQN is dropping, we ran a test using the same input parameters, but used only 1000 data points in the experience replay window. The results for a typical replication are shown in Figure 17. The cyclic nature of the learning is more evident in this plot and happens at a higher frequency. A single replication is displayed for this figure as each replication cycles out of phase with one another and averaging them together reduces the cyclic nature of the individual plots. Longer replay windows seem to reduce the learn/forget cycles in our testing, however, the cycles may not be removed, they may just take longer to execute than can be adequately seen in our testing intervals.

49 Reinforcement Learning with Decision Trees - Review, Implementation and Computation Comparison

FIG. 16: Deep Q-Network Results. The network is trained with the following param- eters: 3 hidden layers of 62 neurons each and a dropout rate of 0.33. An experience replay window of 100,000 data points is used and Adam gradient optimization has learning rate 0.00018 and β1 = 0.9 and β2 = 0.999. Mini-batches of size 32 are trained at every time-step.

50 5. Results

FIG. 17: DQN Results with Limited Memory. The results for a typical run of DQN with a 1000 data point experience replay window. The other parameters are set the same as the best DQN results, Figure 16.

Online Random Forest

The tuning of our baseline online random forest FQI method was relatively straightforward. We used known good values for the number of trees, 50, the maximum tree depth, 15, and samples per leaf, 5. We adjusted the stopping iteration and found the best results when we stopped training around iteration 100. Assuming better than random performance for those first 100 iterations, we would expect slightly more than 1000 data points were gathered. The results for this approach are shown in figure 18. The moving window random forest is similar to the previous method, but retains in memory the last N data points for training. In theory, this would allow the random forest to adjust to new conditions and new data, but in practice it may hurt results. For this method, we used a random parameter search to determine the number of trees, maximum depth, samples per leaf and size of

51 Reinforcement Learning with Decision Trees - Review, Implementation and Computation Comparison

FIG. 18: Online Random Forest FQI with Stopping Results. The random forest has 50 trees, maximum depth of 15, minimum samples per leaf of 5 and training stopped at iteration 100. The apparent learning beyond step 100 is due to the lag of the 100 iteration moving average and the e-greedy exploration policy, which takes random actions at a high rate before decaying to its lowest rate around 400 iterations.

52 5. Results

FIG. 19: Moving Window Random Forest Results. The random forest has 50 trees, maximum depth of 25, minimum samples per leaf of 1 and moving window of 10,000 data points. the memory window. The best results are at 50 trees, depth of 25, 1 sample per leaf and a 10,000 data point memory, the largest allowable in our search range. The results for this setting can be seen in Figure 19. As we did with Abel’s method and DQNs, we examined the result of lowering the amount of memory in the moving window. We found the best result in the random parameter search that used 1000 data points and examined its results. This parameter setting shows learn/forget cycles similar to the DQNs with small window sizes. Again, we plot only one replication, Figure 20, as the cycles are out of phase and are reduced in the average plot. The performance here is typical of all the replications. Given the performance of the limited memory DQN and moving window random forests, it would seem that the learn/forget cycles are not solely attributable to neural networks, but the moving window or experience replay. If an important state or states moves out of the memory

53 Reinforcement Learning with Decision Trees - Review, Implementation and Computation Comparison

FIG. 20: Moving Window Random Forest Results with Limited Memory. The ran- dom forest has 100 trees, maximum depth of 28, minimum samples per leaf of 13 and moving window of 1,000 data points. window and is not replaced by another important state, then the agent cannot adequately train in how to deal with that situation. A potential solution for this are longer windows, but that may just delay the point at which the agent will forget. However, the problem may no be whether an important state is or is not in the recent memory of an agent, but the concentration of important states to unimportant ones. A low concentration of important states could also cause the agent to train on the many unimportant states and not the few important ones. A potential solution for this would be prioritized experience replay (Schaul et al. 2015) which samples important states more frequently, making them harder to "forget". However, this may not prevent the important states from leaving the experience window altogether making it impossible to train on those important states. G-learning, Abel’s method and our baseline online random forest method offer a possible third solution: fixing part of the tree/ensemble/network structure after the initial training so that it cannot be changed by subsequent training. However, fixing some part of the structure while

54 5. Results still leaving the rest of the structure adaptable enough to accommodate subsequent learning is a challenging problem and will be left to future work.

5.3 Comparison

In order to gauge how the different RL methods compare to one another, results for each method have been compiled into Figure 21. All the methods demonstrate the capacity to learn by achieving an average reward over 100, beating random actions which only average 20 reward. However, only random forest FQI can consistently average over 195. It can seen that tree-based RL methods can provide results that are comparable to neural net- work based methods. In batch learning, the tree-based methods (Figures 21a, 21c, 21e) show sim- ilar performance to the equivalent batch neural network method, Figure 21g. The peak reward is similar for these methods, with random forest providing the best peak reward, but there is a no- ticeable delay in learning for neural network FQI. The tree-based batch methods start around a 25 reward and plateau at before 200 iterations. Neural FQI takes over 100 iterations until it exceeds 25 reward at plateaus around 300 iterations. The online methods, Figures 21b, 21d, 21f and 21h, have somewhat different behaviors. G- learning performs the worst of any method, and suffers from implementation and memory prob- lems. The moving window random forest learns well, but stops short of 175 reward. Abel’s method is the slowest learning of the methods, but eventually achieves a good reward. The DQN is slow to start, achieves a good reward and then suffers a large drop in the reward, likely due to the learn/forget cycles discussed in Pyeatt and Howe (2001). More research is needed to deter- mine if the learn/forget cycles are a product of neural networks or online windowed methods in general. G-learning is not a windowed method and is specifically designed to avoid learn/forget

55 Reinforcement Learning with Decision Trees - Review, Implementation and Computation Comparison

(A) Single Tree FQI (B) G-Learning

(D) Moving Window (C) Random Forest FQI Random Forest

(E) Boosted FQI (F) Abel’s Method

(G) Neural FQI (H) Deep Q-Network

FIG. 21: Comparison of RL Methods. Figures in left column are batch methods. Figures in right column are online. Since the online random forest FQI with stopping is not a true online process, we choose not to include it among the online results.

56 5. Results cycles. Abel’s method is an online windowed method but like boosted FQI it retains the previ- ous trees from previous iterations, so it is resistant to forgetting previously learned information. DQNs have no natural method to prevent forgetting previously learned information except for using large window size and retraining on samples of old data, but for long enough time frames, this might not be sufficient. Even though it uses a similar method of moving windows of previous data, the moving window random forest does not have the same dropoff in learning that the DQN does. One advantage that tree based RL methods have is the time to execute. Single tree methods are very fast so even ensembles of trees can be competitive with neural network based methods. The time to execute each method is listed in Table 3. Listed times are the time to complete 20 replications, run in parallel on the Pitzer Cluster of the Ohio Supercomputer Center (Ohio Super- computer Center 1987). The batch tree methods are faster than the batch neural network method, while Abel’s method is faster than DQN. Abel’s method, DQN and moving window random for- est are slowed by their use of large moving windows, but DQN is also slowed by updating at every time-step. When the moving window size is reduced to 1000, Abel’s method approaches boosted FQI in speed and the moving window random forest approaches random forest FQI’s speed, while DQNs still takes about 40 minutes. As seen in Figure 15b, the accuracy of Abel’s method does not suffer much by reducing the window size, but DQNs and moving window random forests do. As discussed in the individual method results, the tree based methods were much easier to tune than the neural network methods. There are fewer parameters and good guidelines for choosing starting values. With little work, it was possible to use a random parameter search to achieve results with the tree methods that met or exceeded the results of hand-tuning. In contrast, the neural network based methods have a large parameter search space and few guidelines for good starting values. Random parameter search was utilized for neural FQI and DQNs, but the

57 Reinforcement Learning with Decision Trees - Review, Implementation and Computation Comparison

TABLE 3: Execution Time of RL Methods. All values listed are in minutes of wall time needed to execute 20 replications of each method on a 20 processors node of the Pitzer cluster of the Ohio Supercomputer Center (Ohio Supercomputer Center 1987).

Function Approximator Batch Online

Single Tree Single Tree FQI G-Learning 0.45 90 Random Forest Random Forest FQI Moving Window Random Forest 6 35 Gradient Boosting Trees Boosted FQI Abel’s Method 3 25 Neural Network Neural FQI Deep Q-Networks 17 82

search needed to be repeated multiple times using different ranges of parameter values before an acceptable result was found. Even using a supercomputer, this equates to several days work of tuning the neural networks, while tuning the tree based methods takes several hours. Given these results, we find that tree-based RL and particularly random forest FQI (Ernst et al. 2005) offer several advantages over neural network based RL methods.

6 Limitations

A limitation of this work is that our results are based on a single test problem. We are planning on doing further tests with some of the more promising methods to see if our conclusions continue to hold in other environments. One advantage of neural network based methods is that convolutional neural networks pro- vide an easy interface to dealing with, image, video, audio and natural language data. Using a

58 7. Future Work tree-based method with these data formats would likely require some sort of pre-processing to convert them to a more manageable format. However, this limitation does not apply if the data is already in a tabular format, like in pricing, medical treatment, marketing or many of the other examples referenced in the introduction. For many real-world problems, a tree-based or batch RL method is sufficiently powerful and much easier to work with than neural networks. One of the limitations of the methods is that boosted FQI and Abel boosting are both limited by how many iterations they can run. Both methods create a new decision tree after every iteration. This new tree is added to the ensemble, but no trees are ever removed. This means if one of the boosting methods runs for 10,000 iterations, it will create and use 10,000 trees. This is a simple and effective method but there is a practical limit to how many iterations are possible and as such, they cannot be considered true online algorithms, because they cannot run continuously. Online gradient boosting trees (Beygelzimer et al. 2015) could potentially solve this problem, but more research is needed to integrate it with RL.

7 Future Work

Future work on this topic has many possible directions. First, continuing the research on online random forests. The moving window random forest shows promise, but adapting the prioritized experience replay (Schaul et al. 2015) or other method of remembering important states could be useful in improving its results. Using an existing method of online random forests like Saffari et al. (2009) or Lakshminarayanan et al. (2014) for online RL could be a good fit, but may suffer from learn/forget cycles like the windowed methods because they are designed to be used with non- stationary processes and gradually forget old data points. An online single tree, like Hoeffding trees (Domingos and Hulten 2000) could also be a competitive method for RL function approxi- mation. The complications encountered with G-learning are not necessarily indicative of all online

59 Reinforcement Learning with Decision Trees - Review, Implementation and Computation Comparison tree methods. As indicated in the introduction, a combination of this and the subsequent essay could lead to interesting results. The additive model removes the iteration limit of boosted FQI and Abel’s method, while having an interpretable method of RL could be a means of gaining in- sight from traditionally inscrutable methods. As for the batch methods, future work could pursue an standardized way of using batch algorithms in online processes. Lange et al. (2012) highlights the efficiency and stability of batch methods, which our research supports. Using these methods effectively in online RL could be a valuable addition to the literature.

8 Discussion

In this essay, we demonstrated the performance of several tree-based RL methods and com- pared them to two neural network methods. Random forest FQI provides great results with basic settings and even better results with minimal parameter tuning. The performance of the batch tree methods was competitive with the more popular neural network methods. We proposed several methods of using random forests in online RL and they showed promising results. Tree-based on- line methods were implemented and are viable RL methods, while tree-based batch RL provides a fast, easy to use, effective method of using RL for real-world problems.

60 Accelerated Boosting of Partially Linear Ad- ditive Models

1 Introduction

Modern data analysis techniques must balance a combination of objectives including predic- tion accuracy, speed, simplicity and interpretability. Simple linear regression trades accuracy for increased speed and ease of use. Neural networks trade speed and interpretability for increased accuracy. Partially linear additive models (PLAMs, Härdle and Liang 2007; Liang et al. 2008, and references therein) are a combination of the interpretable simple linear regression and flexible functions for nonlinearity. We propose accelerated boosting of partially linear additive models. Boosting is a relatively quick way to build a PLAM, but acceleration makes the process even faster by reducing the computational effort required. In addition, in many applications a lot of covariates are collected but only a small fraction of those variables are actually important. Hence, variable selection is often of vital interest to parsimoniously achieve a sparse model, which may also result in higher prediction accuracy. In this paper, we further propose accelerated twin boosting to speed up the variable selection process for partially linear additive models.

61 Accelerated Boosting of Partially Linear Additive Models

One of the essential pieces of our proposed method is the partially linear additive model. PLAMs extend generalized additive models (GAMs, Hastie and Tibshirani 1990) to allow some terms enter the model nonlinearly while retaining simplicity by keeping many covariates as par- tially linear terms. The advantage of using a PLAM is that the model only adds nonlinearity when needed to accurately model the data. For the other informative terms, linear models are used to maintain a simple and interpretable model. In addition to variable selection, one of the main challenges of using PLAMs is in choosing whether an input variable is nonlinear or linear (Wang et al. 2011). In particular, among many covariates, we need to tackle two primary model selection challenges: 1) determining which covariates to include in the model; and 2) determin- ing which of these covariates to include nonlinearly as additive terms. We may view this as a three-way variable selection among linear, nonlinear and uninformative covariates. Traditional methods of fitting a PLAM like backfitting (Stone et al. 1985; Opsomer and Ruppert 1999) usually pre-specify which terms are linear or nonlinear as well as fixed important covariates to include. Using subset variable selection with an iterative procedure like backfitting can be computationally prohibitive. Lou et al. (2016) attempts to bridge between the LASSO (Tibshirani 1996) and sparse additive models into a single convex optimization problem. Using boosting, Tang and Lian (2015) successfully combine variable selection with the model fitting process. However, the boosting al- gorithm of Tang and Lian (2015) is computationally expensive that can still take many thousands of iterations. We propose accelerated boosting for the three-way variable selection and model fitting while aiming to substantially reduce computational effort. Boosting is a statistical method for model fitting that is frequently used with both linear and nonlinear models. Generally, boosting uses either a linear (Bühlmann and Yu 2003) or nonlinear (Friedman 2001) model, although a few have found a combination of both to be successful (Kneib et al. 2009; Tang and Lian 2015). Gradient

62 1. Introduction boosting (Friedman 2001) was originally designed to use nonparametric models. Bühlmann and Yu (2003) proposed component-wise boosting to be used with linear additive models. Boosting for PLAM can be viewed as taking advantage of both approaches. Specifically, we use nonparametric models for the nonlinear components and a simple linear regression for the linear components. This keeps the model as simple as possible while still being able to introduce nonlinearity when necessary. To fit a partially linear additive model with boosting, we use a variation of component-wise boosting (Bühlmann and Yu 2003) that uses linear and nonlinear models. Component-wise boost- ing has a univariate base learner for each component of the input. At every iteration, all the base learners are fit to the residual of the output minus the prediction. Then, the one base learner that best fits the residual is shrunk by the learning rate and added to the final ensemble model. So, on each iteration, only a portion of one component enters the final model. Boosting of PLAMs extends this by using two univariate base learners for each independent variable, one linear and one nonlinear. On each iteration, the single best fitting base learner from all dimensions, linear or nonlinear, is picked to enter the final model. We examined several techniques designed to reduce iterations and improve the performance of gradient-based methods by changing the learning rate. Momentum (Rumelhart et al. 1985) adds a portion of the previous gradient to the current gradient. The proportion of previous gradient to use is called the momentum factor. Momentum methods act like a ball gaining speed when descending a hill: initial gradient steps are small, but build and become larger over time. Momen- tum is a simple but effective improvement on standard gradient methods. Newton’s method uses the derivative of the gradient to take more accurate gradient steps. However, this derivative can be computationally expensive, especially in high-dimensional cases. Acceleration (Nesterov 1983) improves upon momentum in two important ways. First, while

63 Accelerated Boosting of Partially Linear Additive Models momentum adds a portion of the previous gradient to the current gradient, acceleration calculates the new gradient after the previous gradient is added. Second, momentum uses a fixed momen- tum factor to determine the amount of previous gradient to use. Acceleration uses an equation to determine the optimal momentum factor based on the iteration number. These two changes result in an optimal version of momentum for convex, non-stochastic cases (Nesterov 1983). Ac- celeration is less computationally difficult than Newton’s method and has similar complexity to momentum. Momentum introduces a new tuning parameter, the momentum factor, while ac- celeration does not. In acceleration, the momentum factor is specified for each iteration by a set formula. For these reasons, we adopt acceleration to reduce the computational effort of gradient boosting. While boosting inherently performs some variable selection, its primary goal is prediction. After many iterations of boosting, the majority of the informative components of the input variable have entered the model and it becomes easy for a spurious correlation to be found among the uninformative components. At this point, uninformative components will enter the model despite their lack of information. To counter this, several techniques have been proposed to enhance the variable selection of boosting including stability selection (Hofner et al. 2015) and twin boosting (Bühlmann and Hothorn 2010). Of these methods, twin boosting may be the most promising. Tang and Lian (2015) compared stability selection and twin boosting and found that they achieved similar results but stability selection was much slower than twin boosting. Hence, we further implement accelerated twin boosting in our approach for the three-way variable selection. we find promising results in our simulation studies and real data applications. This essay builds off the contributions of Tang and Lian (2015) which developed boosting for partially linear additive models and twin boosting for partially linear additive models. The major advantage of these methods over other methods that fit PLAMs is the three-way variable selection.

64 2. Partially Linear Additive Models and Boosting

Few other methods can automatically distinguish between linear, non-linear and uninformative variables and those that do tend to be computationally inefficient. Boosting for PLAMs is more efficient than these methods and we introduce acceleration to make it even more efficient. Ac- celeration (Nesterov 1983) is a popular method of improving gradient-based methods that only recently been applied to gradient boosting of trees (Biau et al. 2019), but not to component-wise boosting or boosting of PLAMs. The rest of the essay is as follows. Section 2 introduces partially linear additive models and Tang and Lian’s method of using boosting to fit them. Section 3 develops our methods of acceler- ated boosting for partially linear models and accelerated twin boosting of partially linear models. Section 4 demonstrates the performance of our methods using simulated and real-world data.

2 Partially Linear Additive Models and Boosting

2.1 Partially Linear Additive Models

We observe a random sample (yi, xi) for i = 1, . . . , n, with m-dimensional covariates xi and response variable yi . y represents the vector of [y1,..., yn] and x represents [x1,..., xn]. We adopt a partially linear additive model (Härdle et al. 2012) with three-way variable selection. Each covariate, j = {1, . . . , m}, of the input variable can be decomposed such that

Fj = β0,j + fj + hj, (1)

where β0,j is a constant, fj is a linear function and hj is a nonlinear function. Additionally, we restrict β0,j, fj and hj to be pair-wise orthogonal such that there is no linear or nonlinear component to the constant terms, there are no constant or nonlinear component to the linear terms and no

65 Accelerated Boosting of Partially Linear Additive Models constant or linear component to the nonlinear terms. The full model takes the form

m m F(x) = β0 + ∑ fj(x) + ∑ gj(x), (2) j=1 j=1

m where F(x) = E(y|x), β0 = ∑j=1 and β0,j For a simple linear regression, fj(x) can be replaced th by βjx, where βj is the linear coefficient for the j dimension of the input variable. We adopt univariate penalized splines as a base learner for the nonlinear function, such that gj(x) can be replaced by Bjδj, where Bj are the spline basis functions and δj are the spline coefficients for the jth component of the input variable. See Section 2.2 for more information about the spline functions. The estimated model is

m m ˆ ˆ ˆ F(x) = β0 + ∑ fj(x) + ∑ gˆj(x). (3) j=1 j=1

where the estimated intercept is βˆ0, linear coefficients βˆ j and spline coefficients δˆ j. Commonly, the input variable is sparse with many of the input components providing no sig- nificant information with respect to the input variable. As these provide no significant output information, we refer to these components as "uninformative". We assume there are p informative linear components and q informative nonlinear components. Therefore the number of uninforma- tive components is m − p − q. We can reorder the input components with no loss of generality so that the first q components are nonlinear, the next p are linear and the rest are uninformative.

Given this order, the coefficients δ1,..., δq and βq+1,..., βp+q are non-zero and the rest coeffi- cients δq+1,..., δm and the rest of linear coefficients β are zero. When estimating the model, we can use these criteria to do variable selection. We will select the component j to be nonlinear if

δj 6= 0 or linear if δj = 0 and βj 6= 0. If both δj = 0 and βj = 0 then the component is selected as uninformative.

66 2. Partially Linear Additive Models and Boosting

It is common in the PLAM literature to pre-specify which input components can be linear and which can be nonlinear. Then variable selection decides if the components are informative or not. However, boosting for PLAMs needs no such pre-specification. Using boosting, we can automatically select variables to be linear, nonlinear or uninformative. The next section provides more details on how variable selection works for boosting of partially linear additive models.

2.2 Boosting of Partially Linear Additive Models

Component-wise boosting (Bühlmann and Yu 2003) provides a natural way to separate the informative components from the uninformative ones. Component-wise boosting differs from gradient boosting (Friedman 2001) in that it operates on the individual components of the input variable. Boosting works by iteratively fitting many small models to the data. These small models are individually less powerful than the final model, so they are referred to as “weak” or “base” learners. In gradient boosting, a multivariate base learner is fit to the error gradient using all the input components. A learning rate, typically less than 0.1, shrinks each base learner to ensure that it is “weak” and does not over-fit the data. This shrunk base learner is then added to the final gradient boosting model. The iterative process of fitting a base learner, shrinking it and adding it to the model continues until the test error is minimized. In component-wise boosting, a univariate base learner is fit to each component of the error gradient. Each base learner only uses one input component. The single base learner that reduces the error the most is shrunk by the learning rate and added to the final model. In this way, informative input components are selected frequently in component-wise boosting, while uninformative ones are selected less frequently. If the base learners coefficients are linearly additive, the final model can be represented by one term per com- ponent by adding all the coefficients together. This is in contrast to traditional gradient boosting

67 Accelerated Boosting of Partially Linear Additive Models where the final model is an ensemble of all the base learner models. The compactness of the fi- nal model aids the portability and interpretability of component-wise boosting over traditional gradient boosting. To boost partially linear additive models, we follow the general process of component-wise boosting, but use two types of base learners, a simple linear base learner and a nonlinear base learner for each input component. Like before, only one base learner is added to the final model at each iteration, but now the base learner is selected from the combined set of all linear and nonlinear base learners. Although many different methods could be used for the nonlinear base learners, we adopt univariate penalized splines. Without loss of generality, all input and output data is standardized to the interval [0,1] using a linear transform. We use truncated power splines with degree d and

N a series of knots {kl}l=1 on [0,1]. The basis functions for the given parameters are B−1(x) ≡ B (x) = x B (x) = x2 B (x) = xd B (x) = (x − k )d B (x) = (x − 1, 0 , 1 ,..., d−1 , d 1 1[x>k1],..., d+N−1 )d kN 1[x>kN ]. As our nonlinear functions are orthogonal to the constant and linear functions, as in equation (1), we remove the constant and linear components from the spline. Since the flexibility of splines can lead to overfitting, we introduce a penalty term to our trun- cated power splines. We use a second derivative penalty matrix D and a penalty tuning parameter

λ. To find the spline coefficients δj of component j, we solve

0 0 min(y − Bjδj) (y − Bjδj) + λjδjDδj. δj

We choose λj such that the degree of freedom is equal to one as suggested in Tang and Lian 0 −1 0 (2015). This is accomplished by finding λj such that trace(Bj(Bj Bj + λjD) Bj) = 1. If a larger degree of freedom is used then boosting will consistently select a nonlinear base learner in place of a linear base learner because of its increased flexibility to fit the data. We prevent this by heavily

68 2. Partially Linear Additive Models and Boosting penalizing the spline and using a degree of freedom of one. More details of using penalized splines as base learners can be found in Tang and Lian (2015). Algorithm 1 shows the complete process of boosting partially linear additive models. This is the algorithm that Tang and Lian (2015) develops and we will use as a baseline in our testing. The terms used in this algorithm are as follows, n is the number of sample points, m is the number of input components, Fˆt(x) is the estimated partially linear additive model at iteration t. In order to simplify the notation, all base learners are represented by h, where h1,t(x),..., hm,t(x) are the ∗ nonlinear base learners and hm+1,t(x),..., h2m,t(x) are the linear base learners at iteration t. ht (x) ˆ is the best fitting base learner for iteration t and β0,t is the estimated intercept term.y ˜ t is the pseudo-residual that the base learners must fit. η is the learning rate, also known as step-size or shrinkage factor.

Algorithm 1 Boosting Partially Linear Additive Models

1: Fˆ0(x) = 0, βˆ0,0 = 0, hj,0(x) = 0 for all j = 1, . . . , 2m 2: for t = 1 to T do 3: y˜ t = η(y − Ft−1(x)) 4: Fit nonlinear base learners hj,t(x) for j = 1, . . . , m 0 0 minimizing penalized squared loss (y˜ t − hj,t(x)) (y˜ t − hj,t(x)) + λjδjDδj 5: Fit linear base learners hj+m,t(x) for j = 1, . . . , m 0 minimizing squared loss (y˜ t − hj+m,t(x)) (y˜ t − hj+m,t(x)) ∗ 0 6: Index jt of selected base learner of min (y˜ t − hj,t(x)) (y˜ t − hj,t(x)) j=1, ..., 2m ∗ 7: h (x) = h ∗ (x) t jt ,t ˆ ˆ t ∗ 8: Ft(x) = β0,t−1 + ∑i=1 hi (x) 9: βˆ0,t = βˆ0,t−1 + mean(y − Fˆt(x)) 10: end for

The first step sets the initial estimate Fˆ0 and all 2m of the initial base learners h0 to be 0 for all x.

The initial intercept estimate βˆ0,0 is also set to 0. The process that follows will iterate T times. Line 3 calculates the residual between the observed outcome and the prediction and reduces it by the learning rate. Next, 2m base learners are fit. Base learners 1 . . . m are nonlinear penalized splines,

69 Accelerated Boosting of Partially Linear Additive Models one for each of the m predictor variables. Base learners m + 1 . . . 2m are simple linear regression for each predictor. Once all 2m base learners have been fit, the single base learner that results in the lowest error is added to the final model. Finally, the intercept estimate βˆ0 is adjusted so that the means of the outcome and the prediction are equal after adding the new base learner. ˆ 1 ˆ 0 ˆ For regression, naturally we use a squared loss function, L(y, F(x)) = 2 (y − F(x)) (y − F(x)). This results in the gradient of the loss function being the residual of the prediction, y − Fˆ(x). This gradient is used in step 3 of algorithm 1. For simplicity, we present step 8 of algorithm 1 as an ensemble model of all the best base learners. However, we are using a model with additive terms, so the final model can be reduced to equation (3). For this to happen, we substitute the intercept βˆ0,T for βˆ0, the sum of all selected nonlinear base learners, T ∗ h (x)1 ∗ , (4) ∑ j [j=ji ] i=1 are replaced by gˆj(x) and, similarly, the sum of all selected linear base learners,

T ∗ h (x)1 ∗ , (5) ∑ j+m [j+m=ji ] i=1 are replaced by fˆj(x), for all components j = 1, . . . , m. The nonlinear function gˆj(x) are all of the nonlinear base learners for component j ∈ 1, . . . , m that were selected over T iterations T ∗ ∑ h (x)1 ∗ . The linear function fˆ (x) for component j ∈ 1, . . . , m are the linear base learners, i=1 j [j=ji ] j T ∗ h + , that were selected over all the iterations ∑ h (x)1 ∗ . j m i=1 j+m [j+m=ji ]

70 2. Partially Linear Additive Models and Boosting

2.3 Twin Boosting of Partially Linear Additive Models

By itself, boosting of PLAMs performs some variable selection, but it could be more effective. Tang and Lian (2015) use twin boosting to increase the variable selection performance of boost- ing PLAMs. The estimation accuracy of twin boosting is similar or slightly better than "single" boosting as fewer uninformative variables are included. Twin boosting (Bühlmann and Hothorn 2010) was developed to increase boosting’s inherent variable selection ability. It accomplishes this by running the boosting process twice. Initially, the boosting process runs as normal. Using final model coefficients, weights are calculated for each input component. Larger weights are assigned to those components that have a higher influence in the final model. Then the boosting process is repeated using those weights to increase the likelihood that highly influential components are picked on each iteration and the less influential components have a reduced chance of entering the model. The twin boosting process causes boosting to be much more selective when choosing variables to enter the model. Twin boosting of partially linear models is presented in Algorithm 2. First, a normal run of boosting of partially linear models is executed, as described in Algorithm 1. Next, in steps 2 and 3 of Algorithm 2, use the final linear and spline coefficients from step 1 to calculate a weight for each base learner, linear and nonlinear. Then for steps 4-13, repeat the accelerated boosting process, but use the weights to pick the best fitting base learner at every iteration, as seen in step 10. This is similar to twin boosting from Bühlmann and Hothorn (2010) except that now the best single base linear must be chosen from two sets of base learners, the linear learners and the non-linear learners. Refer to Tang and Lian (2015) for more details on using twin boosting with partially linear additive models.

71 Accelerated Boosting of Partially Linear Additive Models

Algorithm 2 Twin Boosting of Partially Linear Additive Models 1: Execute algorithm 1 1 0 2: Calculate nonlinear weights: wj = n δjδj for j = 1, . . . , m 1 0 3: Calculate linear weights: wj+m = n βjβj for j = 1, . . . , m 4: Execute second round of boosting as follows: 5: Fˆ0(x) = 0, βˆ0,0 = 0, hj,0(x) = 0 for j = 1, . . . , 2m 6: for t = 1 to T do 7: y˜ t = η(y − Ft−1(x)) 8: Fit nonlinear base learners hj,t(x) for j = 1, . . . , m 0 0 minimizing penalized squared loss (y˜ t − hj,t(x)) (y˜ t − hj,t(x)) + λjδjDδj 9: Fit linear base learners hj+m,t(x) for j = 1, . . . , m 0 minimizing squared loss (y˜ t − hj+m,t(x)) (y˜ t − hj+m,t(x)) ∗  0 0  10: Index jt of selected base learner of max wj ∗ (y˜ t) (y˜ t) − y˜ t − hj,t (x) y˜ t − hj,t (x) j=1, ..., 2m ∗ 11: h (x) = h ∗ (x) t jt ,t ˆ ˆ t ∗ 12: Ft(x) = β0,t−1 + ∑i=1 hi (x) 13: βˆ0,t = βˆ0,t−1 + mean(y − Fˆt(x)) 14: end for

3 Accelerated Boosting

We develop efficient algorithms with accelerated boosting for estimation and simultaneous three-way variable selection for partially linear additive models, aiming to reduce the heavy com- putational burden. We propose two algorithms. The first accelerates the process of boosting PLAMs to reduce computational effort. The second algorithm builds on the first while adopting twin boosting to increase the performance of three-way variable selection.

3.1 Accelerated Boosting of Partially Linear Additive Models

Acceleration is a technique frequently used to increase the convergence speed for many gradient- based techniques, including gradient descent, gradient boosting and training of neural networks. Other methods have been proposed to improve the convergence properties of gradient-based

72 3. Accelerated Boosting methods, but they need additional tuning parameters, are memory-intensive or computationally- intensive. Acceleration provides increased convergence speed with only simple changes to the original boosting algorithm and without an additional memory or CPU burden. We use the form of acceleration presented in Bengio et al. (2013) because it integrates easily into boosting. We just replace the gradient calculation, step 3 of Algorithm 1, with the acceler- ated gradient, seen in steps 3 and 4 of Algorithm 3, accelerated boosting of PLAMs. In this new calculation, vt is the momentum gradient, which carries momentum through iterations andy ˜ t is the accelerated gradient, which is a slight, but important modification of the momentum gradient.

To use acceleration with PLAMs, the momentum factor, µt, must be modified to account for the number of input components. The momentum factor is the proportion of the previous gradient to be added to the current gradient. For acceleration to be optimal, Nesterov (1983) defined µt to be q  2  (αt − 1)/αt+1 where: αt = 1 + 4αt−1 + 1 /2. Sutskever (2013) points out that αt ≈ (t + 4)/2 when t is large and as such µt ≈ 1 − 3/(t + 5). This equation for µt is too aggressive for acceler- ated boosting with PLAMs and the rate at which µt increases needs to be reduced. We believe the need for this change originates in the use of component-wise boosting instead of gradient boost- ing. On each iteration only one of 2m components is updated where multivariate base learner in a gradient boosting approach could affect all of its input components. Thus, we reduce the change in µ per iteration by a factor of 2m, resulting in:

3 t + 4m µ = 1 − = (6) t t  t + m 2m + 5 10

If the number of true linear and nonlinear terms were known (p and q, respectively) then the 2m reduction factor could potentially be lowered to p + q, resulting in more acceleration and less computational effort. This might be valuable for very high dimensional sparse data sets where µt would be reduced drastically by the large number of input components. However, we find that

73 Accelerated Boosting of Partially Linear Additive Models even though 2m is a conservative value, it still shows improved performance over non-accelerated boosting without prior knowledge of the number of informative components. Even with a large m, the approximation of µt means that the momentum factor will never be lower than 0.4. We leave further exploration using smaller momentum reduction factors to future research.

Algorithm 3 Accelerated Boosting of Partially Linear Additive Models

1: Fˆ0(x) = 0, βˆ0,0 = 0, hj,0(x) = 0 for all j = 1, . . . , 2m 2: for t = 1 to T do 3: vt = µt−1vt−1 + η(y − Fˆt−1(x)) 4: y˜ t = −µt−1vt−1 + µtvt + vt 5: Fit nonlinear base learners hj,t(x) for j = 1, . . . , m 0 0 minimizing penalized squared loss (y˜ t − hj,t(x)) (y˜ t − hj,t(x)) + λjδjDδj 6: Fit linear base learners hj+m,t(x) for j = 1, . . . , m 0 minimizing squared loss (y˜ t − hj+m,t(x)) (y˜ t − hj+m,t(x)) ∗ 0 7: jt = argmin (y˜ t − hj,t(x)) (y˜ t − hj,t(x)) j=1, ..., 2m ∗ 8: h (x) = h ∗ (x) t jt ,t ˆ ˆ t ∗ 9: Ft(x) = β0,t−1 + ∑i=1 hi (x) 10: βˆ0,t = βˆ0,t−1 + mean(y − Fˆt(x)) 11: end for

3.2 Accelerated Twin Boosting of Partially Linear Additive Models

Given the advantages of twin boosting, we also propose accelerated twin boosting of PLAMs. The general process is similar to Algorithm 2, but now uses the accelerated gradient. The full process can be found in Algorithm 4.

4 Numerical Study

To test the performance of our proposed methods, we ran three numerical studies. The first is a simple partially linear simulation with three linear inputs, three nonlinear inputs, twenty four

74 4. Numerical Study

Algorithm 4 Accelerated Twin Boosting of Partially Linear Additive Models 1: Execute algorithm 3 1 0 2: Calculate nonlinear weights: wj = n δjδj for j = 1, . . . , m 1 0 3: Calculate linear weights: wj+m = n βjβj for j = 1, . . . , m 4: Execute second round of boosting as follows: 5: Fˆ0(x) = 0, βˆ0,0 = 0, hj,0(x) = 0 for j = 1, . . . , 2m 6: for t = 1 to T do 7: vt = µt−1vt−1 + η(y − Fˆt−1(x)) 8: y˜ t = −µt−1vt−1 + µtvt + vt 9: Fit nonlinear base learners hj,t(x) for j = 1, . . . , m 0 0 minimizing penalized squared loss (y˜ t − hj,t(x)) (y˜ t − hj,t(x)) + λjδjDδj 10: Fit linear base learners hj+m,t(x) for j = 1, . . . , m 0 minimizing squared loss (y˜ t − hj+m,t(x)) (y˜ t − hj+m,t(x)) ∗  0 0  11: jt = argmax wj ∗ (y˜ t) (y˜ t) − y˜ t − hj,t (x) y˜ t − hj,t (x) j=1, ..., 2m ∗ 12: h (x) = h ∗ (x) t jt ,t ˆ ˆ t ∗ 13: Ft(x) = β0,t−1 + ∑i=1 hi (x) 14: βˆ0,t = βˆ0,t−1 + mean(y − Fˆt(x)) 15: end for

75 Accelerated Boosting of Partially Linear Additive Models uninformative inputs and gaussian noise. The second example is a simulation with 4 different nonlinear inputs, three inputs, 15 uninformative inputs where all of the inputs covary with one another. In addition, the noise term is heteroscedastic and is a function of one of the inputs. Example 2 is based on the simulation from Tang and Lian (2015). Example 1 follows a similar partially linear additive setup as example 2 but uses different formula, sample size and number of inputs. The third example is based on real housing data from Taiwan. We first introduce the data sets, then discuss variable selection results and finally, estimation results. Example 1 is a simulation with 30 input variables, three are nonlinear, three are linear and the rest are uninformative. The data for the first example is generated using the following functions:

2 g1(z1) = 2(2z1 − 1)

g2(z2) = sin(2πz2)

g3(z3) = 2sin(2πz3)/(2 − sin(2πz3))

f1(x1) = 4x1

f2(x2) = 3x2/2

f3(x3) = −3x3

Where the final outcome y is calculated:

3 3 y = ∑ fj(xj) + ∑ gj(zj) + e (7) j=1 j=1

All x’s and z’s are independently generated from a uniform random variable with range 0 − 1 and e is N(0, 0.52). The training sample size is 150 and test sample size is 100. Each replication of

76 4. Numerical Study either simulation generates a new data set using the given parameters. Example 2 is a simulation with more input covariance. The input variables are no longer independent, they now covary with one another. Any two predictors j1, j2 vary by the formula |j −j | 2 0.5 1 2 . The error term e = N(0, 0.5 ) varies with z4 as shown in equation 8. A training sample size N of 300 is used with a testing set of size 300. There are a total of 15 input variables in this example, four nonlinear, three linear and eight uninformative. The functions for the informative inputs are:

2 g1(z1) = (sin(2πz1) + 2 cos(2πz1) + 3 sin (2πz1) + ...

2 3 4 cos (2πz1) + 5 sin (2πz1))/5

g2(z2) = −sin(2πz2)/(2 − sin(2πz2))

g3(z3) = 5z3(1 − z3)

2 g4(z4) = 1/(z4 + 0.5)

f1(x1) = 2x1

f2(x2) = −2x2

f3(x3) = x3 so that 3 4 y = ∑ fj(xj) + ∑ gj(zj) + (0.5 + z4)e (8) j=1 j=1

In addition to the simulation studies, we demonstrate the performance of our method on a real data example. For this example, we use the Taiwan housing data from Yeh and Hsu (2018). The output of this data set is the price per unit area, in local units (Taiwanese doller per ping). The

77 Accelerated Boosting of Partially Linear Additive Models input predictors are transaction date (year, month), house age (years), distance to nearest transit station (meters), number of nearby convenience stores (count), latitude and longitude (degrees). The data set has 414 entries and each replication uses a randomly sampled 80/20 train-test split. In our testing, we use truncated power splines with degree 3. There are nine knots spaced equidistantly at [0.1, 0.2, . . . , 0.9]. Unless otherwise stated, the learning rate is set at 0.01. The maximum number of iterations was set to 5000. Each test used 100 replications.

4.1 Variable Selection Results

As Tang and Lian (2015) found twin boosting of PLAMs to have more promising results than single boosting of PLAMs, we first demonstrate the same to be true for accelerated boosting, then we will compare twin boosting of PLAMs to accelerated twin boosting of PLAMs. We start by plotting the estimation results for example 1. We compare the accelerated boosting versus accel- erated twin boosting, Figure 1, and see that the final test error is similar for accelerated boosting and accelerated twin boosting. Next we plot the variable selection results for example 1, as seen in Figure 2. For this example, we expect the three z variables to be selected as nonlinear, the three x variables to be linear and all u variables to be uninformative. In both Figures 2a and 2b, the z variables are all correctly selected as nonlinear. However, the x variables are sometimes incorrectly selected as nonlinear and the u variables are incorrectly selected as either linear or nonlinear. However, accelerated twin boosting shows better accuracy in selecting both the linear and uninformative variables. For example 2, we again start with the estimation results, Figure 3 and again see similar final test error between accelerated boosting and accelerated twin boosting. The increase in test error around iterations 75-85 seems to be the result of incorrectly selecting uninformative variables. As a result, accelerated twin boosting reduces this increase as it is less likely to select uninformative

78 4. Numerical Study

FIG. 1: Average root mean square testing error (RMSE) versus iterations for example 1. The solid line corresponds to the base case of accelerated boosting of partially linear additive models, while the dashed line corresponds to accelerated twin boosting.

(A) (B)

FIG. 2: Percentage of variables selected for Example 1 using accelerated boosting (left) and accelerated twin boosting (right). The black, grey and white bars correspond to nonlinear, linear and uninformative selection, respectively.

79 Accelerated Boosting of Partially Linear Additive Models

FIG. 3: Average root mean square testing error (RMSE) versus iterations for example 2. The solid line corresponds to the base case of accelerated boosting of partially linear additive models, while the dashed line corresponds to accelerated twin boosting. variables, as is shown in Figure 4, the variable selection results for example 2. Again, the nonlinear inputs are all correctly selected as nonlinear, but the accuracy of the base accelerated boosting case is much lower. Accelerated twin boosting shows much improvement in correctly selecting variables. Table 1 presents the variable selection results in table format. As can be seen in the table, twin boosting improves the accuracy of both nonlinear variable selection and linear variable selection. Nonlinear variables are always selected as nonlinear, linear variables are usually selected as lin- ear and occasionally selected as nonlinear but never selected as uninformative. All of the false negatives in the linear condition of Table 1 were selected as nonlinear, not uninformative. Unin- formative variables are usually correctly selected, but occasionally selected as linear or nonlinear. This indicates a tendency for accelerated boosting to select a variable as a more informative type,

80 4. Numerical Study

(A) (B)

FIG. 4: Percentage of variables selected for Example 2 using accelerated boosting (left) and accelerated twin boosting (right). The black, grey and white bars correspond to nonlinear, linear and uninformative selection, respectively. but not a less informative one. From an estimation standpoint, this is an advantage as acceler- ated boosting will not lose information because an informative variable was incorrectly selected as uninformative. Twin boosting reduces this tendency to select variables as more informative than they are, but still errs on the side of having more information and not removing information. Given the effectiveness of twin boosting in improving variable selection and the importance of variable selection in model real data sets, all further tests are performed using twin boosting and accelerated twin boosting of PLAMs.

4.2 Estimation Results

To show how effective our method is at model estimation, we compare accelerated twin boost- ing versus non-accelerated twin boosting of partially linear additive models. There are two im- portant factors to examine in the estimation results. Generally, boosting iterations continue until

81 Accelerated Boosting of Partially Linear Additive Models

TABLE 1: Summary of nonlinear and linear variable selection for Examples 1 and 2. The nonlinear condition compares nonlinear selection versus either linear or uninfor- mative selection. The linear condition compares linear selection versus either nonlin- ear or uninformative selection. Accuracy corresponds to the percentage of correctly classified variables as calculated by the number of true positives plus true negatives divided by the total number of variables. True negative “TN” corresponds to the per- centage of variables that are incorrectly classified as being nonlinear or linear, respec- tively. False negative “FN” corresponds to the percentage of variables that are incor- rectly classified as not being nonlinear or linear, respectively.

Nonlinear Linear

Accuracy TN FN Accuracy TN FN

Example 1 Acc. Boosting 90.1 89.0 0.0 95.6 96.5 12.3 Acc. Twin Boosting 98.4 98.9 0.0 99.0 99.1 2.3 Example 2 Acc. Boosting 63.3 49.9 0.0 88.0 97.3 49.3 Acc. Twin Boosting 96.5 95.3 0.0 97.7 98.3 4.7 the error on the testing data set is minimized. The test error is an indicator of model fit and min- imizing it should give the best model fit. So, when examining the results we want the minimum test error of our method to be the same or smaller than non-accelerated twin boosting of partially liner additive models. The other important factor is the number of iterations needed to achieve the minimum test error. Since it takes roughly the same amount of time per iteration for accelerated and non-accelerated twin boosting, the number of iterations is a rough proxy for computational time. For example, 1,000 boosting iterations will take roughly 10 times longer than 100 boost- ing iterations, given all other factors are the same. In our results, we show that accelerated twin boosting of PLAMs takes substantially less iterations than non-accelerated twin boosting. We start by plotting the average testing error against the number of iterations for example 1, as seen in Figure 5. This plot shows that both accelerated and non-accelerated twin boosting have a similar minimum test error, but accelerated twin boosting achieves its minimum sooner

82 4. Numerical Study

FIG. 5: Average root mean square testing error (RMSE) versus twin boosting iterations for example 1. The solid line corresponds to the base case of non-accelerated twin boosting of partially linear additive models, while the dashed line corresponds to our proposed method of accelerated twin boosting. than non-accelerated twin boosting. The minimum test error of accelerated twin boosting occurs at less than 500 iterations, while non-accelerated twin boosting occurs around 4000-5000 itera- tions. For this example, accelerated twin boosting takes roughly one-tenth the iterations needed for non-accelerated twin boosting, while providing similar accuracy. As is common in boosting, continuing to iterate beyond the minimum test error will cause the test error to slowly increase due to overfitting. While this is apparent in our plot of accelerated twin boosting, due to its faster response, the same phenomenon would also be visible in non-accelerated twin boosting if the plot was extended another 20,000 iterations. The learning rate parameter for boosting is frequently used to tune the tradeoff between speed and accuracy. The previous plot uses a commonly recommended value of 0.01 for the learning

83 Accelerated Boosting of Partially Linear Additive Models

(A) (B)

FIG. 6: Boxplots of the root mean square error (RMSE) (left) and iterations needed (right) versus learning rate for Example 1. The grey boxes are the results for non- accelerated twin boosting of partially linear additive models and the white boxes are for accelerated twin boosting. rate, but it is important to examine these results for a variety of learning rates. Figures 6a and 6b show boxplots for the minimum test error and iterations needed to reach the minimum for example 1. It shows that over a wide range of learning rates, accelerated twin boosting fits a partially linear additive model in much fewer iterations, while achieving a similar or better test error than non-accelerated boosting. We repeat these tests for example 2 and find the results to be similar to the example 1. Figure 7 shows the test error versus twin boosting iterations for example 2. The accelerated case achieves it minimum test error and best model fit at less than 500 iterations. The non-accelerated boosting does not achieve its minimum test error before the 5000 iteration cutoff. Again, these results prove to be true for a wide range of learning rates, as shown in Figures 8a and 8b. Our method shows promise in achieving a similar or better error with reduced number of iterations. To show the above results are valid for a real data example, we test fitting a partially linear

84 4. Numerical Study

FIG. 7: Average root mean square test error versus boosting iterations for the second example. The solid line represents the non-accelerated twin boosting of partially linear models, while the dashed line represents accelerated twin boosting.

(A) (B)

FIG. 8: Boxplots of the root mean square error (RMSE) (left) and iterations needed (right) versus learning rate for Example 2. The grey boxes are the results for non- accelerated twin boosting of partially linear additive models and the white boxes are for accelerated twin boosting.

85 Accelerated Boosting of Partially Linear Additive Models

(A) (B)

FIG. 9: Boxplots of the root mean square error (RMSE) (left) and iterations needed (right) versus learning rate for the Taiwan housing data. The grey boxes are the results for non-accelerated twin boosting of partially linear additive models and the white boxes are for accelerated twin boosting. additive model to the Taiwan housing data. Each replication uses a different sample from the data for the training and testing sets. The test error for this example, as shown in Figure 9a, is similar for all learning rates, but the number of iterations is again better when using accelerated twin boosting, Figure 9b. Table 2 shows the numeric estimation results for all three of our example studies. In each example, the lowest average test error was achieved by our proposed method, though some of the non-accelerated cases achieved an error that was statistically insignificant from the lowest. The major advantage of accelerated twin boosting is the lower number of iterations necessary to achieve the minimum error. If we compare the 0.1 learning rate cases of the non-accelerated twin boosting to the “slower” 0.01 learning rate of the accelerated twin boosting cases, we can see that for all three studies, the 0.1 non-accelerated cases were 2.5 to 5 times slower than the 0.01 accelerated cases. Although results can vary depending on the data set used, we recommend

86 4. Numerical Study using a learning rate of 0.01 with accelerated twin boosting as it achieves the best or statistically insignificant from the best error in every example and does so with much less iterations than non-accelerated twin boosting.

87 Accelerated Boosting of Partially Linear Additive Models 0.539 0.556 0.571 0.524 0.520 0.528 0.529 0.528 0.528 0.553 0.552 0.553 0.550 0.549 0.0717 0.0705 0.0699 0.0697 0.0688 0.0690 0.0689 0.0688 Twin Boosting Accelerated Twin Boosting lowest mean RMSE or statistically insignificant from the lowest mean for that row. RMSE 1.034 0.552 0.532 RMSE 0.988 0.621 0.574 RMSE 0.0791 0.0727 2: Results of non-accelerated and accelerated twin boosting of partially linear models. We Iterations 5000 4982 4495Iterations 5000 1607 4999 847 4999Iterations 2368 4974 3628 590 4721 1904 275 4270 3081 2923 118 675 1641 389 97 1743 178 936 129 649 377 264 ABLE Learning Rate 0.001 0.005 0.01 0.05 0.1 0.001 0.005 0.01 0.05 0.1 T present the average and standardpresent deviation (SD) the of average the number root of mean square iterations test needed error to (RMSE). achieve We also the best test error. Bold values are the Housing SD 0.0145 0.0156 0.0157 0.0159 0.0160 0.0159 0.0158 0.0160 0.0156 0.0159 Example 1 SDExample 2 0.093 0.042 0.039 SD 0.039 0.052 0.039 0.039 0.039 0.033 0.038 0.029 0.040 0.029 0.041 0.029 0.042 0.028 0.028 0.029 0.031

88 Conclusion

In conclusion, we presented two essays on popular machine learning methods and how these methods can be made easier, faster and potentially more interpretable. As machine learning grows in popularity, the they will be used in a broader variety of situations and for more types of prob- lems. To allow more people access to these techniques they must be fast and easy to use. If a state of the art process requires thousands of computers and hundreds of man-hours to run, then it will not find widespread adoption. For a method to be useful to the average user, it must run a on sin- gle computer in a reasonable amount of time. In practice, most machine learning methods need some amount of parameter tuning and cross-validation. This magnifies the amount of time that a process takes and is why at least some attention is paid to time in both of the included essays. The first essay presented the case for using decision trees and tree ensembles in reinforcement learning. We found the tree-based batch methods, particularly random forest FQI (Ernst et al. 2005), to have promising results. More research is warranted for the online tree-based methods, but they have potential, including the method we proposed for RL, a moving window random forest. Future research for this research stream includes more work on the online tree methods and more efficient ways to apply batch methods to online RL. We are interested in the real-world application of these methods. They are designed to be used quickly and easily by the average user

89 Conclusion and we wish to demonstrate their usefulness with an applied problem. The second essay develops the use of accelerated boosting for partially linear additive mod- els. Partially linear models are powerful but it can be challenging to separate the linear terms from the non-linear ones. Our method can automatically select variables to be linear, non-linear of uninformative, making partially linear additive models easier to use. Our application of ac- celeration to this task makes the process take less time than non-accelerated methods. There are several avenues for future research in this stream. First, useful extensions to the method include using it for classification and . Second, acceleration has an averaging effect on the input gradients. This makes it potentially useful to stochastic boosting and the combination of acceleration and stochastic boosting could be quite effective. Third, other methods of optimizing gradient steps such as Adam (Kingma and Ba 2014) and L-BFGS (Liu and Nocedal 1989) and could potentially be used with gradient boosting.

90 Bibliography

Abel, D., Agarwal, A., Diaz, F., Krishnamurthy, A., and Schapire, R. E. (2016). Exploratory gradient boosting for reinforcement learning in complex domains. arXiv preprint arXiv:1603.04119.

Bengio, Y., Boulanger-Lewandowski, N., and Pascanu, R. (2013). Advances in optimizing recur- rent networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE.

Bergstra, J. and Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb):281–305.

Beygelzimer, A., Hazan, E., Kale, S., and Luo, H. (2015). Online gradient boosting. In Advances in neural information processing systems, pages 2458–2466.

Biau, G., Cadre, B., and Rouvière, L. (2019). Accelerated gradient boosting. Machine Learning.

Breiman, L. (2001). Random forests. Machine Learning, 45(1):5–32.

Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and regression trees.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym.

91 BIBLIOGRAPHY

Bühlmann, P. and Hothorn, T. (2010). Twin boosting: improved and prediction. Statistics and Computing, 20(2):119–138.

Bühlmann, P. and Yu, B. (2003). Boosting with the l2 loss: Regression and classification. Journal of the American Statistical Association, 98(462):324–339.

Caruana, R., Karampatziakis, N., and Yessenalina, A. (2008). An empirical evaluation of super- vised learning in high dimensions. In Proceedings of the 25th international conference on Machine learning - (ICML-08). ACM Press.

Chapman, D. and Kaelbling, L. P. (1991). Input generalization in delayed reinforcement learning: An algorithm and performance comparisons. In IJCAI, volume 91, pages 726–731.

Domingos, P. and Hulten, G. (2000). Mining high-speed data streams. In Kdd, volume 2, page 4.

Ernst, D., Geurts, P., and Wehenkel, L. (2005). Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6(Apr):503–556.

Ernst, D., Stan, G.-B., Goncalves, J., and Wehenkel, L. (2006). Clinical data based optimal STI strategies for HIV: a reinforcement learning approach. In Proceedings of the 45th IEEE Conference on Decision and Control. IEEE.

Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. The Annals of Statistics, 29(5):1189–1232.

Härdle, W. and Liang, H. (2007). Partially linear models. In Statistical methods for and related fields, pages 87–103. Springer.

Härdle, W., Liang, H., and Gao, J. (2012). Partially Linear Models. Springer Science & Business Media.

92 BIBLIOGRAPHY

Hastie, T. and Tibshirani, R. (1990). Generalized Additive Models, volume 43. CRC Press.

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning: , inference, and prediction, Springer Series in Statistics. Springer New York.

Hofner, B., Boccuto, L., and Göker, M. (2015). Controlling false discoveries in high-dimensional situations: Boosting with stability selection. BMC Bioinformatics, 16(1):144.

Kaelbling, L. P., Littman, M. L., and Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237–285.

Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

Kneib, T., Hothorn, T., and Tutz, G. (2009). Variable selection and model choice in geoadditive regression models. Biometrics, 65(2):626–634.

Kolter, J. Z. and Ng, A. Y. (2009). Regularization and feature selection in least-squares temporal difference learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML-09). ACM Press.

Lagoudakis, M. G. and Parr, R. (2003). Reinforcement learning as classification: Leveraging mod- ern classifiers. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 424–431.

Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22.

Lakshminarayanan, B., Roy, D. M., and Teh, Y. W. (2014). Mondrian forests: Efficient online ran- dom forests. In Advances in neural information processing systems, pages 3140–3148.

93 BIBLIOGRAPHY

Lange, S., Gabel, T., and Riedmiller, M. (2012). Batch reinforcement learning. In Reinforcement learning, pages 45–73. Springer.

Liang, H., Thurston, S. W., Ruppert, D., Apanasovich, T., and Hauser, R. (2008). Additive partial linear models with measurement errors. Biometrika, 95(3):667–678.

Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8(3-4):293–321.

Liu, D. C. and Nocedal, J. (1989). On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1-3):503–528.

Lou, Y., Bien, J., Caruana, R., and Gehrke, J. (2016). Sparse partially linear additive models. Journal of Computational and Graphical Statistics, 25(4):1126–1140.

Marcus, G. (2018). Deep learning: A critical appraisal. arXiv preprint arXiv:1801.00631.

Martín-Guerrero, J. D., Soria-Olivas, E., Martínez-Sober, M., Serrrano-López, A. J., Magdalena- Benedito, R., and Gómez-Sanchis, J. (2008). Use of reinforcement learning in two real applica- tions. In Lecture Notes in Computer Science, pages 191–204. Springer Berlin Heidelberg.

McCallum, A. K. (1996). Reinforcement learning with selective perception and hidden state. PhD thesis, University of Rochester.

Michie, D. and Chambers, R. A. (1968). Boxes: An experiment in adaptive control. Machine intelli- gence, 2(2):137–152.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Ried- miller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I.,

94 BIBLIOGRAPHY

King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540):529–533.

Moore, A. W. and Atkeson, C. G. (1995). The parti-game algorithm for variable resolution rein- forcement learning in multidimensional state-spaces. Machine Learning, 21(3):199–233.

Nesterov, Y. E. (1983). A method for solving the convex programming problem with convergence rate o (1/kˆ 2). In Doklady Akadamii Nauk SSSR, volume 269, pages 543–547.

Ohio Supercomputer Center (1987). Ohio supercomputer center. Columbus OH: Ohio Supercom- puter Center. http://osc.edu/ark:/19495/f5s1ph73.

Opsomer, J. D. and Ruppert, D. (1999). A root-n consistent backfitting estimator for semiparamet- ric additive modeling. Journal of Computational and Graphical Statistics, 8(4):715.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Pretten- hofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830.

Pyeatt, L. D. and Howe, A. E. (2001). Decision tree function approximation in reinforcement learn- ing. In Proceedings of the third international symposium on adaptive systems: evolutionary computation and probabilistic graphical models, volume 2, pages 70–77.

Riedmiller, M. (2005). Neural fitted q iteration – first experiences with a data efficient neural re- inforcement learning method. In Machine Learning: ECML 2005, pages 317–328. Springer Berlin Heidelberg.

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1985). Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science.

95 BIBLIOGRAPHY

Rummery, G. A. and Niranjan, M. (1994). On-line Q-learning using connectionist systems, volume 37. University of Cambridge, Department of Engineering Cambridge, England.

Saffari, A., Leistner, C., Santner, J., Godec, M., and Bischof, H. (2009). On-line random forests. In 2009 ieee 12th international conference on computer vision workshops, iccv workshops, pages 1393– 1400. IEEE.

Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2015). Prioritized experience replay. arXiv preprint arXiv:1511.05952.

Stone, C. J. et al. (1985). Additive regression and other nonparametric models. The Annals of Statistics, 13(2):689–705.

Sutskever, I. (2013). Training recurrent neural networks. PhD thesis, University of Toronto.

Sutton, R. S. and Barto, A. G. (1998). Introduction to reinforcement learning, volume 135. MIT press Cambridge.

Tang, X. and Lian, H. (2015). Mean and quantile boosting for partially linear additive models. Statistics and Computing, 26(5):997–1008.

Tesauro, G. and Kephart, J. O. (2002). Pricing in agent economies using multi-agent q-learning. In Multiagent Systems, Artificial Societies, and Simulated Organizations, pages 293–313. Springer US.

Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285–294.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288.

96 BIBLIOGRAPHY

Tosatto, S., Pirotta, M., D’Eramo, C., and Restelli, M. (2017). Boosted fitted q-iteration. In Proceed- ings of the 34th International Conference on Machine Learning-Volume 70, pages 3434–3443. JMLR.

Uther, W. T. and Veloso, M. M. (1998). Tree based discretization for continuous state space rein- forcement learning. In AAAI-98, pages 769–774.

Wang, L., Liu, X., Liang, H., and Carroll, R. J. (2011). Estimation and variable selection for gener- alized additive partial linear models. Annals of statistics, 39(4):1827.

Wang, X. and Dietterich, T. G. (1999). Efficient value function approximation using regression trees.". In Proceedings of: IJCAI-99 Workshop on Statistical Machine Learning for Large-Scale Opti- mization, Stockholm, Sweden, volume 31.

Watkins, C. J. and Dayan, P. (1992). Q-learning. Machine Learning, 8(3/4):279–292.

Watkins, C. J. C. H. (1989). Learning from delayed rewards. PhD thesis, King’s College.

Yeh, I.-C. and Hsu, T.-K. (2018). Building real estate valuation models with comparative approach through case-based reasoning. Applied Soft Computing, 65:260–271.

97