Advances in and Their Implications for Intelligent Control

Steven D. Whitehead*, Richard S. Suttoni and Dana H. Ballard*

1 Introduction 0 Robust under Incomplete and Uncertain Domain Knowledge: An intelligent control sys- What is an intelligent control system, and how will we tem must not depend upon a complete and accu- know one when we see it? This question is hard to rate domain model. Even for relatively simple, answer definitively, but intuitively what distinguishes narrow domains, it is extremely difficult to build intelligent control from more mundane control is the models that are complete and accurate [ShaSO]. complexity of the task and the ability of the control If control depends upon a domain model, it must system to deal with uncertain and changing conditions. suffice that the model be incomplete and inaccu- A list of the definitive properties of an intelligent con- rate. If domain knowledge is available, then the trol system might include some of the following: system should be capable of exploiting it, but it should not be a prerequisite for intelligent con- 0 Effective: The first requirement of any control system, intelligent or not, is that it generate con- trol. A system that uses a domain model that is trol actions that adequately control the plant (en- learned incrementally through experience is to be vironment). The adverb adequately is intention- preferred over one that relies upon a complete a ally imprecise since effective control rarely means priori model. optimal, but more often means sufficient for the Perceptual Feasibility: The information pro- purposes of the task. vided directly by a system's sensors is necessar- ily limited, and an intelligent control architecture 0 Reactive: Equally important is that a system's control be timely. Effective control is useless if must take this limitation into account. Instead of it is executed too late. In general, an intelligent assuming that any and all information about the control system cannot require arbitrarily long de- state of the environment is immediately available, lays between control decisions. On the contrary, intelligent system must be designed with limited it must be capable of making decisions on a mo- but flexible sensory-motor systems. Control algo- ments notice-upon demand. If extra time is rithms must consider actions for collecting neces- available that can be exploited to improve the sary state information as well as actions for af- quality of a decision, so much the better, but con- fecting change in the environment. trol decisions must be available any--lime [DB88]. 0 Mathematical Foundations: To facilitate per- formance analysis, it is important that an intelli- 0 Situated: An intelligent control system should be tightly coupled to its environment, and its con- gent control system be based on a framework that trol decisions must be based on the immediate sit- has a solid mathematical foundation. uation. A system must be capable of responding While the above list is not particularly surprising to unexpected contingencies and opportunities as or new, few if any control systems have been built they arise [AC87]. that satisfy all of the requirements. For example,

0 Adaptive An intelligent control system must use the problem-solving architectures that have been the its experience with the environment to improve dominate approach in over the its performance. Adaptability allows a system to past twenty years fall well short of these goals. These refine an initially suboptimal control policy and problem-solving architectures are not particularly re- to maintain effective control in the face of non- active, situated, adaptive, or robust in the face of in- stationary environments. complete and inaccurate domain models. They also tend to make unrealistic assumptions about the ca- Department of Computer Science, University of pabilities of the sensory system. While extensions Rochester, Rochester NY 14627 and modifications to these architectures continue to 'GTE Laboratories Incorporated, Waltham, MA 02254 be popular, other approaches are being considered as

1289 TH0333-5/90/0000/1289$01.OO 0 1990 IEEE

Authorized licensed use limited to: IEEE Xplore. Downloaded on January 30, 2009 at 17:01 from IEEE Xplore. Restrictions apply. well. (punishment). This paper focuses on control architectures that are Recent examples of controllers based on reinforce- based on reinforcement learning. In particular, we sur- ment learning include Barto et al.’s pole balancer vey several recent advances in reinforcement learning [BSA83, Sut841, Grefenstette’s simulated flight con- that have substantially increased its viability as a gen- troller [Greg01, Lin’s animats [Lingo], and Franklin’s eral approach to intelligent control. adapt,ive robot controllers [Fra88], among others. We begin by considering the relationship between re- inforcement learning and dynamic programming, both 2.1 Evaluating Reinforcement Learning viewed as methods for solving multi-stage decision problems [Wat89, BSWSO]. It is shown that many re- Reinforcement learning is emerging as an important alternative to classical problem-solving approaches to inforcement learning algorithms can be viewed as a kind of incremental dynamic programming; this pro- intelligent control because it possesses many of the properties for intelligent control that problem-solving vides a mathematical foundation for the study of rein- focrement learning systems. Connecting reinforcement approaches lack. In many respects the two approaches are complimentary and it is likely that eventual intel- learning to dynamic programming has also led to a ligent control architectures will incorporate aspects of strong optimal convergence theorem for one class of both.’ reinforcement learning algorithms and opens the door for similar analyses of other algorithms [Wat89]. Following is a discussion of the degree to which cur- Next we show how reinforcement learning methods rent reinforcement learning systems achieve each of the can be used to go beyond simple trial-and-error learn- properties that we associate with intelligent control. ing. By augmenting them with a predictive domain Effective: Reinforcement learning systems are model and using the model to perform a kind of in- effective in the sense that they eventually learn cremental planning, their learning performance can be effective control strategies. Although a system’s substantially improved. These control architectures initial performance may be poor, with enough in- learn both by performing experiments in the world teraction with the world it will eventually learn and by searching a domain model. ,Because the do- an effective strategy for obtaining reward. For main model need not be complete or accurate, it can the most part, the asymptotic effectiveness of re- be learned incrementally through experience with the inforcement learning systems has been validated world [SutSOa, SutSOb, WB89, Whi89, Lingo]. only empirically, however recent advances in the Finally, we discuss active sensory-motor of reinforcement learning have yielded for feasible perception and how these systems inter- mathematical results that guarantee optimality in act with reinforcement learning. We find that, with the limit for an important class of reinforcement some modification, many of the ideas from reinforce- learning systems [Wat89]. ment learning can be successfully combined with ac- tive sensory-motor systems. The system then learns Reactive: Decision-making in reinforcement not only an overt control strategy, but also where to learning systems is based on a policy func- focus its at,tention in order to collect necessary sensory tion which maps situations (inputs) directly into information [WBSOb]. actions (outputs) and which can be evaluated The results surveyed in this paper have been re- quickly. Consequently, reinforcement learning ported elsewhere (primarily in the systems are extremely reactive. literature). The objective here is to summarize them Situated: Reinforcement learning systems are and to consider their implications for the design of in- situated because each action is choser? based on telligent control architectures. the current state of the world. 2 Reinforcement Learning for Adaptive: Reinforcement learning systems are adaptive because they use feedback to improve Intelligent Control their performance. What is reinforcement learning? A reinforcement Incomplete and Uncertain Domain Knowl- learning system is any system that through interac- edge: Reinforcement learning systems do not de- tion with its environment improves its performance by pend upon internal domain models because they receiving feedback in the form of a scalar reward (or learn through trial-and-error experience with the penalty) that is commensurate with the appropriate- world. However, when available, they can ex- ness of the response. By improves its performance, ploit domain knowledge by 1) using prior knowl- we mean that the system uses the feedback to adapt edge about the control task to determine a good its behavior in an effort to maximize some measure of the reward it receives in the future. Intuitively, ‘To some extent this int,egration has already begun to a reinforcement learning system can be viewed as a occur, with the development of reinforcement learning sys- hedonistic automaton whose sole objective is to maxi- tem that learn and use internal domain models to improve mize the positive (reward) and minimize the negative overall performance.

1290

Authorized licensed use limited to: IEEE Xplore. Downloaded on January 30, 2009 at 17:01 from IEEE Xplore. Restrictions apply. initial policy [Fra88], 2) using an internal do- are more general and outperform previous archi- main model to perform mental experiments in- tectures (cf. Sections 4 and 5). stead of relying solely upon trial-and-error ex- periences, and 3) using a domain model to gen- 3.1 Multi-stage Decision Problems eralize the results of experiments in the world Multi-stage decision problems are modeled as Markov [YSUBSO]. Also, because of the incremental na- decision processes. A Markov decision process is de- ture of reinforcement learning, the models used fined by the tuple (S,A, T, RI, where S is the set of need not be complete or accurate. This has lead possible states the world can occupy; A is the set of to systems that profit from using models acquired possible actions a controller may execut,e to change the through trial-and-error interaction with the world state of the world; T is the state transition function; [SutSOa, SutSOb, Whi89, Lingo]. and R is the reward function. Usually, S, and A are discrete and finite. In a Perceptual Feasibility: Most reinforcement discrete-time Markov deci- learning systems have not addressed the issue of sion process, time advances by discrete, unit length perceptual feasibility. However, recent results in- quanta; t = 0,1,2, .... At each time step the world oc- dicate that many of the ideas from reinforcement cupies one state, st E S and a time step occurs when learning can be carried over (if indirectly) to sys- the controller applies an action, at E A. The result tems that must actively control their sensory pro- of executing an action is a new state st+l and the re- ceipt of a reward rt+l. The state transition function, cesses [WBSOa, WBSOb]. T, models the effects of applying different actions in Mathematical Foundations: Although rein- different states and maps state-action pairs into a re- forcement learning has been primarily an empir- sulting new state. In general, transitions are proba- ical science, there is a growing body of theory bilistic so that applying action a in state s yields a [BA85, Wi188, Sut88, Wat89] Advances relating new state s' = T(s,U) that is drawn from a probabil- reinforcement learning to dynamic programming ity distribution over S. The probabilities that govern are beginning to provide a solid mathematical the transition function depend only upon the action foundation, as discussed below. selected and the state in which it was applied. These probabilities are assumed to be known and are denoted 3 Reinforcement Learning as by Pz,y(a)where Incremental Dynamic Programming Pc,y(a) = Pax,a) = Y). (1) Like much of artificial intelligence, reinforcement As with transitions, rewards are generated probabilis- learning is primarily an empirical science, and the sys- tically and R is a probabilistic function of the state- tems developed to solve reinforcement learning prob- action pair executed. The distributions governing R lems have been validated primarily through extensive depend only upon the state-action pair executed, and simulation studies. However, this trend is changing, are assumed to be known. and a mathematical theory of reinforcement learning Given a description of a Markov decision process, is beginning to emerge. The catalyst for this change the objective is to find a cont.ro1 policy (i.e. a mapping has been the identification of the relationship between from states to actions) that, when executed by the reinforcement learning and dynamic programming. In controller, maximizes some measure of the cumulative particular, Watkins has shown that certain reinforce- reward received over time. ment learning algorithms can be viewed as Monte There are numerous measures of cumulative reward. Carlo versions of dynamic programming algorithms for One of the most common is a measure based on a solving multi-stage decision problems [Wat89] .2 discounted sum of the reward received over time. This This connection has had three important implica- sum is called the return and for time t is defined as

tions for reinforcement learning: M It has lead to a strong optimality theorem con- cerning the asymptotic performance of an impor- n=O tant class of reinforcement learning algorithms. where y is a discount factor between 0 and 1. Be- It has tied reinforcement learning to a well estab- cause the process is stochastic, the objective is to find lished mathematical foundation on which further a decision policy that maximizes the expected return. analytical studies can be based. For a fixed policy A, define V, (z) to be the expected It has clarified our intuitive understanding of re- return given that the process begins in state z and fol- inforcement learning and has directly contributed lows policy A thereafter. V, is called the utility func- to the development of extended architectures that tion for policy A and can be used to define optimality criteria. The objective is to find a policy whose utility 'To our knowledge, Werbos first made the connection function is uniformly maximal for every state. That between reinforcement learning and the theory of multi- is, to find an optimal policy A* such that stage decision problems. However, Watkins solidified the connection and was first to obtain theoretical results. (3)

1291

Authorized licensed use limited to: IEEE Xplore. Downloaded on January 30, 2009 at 17:01 from IEEE Xplore. Restrictions apply. The optimality theorem from the theory of multi- to reinforcement learning by suggesting a new learn- stage decision problems guarantees that for a st.ation- ing algorithm called Q-learning. The significance of ary, discrete time, discrete state Markov decision pro- Q-learning is that one version of it, 1-step Q-learnzng, cess there exists a deterministic decision policy that when applied to a Markov decision process, can be is optimal. Furthermore, a policy a is optimal if and shown to converge to the optimal policy, under appro- only if priate conditions. 1-step Q-learning is the first rein- forcement learning algorithm to be shown convergent QT(z, .(.)I 1 max(Q=(z, 6)) vz,s (4) bd to the optimal policy for decision problems involving where Qx(z,a), called the action-value for the state- delayed reward. The connections between Q-learning and dynamic action pair (z,a),is defined as the return the system expects to receive given that it starts in state z, ap- programming are strong: 1-step Q-learning is moti- plies action a next, and then follows policy a thereafter vated directly by value-iteration and its convergence [Be157, Ber871. proof is based on a generalization of the convergence proof for value-iteration [Wat89]. 3.2 Policy Optimization by Dynamic In value-iteration, the optimal policy is obtained in Programming the limit by solving a series of finite horizon tasks. The Two of the most important dynamic programming equations for computing each cycle in the iteration, methods for computing the optimal policy for a given (i.e. V" from Vn-') are [Wat89]: Markov decision process are policy iteration and value Q"(z, a) = E[R(x, a)] + rEIVn-l(T(z, a))] (5) iteration. Policy iteration begins with an arbitrary V(z) = maxQn(z,U) (6) policy and monotonically improves it until it converges &A on an optimal policy. The principle idea is simply to an(z) = a such that Q"(z,a) = V"(2) (7) choose a policy, a; compute V,, the expected return where i?', V", and Q" are the optimal policy, value, associated with that policy; and then improve the pol- and action-value functions for the n-stage task, respec- icy by replacing actions whose local counterparts out- tively. perform them. In value iteration, the optimal utility In reinforcement learning these equations can- function is directly computed without going through a not be solved iteratively since the statistics re- series of suboptimal policies. Once the optimal utility quired to compute the expectations are unavailable. function has been obtained it is then straightforward That is, E[R(z,a)] and Pz,y(u) (needed to compute to compute the optimal policy via Equation 4. E[V"-l(T(z, U))]) are unknown. The principle of 1- 3.3 The Relationship Between step Q-learning is to solve these equations incremen- Reinforcement Learning and Multi-Stage tally by using experience gained through interactions Decision Problems with the actual environment to estimate the expec- tations in Equation 5. Replacing the expectations in There is a close relationship between reinforcement Equation 5 with the values observed during a trial, learning and using dynamic programming to solve leads to the following updating rules: multi-stage decision problems. In both the world is characterized by a set of states, a set of possible ac- Qt+l(zt,at) = (1-@)Qt(xt, at)+@[.t+r%(zt+~>](8) tions, and a reward function. In both the objective is fi+l(z) = maxQt+l(z, a> (9) to find a decision policy that maximizes the cumula- a€ A tive reward received over time. There is an important %t+l = a such that Qt+l(c,a) = @+l(z) (10) difference though. When solving a multi-stage deci- where and Qt are the system's estimates for the sion problem, the analyst (presumably the designer optimal utility and action-value functions at time t. of the eventual control system) has a complete (al- Notice that the bracketed term on the right hand side beit stochastic) model of the environment's behavior. of Equation 8 is an estimate of the system's future Given this information, the analyst can compute the return that is based on the actual results of execut- optimal control policy with respect to the model, as ing 1-step, at in state zt at time t (i.e., it estimates outlined above. In reinforcement learning, the set of the term on the right-hand side of Equation 5). For states, and the set of possible actions is known a pri- n-step Q-learning, this 1-step estimate is replaced by ori, but the effects of action on the environment and its n-step counterpart, based on the actual results of on the production of reward is not. Thus, the designer executing n actions. In this case, the bracketed term cannot compute an optimal policy a priori. Instead becomes: the control system must learn an optimal policy by experimenting in the environment. [.t + y.ti-1 + ...P.*+"-1 +r"@(zt+n)] (11) Watkins has shown that any 1-step Q-learning sys- 3.4 Q-learning tem that 1) decreases its learning rate at an appropri- In addition to recognizing the intrinsic reldionship be- ate rate and 2) tests each state-action pair infinitely tween reinforcement learning and dynamic program- often over the coarse of its lifetime is guaranteed to ming, Watkins has made an important contribution converge to an optimal policy [Wat89].

1292

Authorized licensed use limited to: IEEE Xplore. Downloaded on January 30, 2009 at 17:01 from IEEE Xplore. Restrictions apply. 4 Beyond Trial and Error Learning: An outline‘ of a Dyna architecture is shown in Figure 1. Incremental Planning The two main components are a reinforcement learning subsystem, which can employ any number of reinforce- Reinforcement learning and dynamic programming lie ment learning algorithms (e.g Q-learning, AHC algo- at opposite ends of a spectrum. At one end, rein- rithms, Bucket Brigade, etc), and a predictive model, forcement learning systems learn control policies based which mimics the one-step input-output behavior of solely on trial-and-error interactions with the environ- the world. The significance of the internal model in ment. No dynamical model of the world is needed or Dyna architectures is that it provides a fast and inex- used. At the other end, dynamic programming is used pensive mechanism for propagating the effects of ac- to compute control policies based solely on a complete tual experience throughout the system. dynamical model of the Markov decision process. Dy- To see how an internal model can help, consider a namic programming does not rely on actual experience l-step Q-learning system that at time t, applies ac- with the world. tion at in state zt and as a result obtains the new The principle advantage of dynamic programming is state zt+l and receives the reward rt+l. In l-step that, if a problem can be specified in terms of a Markov Q-learning, the only immediate effect of this experi- decision process, then it can be analyzed and an op- ence is to change the functional values associated with timal policy obtained a priori. Other than computa- state zt. That is, Q(zt,at),V(zt), and x(zt) stand tional complexity, the two principle disadvantages of to change, but no other values will. Eventually the dynamic programming are 1) for many tasks, it is diffi- ramifications of the experience will be propagated to, cult to specify the dynamical model and 2) because dy- states other than zt. For example, changes in V(q) namic programming determines a fixed control policy effect the value estimates of other neighboring states a priori, it does not provide a mechanism for adapting that are causally connected to zt (i.e., can immedi- the policy to compensate for non-stationary dynamics ately proceed Et). These states, in turn, will effect the or modeling errors. utility estimates of their causally connected neighbors, Reinforcement learning has the complementary ada- and so on. In general, a single experience can have vantages: 1) it does not require a prior dynamical ramifications that effect the policy and utility value of model of any kind, but learns based on experience every state. What makes l-step &-learning slow is that gained directly from the world, and 2) to some degree, propagation occurs only one step at a time. The tran- it can track the dynamics of non-stationary systems. sitions in a sequence must be traversed O(n) times, in The principle disadvantage of reinforcement learning order for the effects of an event to be propagated back is that in general many trials (repeated experiences) n-st ages .3 are required to learn an optimal control strategy, es- Using an internal model to perform hypothetical ex- pecially if the system starts with a poor initial policy. periments is a fast and inexpensive mechanism for This suggest that the respective weaknesses of these propagating utility information throughout the sys- two approaches may be overcome by integrating them. tem. Hypothetical reasoning is fast, because the ef- That is, if a complete, possibly inaccurate, model of fects of actions can be simulated faster than they can the task is available a priori, dynamic programming be performed in the real world, and because hypo- can be used to develop the initial policy for a reinforce- thetical experiments need not be tied to the current ment learning system. A reasonable initial policy can state, but can be performed in any state that can be substantially improve the system’s initial performance imagined. Hypothetical reasoning is also inexpensive and reduce the time required to reach an acceptable because the reward and punishment received by the level of performance [Fra88]. Conversely, adding an system is imaginary. As a result, the amount of time adaptive, reinforcement learning component to an oth- actually exposed to dangerous situations and the cu- erwise fixed controller whose policy is determined by mulative punishment received for performing badly is dynamic programming can compensate for an inaccu- reduced. rate model. In addition to improving learning rate, Dyna archi- Although this simple approach helps to mitigate the tectures exhibit a number of other important prop- poor initial performance of naive reinforcement learn- erties. First hypothetical experiences in Dyna archi- ing systems, it does not in itself improve their overall learning rate. It has been proposed by Sutton and Other reinforcement learning algorithms exist that use Whitehead that learning rate can be improved and memory (or “eligibility”) traces to propagate back the ef- other advantages obtained if reinforcement learning fects of an action to the states that preceed it. Most systems are, in addition, augmented with a predic- notable are the algorithms based on Temporal Difference tive model of the environment [SutSOa, SutSOb, WB89, Methods [Sut88]. These algorithms mitigate the propaga- tion problem to some degree. However, they do not do a Whi89j. These Dyna architectures are based on the complete job because they only effect the sequence of states idea that planning is like trial-and-error learning from that immediately proceeded the event. Other trial-and- hypothetical experience. That is, the model is used error experiments are necessary to propagate the changes to construct hypothetical experiences, and then these to other causally related states not found in the experi- are learned from just as if they had actually happened. enced sequence.

1293

Authorized licensed use limited to: IEEE Xplore. Downloaded on January 30, 2009 at 17:01 from IEEE Xplore. Restrictions apply. tectures are incremental. A hypothetical experience 5.1 The Formal Model can be as short as the simulation of one action and yet be completely effective. Thus, Dyna architectures The formal model describing a learning agent and the provide a means for integrating reactive control with world in which it is embedded is shown in Figure 2. search based planning. Also Dyna architectures do not The external world is modeled as a Markov decision depend upon complete and accurate domain models. process (whose statistical parameters are unknown to Instead they can use partial models that are learned the agent). The decision process is characterized by as experience is gained in the world. the tuple (SE,AE,T,R) where SE is the set of ex- ternal world states, AE is the set of external (overt) 5 Beyond Perfect Perception: Active actions the agent can perform on the world, T is the Representations transition function, R is the reward function. For the most part, reinforcement learning research has The agent has two major subsystems: an active avoided issues of perception. In most reinforcement sensory-motor subsystem and a decision subsystem. learning systems the sensory system is either trivially The sensory-motor system implements three functions: simple or abstracted out of the model altogether. The 1) a perceptual function P; 2) an internal configura- tion function and 3) a motor function The usual assumption, as in a Markov decision process, is 1; M. that after each action the system observes the state of purpose of the sensory-motor subsystem is to ground the world. The “state” of the world is defined by the internal perceptions and actions in the external world. values of the system’s sensory inputs, and usually these On the sensory side, the system translates external inputs are carefully chosen by the system designer. world states into the agent’s internal representation. In more realistic learning tasks, complete knowledge Since perception is active, this mapping is dynamic of the task cannot be exploited in the design of the sen- and dependent upon the configuration of the sensory- motor apparatus. Formally, the relationship between sory system. In this case, the system must learn which external world states and the agent’s internal represen- aspects of the world are relevant to the task on its own. Before learning to solve the problem, the system must tation is modeled by the perceptual function P, which maps the set of possible world states SE and the set of learn to represent it. One approach to this problem is possible sensory-motor configurations C onto the set to build a sensory system that is as complete as possi- ble. A more practical approach is to consider systems of possible internal representations SI. On the motor that can flexibly sense different aspects of the world, side, the agent has a set of internal motor commands, but that on a moment-by-moment basis only register a AI,that affect the model in two ways: they can either limited amount of information. For example, one can change the state of the external world (by being trans- imagine an autonomous robot that possesses a large lated into external actions, AE), or they can change repertoire of sensory routines that it can use to ana- the configuration of the sensory-motor subsystem. In- lyze the world (say 100 or so), but because of time, ternal commands that change the state of the exter- space, and processing constraints, it can only afford to nal world are called overt actions and commands that apply a few at a time (say less than 10) [U1184, TS901. change the configuration of the sensory-motor system Deictic (or functional indexical) representations offer are called perceptual actions. As with perception, the another example of this active approach to perception configuration of the sensory-motor system relativizes [AC87, Agr88, ChaSO]. In a deitic representation, the the effects of internal commands. This dependence is system is capable of attending to only a limited num- modeled by the functions M and Z, which map inter- ber of objects at a time, and the properties of those nal commands and sensory-motor configurations into objects form the basis of the system’s sensory inputs. actions in the external world and into new sensory- By changing its focus of attention, the system can rep- motor configurations, respectively. resent different parts of the world and obtain a variety The remaining component of the agent is the deci- of representations for the same situation. sion subsystem. This subsystem is like a homunculus Whitehead and Ballard have investigated control ar- that sits inside the agent’s head and controls its ac- chitectures that integrate active perception and rein- tions. The decision subsystem corresponds to the re- forcement learning [WBSOb]. In the following subsec- inforcement learning systems discussed previously ex- tions we present their three main contributions: 1) an cept now it is embedded inside the agent and buffered abstract model that formalizes the functional relation- from the external world by the sensory-motor system. ships that exist between the world, the sensory-motor On the sensory side, the decision subsystem has access system, and the embedded decision system; 2) analy- only to the agent’s internal representation, not to the ses and demonstrations that the integration of active state of the external world. Similarly, on the motor perception and reinforcement learning is non-trivial, side, the decision subsystem generates internal action due to aliasing in the representation of world states; commands that are interpreted by the sensory-motor and 3) a new reinforcement learning algorithm that system. Formally, the decision subsystem implements overcomes the difficulties caused by active perception a behavior function f? that maps sequences of internal for a restricted class of tasks. states and rewards (SIx 3)’into internal actions, AI,

1294

Authorized licensed use limited to: IEEE Xplore. Downloaded on January 30, 2009 at 17:01 from IEEE Xplore. Restrictions apply. 5.2 Perceptual Aliasing the optimal actions for each map into the same inter- Interposing an active sensory-motor system between nal command to be executed by the decision subsys- the world and the decision system can lead to decision tem. In the above example, sa is inconsistent because problems which cannot be learned using standard re- VrL(sE1) f: VxL(sE2). inforcement learning techniques. The problem arises The key property of consistent internal states in de- from the many-to-many mapping between states in the terministic tasks is that whenever one is encountered, external world and states in the internal representa- the optimal return from that point forward is fixed tion. That is, a state se E SE in the world may map and independent of the actual state of the external to several internal states, depending upon the configu- world. For inconsistent internal states the optimal re- ration of the sensory-motor system. More importantly, turn depends upon the actual state of the world and a single internal state, si E SI,may represent multiple cannot be absolutely determined from knowledge of world states. This overlapping between the world and the inconsistent internal state. In general, each inter- the agent's internal representation is called perceptual nal state defines an equivalence class of external world aliasing [WBSOb]. states. A consistent internal state is useful because (by Perceptual aliasing can transform a problem that definition) it guarantees that every world state in its is Markovian into one that is not. Intuitively, per- equivalence class has the same utility and same opti- ceptual aliasing interferes with reinforcement learning mal action command. Inconsistent internal states are by allowing the decision subsystem to confound (per- not useful for predicting the utility of the current sit- ceive as the same) external world states that may have uation or the optimal action to be executed next. different utility values. For example, suppose that de- The principle idea of the lion algorithm is to ac- pending upon the configuration of the sensory-motor tively configure the sensory-motor system so that at system, an internal state s, can represent one of two each time step the system's internal representation is world states, SE1 or SE^. In reinforcement learning, consistent and the overt action the system chooses to utility values are estimated by averaging the rewards execute next is based upon the estimates of consistent accumulated over time. Assuming the system per- representations. Since perceptual aliasing interferes forms optimally, the utility- estimate learned for the with reinforcement learning by injecting inconsistent internal state s, (denoted VI(S~>)will correspond to states into the decision process, the lion algorithm ac- tively detects and prevents them from participating in a sampled average of the utilities for SE1 and SE^. If both states are encountered equally often then the util- the decision making process. The basic steps in the ity estimate for state s, will be approximately equal control cycle are as follows: to their arithmetic mean 1. Execute a series of perceptual actions in order to collect a set of internal representations for the current world state. 2. Choose one internal state that is believed to be If the utility values for SEI and SEZare the same, consistent from the set collected in step 1. Call then the learned estimate for s, will reflect the actual this state the lion. return the system can expect to receive whenever it finds itself in state s,. However, when the utility val- 8. Choose the next oved action to execute based on ues for SE1 and SEZdiffer (say v,;:(S,yl) << vr;:(s~2)) the policy value for the lion. then V(S,)will fail to accurately estimate the expected 4. Use the lion state, along with the return received utility associated with the current world state. If the on the last time step io update the utility and external world is in state SE1 then C'(Sa) will over- action-value estimates for the previous cycle. estimate the expected return; if the world is in state The key operation in the lion algorithm is to select, SE^ then V1(sa) will underestimate the expected re- for each situation, one internal state that is believed turn. This potential for a mismatch in estimating the to be consistent (i.e., Step 2 above). This operation is utility of world states, caused by perceptual aliasing, achieved by differentiating between consistent and in- interferes with the decision system's ability to learn consistent states based on the observation that, for de- the optimal policy. terministic tasks, the variance in the utility estimates for inconsistent states will always be non-zero, whereas 5.3 The Lion Algorithm the variance for consistent states will tend to zero over Whitehead and Ballard have recently proposed a new time. For details see [WBSOb]. learning algorithm, called the lion algorithm that over- The lion algorithm has been demonstrated in a sys- comes the difficulties caused by perceptual aliasing for tem that learns to solve a simple class of block ma- deterministic tasks. The lion algorithm is based on nipulation tasks. The significance of the demonstra- the notion of consistent internal states. Intuitively, tion is not the difficulty of the task per se, but that an internal state is consistent if all the external world the system employed an active (deictic) sensory-motor states it represents are the same in the following sense: system, and learned not only how to solve the prob- 1) they all have the same optimal utility values, and 2) lem but also how to control its sensory apparatus to

1295

Authorized licensed use limited to: IEEE Xplore. Downloaded on January 30, 2009 at 17:01 from IEEE Xplore. Restrictions apply. attend to the objects relevant to the task. Although [ChaSO] David Chapman. Vision, Instruction, and the results are preliminary, the idea of learning task- action. PhD thesis, MIT Artificial Intelli- dependent representations using reinforcement learn- gence Laboratory, 1990. (Technical Report ing, active sensory-motor systems, and the notion of 1204). consistency is an important step towards the develop- [DB88] Thomas Dean and Mark Boddy. An analy- ment of ever more autonomous learning agents. sis of time-dependent planning. In Proceed- ings AAAI-88, pages 49-54, 1988. 6 Summary [ Fr a8 81 Judy A. Franklin. Refinement of robot mo- In this paper we have surveyed a number of recent tor skills through reinforcement learning. advances that have contributed to the viability of re- In Proceedings of the 27th IEEE Confer- inforcement learning approaches to intelligent control. ence on Decision and Control, Austin, TX, These advances include the formalization of the rela- December 1988. tionship between reinforcement learning and dynamic [Gre89] John J. Grefenstette. Incremental learn- programming, the use of internal predictive models to ing of control strategies with genetic algo- improve learning rate, and the integration of reinforce- rithms. In Proceedings of the Sixth Inter- ment learning with active perception. Based on these national Workshop on Machine Learning. and other results, we conclude that control architec- Morgan Kaufmann, June 1989. tures based on reinforcement learning are now in a [Lingo] Long-Ji Lin. Self-improving reactive position to satisfy many of the criteria associated with agents: Case studies of reinforcement learn- intelligent control. ing frameworks. In Proceedings of the First International Conference on the Simulation References of Adaptive Behavior, September 1990. [AC87] Philip E. Agre and David Chapman. Pengi: [S ha9 01 Steve Shafer. Why we can’t model the an implementation of a theory of activity. physical world. (for the 25th anniversary In AAAI, pages 268-272, 1987. of the ChilU CS Dept.), September 1990. [Sut84] Richard S. Sutton. Temporal Credit As- Philip Agre. The Dynamic Structure of h-881 E. signment In Rein forcement Learning. PhD Everyday Life. PhD thesis, MIT Artificial thesis, University of Massachusetts at Intelligence Lab., 1988. (Tech Report No. Amherst, 1984. (Also COINS Tech Report 1085). 84-02). [BA851 A. G. Barto and P. Anandan. Pattern [Sut88] Richard S. Sutton. Learning to predict by recognizing stochastic learning automata. the method of temporal differences. Ma- IEEE Bansactions on Systems, Man, and chine Learning, 3( 1):9-44, 1988. Cybernetics, 15:360-375 , 1985. [Su t 90 a] Richard S. Sutton. First results with [Bel5 71 R. E. Bellman. Dynamic Programming. DYNA, an int,egrated architecture for Princeton University Press, Princeton, NJ, learning, planning, and reacting. In Pro- 1957. ceedings of the AAAI Spring Symposium [BeNI D. P. Bertsekas. Dyanmic Program- on Planning an Uncertain, Unpredictable, ming: Deterministic and Stochastic Mod- or Changing Environments, 1990. els. PrenticeHall, 1987. [Su t 90 b] Richard S. Sutton. Integrating architec- tures for learning, planning, and reacting [B SA831 Andrew G. Barto, Richard S. Sutton, and based on approximating dynamic program- Charles Anderson. Neuron-like ele- W. ming. In Proceedings of the Seventh Inter- ments that can solve difficult learning con- national Conference on Machine Learning, trol problems. IEEE Trans. on Systems, Austin, TX, 1990. hilorgan Kaufmann. Man, and Cybernetics, SMC-13(5):834- 846,1983. [TS90] Riling Tan and Jeffery C. Schlimmer. Two case studies in cost-sensitive concept ac- [BS W9 01 Andrew B. Barto, Richard S. Sutton, and quistition. In Proceedings of AAAI-SO, Chris J. C. Watkins. Learning: and se- 1990. quential decision making. In M. Gabrial and J. W. Moore, editors, Learning and [U11841 Shimon Ullman. Visual routines. Cogni- Computational Neuroscience. hilIT Press, tion, 18:97-159, 1984. (Also in: Visual Cog- Cambridge, MA, 1990. (Also COINS nition, S. Pinker ed., 1985). Tech Report 89-95, Dept. of Computer and [Wat89] Chris Watkins. Learning from delayed re- Information Sciences, University of Mas- wards. PhD thesis, Cambridge Universit,y, sachusetts, Amherst, MA 01003). 1989.

1296

Authorized licensed use limited to: IEEE Xplore. Downloaded on January 30, 2009 at 17:01 from IEEE Xplore. Restrictions apply. [WB89] Steven D. Whitehead and Dana H. Ballard. A role for anticipation in reactive systems that learn. In Proceedings of the Sixth In- ternational Workshop on Machine Learn- r------The agent 1 ing, Ithaca, NY, 1989. Morgan Kaufinann. I I I I [ W B9Oa] Steven D. Whitehead and Dana H. Ballard. I I Active perception and reinforcement learn- I I ing. In Proceedings of the Seventh Inter- I I I Aclion I national Conference on Machine Learning. I I Morgan Kaufmann, June 1990. (Also to II CI appear in Neural Computation). [WBSOb] Steven D. Whitehead and Dana H. Ballard. Learning to perceive and act. Technical Re- port TR 331 (revised), Computer Science Dept., University of Rochester, 1990. (Also submitted to Machine Learning).

[Whi89] Steven D. Whitehead. Scaling in rein- The world forcement learning. Technical Report TR 304, Computer Science Dept., University of Rochester, 1989. [Wi188] R. J. Williams. Toward a theory of reinforcement-learning connectionist sys- Figure 1: Dyna architectures are organized around two tems. Technical Report NU-CCS-$8-3, Col- basic components: an adaptive decision system (used lege of Computer Science, Northeastern for control) and an adaptive internal model (used to University, Boston, MA, 1988. predict the input-output behavior of the world). The model is used to perform hypothetical experiments. [YSUBSO] Richard C. Yee, Sharad Saxena, Paul E. A switch is used to modulate the decision system’s Utgoff, and Andrew G. Barto. Explaining interaction with the world and the model. temporal-differences to create useful con- cepts for evaluating states. In Proceedings of AAAI-90, 1990. The world

Et4czR

I I I I I I I I I I I I I I I I I SI I I 4 -AI I I B I I I

Figure 2: A formal model for an agent with an embed- ded learning subsystem and an active sensory-motor subsystem.

1297

Authorized licensed use limited to: IEEE Xplore. Downloaded on January 30, 2009 at 17:01 from IEEE Xplore. Restrictions apply.