<<

From: AAAI Technical Report WS-93-06. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved.

Towards a Learning Architecture

Joseph O’Sullivan* School Of Science, Carnegie Mellon University, Pittsburgh, PA 15213 email: josu][email protected]

Abstract continuously improves its performance through learning. I summarize research toward a robot learning Such a robot must be capable of aa autonomous exis- architecture intended to enable a tence during which its world knowledge is continuously to learn a wide range of find-and-fetch tasks. refined from experience as well as from teachings. In particular, this paper summarizesrecent re- It is all too easy to assumethat a learning agent has search in the Learning Laboratory at unrealistic initial capabilities such as a prepared environ- Carnegie Mellon University on aspects of robot ment mapor perfect knowledgeof its actions. To ground learning, and our current worktoward integrat- our research, we use s Heath/Zenith Hero 2000 robot ing and extending this within a single archi- (named simply "Hero"), a commerc/aI wheeled mobile tecture. In previous work we developed sys- manipulator with a two finger hand, as a testbed on tems that learn action models for robot ma- whichsuccess or failure is judged. nipulation, learn cost-effective strategies for us- In rids paper, I describe the steps being taken to design ing sensors to approach and classify objects, and implement a learning robot agent. In Design Prin- learn models of sonar sensors for map build- ciples, an outline is stated of the believed requirements ing, learn reactive control strategies via rein- for a successful agent. Thereafter, in Learning Robot forcement learning and compilation of action Results, I summarise previous bodies of work in the models, and explore effectively. Our current ef- laboratory, each of which investigate subsets of our con- forts aim to coalesce these disjoint approaches victions. Our current work drawing together these var- into a single robot learning agent that learns to ious threads into a single learning robot architecture is construct action models in a real-world environ- presented in Cohesive Robot Architecture. In Ap- ment, learns models of visual and sonar sensors proach and Fetch - a Demonstration, I present a for object recognition and learns efficient reac- prototype system which demonstrates that performance tive control strategies via reinforcement learn- can improve through learning. Finally, I summarizethe ing techniques utilizing these models. current research, pointing out limitations and the work sthat. remain 1 Introduction 2 Design Principles The Learning Robots Laboratory at Carnegie Mellon What is required of a learning agent? Mitchell has University focuses on combining perceptual, reasoning argued[Mitchell, 19901 that a learning agent can be un- and learning abilities in autonomousmobile robots. Suc- cessful systems have to cope with incorrect state descrip- derstood in terms of several types of performance met- tions caused by poor sensors, converge with a limited rics: amount of examples since a robot cannot execute roll- ¯ Correctness The prediction of the effects of its ac- lions of trials, interact with the environmentboth for ef- tions in the world must becomeincreasingly better ficient exploration towards new knowledgeand exploita- for completing a task. tion of learned facts, while being robust. ¯ Perceptiveness Increasingly relevant features The ultimate goal of the research undertaken in the which impact its success must be constructed. laboratory can be stated as to achieve a robot that a Reactivity The time required to chose "correct" "This research was sponsored in part by the Avionics actions must becomeincreasingly quicker. Lab, Wright 1~.esearch and DevelopmentCenter, Aeronau- ticsl Systems Division (AFSC),U. S. Air Force, Wright- ¯ Optimality The sequence of actions chosen must Patterson AFB, OH45433-6543 under Contract F33616-90- be increasingly efficient for task completion. C-1466, Arpa Order No. 7697. The views and conclusions containedin this documentare those of the author and should In addition, an agent must be effective, that is, have not be interpreted as representingthe oi~clal policies, either appropriate sensors and actions to be capable of per- expressed or implied, of the U.S. Government. forming tasks.

47 $ Learning Robot Results environment a scalar performance feedback constructed so that maximumreward occurs when the task is com- Individual systems have been developed that explore var- pleted successfully and typically someform of punish- ious facets in the creation of a learning robot agent. ment is presented when then agent fails. The agent must Learning, planning, even simply reacting to events, then maximize the cumulative reinforcements, which is difficult without having reliable knowledge of the corresponds to developing successful strategies for sue. outcomes of actions. The usual human-programming cess at the task. By using artificial neural networks, Lin methodof defining "what actions do" is both inefficient demonstrated that the agent was able to generalize to and often ineffectual. Christiansen analysed conditions unforeseen events and to survive in moderately complex under which robotic systems can learn action models Al- dynamic environments. However, although reinforce- lowing for automated planning and successful execution ment leaxning was more successful than action compi- of strategies[Christiansen, 1992]. He developed systems lation at self-improvement in a real-world domain, con- that generated action models consisting of sets of fun- vergence of learning was typically longer and the plans nels where each funnel mappeda region of task action produced at early stages of learning were dangerous to space to a reduced region of the state space. He demon- the robot. strsted that such funnels can be acquired for continu- Anotherissue in robot learning is the question of when ous tasks using negligible prior knowledge and that a to exploit current plans or to explore further in the hope simple planner was sufficient for then generating plans of discovering hidden shortcuts. Thrun evaluated the when the learning mechanism is robust to noise and impact of exploration knowledge in tabula rasa envi- non-determlnllm and the planner is capable of reasoning ronments, where no a priori knowledge, such as action about the reliabilities associated with each action model. models, is provided, demonstrating the superiority of Once action models have been discovered, sensing to one particular directed exploration rule, counter-based decide which action to take can have varying costs. The exploration[Thrun, 1992]. time it takes a physical sensor to obtain information vm’ies widely from sensor to sensor. Hero’s camera, a 4 A Cohesive Robot Architecture passive device, is an order of magnitudefaster than ae- tive sensing using s wrist mountedsonar. Yet, sonar in- Following the individual lessons learned from each of formstion is more appropriate than vision when ambigu- these approaches, it remains to scale up and coalesce ity exists about the distance to an object. This lead Tan each of these systems into a single robot architecture. In to investigate learning cost-effective strategies for using addition, whereas previously the laboratory robot relied sensors to approach and classify objects[Tan, 1992]. He upon sonar for perception, we require an effective robot developeds cost sensitive learning system called CSLfor agent to include a visual sensor, both for speed of sens- Hero which, given a set of unknownobjects and models ing (reaction times approaching ~ second are possible) of both sensors and actions, learns whereto sense, which and an increase in discrimination capabilities. Anarchi- sensor to use, and which action to apply. tecture is created based upon the following components Learning to model sensors involves capturing knowi- (see figure 1): edge independent of any particular environment that a robot might face while learning typical environments in which the robot is known to operate. Thrun in- LearningAgent vestigated learning such models by combining artifi- )’ ! cial neural networks and local, instance-based learning techniques[Thrun, 1993]. He demonstrated that learn- lsoosorMo,AotionM o, I ,T kComp’ tion Learner learner [ Recognizer ing these models provides an efficient means of knowi- ’ ] edge transfer from previously explored envizonments to new environments. A robot acting in a real world situation must re- RationalPlanner ReinforcementLearner spond quickly to changes in its environment. Twodis- parate approaches have been investigated. B]ythe & Mitchell developed an autonomousrobot agent that ini- Selection Mechanism tially constructed explicit plans to solve problemsin its domain using prior knowledge of action preconditions and postcondltions[Blythe and Mitchell, 1989]. This "Theo-agentn converges to a reactive control strategy by compiling previous plans into stimulus-response rules using explanation based learning[Mitchell, 1990]. The Figure 1: Schema of Robot Architecture The bold agent can then respond directly to features in the en- lines have been implementedin the prototype "approach vironment with an appropriate action by querying this and fetch" system. rule set. Conversely, Lin applied artificial neural network based ¯ An action model learning mechanism which com- reinforcement learning techniques to create reactive con- piles domain independent knowledgerelating to the trol strategies without any prior knowledgeof the effects expected effect of each action on the sensation of of robot sctions[Lin, 1993]. The agent receives from its the world.

48 ¯ A sensor model learning mechanismwhich develops As an initial goal, the robot agent is required to per- awareness of domain independent characteristics of form a simple find and fetch task. A cup is placed on the each sensor. floor in the laboratory with the open end facing upwards. The robot then must locate, move towards the cup and ¯ A reinforcement learning mechanism which per- forms the action compilation and strategy gen- execute a blind grasp that is successful at picking up eration utilizing action models, sensor models, raw cups placed within a narrow region. perception and a notion of history. A control policy that performs a similar task has been shownto be learnable using just a sonar sweep for per- ¯ A deliberative planning mechanismwhich utilizes ceptual input and limited discrete actions[Tan, 1992]. action models, sensor models, raw perception and a The sameresearch also showedthis to be a difi~cult task notion of history to plan effective strategies. to learn principally due to the narrow reward region. ¯ A selection mechanismto arbitrate between the re- Exploration costs to find the narrow reward band are active and rational strategies. Since the reactive prohibitive in our domain and thus a teacher is intro- strategy will respond faster, the function of this duced which provides to the agent a limited number of mechanismwill be to act on the reactive strategy task completion sequences from several different initial unless uncertainty exists, in which case more costly states. planning using the deliberative componentwill be Anartificial neural network is used to teach the agent required. whenthe goal (task completion) has been reached. Prior to autonomous operation, the object to be approached ¯ A task completion recognizer which functions as a reward provider to the strategy mechanisms. is placed in front of the robot in a numberof varied posi- tions from which grasping would be successful. It is also Both sensor and action models rely on the same form placed in a numberof "near miss" positions and this to- of input, namely an interns] history of previous sensa- tal set of instances is used to train the task completion tions and actions over which learning occurs. The ra- detector. Note that since the agent is mobile, a perfect tional planner can actively interrogate these models to model of the object is not required in all positions. All predict the effects of various action sequences, whereas that is necessary is that a subset of possible task com- the reinforcement learner only operates upon the imme- pletion states are successfully recognilable. The task of diately expected sensations. The task completion recog- the learning scheme then becomes to develop aa active nizer can be visualized as a type of action mode] that sensing strategy that will correctly enter this subset. predicts when the completion strategy can be executed blindly. O,3 vision only vision andsonar "-,--. 5 Approach and Fetch - a Demonstration 0.24 A prototypical system has been developed that encom- passes a subset of the architectural requirements. 0.24 o "Hero" has been enhanced with a fixed head mounted 0.22 wide angle monochromecamera, while still retaining its head and base sonar as described previously[Lin et al., I~ 0.2

1989]. No control exists over the type of sensing ac- 0.Iii tion taken and so the cost-sensitive learning approach has been sidestepped st the drawback of more expensive 0.16 learning strategies. In addition, no deliberative planning e 0.14 I I I I I I 200 400 400 II00 i000 1200 1600 ISO0 componenthas been included in the system. Since this ’Ix’linlng lipochs restriction severely affects initial exploration sad thus time to convergence, our promise of autonomousopera- tion has been relaxed slightly, with a teacher being al- Figure 2: Object Gener,,l;fation Generalisation lowed to provide such examples as could be garnered curves averaged from three trAi-i-gs of the task com- from a deliberative planner. pletion recognizer. At peak generalization the models In order to reduce the dimensionality of the input learned are 85% successful. The graph was produced space, a foveated vision preprocessor filters each picture by evaluating each model on a hold-out set of objects from the camera through a segmented retina which has in novel positions. In one series the task completion high information content near the center of field of view, learner used just visual data as input. In the other se- smoothly decaying to coarse information at the edges of ries the visual data is combinedwith sonar data in the the scene. This has the double advantage of allowing the hope of synergistic sensor fusion. Note that when the learning agent to learn in a lower dimensional space (ap- less reliable sonar sensor was added to the vision data, proximately 300 multi-valued inputs - vision sad sonar generalization performance decreased. Such results were combined - are now used to represent each image point typical across various network configurations. in the perception space) and ofspeeding up the execution time of the learning algorithm by an order of magnitude, Due to the high dimensionality of the world fea- an important factor in a real-world system. ture space, generality is essential in the reinforcement

49 learning mechanism. Even though the task has a two- of the find and fetch task. Then, as attention is dimensional substructure, 256s°° possible states exist switched to the other object, the reactive control straining both the representational powers of traditional strategy will forget that the first object was vis- learning schemasand the practicality of waiting for these ited because no state information is store and so states to be explored. An adaptation of Lin’s connec- will demand that it be re-examined. Lin has ad- tionist reinforcement learning strategy[Lin, 1993] is be- dressed these problem using memory-basedreactive ing used to solve this problem. By providing the system learners[Lin and Mitchell, 1992], but these have not with lessons of successful strategies from the start, the yet been added to the basic architecture. agent can be expected to act in a purposeful manner. ¯ Neither action nor sensor models are currently be- The prototype system has been trained in a supervised ing utilized in the prototype system, yet they of- fashion by placing a "fast-food" cup in various positions fer assistance with tabula rasa learning. Currently in the robot’s field of view. Each of these perceptions the reinforcement learner encompassesthe learning was then hand classified to specify whether or not that of environment specific action and sensor models particular position denoted task completion, namely the along with the learning of control strategies. This cup being center in the field of view. In total, 300 differ- approachis initially being taken due to difficulties ent positions were used for training and a further hold- in scaling up these models to high dimensions. Tra- out test set of 50 exampleswere used for cross-valldation ditional action models attempt to anticipate the el- (see figure 2 for results). feet of a sensor following an action, for example, This reward provider is currently being utilized in the to predict the expected sonar readings following a reinforcement system of our learning agent. The agent is translation. Learning to perform an equivalent pre- expected to learn, in a realistic time frame, this approach diction with vision could require knowledge about and fetch task. This wKl demonstrate that autonomous such things as lighting, perspective and lens prop- systems may be developed which improve performance erties. A proposedsolution is to performpredictions by learning and generalization in a real world robotic only for a reduced set of directly relevant features. domain. Current preliminary indications, when the ac- For a sonar sensor and our task, suitable features tions of the robot are restricted to just rotations, so as would be the distance and angle to the object. Tech- to allow for ease of autonomousoperations, indicate that niques utilizing explanation based neural networks our hopes for the more complexsystem maybe justified. are being used to address this problem[Mitchell and Thrun, 1993]. 6 Summary, Limitations and Future ¯ Whenweak action and sensor models are present, Work but before a complete reactive strategy has been A cohesive robot architecture has been presented the produced, deliberative planning is appropriate. The goal of whichis to illustrate that continuously improving current incarnation of the learning agent utilizes a performance through learning is possible. A prototype teacher to produce approximate strategies during system is being developed whichwill." early stages of learning. ¯ create task-dependent feature representations for Beyondthis future work there are a number of limi- use in learning and strategy matching. The fea- tations not addressed by the architecture in its present tures produced are represented in the hidden units form. of the reinforcement learning mechanismand are di- ¯ Whenweak action models are present, the archi- rectly related to the reward as provided by the task tecture must explicitly learn these models. The em- description. bedding of knowledgein the system without learning ¯ discover robust active strategies for completion of has not been explored. It would require a mecha- the task despite errors and uncertainty invalidating nism for arbitrating between a weak model and a portions of the policy. "more accurate" learned model. Since such learn- ing may be performed quite well using supervised ¯ represent the reinforcement function with a fixed techniques, as with the task completion recognizer, computation cost, leading to a constant response it was felt that maintaining such separate competing time. Quantitative statements can therefore be model representations was unnecessary. madeabout the reactivity of the system. ¯ Extending the framework to maintaining a number ¯ most importantly, learn to produce increasingly cor- of concurrent goals maybe approximated either by rect decisions without assumptionsas to the correct- having multiple competing learning agents or ex- ness of previous strategies (although having nearly tending the task completion recognizer to control correct strategies speeds up convergence time). the sequence in which the goals maybe satisfied. There are a number of extensions to the prototype Both these extensions requiring capabilities to an- system that remain to be incorporated. alyse strategies so that conflicting policies maybe correctly blended. ¯ The agent suffers from perceptual incompleteness [Chrisman eta/., 1991]. Consider a case when two ¯ Collaboration with a humanoperator for purposes similar objects in the robot’s field of view. One of task specification and demonstration has been object is examined and found not to be the object poorly defined. The prototype system allows this

50 to occur by using "Hero" in a tele-operated fash- [Lin, 1993] Long-Ji Lin. Reinfoneement I~earning for ion. Alternative more natural methods would in- Robots Ueing Neural Networi~. PhD thesis, Carnegie clude demonstrating the task to the sgent[Ikeuchi Mellon University, January 1993. and Suehiro, 1991], or verbally dictating the task [Mitchell and Thrun, 1993] TomM. Mitchell and Sebas- requirements. tian B. Thrun. Explanation-bued neural network ¯ Explanation of reactive strategies which the agent learning for robot control. In J. E. Moody,S. ;/. Han- has learned is not a simple task. It is often crucial son, and R. P. Lipmun, editors, Advances in New for plans to be accountable, that is, for s reason ral Information Proceuing SysLem, 5. Morgan Kauf- to be present for each control decision. Since the mann, December 1993. representation of the control strategy is interns] to [Mitchell, 1990] TomM. Mitchell. Becomingincreas- the reinforcement learning algorithm and in terms ingly reactive. In Proceedings of the Eight National of task dependent features, relating these features Conference on Artificial Intelligence, psses 1051- back to the real world and the actions taken is a 1059. AAAIPress/MIT Press, 1990. real issue. This is especially important whenit is wished to apply the compiled reactive strategies to ITCh, 1992] Min8 Tan. CosL-Sen, itive Robot Learning. novel task domains. PhD thesis, Carnegie Mellon University, 1992. [Thrun, 1992] Sebastian B. Thrun. Efficient exploration Acknowledgements in reinforcement learning. Technics/Report CMU-CS- 92-102, School of Computer Science, Carnegie Mellon I thank the wrious membersof the Learning Robots Lab, University, 1992. both past and present, whose ground work has allowed [Thrun, 1993] Sebastian B. Thrun. Exploration and the ideas presented here to have been developed . l’m model building in mobile robot domains. In Proceed- specifically grateful to TomMitchell for his advice on this ings of the IEEE International Conference on Neural work, to Rich Caruana and Justin Boyanfor both being part of the development of the prototype system and, NetworkJ. IEEE Press, March 1993. ,,long with Sven Koenig, for commentingon an earlier dr~ft.

References [Blythe and Mitchell, 1989] Jim Blythe and Tom M. Mitchell. On becoming reactive. In Proceedings of the Siz~ International ,~/[achine Learning Workshop, pages 255-259. Morgan Kaufinann, June 1989. [Chrisman et al., 1991] Lonnie Chrismsn, Rich Caru- ans, and WayneCarriker. Intelligent agent design is- sues: internal agent state and incomplete perception. In Proceedings of the AAAI Fall S]ffnposium on Sen- aory Aspects of Robotic Intelligence. AAAIPress/MIT Press, 1991. [Christiansen, 1992] Alan D. Christiansen. Automatic Acqu~ition of Task Theoriea for Robotic Manipula- tion. PhDthesis, Carnegie Mellon University, March 1992. [Ikeuchi and Suehlro, 1991] Katsush/ Ikeuchi and Tskashi Suehiro. Towards an assembly plan from observation. Technical Report CMU-CS- 91-167, School of ComputerScience, Carnegie Mellon University, 1991. [Lin and Mitchell, 1992] Long-Ji Lin and TomMitchell. Memoryapproaches to reinforcement learning in non- markovian domains. Technical Report CMU-CS-92- 138, School of Computer Science, Carnegie Mellon University, 1992. [Lin et al., 1989] Long-Ji Lin, TomM. Mitchell, Andrew Philips, and Reid Simmons. A case study in au- tonomous robot behavior. Technics] Report CMU-RI- 89-1, Robotics Institute, Carnegie Mellon University, 1989.

51