<<

A system architecture for learning

Adam Leigh Haber

A thesis in fulfilment of the requirements for the degree of Doctor of Philosophy, October 2014.

School of Computer Science and Engineering Faculty of Engineering

PLEASE TYPE THE UNIVERSITY OF NEW SOUTH WALES Thesis/Dissertation Sheet

Surname or Family name: Haber

First name: Adam Other name/s: Leigh

Abbreviation for degree as given in the University calendar: PhD

School: Computer Science & Engineering (CSE) Faculty: Engineering

Title: A system architecture for learning robots

Abstract 350 words maximum: (PLEASE TYPE)

Control systems for intelligent, autonomous robots require the synthesis of a varied set of components in a highly constrained manner. Individual algorithmic solutions for many necessary components are now firmly in place. However, the mechanisms and dynamics of integration into complete systems have received less attention. architectures provide principled frameworks to guide this process. This thesis presents a novel robot architecture, called MALA, constructed specifically to incorporate the diversity of specialised components necessary for autonomous robots, and integrate them with planning systems necessary for flexible reasoning. The extent to which robots can function autonomously over long timescales is dependent on their ability to learn from experience. Therefore, MALA is also focused on embedding learning into all aspects of .

Previous research has typically focused on robot learning an isolated setting, rather than interleaved with extended autonomous operation. Integrated approaches such as MALA enable robot learning to be analysed in this embedded context. To demonstrate this, MALA is used to control a simulated rescue robot that explores a disaster site, constructs plans to achieve its goals, and simultaneously learns to drive under difficult conditions. It is shown that the learning components enable the robot’s performance to consistently improve.

There is significant potential to improve robot performance by enabling learning processes operating at different levels of abstraction to influence and guide each other, which is referred to as hybrid learning. A particularly novel aspect of MALA is the emphasis placed on hybrid learning in its design. Using MALA, a hybrid learning system capable of simultaneously acquiring task- and control- level knowledge from demonstrations is introduced. A robot initially acquires relational models of novel actions through Explanation-Based Learning (EBL), and control-level knowledge of how to perform these actions using Dynamic Motion Primitives (DMPs). It then generalises these action models using an active learning algorithm based on Inductive Logic Programming (ILP). The system is shown to be capable of learning complex manipulation actions involving both dynamic and logical characteristics, such as hitting with a hammer, and using these actions to make plans to achieve difficult goals.

Declaration relating to disposition of project thesis/dissertation

I hereby grant to the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or in part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all property rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation.

I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstracts International (this is applicable to doctoral theses only).

…………………………………………………………… ……………………………………..……………… ……….……………………...…….… Signature Witness Date

The University recognises that there may be exceptional circumstances requiring restrictions on copying or conditions on use. Requests for restriction for a period of up to 2 years must be made in writing. Requests for a longer period of restriction may be considered in exceptional circumstances and require the approval of the Dean of Graduate Research.

FOR OFFICE USE ONLY Date of completion of requirements for Award:

THIS SHEET IS TO BE GLUED TO THE INSIDE FRONT COVER OF THE THESIS

Originality Statement

I hereby declare that this submission is my own work and to the best of my knowl- edge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged.

Signed Date

iii

COPYRIGHT STATEMENT

‘I hereby grant the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstract International (this is applicable to doctoral theses only). I have either used no substantial portions of copyright material in my thesis or I have obtained permission to use copyright material; where permission has not been granted I have applied/will apply for a partial restriction of the digital copy of my thesis or dissertation.'

Signed ……………………………………………......

Date ……………………………………………......

AUTHENTICITY STATEMENT

‘I certify that the Library deposit digital copy is a direct equivalent of the final officially approved version of my thesis. No emendation of content has occurred and if there are any minor variations in formatting, they are the result of the conversion to digital format.’

Signed ……………………………………………......

Date ……………………………………………...... INCLUSION OF PUBLICATIONS STATEMENT

UNSW is supportive of candidates publishing their research results during their candidature as detailed in the UNSW Thesis Examination Procedure.

Publications can be used in their thesis in lieu of a Chapter if: • The student contributed greater than 50% of the content in the publication and is the “primary author”, ie. the student was responsible primarily for the planning, execution and preparation of the work for publication • The student has approval to include the publication in their thesis in lieu of a Chapter from their supervisor and Postgraduate Coordinator. • The publication is not subject to any obligations or contractual agreements with a third party that would constrain its inclusion in the thesis

Please indicate whether this thesis contains published material or not. This thesis contains no publications, either published or submitted for publication ☐ (if this box is checked, you may delete all the material on page 2) Some of the work described in this thesis has been published and it has been documented in the relevant Chapters with acknowledgement (if this box is

☒ checked, you may delete all the material on page 2)

This thesis has publications (either published or submitted for publication) ☐ incorporated into it in lieu of a chapter and the details are presented below

CANDIDATE’S DECLARATION I declare that: • I have complied with the Thesis Examination Procedure • where I have used a publication in lieu of a Chapter, the listed publication(s) below meet(s) the requirements to be included in the thesis. Name Signature Date (dd/mm/yy) Adam Leigh Haber 24/09/2019

Postgraduate Coordinator’s Declaration (to be filled in where publications are used in lieu of Chapters) I declare that: • the information below is accurate • where listed publication(s) have been used in lieu of Chapter(s), their use complies with the Thesis Examination Procedure • the minimum requirements for the format of the thesis have been met. PGC’s Name PGC’s Signature Date (dd/mm/yy)

i

Abstract

Control systems for intelligent, autonomous robots require the synthesis of a diverse set of components in a highly constrained manner. Algorithmic solutions for many necessary individual modules are now firmly in place. However, the mechanisms and dynamics of integration into complete systems have received less attention. Robot architectures provide principled frameworks to guide this process. This thesis presents a novel robot architecture, called MALA, constructed specifically to incorporate the diversity of specialised components necessary for autonomous robots, and integrate them with planning systems necessary for flexible reasoning. The extent to which robots can function autonomously over long timescales is dependent on their ability to learn from experience. Therefore, MALA is also focused on embedding learning into all aspects of robot software. Previous research has typically focused on robot learning an isolated setting, rather than interleaved with extended autonomous operation. Integrated approaches such as MALA enable robot learning to be analysed in this embedded context. To demonstrate this, MALA is used to control a simulated rescue robot that explores a disaster site, constructs plans to achieve its goals, and simultaneously learns to drive under difficult conditions. It is shown that the learning components enable the robot’s performance to consistently improve. There is significant potential to improve robot performance by enabling learning processes operating at different levels of abstraction to influence and guide each other, which is referred to as hybrid learning. A particularly novel aspect of MALA is the emphasis placed on hybrid learning in its design. Using MALA, a hybrid learning system capable of simultaneously acquiring task- and control- level knowledge from demonstrations is introduced. A robot initially acquires relational models of novel actions through Explanation-Based Learning (EBL), and control-level knowledge of how to perform these actions using Dynamic Motion Primitives (DMPs). It then generalises these action models using an active learning algorithm based on Inductive Logic Programming (ILP). The system is shown to be capable of learning complex manipulation actions involving both dynamic and logical characteristics, such as hitting with a hammer, and using these actions to make plans to achieve difficult goals.

vi Acknowledgements

This research has been supported financially through an Australian Postgraduate Award (APA), Engineering Supplemental Award, and a UNSW Research Excel- lence Award. The support of the Australian and NSW governments is gratefully acknowledged. I am deeply grateful to those people who have been part of my life while I completed the research in this thesis, most of whom hopefully already know how thankful I am. If there is anyone whom I forget to mention, thank you also. First to my sisters, Anna and Els, thanks for listening to a lot of complaining about research. To my parents, Paul and Michelle, thank you for tireless proof- reading of chapters and papers and heaps of coffee. Thank you Laura for love and support. Thank you Jonathan B, you provided much wisdom and sanity. The other students, and staff that made Level 3 lab a decent place to work for the past four years: Adrian, Bradford, Jenny, Mark, Mike G, Nawid, Tim C, Tim W. Those on other levels who occasionally wandered down and offered discussion and insight: Abdallah, Bernhard, Dave, Mike B, and Michael T. To Matthew McGill, for keeping nocturnal hours and helping me above and beyond the call of duty, it probably would have taken me another year without your help. Finally, I am grateful to Claude, my supervisor, for never not telling me when I was wrong, and always coming through in a pinch.

vii Contents

List of Figures xiii

List of Tables xvi

List of Algorithms xvii

List of Abbreviations xviii

1 Introduction 1 1.1 Intelligent Robots in Antiquity ...... 1 1.2 Motivations of Intelligent Research ...... 3 1.2.1 State of the Field ...... 5 1.3 Principles of Robot Architecture Design ...... 6 1.3.1 Specialised Modular Components ...... 7 1.3.2 Integration of Symbolic and Sub-symbolic Processing . . 11 1.3.3 Incremental Learning ...... 13 1.3.4 Hybrid Learning ...... 13 1.4 Overview of Multi Agent Learning Architecture (MALA) . . . . . 15 1.5 Experimental Domain ...... 18 1.5.1 Rescue ...... 18 1.5.2 Complex Manipulation and Tool-use ...... 19 1.5.3 Simulation Environment: jmeSim ...... 20 1.6 Principal Contributions ...... 22 1.7 Overview ...... 23

2 Background 24 2.1 Chapter Overview ...... 24 2.2 Cognitive Architectures ...... 25

viii 2.2.1 Functions and Capabilities of Cognitive Architectures . . 25 2.3 Notable Architectures ...... 27 2.3.1 Soar ...... 28 2.3.2 ACT-R ...... 29 2.3.3 ICARUS ...... 30 2.3.4 Global Workspace ...... 31 2.3.5 Open Issues in Cognitive Architectures ...... 33 2.4 Multi-agent Cognitive Systems ...... 34 2.4.1 Blackboard Architectures ...... 35 2.4.2 Society of Mind ...... 38 2.4.3 Real-time Control System (RCS) ...... 39 2.5 Robot Software Architectures ...... 41 2.5.1 Shakey and STRIPS ...... 41 2.5.2 Reactive Robots ...... 42 2.5.3 Hybrid Architectures ...... 43 2.5.4 Component-Based Architectures ...... 43 2.6 Robot Learning ...... 47 2.6.1 Active Learning ...... 49 2.6.2 Learning from Demonstration ...... 54 2.6.3 Relational Learning ...... 58 2.7 Hybrid Learning Architectures for Robots ...... 61 2.7.1 Hybrid Learning in MALA ...... 63 2.7.2 Related Hybrid Learning Architectures ...... 64 2.8 Autonomous Robots for Rescue ...... 68 2.8.1 The DARPA Robotics Challenge (2012-15) ...... 69 2.8.2 Need For Increased Autonomy ...... 71

3 Multi Agent Learning Architecture (MALA) 73 3.1 Chapter Overview ...... 73 3.2 Robots In Unstructured Environments ...... 74 3.3 MALA: Agents and Blackboards ...... 77 3.3.1 Architectural Components ...... 77 3.3.2 Communication via FRS ...... 79 3.3.3 Agent Interactions ...... 83 3.4 Software Infrastructure ...... 83 3.4.1 The MALA Agent ...... 84 3.5 Comparison of MALA to Other Architectures ...... 86

ix 3.6 Domain Specific Configuration: Rescue Robot ...... 87 3.6.1 Robot and Sensors ...... 87 3.6.2 Agents for Rescue ...... 88 3.6.3 Frames for Rescue ...... 95 3.6.4 System Operation ...... 98 3.7 Experimental Demonstration and Evaluation ...... 101 3.8 Chapter Summary ...... 105

4 Control-level Learning in MALA 107 4.1 Chapter Overview ...... 107 4.2 Learning in a Robot for Urban Search and Rescue (USAR) . . . . 108 4.3 System Architecture ...... 110 4.3.1 Agents For Rescue ...... 111 4.3.2 Perception ...... 112 4.3.3 Goal Generation ...... 114 4.3.4 Planning and Execution ...... 116 4.3.5 Motion Planning and Control ...... 119 4.4 Learning ...... 120 4.4.1 Learning to Drive Over Difficult Terrain: The Terrain Driver Agent ...... 120 4.4.2 Learning a Concept of Traversability: The Terrain Recog- niser Agent ...... 127 4.5 Experimental Setting: Urban Search and Rescue (USAR) . . . . . 131 4.6 Results and Discussion ...... 136 4.6.1 Experiment 1: Learning to Traverse the Rescue Environment136 4.6.2 Experiment 2: Learning to Cope with Extreme Terrain . . 140 4.6.3 Conclusions ...... 144

5 Integrated Task and Control-level Learning from Demonstration 146 5.1 Chapter Overview ...... 146 5.2 Introduction ...... 147 5.2.1 Learning from Demonstration ...... 147 5.3 Theory ...... 149 5.3.1 Hybrid Action Models ...... 150 5.3.2 Control-level Learning ...... 151 5.3.3 Task-level Learning ...... 153 5.4 Application Domain: Manipulation and Tool-use ...... 154

x 5.5 Hybrid Learning Implementation ...... 154 5.5.1 Perception and State Representation ...... 156 5.5.2 Constraint Solving ...... 157 5.5.3 Level-coupling for Robot Manipulation ...... 162 5.5.4 Execution ...... 166 5.5.5 Learning ...... 167 5.6 Experiments ...... 179 5.6.1 Experimental Domain ...... 179 5.6.2 Experiment 1: Learning New Actions ...... 181 5.6.3 Experiment 2: Planning Using Learned Actions ...... 191 5.6.4 Conclusion ...... 195

6 Discussion and Conclusion 196 6.1 Lessons and Limitations ...... 198 6.1.1 Evaluation of Robot Architectures Remains a Challenge . 198 6.1.2 Symbolic AI and Knowledge Engineering for Robots . . . 199 6.2 Future Work ...... 199 6.2.1 Richer Control-level Representation ...... 199 6.2.2 Extended Hybrid Learning ...... 200 6.2.3 Learning From Unstructured Demonstrations ...... 202 6.2.4 Improving Active Learning ...... 203 6.2.5 Real Robot Implementation ...... 204 6.2.6 Robot Competitions ...... 205 6.2.7 Control of Multiple Robots ...... 205 6.2.8 Autonomous Tool Building ...... 206

References 207

Glossary 237

Appendix A The jmeSim Environment 239 A.1 Environment Generation ...... 240 A.2 Robot Model and Control ...... 240 A.2.1 Sensor Simulation ...... 241 A.3 Integration with ROS ...... 243 A.4 Running jmeSim ...... 244

xi Appendix B MALA Implementation 246 B.1 User Guide ...... 246 B.1.1 Installation and Dependencies ...... 246 B.1.2 Running MALA ...... 247 B.2 Agent Implementation ...... 248

Appendix C Supplementary Analysis and Results 252 C.1 Inverse Dynamics Controller ...... 252 C.2 Action Models Induced by Explanation-Based Learning (EBL) . . 255

xii List of Figures

1.1 Ancient precursors of intelligent robots...... 2 1.2 Controversial applications of autonomous robots...... 4 1.3 Information flow within Multi Agent Learning Architecture (MALA). 16 1.4 The Emu autonomous rescue robot...... 19 1.5 Simulated Emu robot with JACO arm mounted at the front. . . . . 20

2.1 Summary of capabilities of both cognitive and robot architectures surveyed in Chapter 2 based on the principles outlined in Section 1.3. 46 2.2 Conceptual view of a layered hybrid learning architecture showing relationships between previous work in hybrid robot learning. . . . 64 2.3 Hybrid learning architecture implemented in Chapter 5...... 68 2.4 The winner of the DARPA Robotics Challenge trial phase compe- tition, the S-One robot from team SCHAFT, in 2013...... 71

3.1 FRS definition of a generic frame ...... 80 3.2 Information flow within MALA...... 82 3.3 Simulation of the ‘Emu’ autonomous rescue robot...... 89 3.4 Rescue robot simulation produced using jmeSim (Haber et al., 2012). 90 3.5 Abridged inheritance hierarchy of generic frames used for MALA rescue architecture...... 96 3.6 FRS definitions of sensory frames in the MALA rescue robot control architecture...... 97 3.7 FRS definitions of PDDL plan frames in the MALA rescue robot control architecture...... 98 3.8 FRS definitions of motor frames in the MALA rescue robot control architecture...... 99 3.9 Agents communicate by posting frames to a blackboard...... 100

xiii 3.10 Trace of a demonstration run of the system on Map 1, with the robot beginning in the bottom left hand corner...... 102 3.11 Abridged trace of frames on the blackboard during the run shown in Figure 3.10, only relevant slot values are shown...... 104

4.1 Urban Search and Rescue (USAR) with the Emu autonomous rescue robot...... 111 4.2 Rescue robot architecture incorporating control-level learning. . . 112 4.3 Planning Domain Definition Language (PDDL) action model for put_between(x,y,z,w) action, in which the robot attempts to place an object between two other objects...... 118 4.4 Extracting features from difficult terrain using Planes and Deviations.124 4.5 Learning to drive over difficult terrain by constructing a decision tree...... 125 4.6 Selecting a navigation corridor and goal point for learning to drive over difficult terrain...... 130 4.7 Avoiding danger by planning a path around a region recognised as impassable...... 131 4.8 Shoring of damaged ceilings...... 133 4.9 Aerial view of one of the simulated rescue arena maps used in our experiments...... 135 4.10 Performance of the complete system on the rescue task...... 137 4.11 Performance of the learned driving controller over the course of ten missions in the rescue arena...... 139 4.12 Varying stepfield difficulty to investigate MALA’s capability to recognise difficult terrain...... 141 4.13 Rescue performance in environments of varying difficulty for sev- eral variants of the control architecture...... 143

5.1 Information flow between components of the hybrid learning system.151 5.2 Simulated Emu robot with six degree-of-freedom JACO manipula- tor arm mounted at the front...... 155 5.3 Software components for hybrid learning architecture introduced in Chapter 5...... 156 5.4 Hybrid action model for the hit_with action, in which the robot attempts to hit a target object using a tool object as a hammer. . . 164 5.5 Visualisation of a version space of hypotheses, bounded by the most-specific (hS) and most-general (hG) boundaries...... 170

xiv 5.6 The winner of the trial phase of the DARPA Robotics Challenge, the S-One robot...... 180 5.7 Visualisation of the four learned manipulation actions...... 182 5.8 Learning curves for the four manipulation actions...... 183 5.9 Comparison of the number of trials required to learn various actions.188 5.10 Learning which tools can be used to execute the hit_with action. 190 5.11 Execution of planning tasks using learned action models...... 192

6.1 Hybrid learning capabilities not yet implemented in MALA . . . . 201

A.1 Environment generation in jmeSim...... 242 A.2 Sensor data in jmeSim...... 243 A.3 ROS-jmeSim connectivity...... 244

xv List of Tables

3.1 Performance on the autonomous rescue task for five randomly generated environments, including the one shown in Figure 3.10. . 103

4.1 Discrete drive actions used by the learned drive controller. . . . . 121 4.2 Components of the feature vector constructed by the Planes and Deviations feature extractor...... 126 4.3 Varying stepfield difficulty to investigate Multi Agent Learning Architecture (MALA)’s capability to recognise difficult terrain. . . 141

5.1 Definitions of several spatial predicates in FRS...... 158 5.2 Results of learning action model preconditions for the two actions grasp and place_on...... 185 5.3 Results of learning action model preconditions for the two actions open_door and hit_with...... 186 5.4 Results of execution of planning problems using the hybrid action models learned in Experiment 1...... 193

B.1 Agents written in five languages are currently supported in Mala. . 248 B.2 Details of perception agents implementation...... 249 B.3 Details of learning agents implementation...... 250 B.4 Details of action agents implementation...... 251

C.1 Initial action models of grasp and place_on...... 255 C.2 Initial action models of open_door and hit_with...... 256

xvi List of Algorithms

1 Event-driven processing of a Multi Agent Learning Architecture (MALA) agent...... 84 2 Definition of spatial constraints in ECLiPSe Prolog...... 94 3 Object detection agent update ...... 113 4 Terrain driver agent learning update ...... 122 5 Solving a constraint satisfaction problem (CSP) using JaCoP . . . 160 6 Induction of new action model ...... 169 7 Generalisation of new action model by active learning ...... 171 8 Execute a learned action to perform an experiment ...... 173 9 Induction of most general hypothesis using GOLEM’s negative- based reduction. Adapted from (Muggleton and Feng, 1992). . . . 174 10 Generate a set of possible experiments ...... 174 11 Select a tool for the next experiment. From Brown (2009). . . . . 175 12 Test if an experiment is feasible ...... 176 13 Induction of most specific hypothesis using Progolem’s BEST- Asymmetric Relative Minimal Generalisation (ARMG). Repro- duced from (Muggleton et al., 2010)...... 177 14 Random generation of rescue arena mazes...... 241

xvii List of Abbreviations

ACT-R Adaptive Control of Thought – Rational. 27, 29

ADE Agent Development Environment. 44

AI Artificial Intelligence. 1, 3, 5, 6, 8, 13, 52

AOI area of interest. 123, 124, 126, 128

ARMG Asymmetric Relative Minimal Generalisation. xvii, 172, 177

CAS Cognitive systems Architecture Schema. 44, 81, 86, 87

CSP constraint satisfaction problem. xvii, 93, 154, 157–162

DAI Distributed Artificial Intelligence. 35

DARPA Defense Advanced Research Projects Agency. 5, 12, 70, 149, 154, 204

DIARC Distributed Integrated Affect, Reflection, Cognition. 44, 86, 87

DMP Dynamic Motion Primitive, Glossary: DMP. 56–58, 66, 68, 147–153, 155, 162, 164–167, 169, 173, 188, 191, 199–202, 250

DVMT Distributed Vehicle Monitoring Testbed. 37

EBL Explanation-Based Learning. xii, 29, 31, 60, 61, 64, 151, 153, 155, 164, 165, 168, 169, 181, 255, 256

G-RAM Generalised Random Access Memory. 32

GILPS Generalised Inductive Logic Programming System. 250

xviii HRL Hierarchical . 65, 66

ICP Iterative Closest Point. 90, 249

ILP Inductive Logic Programming. 23, 52, 60, 64, 65, 149–151, 153, 155, 170, 189, 192, 250

KS Knowledge Source. 36

LfD Learning from Demonstration. 48, 54–58, 68, 147, 150, 201

MACSi Motor Affective Cognitive Scaffolding for the iCub Robot. 44, 46, 47, 86, 87

MALA Multi Agent Learning Architecture. viii, ix, xiii, xiv, xvi, xvii, 6, 7, 11, 14–18, 22–24, 37, 41, 43–47, 53, 62–65, 67, 72–74, 76–87, 89–91, 93, 95–99, 101, 103, 106–110, 114, 116, 132, 134, 141, 144–146, 157, 158, 196–202, 204, 205

MBICP Metric-based Iterative Closest Point. 90, 156, 249

MICA Multimodel Interagent Communication Architecture. 37, 78, 81

MINDS Multiple Intelligent Node Document Servers. 37

NIST National Institute of Standards and Technology. 135

PD Proportional-Derivative. 120

PDDL Planning Domain Definition Language, Glossary: PDDL. xiv, 41, 45, 47, 58, 63, 92, 96, 112, 116–118, 149–151, 153, 155, 163, 164, 166, 168, 170, 193

RCS Robot Control System. 39, 40

RL Refinforcement Learning. 29, 50, 51, 54, 58, 65

RLGG Relative Least General Generalisation. 172

ROS Robot Operating System. 21, 75, 81, 85, 86

xix SLAM Simultaneous Localisation and Mapping. 74, 88, 90, 96, 99, 101, 112, 114, 115, 119, 129–131, 136, 156

STRIPS Stanford Research Institute Problem Solver. 41, 42, 60

T-REX Teleo-Reactive Executive. 44, 45, 47, 86, 87

TCP/IP Transmission Control Protocol (TCP) and Internet Protocol. 82, 85

UAV Unmanned Aerial Vehicle. 4

USAR Urban Search and Rescue. x, xiv, 18, 24, 87, 106–109, 111

xx Chapter 1

Introduction

The shared goal of research in Artificial Intelligence (AI) and Robotics is to produce robots capable of doing the work of a human. Although this concept may appear uniquely modern, it is not at all new.

1.1 Intelligent Robots in Antiquity

The earliest known descriptions of mechanical animated helper objects are found in the writings of ancient Greek historians (McCorduck, 1979). A translation of a passage from Homer’s Illiad describes artificial assistants fashioned by the gods:

These are golden, and in appearance like living young women. There is intelligence in their hearts, and there is speech in them and strength, and from the immortal gods they have learned how to do things.

But the Ancient Greek conception of intelligent animated machines was not limited to poetic license. The great Greek engineer Hero of Alexandria described several automata in his treatise On Automatic Theaters, On Pneumatics, and On Mechanics (c. 85 A.D.). Hero designed many automata, including an automated music instrument powered by wind. This wind-powered organ is shown in the left

1 1.1. Intelligent Robots in Antiquity

Figure 1.1: Ancient precursors of intelligent robots. Left: Modern reconstruction of wind organ and wind wheel of Hero of Alexandria (1st century AD) according to W. Schmidt: Herons von Alexandria Druckwerke und Automatentheater, Greek and German, 1899 (Heronis Alexandrini opera I, Reprint 1971). Image from Wikimedia Commons (1899). Right: Photograph of a page of al-Jazari’s Machine Pouring Wine, showing a mechanical humanoid pouring wine. Image credit: (Samling, 2014). panel of Figure 1.11. Particularly striking amongst these automata was a programmable cart capable of carrying automatic puppets to the stage to begin their performance (Rosheim, 1994). Hero is thought to have physically constructed this cart, making this automaton the earliest known example of a programmable (Sharkey, 2007). Some hundreds of years later (c. 1205 A.D), the legacy of the ancient Greeks was collated by Al-Jazari, a much-celebrated Muslim engineer. Al-Jazari’s designs built upon the constructions of Ancient Greece, and he proposed over a hundred designs for mechanical assistants, capable of pouring liquids (Figure 1.1 right), and even providing fresh towels for guests (Sharkey, 2007). The designs of al-Jazari are thought to have inspired more modern engineers, most famously Leonardo Da Vinci (Moon, 2007).

1In the online version of the thesis, blue text denotes hyperlinks.

2 1.2. Motivations of Intelligent Robotics Research

Although the development of intelligent robots has been a part of human imagination for a very long time, it is only in the last few years that advances in computer science, AI and mechanical engineering have enabled the construction of autonomous robotic systems that are genuinely part of everyday life. Accordingly, research in AI and robotics has received ever increasing attention in both art and academia. However, imbuing artefacts with intelligence is also somewhat eerie and has remained controversial since its conception. Perhaps not least because animate machines challenge many conceptions of the human mind as unrivalled and ethe- real. Turing (1950) enumerated several such fears in the groundbreaking paper in which he proposed the imitation game now known as the Turing test. There are several motivations for the study of intelligent systems, some of which are indeed unsettling.

1.2 Motivations of Intelligent Robotics Research

Most obviously, designing schema for intelligent computer programs is an exercise in reverse engineering. Although AI has a vast amount of ground to cover before claims can be made about simulating the full breath of human intellectual capacity, results from AI research has certainly influenced thinking in the neurosciences and the general scientific understanding of the mind and its internal capabilities. A notable example of this phenomena is the work of Marr (1982), whose compu- tational theory of vision formed one of the seeds of research into computational neuroscience. In this spirit, AI research has an exploratory character, and attempts to discover engineering principles that eventually lead to testable hypotheses about the human brain. Another driver of research in intelligent systems is the utility of creatures that are as capable of humans, but not as valuable. Unfortunately, a natural application for a simultaneously capable and expendable automaton is in the field of battle. Much of research into autonomous robotics has been driven by this goal, and

3 1.2. Motivations of Intelligent Robotics Research

Figure 1.2: Controversial applications of autonomous robots: the General Atomics MQ-9 Reaper ‘hunter-killer’ Unmanned Aerial Vehicle (UAV), or ‘drone’, is a military aircraft designed for long-endurance, high-altitude surveillance. Image from Wikimedia Com- mons (2014). there are several systems, such as the Reaper autonomous hunter-killer unmanned aerial vehicle (Figure 1.2), currently operating in military theatres around the world in somewhat controversial circumstances. Advances in this area raise troubling questions about the future of intelligent robotics research. Despite this, there are several worthwhile applications for benevolent artificial assistants, particularly to work on tasks that are boring, repetitive or undesirable, or that require mechanical strength or precision beyond what humans are capable of. Examples of this include robots for surgery, moving stock around warehouses, and manufacturing. However, the most pressing applications for such an assistant are to aid or even replace human workers in scenarios that are particularly dan- gerous such as bomb disposal, minesweeping, or disaster rescue. Several recent catastrophes, notably the March 2011 meltdown of the Fukushima Daichii nuclear power station in northern Japan, have highlighted the need for intelligent robots that can navigate human-built environments and interact with the associated tools and control systems. The work in this thesis is directly motivated by this goal. The proposed system architecture is designed with autonomous rescue in mind, and

4 1.2. Motivations of Intelligent Robotics Research evaluated using tasks that the rescue robots of the future will solve during real missions.

1.2.1 State of the Field

Algorithms for navigation, mapping and motion planning have reached a state of maturity such that many challenging tasks are now feasible for real robots. In addition, the amount of computational power available to robots has radically increased, both due to faster and lighter onboard processors, and the rise of widely available distributed or ‘cloud’ computing infrastructure. These two factors are at the core of many recent success stories in autonomous robotics, notably the United States Defense Advanced Research Projects Agency (DARPA) Grand Challenges of 2006 and 2007, the DARPA Robotics Challenge of 2013, and Google’s self- driving cars. However, two important shortcomings have not yet been addressed by the current generation of autonomous robot systems. First, current robots are often capable of reliably performing a task in a limited set of circumstances. However, typically such systems are ‘brittle’, meaning that they may fail if sufficiently large changes to the problem domain are introduced. Further, if the task description or execution context changes, reprogramming is cumbersome and must be done manually. Systems with such rigidity abound in the robotics literature. One illustrative example is provided by the requirement for Google engineers to manual record the locations of traffic signal lights that are much dimmer than usual, so that self-driving cars do not ignore them (Fairfield and Urmson, 2011). Second, there is significant consensus that improvements in robot functionality can only be achieved by adding capabilities such as task planning, knowledge representation, and learning (Kaelbling and Lozano-Pérez, 2013; Tenorth and Beetz, 2013; Liu and Nejat, 2013). These capabilities were once at the core of research in AI robotics, however typically, deployed robot systems do not include them (Sammut, 2012). Many impressive applications of each of the above capa- bilities have demonstrated how robot behavioural repertoire can be expanded, but

5 1.3. Principles of Robot Architecture Design systems containing all of the above are far more rare. Thus, integrated frameworks are needed to produce intelligent robot systems capable of improved autonomy, particularly when they are operating in unpredictable environments.

1.3 Principles of Robot Architecture Design

This thesis contends that robot architectures capable of supporting intelligent behaviour in realistic application domains are governed by several design principles. These principles, inspired both by the development of physical robot systems and results in the neurosciences, have been appreciated to greater or lesser degrees by previous researchers. Because of this, such principles are given varying weight in existing cognitive robot software systems, such as MACSi (Ivaldi et al., 2014) or Dora the Explorer (Hawes et al., 2010). This section explores the four principles that we claim are necessary to cope with the challenges of long term autonomous robotics. In the next chapter, we discuss the extent to which these principles have been implemented in previous work. They are:

• Specialised modular components (1.3.1) • Integration of symbolic and sub-symbolic processing (1.3.2) • Incremental learning (1.3.3) • Hybrid learning (1.3.4)

The discussion of these principles motivates the construction of the Multi Agent Learning Architecture (MALA) software architecture, the central contribution of this thesis, which is introduced in Section 1.4. Generally, within integrative AI literature, the term robot architectures often refers to software systems that handle only with the low-level aspects of robots while cognitive architectures refer to those capable of higher cognitive functions such as planning, learning and memory (Langley et al., 2009). However, since a central tenet of this thesis is that the two functions are inextricably entwined, we use both terms interchangeably to refer to Multi Agent Learning Architecture (MALA).

6 1.3. Principles of Robot Architecture Design

Like other cognitive architectures, MALA simulatenously represents a blueprint for construction and an informal theory of the operation of integrated software systems for intelligent robots. Thus the research in this thesis alternates between theoretical descriptions of architectural properties, and analysis of concrete imple- mentations of robot software constructed using MALA as a framework.

1.3.1 Specialised Modular Components

In his seminal work, The Architecture of Complexity, Simon (1962) advocated that hierarchy and modularity are fundamental characteristics of all complex systems. Simon claimed that biological evolution required modularity of genetic information to proceed, by analogy with the construction of complex machines from many parts. He showed that any interruption to construction slowed development times far less if the assembly proceeded through a series of stable intermediate states (subassemblies), than if the complete assembly was continuously unstable. Through a straightforward probabilistic analysis, Simon showed that a wristwatch was far more likely to be completed on time, and similarly, that an organism was far more likely to evolve through random mutation, if its structure was modular. This led Simon to propose (1962) that these subassemblies or modules were fundamental to an evolvable system. More recent work in evolutionary biology and evolutionary computer science supported Simon’s hypothesis that modularity of genetic information bestows improved evolvability to a genome, since mutation within one module is less likely to adversely affect the rest of the organism (Callebaut and Rasskin-Gutman, 2005; Wagner et al., 2007; Wagner and Zhang, 2011). Investigation of this idea through simulation of artificial evolution shows that simulated organisms with a modular genetic structure quickly reach a higher fitness, and displace those that do not (Frenken and Marengo, 1999). Indeed, consensus within the recent literature is that genetic modularity is closely linked to the ability of biological systems to generate robustness, in the sense that genetic mutations are likely to lead to viable and complex adaptations, that lead to improvement in the environmental fitness of

7 1.3. Principles of Robot Architecture Design the organism (Pigliucci, 2008; Wagner and Zhang, 2011). An important parallel between the evolution of a species and the evolution of knowledge in an autonomous robot is the necessity of continuous operation. Were evolution to have a chance to stop and redesign life from scratch, it is not at all clear that current organisms are optimal in any sense, and would recur. Rather, once the commitment to any given subsystem has been made, it is dangerous to tinker with it, since any interruption in operation is likely to be catastrophic for a species. Rather it must be ‘frozen’ into a module, to ensure that any subsequent evolution is additive, and not destructive. This is particularly evident when considering the human brain: rather than a different, more advanced, type of brain to those in our more primitive relatives, almost every component is the same. The spinal cord, hindbrain, midbrain, thalamus, cerebellum, and cerebrum are all present in every other vertebrate brain (Jarvis et al., 2005). Thus, as new structures are added and then frozen they create the brains modular structure, at least at the gross anatomical level. Whether this incrementally acquired modularity exists at the functional, com- putational level, in which we have more interest, remains a subject of controversy (Barrett and Kurzban, 2006), but there are certainly strong arguments that it does (Pinker, 1997; Minsky, 1988; Meunier et al., 2009). This type of evolution, in which a single module of the brain is evolved while the rest is kept static, is termed mosaic evolution, and evidence of it is abundant (Barton and Harvey, 2000; Iwaniuk and Dean, 2004). By analogy with mosaic evolution, it is clear that as a persistent system acquires more and more specific knowledge, it becomes increasingly dif- ficult to add new knowledge, as the cost of completely revising the existing base becomes prohibitive. At this point it can be easier to consolidate knowledge into a ‘black box’, and stop refining it. This can allow the learning process to continue to higher levels of abstraction and complexity without stalling. In AI, experience with developing complex cognitive and robot systems led several researchers to posit that the incremental development of a system was greatly facilitated by a modular architecture, and hampered to the point of failure by

8 1.3. Principles of Robot Architecture Design excessive interdependencies between components, such as Marr (1982):

Any large computation should be split up and implemented as a collec- tion of small sub-parts that are nearly independent of one another . . . If a process is not designed in this way, a small change in one place will have consequences in many other places. This means that the process as a whole becomes extremely difficult to debug or to improve and later Minsky (1988):

When a system evolves to become more complex, this always involves a compromise: if its parts become too separate, then the system’s abilities will be limited – but if there are too many interconnections, then each change in one part will disrupt many others.

Indeed, several recent cognitive architectures have advocated modularity as a design principle, citing the importance of extendability and improvability for their systems (Wyatt and Hawes, 2008). In this spirit, extensive modularity and encapsulation serve as a fundamental principle of robot software architectures, firstly in pragmatic terms, since it is desirable for such systems to be improvable and extendable, but more importantly, to aid the incorporation of learning in the architecture (this idea is made concrete in Section 1.3.3). For the same reasons that evolution requires a modular structure to produce complex solutions to biological problems, a cognitive architecture that seeks to learn to improve itself must not be at risk of improving one system that degrades several others through excessive connectivity. A specialised procedure can, in general, be made more efficient than one that works in a wider context, since specialisation can allow simplifying assumptions to expedite processing. Biological systems often take advantage of this principle to improve efficiency. A famous example of this is the visual food-detection system of the frog, which simply searches for fast moving, bug-sized things (Lettvin et al., 1959). This system is certainly not a general method for detecting food under all conditions, but undoubtably an efficient one for the frog. Analogously in

9 1.3. Principles of Robot Architecture Design robotics, developers are forced to use specialised representations and algorithms to improve performance: for example, in robot soccer, the rules currently dictate that the ball is orange. Accordingly, the vision system quickly detects the ball by scanning the camera image for orange pixels, scanning progressively fewer pixels closer to the top of the image, since the ball will appear smaller if further away (Sammut, 2010) – a highly efficient, specialised method. The necessity of real time operation using limited on-board computing resources provides the dual pressures on efficiency shared in both cases: under such restrictions, processing speed is paramount. Thus, a cognitive architecture able to effectively control a robot is required to accommodate a variety of knowledge representation and reasoning mechanisms that are specialised for different tasks. This section argued that specialisation is necessary for efficient operation under resource constraints, and that the difficulty of solving robotics problems often forces a specialised approach. Another hard-learned lesson from robotics is that no method works well all the time. Effective robotics often requires a collection of preprogrammed behaviours, none of which is sufficiently general to be useful all the time, but in concert, when switched between appropriately, can produce effective behaviour under a wide range of conditions (Sammut, 2012). Minsky (1988) described his experience in constructing some of the earliest cognitive robot systems:

We tried a dozen different ways .. but no single method worked well by itself. Eventually we got better results by finding ways to combine them. We had the same experience at every level: no single method ever sufficed, but it helped to combine several different ones.

Minsky’s (2006) work also notes that biological cognition takes a similar approach, for example the brain uses an array of algorithms to perceive depth - binocularity, occlusion, linear perspective, distribution of shadows and illumination and others. Cassimatis (2010) also noted that many computational methods achieve their suc- cesses in one domain at the expense of performance in other domains or attributes, and proposed that hybrids or combinations can achieve more robust performance.

10 1.3. Principles of Robot Architecture Design

To adhere to these principles, an architecture must be composed of a disparate and heterogenous collection of encapsulated computational entities, each relatively specialised to a particular function. Minsky (2006) suggested how an architecture with these properties might behave, and in his later work, how much of the commu- nication and control might flow, but not in sufficient detail that the Society of Mind could be constructed directly2. Thus, the pursuit of generality through collections of specialisations, each of which is embedded within a discrete computational entity, is adopted as a fundamental principle of MALA.

1.3.2 Integration of Symbolic and Sub-symbolic Processing

In the robotics research community, there has been a pronounced lack of com- plex cognitive systems or advanced reasoning implemented on robots. This has been driven by several factors. Historically, the inability of robot systems to per- form fundamental tasks, such as recognising and interacting with objects, has left robots either successfully performing cognitively simple tasks like Brooks’ (1986) behaviour-based robots (see Section 2.5.2), or paralysed by their ungainly and slow deliberation cycles like Shakey (Nilsson, 1984) (see Section 2.5.1). Another important reason why robotics researchers have not embraced cogni- tive architectures in their software design is the very small proportion of robotic behaviours that actually require complex cognition. Especially in the limited do- mains in which current robotic systems are capable, deliberation, consideration of future states, or search tasks are rarely required (Sammut, 2012). However, recent years have seen advances in theoretical foundations of robotics, most notably in probabilistic robotics (Thrun et al., 2006). These developments have been accompanied by advances in sensor technology, in particular, the advent of dense range sensors based on LIDAR and time-of-field cameras such as the Microsoft Kinect. The combination of these developments has been at the heart of many recent successes in the robotics community, the most visible of which has

2Singh (2005) did make an attempt to do this, however, the implementation was driven by his own design rather than explicitly following Minsky’s abstract descriptions.

11 1.3. Principles of Robot Architecture Design been the fully autonomous ‘self-driving’ cars produced first in pursuit of DARPA’s grand challenges, and later by Google. These successes demonstrate that many of the hurdles which have prevented even the consideration of robots with true long term autonomy are now being overcome. Future applications of assistive robotics may require robots to be deployed in an environment as complex and changeable as human engineered environments like homes, offices, and cities, and to continually operate and achieve goals for a long period of time. We refer to this as long-term autonomy3. To achieve this level of autonomy, a robot must constantly acquire new skills and incorporate them into its existing knowledge base, adapt to changes in the environment and in itself, and make plans that consider past, present and future world states. Clearly, robots deployed in such long-term domains require the complex capabilities embodied in symbolic planners and cognitive architectures, and current, lower level robotics approaches will not scale to this type of problem. In addition, many types of robot assistants, particularly domestic robots must be able to interact with humans. This entails understanding commands and having conversations. To achieve this the robot must have a model of the home and models of what the human occupants expect the robot to do for them. This is so that robot and human can confirm that they are referring to the same objects and so the robot can execute or even ask questions about commands given to it. A particularly important point is that language-like models are needed when the robot is asked to explain its reasoning. Thus, symbolic representations are particularly important when a robot must work with humans, most strongly in the home but also in many applications where human-robot interaction is required, and this is almost all domains. Accordingly, it is essential that candidate architectures for intelligent robots possess the ability to integrate such symbolic processing structures with the low-level (sub-symbolic) components that interact more closely with their sensors and actuators. 3This is a much more demanding view of the requirements for long-term autonomy than that provided by other authors such as Meyer-Delius et al., 2012.

12 1.3. Principles of Robot Architecture Design

1.3.3 Incremental Learning

Earlier sections pointed out that many successful autonomous robots have operated in restricted problem domains. When operating outside such restrictions, a particu- lar challenge is the dynamic nature of uncontrolled environments. Static models of the world, or of the robot’s behaviours, lack the ability to adapt to changes in the world, and must be augmented by online learning. That is, by embedding learning systems that incrementally adjust the robot’s knowledge as aspects of the environment change. This challenge has, to some extent, been met by significant progress in the application of statistical techniques to autonomous robotics, enabling robots to learn fundamental skills such as locomotion and manipulation of objects. However, in isolation, such statistical techniques do not extend to providing the ability to specify robot behaviour at an abstract level, referred to as task-level autonomous operation. For example, statistical learning of robot behaviours can acquire the skill to pick up an object, but cannot direct the robot as to when it is a good idea to do so, or what can be done with the object once it is picked up. Through integration of statistical machine learning algorithms that have proven successful in robotic contexts with automated planning techniques and logical learning algorithms from the AI literature, robots can learn models that enable them to achieve a much larger degree of autonomy. In particular, robots can acquire logical models of action that enable them to learn not only how to perform a particular behaviour, but to learn how to use the behaviour to implement plans of action to achieve goals, necessary for task-level autonomy.

1.3.4 Hybrid Learning

The previous two sections claimed that the inclusion of both symbolic and sub- symbolic processing and incremental learning are necessary for robot architectures. To implement learning using a hybrid representation, the learning system must

13 1.3. Principles of Robot Architecture Design have a hybrid character – it must be capable of acquiring and refining knowledge at multiple levels of abstraction. We refer to a learning system with this capability as a hybrid learning system. In the process of learning one type of knowledge, constraints are often imposed upon other forms of knowledge. Hybrid learning systems are capable of enabling the coupling of learning processes at different abstraction levels so that the complete multi-level representation is consistent. To illustrate this point, consider an example of a student learning to play tennis under instruction from a coach. The coach’s guidance is usually in the form of verbal instructions, describing, for example, strategic level knowledge of how to best play the game – where to stand, which is the best type of shot to hit, and so on. As the student absorbs this information they acquire a model of tennis playing at one level of representation, this may be considered a high-level learning process. However, in order to actually move around the court and hit the ball, the student must learn the fine motor control necessary to execute these dynamic movements – low-level learning. If the coach is paying attention, he or she will adjust his or her advice based on the strengths of their physical game that become apparent, thus the high-level learning is, to some extent, constrained by the result of the lower. The high-level learner adjusts the state of high-level knowledge based on what is physically possible at the lower level. Simultaneously, the goal of low-level learning shifts as the high-level actions required change. If the student has been instructed to adopt a serve-and-volley game, the finer points of these actions must be learned while other shots and movements may be neglected. The high-level learner has constrained the search space of possible physical movements that the low-level process is required to obtain. In like fashion, intelligent robots must incorporate hybrid learning systems capable of acquiring and refining knowledge at multiple levels of abstraction, if they are to cope with the many types of changes that occur in the physical environment, the task that is required, or even in the robot itself. As far as we are aware, the MALA architecture introduced in the next section represents the first robot software architecture explicitly designed with the above

14 1.4. Overview of MALA four principles in mind. The claim of this thesis is that MALA is capable of supporting more complex intelligent behaviour in more realistic settings than previous robot software architectures.

1.4 Overview of MALA

MALA (Haber and Sammut, 2013) is a distributed architecture, composed of a set of agents4, and a blackboard that serves as a central communication mechanism. This section gives a brief overview of MALA and its operation, and then give details of the implementation in later chapters. The central tenet of MALA is that every computational process required for the robot’s cognitive function should be encapsulated within an individual agent. Agents are asynchronous, independent processes, that cannot directly communicate with each other. Each agent performs a single dedicated function, such as task planning or object detection. Internal representation within an agent is specialised to the task that the agent must perform. All communication between agents is mediated by one or more blackboards. Like agents, each blackboard is an independent process. Blackboards provides a communication mechanism to distribute data amongst the set of agents, and a memory mechanism to store traces of the data flow through the system. The function of each blackboard is to provide communication between agents, while also acting as a long term repository of the robot’s knowledge. Agents communicate indirectly by posting data to the blackboard, which may be forwarded on to other agents for further processing. On receipt of data from the blackboard, an agent processes it, and posts the results back to the blackboard, which may in turn be forwarded to other agents. In

4The term ‘agent’ has become widely used, and typically refers to an autonomous entity which interacts with the environment (Russell and Norvig, 2010). In MALA, and for the rest of this thesis, the term is used to describe a component of a computer program, such as a subroutine, with a specialised function, which does not display independent autonomy. This terminology is used following Minsky’s (1988) Society of Mind theory.

15 1.4. Overview of MALA

Figure 1.3: Information flow within MALA. Smaller boxes indicate agents or groups of agents, while arrows indicate posting and listening to frames, mediated by the blackboard. Abstraction increases as information flows up the stack of perceptual agents on the left, and decreases again as task-level action plans are made operational by the action generation agents on the right. Figure from (Haber and Sammut, 2013). this way, information propagates around the architecture. Thus, the blackboard operates in much the same way as working memory in a production rule system. To describe the conceptual organisation of agents within MALA, agents are grouped into several categories (Figure 1.3).

Perception Agents

In the visual cortex of the human brain, perception is a constructive process. That is, discrete aspects of images are processed in parallel and gradually fused together at increasing levels of abstraction to produce perception. Low level features such as local contrast, orientation, color are extracted by specialised columns of neurons. Intermediate level analysis involves parsing an entire scene into surfaces and global contours. Features from lower levels of processing are fused together to form high-

16 1.4. Overview of MALA level object percepts using cues from global context, foreground and background, and other properties of the scene (Kandel et al., 2013). Perception in MALA is an analogous process in which low-level perception agents operate directly on raw sensor data to extract features. Higher-level percep- tion agents operate on these features, and fuse them into more abstract representa- tions. These representations then form the input to further high-level perception agents that update the system’s current knowledge of the world. The highest level of knowledge in MALA typically identifies a set of objects, their relationships between each other and their components. Much of an autonomous robot’s computational resources must be dedicated to processing large volumes of multi-modal sensor data, to localise itself, to perceive hazards, and to perceive objects in its environment. Processing must be able to keep up with data rates from sensors or risk losing important information. In MALA, perception agents implement specialised algorithms for this purpose, one agent for each perceptual task.

Goal Generation and Execution

Goal generation agents create the drives and objectives of the system, and operate upon the description of the world state produced by perception agents. Action generation is a top-down cascade of agents that specify actions at a high-level such as task planners, and agents that translate the abstract actions to concrete imple- mentations that can be performed by the robot. The agents involved in this process are constraint solvers, motion planners, and motion controllers. Together, these agents provide the reasoning required to generate plans, execute the behaviours that implement these plans, and achieve goals. An important characteristic of MALA is that the cascade of information from sensors through sets of perception agents and toward actuators through sets of execution agents is not ‘single-channel’. Each stage in the process may – and usually does – contain several agents. This is especially important because robots typically have multiple actuators and sensors. For example, in the system described

17 1.5. Experimental Domain in Chapter 5, there are three motion planner agents, one to generate paths for the robot to drive to locations, one to generate constant-velocity trajectories to position the robot’s arm, and finally one to generate dynamic trajectories for the arm to implement complex manipulation actions.

Learning

Since each agent is a self-contained process, each agent is capable of supporting learning. Additionally, learning agents may observe the inputs and outputs of the above agents and may modify them to improve their performance. Further details on the novel aspects of MALA and its relationship to other architecture are described in Chapter 2, and further details of the structure as well as implementation details are given in Chapter 3. Implementations of learning systems at the control- level is the focus of Chapter 4, and hybrid-learning in Chapter 5.

1.5 Experimental Domain

1.5.1 Autonomous Robot Rescue

The first test domain for the proposed system is provided by robot Urban Search and Rescue (USAR). As discussed in the introduction, the rescue domain is moti- vated by the need to develop working autonomous robot rescue systems to limit the danger to which rescue and repair workers are exposed during or following disasters. This task requires exploration within a complex environment, where the system must pursue a multi-objective search and be faced with dynamic and unpredictable obstacles. Widely accepted standards and metrics for USAR have been developed for the international RoboCup Rescue competition, and we perform our experi- ments in a test area closely modelled on the competition arena. Experiments were conducted in a high fidelity 3D, physics-based, simulation written for this purpose using the open source jMonkeyEngine3 game engine (Vainer, 2013). The simulator

18 1.5. Experimental Domain

Figure 1.4: The Emu autonomous rescue robot. Top Left: Real Emu during its best-in-class Autonomy run at the 2011 RoboCup Rescue Competition. Image Credit: Matthew McGill. Right: Simulated Emu used in autonomous rescue experiments (Chapter 4) with parallel-jaw gripper. Bottom Left: Real Emu about drive over difficult ‘stepfield’ terrain using a learned drive controller. Image credit: (Sheh et al., 2011). is introduced in more detail in Section 1.5.3. Experiments are performed in simulation, rather than on a real robot, so that many repeated experiments can be conducted, to obtain statistically significant results. The simulated robot closely matches a real rescue robot called Emu (Figure 1.4, left panels), which has differential drive, and a suite of sensors. The simulated robot can be seen exploring a rescue environment in the right panel of Figure 1.4.

1.5.2 Complex Manipulation and Tool-use

As the majority of people now live in urban environments, disaster rescue generally occurs in a setting filled with functional devices designed for use by humans. Examples of such objects are hammers, valves, door handles, or light switches.

19 1.5. Experimental Domain

Figure 1.5: Simulated Emu robot with JACO arm mounted at the front. This is the robot used for complex manipulation experiments (Chapter 5).

The vast majority of useful actions that a robot may take in this type of setting involve the use of a robot arm capable of complex motion and manipulation. Accordingly, the second problem domain investigated is that of manipulation and tool-use tasks in a rescue context. To operate in this domain, the system introduced in Chapter 5 requires the use of a mounted on a mobile robot base (Figure 1.5).

1.5.3 Simulation Environment: jmeSim

As mentioned in the previous section, a significant challenge faced when applying machine learning to autonomous robotics is that the repeated trials necessary for the robot to refine its learned models place cause strain on robotic actuators. This can wear out motors, causing them to fail. To avoid this, it is often necessary to perform experiments within a high-fidelity computer simulation of both the robot

20 1.5. Experimental Domain and the task at hand, if necessary, simulating the outputs from the robots sensors such as a cameras and laser scanners. Recently, the free and open source Robot Operating System (ROS) (Quigley et al., 2009) has been widely adopted by the robotics community (Kamei et al., 2012). ROS is a middleware framework that provides a message-passing communication framework between distributed processing nodes and aids in the development of robotic software systems. Integration with the ROS framework allows access to a large codebase of robotic software for perception, most notably the Point Cloud Library (PCL) (Rusu and Cousins, 2011b), or navigation, reasoning, and many aspects of robotics. ROS is described further in Chapters 3 and 4, and details of its integration with jmeSim in Section A.3 In light of this, it is desirable for robot simulation software to integrate with ROS. The simulation package Gazebo (Koenig and Howard, 2004), is distributed with ROS, however at the time the research reported in this thesis was being conducted, this package was limited to the Linux operating system5. An ideal simulation package would not only be easily integrated into the ROS framework, but would in addition, be multi platform, high fidelity in terms of graphics and physics rendering, and open source. High graphical fidelity in a robot simulation is not merely a matter of aesthetics. In robot simulation, the sensory input to the robots perceptual algorithms comes from whichever virtual sensors are provided by the simulation. For example, virtual cameras use the rendering engine of the simulator to provide their images. If the simulated camera image bears little resemblance to a real camera feed, robotic vision algorithms cannot be tested in the simulation framework, rendering it incomplete. Similarly, high fidelity physics is essential, particularly in machine learning applications, where the robot refines a policy that based on the response of the simulated world to its actions. If the physics is not sufficiently realistic, there is little chance of transferring such a policy to a real robot. To provide a simulation environment fulfilling these criteria for the validation

5Gazebo is now multi-platform as of the time of writing.

21 1.6. Principal Contributions of machine learning systems developed in this thesis, the jmeSim simulation environment was constructed, as no other suitable software package was available. A comprehensive comparison6 between jmeSim and other simulation packages is given by Haber et al. (2012). jmeSim is a high-quality simulation of an autonomous robot, and a rescue sce- nario based upon the RoboCup Rescue competition arena, which is an unstructured environment designed to simulate a disaster site. The jmeSim visualisation of ‘Emu’, the autonomous robot used in our experiments, is shown in Figure 1.4 along with its physical counterpart. The modified simulation of Emu, with its JACO robot arm used for complex manipulation, is shown in Figure 1.5. Further details on jmeSim are given in Appendix A.

1.6 Principal Contributions

The main contributions of this thesis can be summarised:

• The design and implementation of MALA, a novel control architecture for autonomous robots. MALA is based on the idea of a collection of individual, concurrent, software components called agents that encapsulate specialised representations and algorithms, and a central blackboard that facilitates inter-agent communication.

• Demonstration of a mechanism to integrate a number of distinct learning modules into a robot control system through the use of the MALA framework. In particular, the integration of task planning and control-level learning systems within MALA enabled an autonomous robot to simultaneously learn how to drive over rough terrain and recognise when its ability to learn was limited by extremely challenging terrain. Robot learning occurred not in an isolated setting, but embedded in the context of a relatively long-term

6Although jmeSim provided a signficant advance over available simulation packages when it was developed, Gazebo has since undergone a great deal of development and subsequently superseded its capabilities. See http://gazebosim.org for further details.

22 1.7. Overview

autonomous mission, and showed that performance of difficult tasks was improved significantly by embedding multiple learning modules within the integrated architecture.

• A hybrid learning algorithm for simultaneously learning complex manipula- tion actions at control and task levels, and refining the task-level description using an Inductive Logic Programming (ILP) algorithm. This system was able to successfully learn to solve manipulation problems such as opening doors and using objects as hammers.

1.7 Overview

The remainder of this thesis is organised as follows: Chapter2 presents a discussion of previous research relevant to the construc- tion of architectures for intelligent robots, both with and without embedded learning capabilities, to describe how this research differs from earlier work and further motivate the chosen approach. Chapter3 introduces the novel software architecture MALA, and explains the motivation and functionalities behind its design principles. Chapter4 describes the first implementation of learning within MALA. This is a two-level learning system which was used to incrementally learn a policy to drive over difficult terrain during the course of a rescue mission, and simultaneously learn to avoid terrain types that are too difficult to be attempted at all. Chapter5 gives details of task-level learning within MALA, and describes its application to solving difficult manipulation problems involving both task and motion planning. Finally, Chapter 6 summarises the main lessons of this research and presents suggestions for future work.

23 Chapter 2

Background

2.1 Chapter Overview

This chapter provides a discussion of relevant background and includes compar- isons to related research. Since this thesis presents work that draws from a wide range of fields, the discussion of related work is correspondingly broad. Multi Agent Learning Architecture (MALA) is a cognitive architecture for learning robots. As such, the discussion begins by describing the functions provided by cognitive architectures (Section 2.2), and several exemplar architectures (Section 2.3). Cognitive architectures based on a component model are described in Section 2.4, and Section 2.5 introduces robot architectures and considers their relationship to cognitive architectures. As MALA is based on a ‘blackboard’ architecture, Section 2.4.1 gives a brief summary of blackboard architectures. An overview of relevant results in robot learning is then provided in Section 2.6, with a particular emphasis on active learning, learning by demonstration, and relational learning. These are the types of learning that are most closely related to the learning systems implemented within MALA in Chapters 4 and particularly Chapter 5. In Section 2.7 the concept of hybrid learning is introduced in detail, and relevant research is discussed. Finally, Urban Search and Rescue (USAR), the problem domain that motivates the experiments and research in this thesis, is described in Section 2.8.

24 2.2. Cognitive Architectures

2.2 Cognitive Architectures

A cognitive architecture specifies the underlying structure for an intelligent sys- tem. That is, those aspects of cognition that are relatively constant over time and across different problem domains. To illustrate this concept with an example, the algorithms and data structures that specify how an agent’s short term memory is categorised, stored, and accessed, are part of the cognitive architecture. In contrast, the contents of those memories are not part of the architecture, since the memories of the agent constantly change as new memories are formed.

2.2.1 Functions and Capabilities of Cognitive Architectures

Clearly, the capabilities of a complete intelligent system, such as a human, are many. Cognitive architectures must therefore possess a daunting set of functionalities in order to support the full range of intelligent behaviour. The following sections describe the most fundamental of these abilities.

Perception, Categorization and Recognition

Any intelligent agent must have the ability to sense objects and changes in its environment, to act appropriately to achieve its goals. Sensory data must be linked to the knowledge that the agent has about the world. The converse of this process links the agent’s knowledge about the world to the objects, patterns, and processes which it observes, and leads to the recognition of these phenomena as instances of previously encountered stimuli. Cognitive architectures must include processes that detect matches between observations and memory, or possibly the degree to which a match is found, and further processes that learn new memories, and importantly, learn new categories that generalise observations of the world.

25 2.2. Cognitive Architectures

Reasoning and Belief Maintenance

Reasoning is the process by which agents recombine their knowledge to obtain new pieces of potentially more useful information (Russell and Norvig, 2010). It is closely related to problem solving. An example of this process might be when a fighter pilot observes a radar contact change course to intercept his own, and concludes that it is probably an enemy. In a more everyday scenario, when having invited a friend for dinner, a person hears a knock at the door and infers that his guest has arrived. To reason in this manner, a cognitive architecture must represent relationships between known objects and processes. There is a variety of means to approach this problem, the most common are first-order logic, Bayesian networks, or semantic nets (Albus, 2005; Langley et al., 2009). An agent must also continuously update its representation of the current state of the world. This process of maintaining a consistent set of beliefs in a changing world is called belief maintenance. Selection of which facts to discard and which to update may also require reasoning. Architectures typically make use of dependency structures to track the evolution of related facts and propagate belief changes through the agent’s knowledge base (Langley et al., 2009).

Planning, Decision Making and Problem Solving

Intelligent behaviour is often characterised as goal-directed (Newell, 1990). In attempting to achieve a goal, an agent will be faced with the choice of which action to take in order to make the best possible progress. To make such decisions, a cognitive architecture must represent actions and engage in some process to select from a set of possible actions. Actions may be physical procedures that have effects in the environment, or internal cognitive processes that change the agent’s internal state. It is typical for most architectures to separate this selection into two steps. The first determines whether an action is possible or legal, depending on context, and the second determines whether the action is optimal. Optimality of actions is often assessed through the calculation of a numerical score, a utility calculation (Anderson, 2007).

26 2.3. Notable Architectures

To solve problems within its environment, an intelligent agent must first under- stand the state of the world, which involves using the recognition and perception functions described above. If the current state does not satisfy the robot’s goal, the agent must make a plan of action to try and achieve it. The generation of such plans may be a complex process, and may require search of a problem space, the generation of a hierarchy of subgoals, learning, anticipation, and prediction of the environment.

2.3 Notable Architectures

There are many cognitive architectures in the literature. Rather than attempt to provide an exhaustive discussion, this section describes several exemplar architec- tures in some detail. Each system discussed is representative of a broad category of architecture. The discussion begins with perhaps the two most prominent cognitive architectures, Soar and Adaptive Control of Thought – Rational (ACT-R). These are members of a class of architectures that are concerned with modelling and producing high-level cognition. Langley (2012) defines this as the capacity to engage in abstract thought that goes beyond immediate perceptions and actions. The history and construction of these architectures is described, with a view to explain how these systems address the functions identified above as essential for cognitive architectures. ICARUS is an architecture more focused on providing control for a physical agent. Finally, an architecture known as Global Workspace is discussed. This architecture is intended to be more faithful to the structures and processes involved in cognition in the human brain. It is representative of the class of ‘connectionist’ architectures that attempt to construct cognitive systems through the collective activity of many parallel processing entities modelled on the neurons of the human nervous system.

27 2.3. Notable Architectures

2.3.1 Soar

Soar (Laird, 2012) has been particularly influential in research on cognitive systems. It is based upon the idea of a production system, a knowledge representation paradigm based around a set of rules called productions (Klahr et al., 1987). Each rule is composed of a precondition and an action, whereby if the precondition is activated then the rule ‘fires’ and the action may be executed. Soar emphasises the role of a shared workspace or memory and serial models of processing. The Soar cognitive architecture has been under continuous development since the 1980s (Laird et al., 1987; Langley et al., 2009), and was heavily influenced by Newell’s (1990) Unified Theories of Cognition. Long-term knowledge in Soar is organised in the form of production rules. This database is further organised in terms of operators and problem spaces. An operator is comprised of one or more production rules, while a problem space is the collection of different states which can be reached by the system, by applying operators, at a particular time. Following the seminal General Problem Solver (Newell et al., 1959), Soar treats all intelligent behaviour as problem solving, that is, iteratively selecting and applying operators to transform an initial state to a goal state. Soar determines its next action by deciding which of the operators associated with the current problem space will have the effect most beneficial to achieving the goal at hand. Operators are successively selected in order to achieve the current goal. When knowledge is insufficient to determine the correct operator to apply, a new subgoal, to determine the correct operator, is generated in a process known as ‘backward chaining’. Soar uses a variety of methods to implement subgoal creation such as hill climbing, means-ends analysis and heuristic search. Soar includes multiple mechanisms to enable it to learn as it attempts to achieve a goal, to refine both its procedural and semantic knowledge. Chunking is a process by which the Soar architecture learns from the results of its problem solving behaviour. Soar examines the traces of previously achieved goals and subgoals. If any particular set of operators successfully produced a useful result, it creates a new operator (chunk) that combines the set into a single rule. This process is closely

28 2.3. Notable Architectures related to Explanation-Based Learning (EBL) (Mitchell et al., 1986). Episodic learning in Soar stores the contents of short term memory in snapshots, while semantic learning stores individual elements of working memory as facts about the world. A recent version of Soar termed Soar-RL incorporates Refinforcement Learning (RL) (Nason and Laird, 2005). RL is used to adjust numeric values associated with rules that govern the selection of operators for optimal behavior. Soar has been adapted to develop AI for pilots in air combat simulations. Its most dramatic use was within an extended military training exercise that ran for 48 continuous hours (Jones et al., 1999). Soar has also been used to produce the behaviour for agents in commercial video games such as Quake II (Laird, 2001), and for the control of mobile robots (Hanford et al., 2009). Soar has also been used to model aspects of human cognition including emotions (Gratch and Marsella, 2009), group psychology (Swartout et al., 2006) and natural language dialogue (Gratch and Marsella, 2001).

2.3.2 ACT-R

ACT-R is the latest in a series of cognitive architectures primarily concerned with modelling human behaviour. It has been under continuous development, and the subject of many studies since the 1970s. Like Soar, ACT-R is based on production rules that dictate the actions performed by the cognitive architecture. The architecture was originally intended as a theory of human brain function more than a framework for building intelligent systems (Anderson, 2007). ACT-R is organised into a set of modules, each of which processes a different type of information, i.e visual, intentional (goals), motor (action) and others. Each module has an associated buffer that serves as short term memory. Knowledge in ACT-R is divided into declarative and procedural. Declarative knowledge is explicitly represented as facts about the world, while procedural knowledge describes processes of action execution. This division is based on models of the character of human knowledge from the field of cognitive psychology (Anderson, 2007). Declarative knowledge is represented within ACT-R’s buffers as vectors of

29 2.3. Notable Architectures properties, each of which is called a ‘chunk’. Procedural knowledge is represented by a set of production rules. Computation in ACT-R occurs in cycles. At each cyle, the state of the ACT-R buffers is compared with the set of productions by the pattern matcher, and rules are executed based on the result of the comparison. The ACT-R system stores its short term memory in the buffers, but long term memory in the set of production rules. In every computational cycle, a subset of the production rules will have preconditions which are satisfied. As the execution of some actions may conflict, ACT-R must decide which is the most appropriate action to execute, according to its utility. Utility of productions is based on expected benefit (goal desirability × probability of success) weighed against the expected cost of the action. Learning in ACT-R attempts to improve the performance of action selection, by updating cost and success probabilities based on experience. Productions that are executed frequently gain utility, as the system becomes ‘familiar’ with a new action series. The architecture can also learn entirely new rules through a generalisation process, analysing multiple concurrent firings, and replacing constants with variables. This process constructs a new rule with more general condition and actions (Taatgen, 2005). ACT-R has generated a significant body of research, studying a variety of topics including memory (Lovett et al., 1999), attention (Sohn et al., 2000), skill acquisition (Taatgen, 2005), and decision making (Fincham et al., 2002). Recent work has related ACT-R modules to brain areas and models have been developed to match data from brain imaging studies (Anderson et al., 2005; Anderson and Fincham, 2013; Anderson and Borst, 2014).

2.3.3 ICARUS

ICARUS is a more recent cognitive architecture that, like Soar and ACT-R, commits to a single representation7 for all its knowledge (Langley and Choi, 2006). However, unlike Soar and ACT-R, whose focus is on high level, abstract cognition, ICARUS

7Recent versions of Soar such as (Nason and Laird, 2005) abandon this commitment.

30 2.3. Notable Architectures is explicitly constructed as an architecture for simulated agents. Accordingly, ICARUS supports reactive execution, meaning that actions may be tightly coupled to sensors to provide responsive feedback. Within ICARUS, knowledge is subdivided into concepts and skills, both rep- resented in a formalism related to first order logic but having the capacity for reactive execution, known as teleo-reactive logic programs. Like the teleo-reactive programs introduced by Nilsson (1994), teleo-reactive logic programs in ICARUS enable execution to proceed in a goal-directed manner while maintaining reactive response to percepts. Bottom up, perceptually driven inference constructs a hier- archical long term memory, and reactive skills are executed in a top-down, goal driven manner. ICARUS incorporates learning new skills through a combination of means-ends analysis and a form of EBL. This process compiles the results of problem solving into new skills, which allows the architecture to solve a previously unknown problem in a reactive manner the next time it occurs. As in Soar, learning in ICARUS is analytical in nature, in the sense that it takes advantage of domain knowledge, in this case, the definitions of existing concepts and skills, to constrain learning.

2.3.4 Global Workspace

One criticism of cognitive architectures such as ACT-R and Soar is that their rep- resentation of knowledge and overall structure bears little resemblance to what is known about the human brain. It has been claimed that the inherently local repre- sentation of information, generally serial nature of processing, and accessibility of information to human programmers within these architectures, both divorce their operation from that of the brain, and limit their potential for intelligent behaviour (O’Reilly, 1998; Vernon et al., 2007). To combat these criticisms, the connection- ist paradigm for architecture design attempts to produce architectures with more biological realism. An example of this type of design philosophy is the Global Workspace cognitive architecture proposed by Shanahan (2006; 2008).

31 2.3. Notable Architectures

Global Workspace includes parallelism at a fundamental level by using a topologically organised network of artificial neurons. Each neuron is a Generalised Random Access Memory (G-RAM) (Braga, 1994). A G-RAM takes many boolean inputs, and produces a boolean output. The architecture consists of a group of specialist modules (composed of collections of G-RAMs), which communicate indirectly, through a central module called the global workspace. This arrangement is designed to simulate conscious and unconscious processing of information. That is, the architecture is only ‘conscious’ of the processing that appears on the central workspace, but not of the details of implementation within the specialist modules. This is intended to model theoretical work by Baars (1986; 2003) on explaining consciousness within the brain. The organisation of the neurons within the network of this architecture is modelled on the structures in the brain, such as the sensory cortex, and basal ganglia, involved in perception and decision making respectively. The architecture controls agent behaviour using two sensory-motor loops, one of which is directly coupled to the environment. The direct coupling of this loop enables the system to produce rapid reflex-like reactions to sensory input. The second, higher-order loop selects the actions of the lower-order loop, preventing reflexive action if necessary (similar to the function of the amygdala and basal ganglia in the brain (Kandel et al., 2013)). The higher-order loop is able to traverse a simulated trajectory on the basis of inputs, which allows it to predict or ‘imagine’ the effects of actions, and plan accordingly. The system has been implemented on several simulated and real robots and has been able to perform a number of tasks including vision-based packing of irregular shapes (Shanahan, 2007), control of a robotic arm (Bouganis and Shanahan, 2010), and object detection (Shanahan and Bouganis, 2008). Global Workspace is representative of most connectionist approaches to cog- nition in that it attempts to remain faithful to what is known about brain function. This produces a form of architecture in which processing is parallel, distributed, and generally non-symbolic. Although this approach is capable of some degree of intel- ligent behaviour or robot control, such connectionist approaches cannot generate or

32 2.3. Notable Architectures simulate the range of behaviour that has been achieved by symbolic architectures such as Soar and ACT-R (Vernon et al., 2007; Langley et al., 2009). More recently, architectures which combines symbolic and connectionist approaches have begun to appear; these are generally known as hybrid architectures8. Some examples of such architectures are Decaf (Graham et al., 2003), Sal (Jilk et al., 2008), CLAR- ION (Sun, 2005), Cognitive Map Architecture (Ng-Thow-Hing et al., 2009), and RETSINA (Sycara et al., 2003).

2.3.5 Open Issues in Cognitive Architectures

The majority of cognitive architectures, in particular Soar and ACT-R, emphasise the achievement of goals or correct behaviour, but often have a limited ability to perceive, categorise, and understand their environment (Langley et al., 2009). The inclusion of more powerful perceptual machinery represents an ongoing challenge for cognitive architectures research. It is worth reiterating that when architectures are required to control physical systems, the necessity of handling perception be- comes of paramount importance. The same is true for generating motor commands such as motion plans. During the development of a cognitive architecture, researchers must select from the arsenal of computational techniques available to solve various problems. Many architectures use representations like formal logic for representing reasoning processes (Langley and Choi, 2006; Ferrein and Lakemeyer, 2008), Bayesian Networks for representing probabilistic knowledge (Cassimatis et al., 2004; Vernon et al., 2007; McGee and Abraham, 2010), and semantic nets or graphs for encoding relational declarative knowledge (Sycara et al., 2003; Swartout et al., 2006). This delegation of particular functions to such techniques is often because, at the time of development, each technique is considered to be the most robust or accurate for a particular task. In Chapter 1, it was argued that this specialisation is fundamental to cognitive processes, particularly when the system is embodied within a robot.

8This class of symbolic-connectionist hybrid cognitive architecture is distinct from hybrid robot architectures (Section 2.5.3) that combine deliberative and reactive components.

33 2.4. Multi-agent Cognitive Systems

As such, a crucial issue in developing more flexible cognitive architectures is the ability to integrate a variety of data structures or algorithms into hybrid cognitive processes (Anderson et al., 2004; Gray, 2007) which make use of this type of multiplicity of techniques, just as the human mind is thought to do (Kandel et al., 2013). A natural step toward integrating ensembles of algorithmic processes is to construct an architecture that has many independent, modular components.

2.4 Multi-agent Cognitive Systems

Typically the types of tasks that are routinely solved by humans require the solution of many subtasks. For example, making a cup of coffee requires grinding the beans, obtaining and heating the milk, brewing the coffee, and then mixing them and adding sugar as desired. This complexity reveals the fact that most tasks are really composed of many sub-tasks. As the task becomes more complex, the sub-tasks increase in complexity accordingly, to the point where solving each one requires considerable expertise. Section 1.3.1 argued that the ‘intelligent’ properties of the human mind may arise not as a result of a single master algorithm, but rather out of complex inter- actions between a group of computational modules. Interestingly, this modular character is paralleled in attempts to build intelligent machines. Initial attempts to build a machine that could solve problems requiring human level intelligence revealed that there is a vast number of specialised skills required to solve even ‘sim- ple’ tasks like building towers of blocks (Singh and Sycara, 2004). This resulted in the creation of a multitude of specialised subfields of AI research to address all these various computational problems individually (Russell and Norvig, 2010). Marvin Minsky has repeatedly suggested that this sub-specialisation is a natural consequence of the character of intelligence:

What magical trick makes us intelligent? The trick is that there is no trick. The power of intelligence stems from our vast diversity, not from any single, perfect principle. (Minsky, 1988)

34 2.4. Multi-agent Cognitive Systems

A natural method of incorporating such a diverse set of processes or algorithms within an intelligent system is to make the system modular, and encapsulate each process in a discrete processing unit. An intelligent system may then be constructed using a collection of such processes, forming a multi-agent system (MAS). Initial research into what is now termed multi-agent systems was in the field of Distributed Artificial Intelligence (DAI). The first DAI workshop was held at Massachusetts Institute of Technology (MIT) in 1980. The focus of DAI was the issue of how intelligent problem solvers could coordinate effectively to solve problems (Jennings and Sycara, 1998). Specifying the conditions under which agents interact to produce the correct overall behaviour continued to be a challenge. One approach to solving this problem led to the development of blackboard architectures.

2.4.1 Blackboard Architectures

A blackboard architecture (Hayes-Roth, 1985; Craig, 1988) is a structure that enables communication and information-sharing between a collection of agents. The blackboard architecture is based on a metaphor of a group of specialists gathered around a blackboard inside a conference room. Each specialist, although highly expert within their own field, is unable to decipher what most of the others are talking about, and patiently waits for something understandable to appear on the blackboard. In the event that this happens, the specialist is able to contribute useful knowledge to the group’s solution to a problem, by writing on the blackboard. This new information may, in turn, trigger a second specialist to add to the blackboard. In this way, the group of experts is able to communicate and utilise knowledge from multiple sources.

Hearsay-II

An early system incorporating a blackboard architecture was the Hearsay-II speech recognition system (Erman et al., 1980). This incorporated a number of independent

35 2.4. Multi-agent Cognitive Systems components, each of which was called a Knowledge Source (KS). Each KS ran different algorithms for signal processing on audio signals. The authors note that:

The necessity for diverse KSs derives from the diversity of trans- formations used by the speaker in creating the acoustic signal and the corresponding inverse transformations needed by the listener for interpreting it.

This insight prompted the Hearsay-II authors to develop a multiplicity of processing entities to solve the problem. As such, Hearsay-II was an early demon- stration of a principle that has proven far-reaching in AI research. That is, that for a majority of complex problems, a combination of several strategies often copes better than a single one (Minsky, 1988; Minsky, 2006). The application of each KS is dictated by a production rule, which specifies the preconditions for the particular transformation that the KS can apply. The collective application of the set of KS’s produces hypotheses about language at multiple levels, that is, phrases, words, and syllables. The satisfaction of preconditions of KS’s is driven by data on the blackboard at all levels, leading to recursive splitting of high level structures, such as sentences and words, into smaller structures in a top-down manner, and also bottom-up, to classify acoustic segments into phonetic categories and so on up the hierarchy. This combination of bottom-up and top-down methods greatly reduces the search space of possible linguistic meanings (Russell and Norvig, 2010). These hypotheses are then published on the blackboard and may satisfy the condition for other KSs, leading to further processing. This type of data-directed control eliminates the need to explicitly call KSs to perform their tasks, since they simply wait for appropriate data to appear on the blackboard, and then begin processing it. Clearly, a crucial concern is whether the successive invocation of KSs leads to the correct interpretation, rather than a combinatorial explosion of possible KS applications. Hearsay-II implements a selective attention mechanism, which tries to ensure that only the most promising actions are executed, rather than blindly

36 2.4. Multi-agent Cognitive Systems executing all KSs that are invoked. This is achieved using a heuristic scheduler that calculates a priority for each action and executes, at each time, the waiting action with the highest priority, much like the utility calculation implemented in ACT-R (Anderson, 2007). With this system in place, Hearsay-II proved to be a successful speech recognition system (Lesser et al., 1975). Hearsay-II was highly influential due to its performance in the very challenging domain of speech recognition, and many subsequent blackboard systems adopted aspects of its design. Two such implementations were the Distributed Vehicle Monitoring Testbed (DVMT) and the Multiple Intelligent Node Document Servers (MINDS) project. DVMT (Lesser and Corkill, 1983) constructed a problem solver using a set of agents distributed throughout a spatial region. The task of the agents was to use audio sensing to monitor and map vehicle movements. The MINDS project (Huhns et al., 1987) is a distributed information retrieval system in which agents cooperate to locate a requested document from a large database.

MICA

Multimodel Interagent Communication Architecture (MICA) is a more recent blackboard implementation described by Kadous and Sammut (2004b), designed for pervasive computing. Because of the integrative nature of blackboard process- ing, blackboards provide a particularly attractive method for integrating information from multiple sensory modalities. MICA allows the sharing of data between multi- ple agents, in particular the use of multiple input and output modalities, such as speech, gesture, etc. Each sensory modality is represented by a knowledge source, as in the Hearsay-II implementation. Much of the blackboard implementation used to construct the MALA architecture is based upon the MICA environment. This relationship is described in more detail in the next chapter where MALA is introduced.

37 2.4. Multi-agent Cognitive Systems

2.4.2 Society of Mind

The central principle of a multi-agent system is that intelligent behaviour may be achieved through the collective capabilities of an interacting group of simpler entities. In the Society of Mind, Minsky (1988) constructed a comprehensive theory of the human mind with this principle as a guiding framework. Society of Mind attempts to explain the full scope of the human mind in terms of a multitude of specialized agents. As mentioned in Chapter 1, each agent is not an independent autonomous decision maker, like a human, but is rather a component of a computer program, such as a subroutine. Each agent has a specialised function, and typically does not display independent autonomy. The central tenet of Minsky’s theory is the proposal that the power and versatility of the human mind results not from a single cognitive process, but rather from a multitude of specialised skills. This premise is summarised by Minsky (1991):

The functions performed by the brain are the products of the work of thousands of different, specialised sub-systems, the intricate product of hundreds of millions of years of biological evolution. We cannot hope to understand such an organisation by emulating the techniques of those particle physicists who search for the simplest possible unifying conceptions. Constructing a mind is simply a different kind of problem - of how to synthesise organisational systems that can support a large enough diversity of different schemes, yet enable them to work together to exploit one another’s abilities.

In his (1988) proposal of the theory, Minsky addresses the complete range of cognitive abilities, from language, to perception, to reasoning and decision making, and even emotion, in terms of the society of agents. As was stated in Chapter 1, while the concept of the Society of Mind is extremely powerful, the statement of the theory was at the level of an abstract description, rather than a detailed specification, and so direct implementation of Society of Mind has not been possible. Despite this, the theory has been influential in many aspects of AI, particularly in the

38 2.4. Multi-agent Cognitive Systems

field of multi-agent intelligent systems (Singh, 2004). Two architectures directly influenced by the Society of Mind are PolyScheme and EM-One. PolyScheme (Cassimatis, 2001) focuses on using a variety of representations, and combining their advantages for problem solving. Like Minsky, Cassimatis advocates encapsulating each representation in a specialist module. PolyScheme integrates the outputs of the various specialists by enforcing a focus of attention on a single predicate, and combining their respective outputs to produce forward inference and subgoaling. PolyScheme has been embodied in several robotic implementations (Trafton et al., 2005), and represents an active and promising research area for cognitive architecture in robotics. EM-One (Singh, 2005) encapsulates processing in components called critics, organised in a layered configuration, from reactive, deliberative, to reflective. This arrangement allows for complex introspection, as critics in higher layers can debug and activate those below. Singh’s work showed that this type of architecture has the potential to integrate higher level cognitive processes with those that are more reactive, which has made it influential in the study of cognition in robotic systems.

2.4.3 Real-time Control System (RCS)

The final multi-agent intelligent system to be considered is called the Robot Control System (RCS). Unlike the other architectures so far described, RCS is a cognitive architecture designed for the control of intelligent robots. Albus (2005) defines RCS as a ‘multi-layered, multi-resolutional hierarchy of computational agents’. To make this description concrete, consider that intelligent behaviour spans many time and space scales. From rapid, fine motor control such as picking up a glass, to long range, long term, multi-stage plans such as taking the glass to a required location. Several early theories about robot software architecture claimed that planning, action, and knowledge must be represented at multiple levels of resolution to produce a range of behaviours (Albus, 1991; Gat, 1997). In RCS, agents are organised in several ‘layers’, each of which operates at a different timescale and level of spatial resolution. High level agents make plans and solve problems in

39 2.4. Multi-agent Cognitive Systems the time range of hours and consider large spatial distances, while the lowest level agents operate at a millisecond level and execute fine motor control on a millimetre scale. RCS has been developed since the late 1970s (Albus, 1996), and has been used in the control systems of a variety of settings, such as industrial robots, automated mining equipment, unmanned underwater vehicles (Albus and Meystel, 2001), and unmanned ground vehicles (Albus, 2005). The structure of RCS is composed of a collection of agents, each of which contains a set of five elements of cognitive function, which Albus splits into sensory processing (SP), world modelling (WM), value judgement (VM), behaviour generation (BG), and knowledge database (KD), following his earlier work (Albus, 1991). As mentioned above, these agents are arranged into a hierarchy, allowing them to form plans and objectives at a range of spatiotemporal scales. Although there are similarities between the construction of RCS and the Society of Mind, two important differences are immediately obvious. First, the level of heterogeneity amongst the agents. In Society of Mind agents are specialised, and thus highly varied, whilst RCS agents all share the same sense-plan-execute struc- ture, albeit at different resolutions. The second difference is the rigid hierarchical overall structure linking the agents. The rigid, preset hierarchy of RCS cannot change via learning, while Minsky suggests that agents are connected in a manner that can change as the society gains experience. The preceding discussion has attempted to describe the many functions required of a complete intelligent system. This multiplicity of function is reflected in the complex integrative nature of the architectures that have so far been discussed. However, with the exception of RCS discussed in the last section (and ICARUS, to some extent), these architectures are only intended to model high-level cognition, that is, the abstract mental functions of an intelligent system. Control of physical actuators and processing sensory data to perform perception is either ignored or delegated to external subroutines that are not explicitly included in the architecture itself. When intelligent systems are applied to the control of autonomous robots,

40 2.5. Robot Software Architectures sensory-motor functions rise to a position of paramount importance. Thus, cogni- tive architectures omit crucial elements of a physical robot’s programming. The remainder of this chapter is dedicated to discussion of architectures for intelligent robots, their relationship to the cognitive architectures we have already discussed, and importantly, their relationship to MALA, the novel system to be presented in the next chapter.

2.5 Robot Software Architectures

Architectures for robots are often classified according to their inclusion of deliber- ative (planning) components, or ability to respond quickly in a reactive manner. This section discusses historical roots and prominent examples of both deliberative and reactive architectures, and then introduces hybrid architectures, that integrate both types of algorithms. Finally, more recent types of robot architectures and their relationship to MALA will be discussed.

2.5.1 Shakey and STRIPS

The best known early example of a robotic architecture was developed as part of the Shakey project (Nilsson, 1984). Shakey, an autonomous mobile robot, constructed a symbolic model of its environment, which was used as the basis for planning to achieve goals. Shakey used the Stanford Research Institute Problem Solver (STRIPS) planner to represent actions and achieve goals (Fikes and Nilsson, 1971). In STRIPS, operators called action models are the basic elements from which a plan is built. For robot problems, each operator corresponds to an action whose execution causes a robot to change the state of the world. For example, an action model may cause it to go through a doorway or push a box. An action model consists of a description of the conditions under which it is applicable, and the results of the action. These are referred to as the pre-conditions, and effects, respectively. Later descriptions of action models in this thesis refer to the Planning Domain Definition Language (PDDL) (McDermott et al., 1998), which evolved

41 2.5. Robot Software Architectures from the STRIPS representation, and has become a standard representation for planning tasks (Fox and Long, 2003). Shakey proved an early success in autonomous robotics, as it was able to use its planner to generate plans to solve novel tasks, such as climbing a ramp to reach a goal object. Shakey’s architecture was an example of a deliberative architecture9. Such architectures generate robot behaviour through a process of searching a state space to construct a plan of action.

2.5.2 Reactive Robots

Brooks (1991) criticised Shakey’s architecture for its slow operation and lack of responsiveness. He argued that robots should not explicitly represent their envi- ronment at all, and proposed the Subsumption architecture that he claimed could provide ‘intelligence without representation’. This architecture is composed of a set of layered circuit controllers that tightly couple sensors to actuators, producing highly responsive robots that can perform simple tasks very well. Following Sub- sumption, many other behaviour-based robotic architectures were proposed, based on interacting reactive controllers with little representation of internal state (Arkin, 1998). Without symbolic processing, behaviour-based robots faced several shortcom- ings. First, without representing goals and problems, the range of behaviours that such robots could perform was limited. Second, behaviour-based robots must be painstakingly reprogrammed to perform different tasks (Langley et al., 2009). This is in contrast to symbolic planners that can generate behaviour to solve a variety of problem that can be described in their formalism. However, the improved responsiveness of behaviour-based systems (Brooks, 1986) demonstrated the need for tight sensor-actuator control loops to provide reactivity.

9Despite being often cited (e.g (Arkin, 1998) or (Bekey, 2005)) as an exemplar of deliberative robot software architectures, Shakey also had a reactive execution layer called PLANEX (Fikes et al., 1972) that implemented each STRIPS action. Thus Shakey is actually an early hybrid architecture.

42 2.5. Robot Software Architectures

2.5.3 Hybrid Architectures

Hybrid architectures include both reactive and deliberative processing in a layered configuration. One example is the three-layer architecture (Gat, 1997). In this arrangement, processes that operate on a slow timescale populate the highest, deliberative layer, while fast and responsive processes are relegated to the reactive layer. Typically, such architectures have three layers of abstraction, including a sequencing layer that interfaces between deliberative and reactive subsystems. Notable hybrid architectures are TouringMachines (Ferguson, 1992), InteRRaP (Müller and Pischel, 1994), 3T (Bonasso et al., 1997) and Saphira (Konolige and Myers, 1998).

2.5.4 Component-Based Architectures

Two concerns that shape robotic architectures are the need to incorporate spe- cialised processing components, and to process data from multiple, high-throughput sensor data streams simultaneously. This enforces the inclusion of multiple, concur- rent processing pathways for sensory input. The same applies to the need to provide concurrent motor control of multiple actuators, for example, of a soccer-playing robot walking around the field whilst looking for the ball. A model that simultaneously addresses both of these needs is a highly modular, component based computation graph of independent computational processes. Albus’s Robot Control System may be seen as a forerunner of such a design paradigm. Recently, this approach has become more popular and has led to a new category of architectures. Architectures that adopt this processing model are referred to as component-based architectures10. As MALA is based on a network of agents communicating via blackboards, it

10It is important to distinguish between such component-based architectures, and component- based middleware, such as ROS. Robot Architectures provide a specification of how information is processed and flows through a robot control system, whereas middleware systems like ROS provide a framework to implement any control architecture and are agnostic to the type of computation performed.

43 2.5. Robot Software Architectures may be considered a component-based architecture. MALA has much in common with these systems, therefore, the following section explores its relationship to the most similar architectures. These are the Cognitive systems Architecture Schema (CAS) (Hawes and Wyatt, 2010), Distributed Integrated Affect, Reflection, Cognition (DIARC) (Scheutz et al., 2007), Teleo-Reactive Executive (T-REX) (McGann et al., 2008), and most recently, Motor Affective Cognitive Scaffolding for the iCub Robot (MACSi) (Ivaldi et al., 2014). CAS shares many features with MALA, such as the presence of concurrent processing components that utilise shared working memories for communication. Working memories are similar to blackboards, but enforce a closer coupling be- tween components, as both read and write operations can be performed. This is contrast to the blackboard architecture, where agents simply receive information when the blackboard relays it, and have no ability to explicitly read the blackboard. While MALA prescribes specialised representations for individual agents, CAS implements specialised ontologies at the level of subarchitectures. Each subarchi- tecture is a small group of components, all of which perform related functions on a local working memory. CAS adopts a communication paradigm in which each component specifies the particular changes to working memory elements about which it should be notified. This is equivalent to listen requests in MALA, in which an agent specifies the type of blackboard object it requires. More recent iterations of CAS have integrated a novel mechanism for component execution management using planning. This allows a continual planner to treat each subarchitecture as a separate agent in a multi-agent planning problem (Hawes et al., 2009). DIARC, an early example of modular robotic architectures, incorporates con- current, autonomous processing modules, each using specialised representations. The perceptual subsystems perform low-level processing and abstraction into re- lational representations in a manner similar to MALA. However, communication within DIARC is point to point, with direct links between computation nodes, that are implemented using the Agent Development Environment (ADE) middleware

44 2.5. Robot Software Architectures

(Scheutz, 2006). This is potentially more efficient than communication in MALA, as it relaxes the restriction that communication objects must pass through the blackboard. A consequence of this tighter coupling is that the potential for online reconfiguration is greatly reduced, since components must be explicitly aware of each other’s existence at run time. Further, constructing introspective monitor- ing processes is made more difficult, as each node-node pair must be monitored independently. To generate behaviour, DIARC relies on an action interpreter that executes action scripts analogous to task-level plans, such as PDDL plans used in MALA. For each goal that a DIARC agent maintains, a corresponding script interpreter manages the execution of events, analogous to task-level actions, to achieve a goal. However, if a DIARC agent encounters a goal for which it does not have a premade script, it has no mechanism to generate a script by concatenating action primitives. A third architecture, T-REX (McGann et al., 2008), is designed to integrate automated planning with adaptive execution in the tradition of deliberative-reactive hybrids. T-REX is composed of a set of encapsulated control programs known as teleo-reactors. Although this decomposition lends the architecture a modular character, it does not exhibit the extensive modularity of systems like CAS that are based on networks of computation nodes. In addition, there is more consistency about the internal structure of T-REX reactors, as they adopt a common represen- tation. Unlike heterogenous architectures such as MALA, CAS, and DIARC, all T-REX reactors run the same core algorithm to synchronise with the rest of the architecture. This reflects the T-REX claim that behaviour should be generated by similar deliberative processes operating on multiple time scales. Communica- tion between nodes is highly structured, consisting of only two types of objects, goals and observations, that are expressed in a constraint-based logical represen- tation. This formalism provides a more expressive framework than the PDDL representation used in MALA, with the ability to reason about temporal resources and conditional effects. This lets T-REX produce effective plans for time-critical missions. Reactors are organised hierarchically according to their time scale in a

45 2.5. Robot Software Architectures

Architecture Type Component Deliberative Reasoning Connectionist Hybrid Reactive Unified Incremental Learning

Hybrid Learning

Specialised Modular Components

Reactive Processing

Implemented on a Physical Robot Architecture Type Component Deliberative Reasoning Connectionist Hybrid Implemented on a Simulated Robot Reactive Unified

Incremental Learning ICARUS ACT − R Soar Society of Mind MALA MACSi DIARC RCS CAS Subsumption Global Workspace Shakey T − REX 3T

Hybrid Learning

Specialised Modular Components

Reactive Processing Figure 2.1: Summary of capabilities of both cognitive and robot architectures surveyed in this chapter based on the principles outlined in Section 1.3. Green cells indicate that the

Implemented on a Physical Robot architecture provides this capability, and red squares indicates that it does not. Broad categories of architectures delineated in earlier sections are also shown. Early cognitive architectures discussed in Section 2.3 are referred to as unified, in reference to their Implemented on a Simulated Robot focus on unified representations and relationship to Newell’s (1990) Unified Theories of Cognition. Finally, whether or not a system based on each architecture has been ICARUS ACT − R Soar Society of Mind MALA MACSi DIARC RCS CAS Subsumption Global Workspace Shakey T − REX 3T successfully demonstrated in either real or simulated robot systems is also indicated.

manner reminiscent of hybrid three-layer architectures (Bonasso et al., 1997). Most recently, the MACSi architecture proposed by Ivaldi et al. (2013; 2014) shares several properties with MALA. MACSi is focused on active learning, ex- ploration and . Implementations of the architecture have

46 2.6. Robot Learning demonstrated exploration and discovery of object properties by a small (Nguyen et al., 2013), and learning of motor skills through demonstration (Stulp et al., 2013). Like MALA, MACSi emphasises multi-modality and modu- larity in its design, and advocates the blending of symbolic representations with motor control. However, the architecture does not include a facility for planning, instead using an ad-hoc definition of individual actions that can be learned by demonstration. These actions may be sequenced, but not in novel configurations as in the case of PDDL planners (Ivaldi et al., 2014). A second important differ- ence between MACSi and MALA is that while both architectures advocate the inclusion of multiple learning modules to enable robots to acquire knowledge, MACSi considers each learning module to be an isolated system. This is in direct contrast to the idea of hybrid learning, in which learning occurs at multiple levels of abstraction that are coupled together and influence each other. Each of the above architectures has demonstrated successful operation in situated robotic systems, CAS in domestic robotics (Hawes et al., 2012), DIARC (Krause et al., 2012), and T-REX in Autonomous Underwater Vehicle (AUV) exploration and underwater mapping (McGann et al., 2008). Typically, however, the motivation of these architectures is more on the integration of planning and execution capabilities with robotic platforms and less on integration of multiple learning modules, which is the focus of this work. The relationship of all surveyed architectures to MALA is summarised in Figure 2.1. To introduce a discussion of the learning mechanisms incorporated into MALA, we begin with a discussion of the relevant forms of machine learning applied to robotics.

2.6 Robot Learning

Autonomous robots are usually manually programmed (Peters et al., 2011). This requires a programmer to anticipate the sensory conditions under which robots should perform particular actions. As the environment and the robot become more complex, this anticipation becomes more difficult. Even worse, if environmental

47 2.6. Robot Learning conditions shift during the robot’s operation, the robot’s pre-programmed responses may no longer be appropriate. Because many important application domains have this dynamic character, the abilities to adapt and learn are important. As argued in Chapter 1, this point is particularly relevant in long-term deployment settings. To embed a learning capacity, robot learning systems must provide mechanisms to gather training examples from the environment, and to generalise existing knowledge to incorporate this new information. Unlike machine learning programs that are not physically embodied, robots are able to influence their own environment to create training data from which to learn. Thus, for example, robots are capable of trial and error learning by performing physical experiments in the world. Common forms of robot learning can be broadly categorised as trial and error or learning from demonstration. In Learning from Demonstration (LfD), a human trainer provides training examples directly. This demonstration seeds learning with a bias towards an operational policy. Thus, learning from demonstration is, in principle, a faster method of learning than trial and error exploration. This is because the learner is immediately directed to relevant regions of the search space by the trainer’s demonstration. Section 2.6.2 provides a discussion of LfD. During trial and error learning, a robot has a choice of which experiment to perform. One mechanism for making this choice is to use existing knowledge to select the experiment that the robot expects to be the most informative. This idea is the basis of active learning. Section 2.6.1 describes active learning in more detail and describes its application in robot systems. In addition to possessing a mechanism of gathering training data, a robot learner must also represent knowledge about whatever is being learned. This knowledge is often referred to as constructing a hypothesis that fits the observed data and any prior knowledge held by the learner (Mitchell, 1997). A particularly relevant class of representation of hypotheses and training data is relational, meaning it explicitly represents relationships between variables. This may be contrasted with attribute- based approaches that represent training data and hypotheses using a vector of

48 2.6. Robot Learning attributes, which are often real-valued. Relational approaches are particularly appropriate for integrating with symbolic planners. Section 1.3.2 argued that this integration is essential to improve robot autonomy in complex domains, thus relational approaches to robot learning are particularly relevant. Accordingly, Section 2.6.3 reviews results in relational robot learning and provides necessary background for the learning system presented in Chapter 5 that uses a relational approach to acquire task-level knowledge that can be used to make and execute plans.

2.6.1 Active Learning

Machine learning systems may acquire training data in a number of ways. In a typical data mining setting, the learner is presented with a set of training data at the onset of learning. This paradigm is known as batch learning. In a batch learning setting, the learner is typically provided with all available training data at the outset, and has no opportunity to acquire new data. An alternative to batch learning is known as incremental learning. This refers to a learning setting in which training examples become available progressively and the learner continually revises its knowledge. As learning progresses, it may be possible to deduce that certain areas of the search space are likely to provide more informative training data. An active learning system is one that preferentially selects training examples in this manner (Cohn et al., 1994; Settles, 2012). Active learners are able to exploit the knowledge gained from early training data to potentially learn concepts from fewer training examples. Once the training examples are obtained, a variety of machine learning algorithms are capable of performing generalisation. Active learning is particularly appropriate for robots for two principal reasons. First, robots require typically require the use of motors, usually electrical, that fatigue, wear, and may eventually ‘burn out’ over time. Typically, machine learning algorithms require a substantial training corpus. Thus, it is desirable to minimise the number of actions that the robot must take in the world to reduce wear on

49 2.6. Robot Learning the robot’s components. Since active learners are able to exploit the early results of their learning to select informative training data, an active learner can acquire concepts using fewer actions by the robot. Second, robots are physically embodied and able to move around, enabling them to influence their surroundings and observe the results. This characteristic means that, like human learners, robots are able to learn by experimentation in the physical world. The application of active learning to robotics is relatively recent, but for these reasons, it is becoming more popular (Cakmak et al., 2010).

Reinforcement Learning (RL)

RL is a well known form of active learning. A reinforcement learner performs actions in some problem domain and as it does so, it receives feedback, in the form of the reward function. This feedback provides information about the utility of the action in the current context. A reinforcement learner acquires training data as a result of its own actions. Thus, like many active learners, a reinforcement learner must strike a balance between the information gained from the action and the success of the action. That is, even if it might be more informative for a reinforcement learner to take a bad action, it may not do so, since it is attempting to achieve a goal and therefore explicitly profitable actions may be preferable. This balance is termed the “exploration-exploitation” trade-off (Sutton and Barto, 1998). RL has proven very successful in robotics research, and a large amount of the robot learning literature is devoted to it. One common problem in RL is the curse of dimensionality. This refers to the fact that RL policies may be unable to converge to an operational policy sufficiently quickly when applied to learning problems that have a high-dimensional search space. Thus, a particular goal of the hybrid learning architectures to be introduced in Section 2.7 is to provide constraints on the search space of a reinforcement learner to improve convergence speed. However RL in isolation11 is not explicitly relevant to the work presented in this thesis, but is covered in a recent survey by Kober et al. (2013).

11RL in combination with Learning from Demonstration is addressed in Section 2.6.2.

50 2.6. Robot Learning

Active Learning of Robot Actions

A classical problem in robot learning is to learn models of actions. Much of the learning described by later chapters of this thesis falls into this category. There have been several applications of active learning to this problem. One early implementation reported by Salganicoff et al. (1996) used active experimentation with different grasps to learn an optimal approach direction to successfully grasp objects. The robot constructed shape approximations of objects and was able to rapidly learn appropriate approach directions to suit their shape. Generalisation over this training set using a version of ID3 decision trees (Quinlan, 1986) allowed the gripper to pick up novel objects. A more recent example of active learning in robotics is given by Kroemer et al. (2009), whose work also concerned robot grasping. In this work, the robot attempted to learn to grasp a ping pong paddle, a difficult task because of the topological complexity of the shape. Learning was bootstrapped from several good demonstration grasps. The system initially used Gaussian process regression to learn a value function for these model grasps. The system then iteratively generated and carried out experimental grasps that are optimal with respect to the value function. In this manner, the value function converged to a description of good grasps for the ping pong paddle with high confidence. The work of Daniel et al. (2014) provides a further example of active learning for grasping. Interestingly, the authors integrate active trial and error learning with RL, where the goal of experimentation is to construct the reward function with the aid of occasional feedback from an expert. The active learner employs Bayesian Optimization to iteratively construct a model of the likely reward feedback based on the reward information provided by the expert. The aim of this model is to predict the likely reward for a given action in the absence of the expert. This minimises the need for expert domain knowledge and hand programming of a re- ward function. The authors show that the combined RL-active learning framework enabled a physical robot to learn grasping and reaching tasks more efficiently than a comparative system that used a hand programmed reward function.

51 2.6. Robot Learning

The above examples of robot action learning describe the learning of behaviours at a ‘control’ level of implementation. That is, the robot is learning how to perform the action, not when, or in what context, it should be performed. As the work described in this thesis focuses on the integration with symbolic planners capable of sequencing actions, learners that can acquire ‘task-level’ knowledge are particularly relevant. An interesting example of an active robot learner capable of acquiring relational models of action is provided by Kulick’s (2013) work. Kulick et al. used a relational reinforcement learner combined with demonstration from a trainer to acquire relational models of ‘pick and place’ behaviours. Importantly, the work was not limited to learning with a pre-defined set of relations, as is often the case in Artificial Intelligence (AI) robotics. Each relation is composed of a first-order predicate and a function f describing the probability of the predicate being true in a given state. The function f is learned through Gaussian process classification, enabling the mapping between low-level state information and relations to be learned. This is a task that is usually performed by hand at significant cost in programmer time. The learned relations were then used by a probabilistic planner to solve a table clearing task using a robot arm. Brown (2009; 2011) also used relational learning for active exploration by a robot. Brown demonstrated that his system was able to acquire models of complex tool use behaviours such as learning to use a hook-shaped stick to access objects otherwise out of reach. Brown’s work is introduced in more detail in Section 2.6.3. Although not precisely learning of robot behaviours, the Robot Scientist devel- oped by King et al. (2004) was a particularly impressive example of active learning involving a robot. The Robot Scientist began with logical models of scientific data. An active learning algorithm was used to propose maximally informative experiments, which were then executed. An Inductive Logic Programming (ILP) approach was used to generalise the model to incorporate new results, some of which proved to be completely novel discoveries in cell biology.

52 2.6. Robot Learning

Active Learning of Object Properties

A natural application for active learning by a robot involves experimenting with objects in its environment to discover their properties. Much research in active learning by robots has been in this area. Although much of the target of robot learning in this thesis is how a robot should act, in Chapter 5 the MALA system learns properties of several tools that enable them to be used to solve tasks in a manipulation domain. Thus, this section provides a brief overview of relevant research in the field of active learning to discover object properties. When learning about the properties of objects, a robot learner may wish to interact with an object, or to use it to perform a particular task. The set of actions that can usefully be applied to the object, or the extent to which the object is useful to achieve a task are referred to as its affordance. A number of authors such as Stoychev (2007), Ugur et al. (2011), and most recently Hermans et al. (2013), have used goal-free exploration to learn object and tool affordances. In this setting, the robot attempts to perform a set of possible actions involving the object and observes the extent to which they have desirable effects. One shortcoming with the above approach is that any context-dependencies of discovered affordances are not represented explicitly. That is, the learner may discover that an object provides a useful property in one context, but be unable to reason about how that property interacts with any other affordances it has discovered. For example, Ugur et al. (2011) point out that a rolling effect created on a ball by push-left behavior cannot be generalized to the other push directions by their system. A related study by Sen (2011) addresses this limitation by representing interactions between object affordances using a probabilistic model. The authors modeled these dependencies using a Hidden Markov Model. This enabled a robot to discover an interaction between the affordances of ‘pulling’ and ‘grasping’. Specifically, the robot inferred that ‘pulling’ changes an aspect of the object in a manner that supports the goal of grasping. The two armed Dexter robot used in their experiments was able to successfully learn this concept and demonstrate the action of pulling an object closer to itself in order to grasp it.

53 2.6. Robot Learning

Sushkov and Sammut (2012) demonstrated an active learning system in which a robot conducted an experimental loop to discover physical properties of nearby objects. The robot maintained a set of possible hypotheses for each property, such as centre of mass. The most informative experiment to perform was then selected by maximising expected information gain. Bayesian inference was then used to update the probability distribution over the possible models. This approach enabled a two-armed robot to quickly discover hidden properties of several objects.

2.6.2 Learning from Demonstration

An advantage of autonomous forms of robot learning like active learning is that they do not require the guidance of an external trainer. Further, as discussed above, active learning is an elegant approach to minimising the necessary quantity of training data. However, in general it may take many experiments to learn useful knowledge from scratch using purely trial and error approaches, particularly in the large state spaces often needed for robot learning (Sammut and Yik, 2010). To avoid this problem, additional domain information may be provided to the learner to facilitate its ability to learn. In the case of RL, additional background knowledge is provided by the reward function. Construction of reward functions requires detailed knowledge of domain dynamics, often requiring considerable programmer effort. Learning from Demon- stration (LfD) provides a method of imparting useful domain knowledge to the learner, using the motions or actions of a human demonstrator to guide learning. There are many advantages to this learning paradigm. First, that teaching only requires expertise in the domain of interest, not programming expertise. More importantly, the demonstration immediately directs the learner’s attention to profitable areas within its search space, often enabling learning from demonstration to succeed in far fewer training examples than trial and error learning (Peters et al., 2011). One limitation of LfD, however, is that it is not possible to obtain natural human demonstrations of all robot actions. For example, it is not possible to learn the gait of a quadrupedal robot in this manner. However, in the case of actions for

54 2.6. Robot Learning which human demonstrations are natural, such as batting, hitting, or grasping, LfD has proved very successful (Argall et al., 2009). LfD has a relatively long history in robot learning (Michie et al., 1990; Schaal et al., 2003), and has been successfully applied to many functions of autonomous robots. To our knowledge, the earliest example of learning from demonstration for robot control was demonstration by Chambers and Michie (1969), who applied the seminal BOXES (Michie and Chambers, 1968) algorithm to the control of a mobile cart that must move in order to balance a pole. More modern work in robot learning from demonstration was spurred by the need to more easily facilitate the programming of robot manipulators for assembly line tasks (Dufay and Latombe, 1984). Later, Pommerleau (1988) applied imitation learning to drive a car, and Sammut et al. (1992) to learning to fly in a fixed-wing flight simulator. To provide a demonstration, a human trainer can either teleoperate (remote- control) a robot via a joystick or via some other interface. As teleoperation is the most widely use mode of robot control, a majority of robot LfD studies (including the one described in Chapter 5) take this approach. Notable examples of learning from teleoperated demonstrations include learning to fly a helicopter upside-down (Ng et al., 2004) and playing robot soccer (Aler et al., 2005). Alternatively, a human trainer may provide guidance through kinesthetic teaching. This is when the trainer physically holds the robots actuators, and takes control of it in order to demonstrate an action. Kinesthetic teaching is a useful paradigm as it enables a human trainer to use their natural expertise in moving their own arms and body to order to impart useful knowledge to the robot. Kinesthetic demonstration has been particularly useful in demonstrating actions to robots that have morphology similar to humans (Calinon et al., 2007). Not all implementations of robot LfD have involved direct physical or teleoper- ated demonstration. For example Duvallet et al. (2013) demonstrated an elegant system in which a robot must learn to follow natural language directions to navigate a partially known environment. In this case, the demonstration took the form of a set of traces of people successfully following directions. Based on this, the LfD

55 2.6. Robot Learning system aimed to predict the trainers’s action and by imitating it, to successfully nav- igate the environment. Learning was performed using a ‘no-regret’ online learner (Ross et al., 2011) that iteratively corrected decisions made by the current policy using the decisions provided by the experts demonstration. This enabled a robot to autonomously follow natural language directions through partially unknown office environments. LfD is a rapidly growing field and space does not permit an exhaustive discus- sion of the state of the field. Such a discussion is found in surveys by Argall (2009) and more recently by Chernova and Thomaz (2014).

Dynamic Motion Primitives (DMPs)

In recent years, a set of algorithms for LfD of robot actions has become particularly influential. The framework was inspired by observations that appeared to suggest that much of human movement was generated by an underlying set of motion prim- itives that could be sequenced and generalised (Schaal et al., 2003). The formalism was introduced by Ijspeert et al. (2002), and represents each particular motion primitive using a dynamical system governed by a set of differential equations called the canonical system. Thus, each motion primitive is termed a DMP. The canonical system is constructed to ensure that it has certain properties. Most important is that the dynamics of the system must converge to a final equi- librium, known as an attractor state. The time taken to converge to the attractor state τ is a parameter of the motion primitive, and enables variation of the speed with which the motion is executed. DMPs can be either discrete – representing point-to-point movements – or rhythmic – in which motion is oscillatory, such as waving or walking. For a discrete DMP the attractor state is a point in space, the goal point, that specifies the end point of the motion. In the case of a rhythmic DMP, the attractor state is a closed trajectory through space that is repeatedly traversed. This is known as a limit cycle. The parameters of each DMP specify the shape, speed, and character of the motion. In order to learn a movement, these parameters must be fitted to a demon-

56 2.6. Robot Learning stration trajectory. Once learning is completed, the DMP can generate trajectories that repeat and generalise the demonstrator’s motion. Trajectories generated by the DMP are phase-space trajectories of the canonical system, and are obtained by integrating the differential equations. A formal derivation of the DMP framework is given in Section 5.3.2. DMPs are attractive for LfD in robotics for several reasons. First, DMPs are parameterized in a straightforward manner, as a linear combination of basis functions. The fitting of a DMP to a trainer demonstration requires learning the weights of these functions, which is a well-studied problem in machine learning. Thus, the DMP framework lends itself naturally to supervised learning from demonstration. Second, the differential equations of a DMP are invariant with respect to time, scale, and translation, making generalisation of movements a simple matter. Finally, trajectories generated by DMPs are enforced to be smooth movements that provide acceptable control policies for a robot. DMPs have received significant attention in the robot LfD literature in recent years. DMPs have been applied to several problems such as ‘pick and place’ (Pastor et al., 2009), compliant manipulation and movement adaptation (Kalakrishnan et al., 2011; Pastor et al., 2011), grasping under uncertainty (Stulp et al., 2011), autonomous assembly of furniture (Niekum et al., 2013) and most spectacularly, playing of ping-pong in real time (Kober et al., 2010). Neikum et al. (2012; 2013) demonstrate several applications of the DMP method to learning from demonstration. The teacher provides a set of unstructured demonstrations of a multi-step task. The learner uses a Nonparametric Bayesian classifier adapted from Fox (2009) to segment statistically similar actions occurring in all demonstrations. Then individual actions are learned by fitting DMPs to each of the episodes segmented from the complete kinesthetically demonstrated trajectories. These primitives are then able to be sequenced using a finite state machine (FSM) to replay the entire task. The hybrid learning system in Chapter 5 is related to Neikum’s system in that discrete motion components are learned from demonstration using DMPs. These components are then used to produce

57 2.6. Robot Learning sequences of robot actions. An important difference between the two systems is that the more expressive PDDL formalism with which actions are expressed in Chapter 5 enables them to be sequenced in novel ways to solve problems that have not yet been demonstrated.

Reinforcment Learning and Learning from Demonstration

One shortcoming of LfD is that the final quality of the policy produced by the learner is limited by the quality of the demonstration. To address this, it is possible to combine the benefits of LfD and RL by using a trainer’s demonstration as an initial policy that can be improved by a reinforcement learner. Using RL to refine the robot’s policy allows it to potentially exceed the performance of the demonstrator. Of course a drawback of this approach is that it reintroduces the need to define a suitable reward function, increasing the level of domain expertise required. This dual approach has been applied in many challenging robotics settings. Some examples are the learning of complex flight sequences with small ‘quad- copter’ flying robots (Michini et al., 2013). Stulp and Schaal (2011) use the properties of the DMP formalism described in the previous section to produce a system combining RL and learning from demonstration. RL is achieved using the Path Integral approach proposed by Theodorou et al. (2010), and is shown to improve significantly the performance of robotic manipulation actions over and above that achieved by the teachers demonstration action.

2.6.3 Relational Learning

Most machine learning algorithms assume that training examples can be expressed as feature vectors. However, if the target concept and background knowledge are specified using a relational representation such as first order logic, it is possible to acquire more complex concepts. Although relational learning in robotics has not been the most common approach, as the level of task-level autonomy expected

58 2.6. Robot Learning of robotic systems increases, it becomes necessary for robots to refine their high- level understanding of the world. This is an area in which relational learning is particularly effective.

Inductive Logic Programming

Inductive logic programming (ILP) (Muggleton, 1991) is a machine learning framework that uses a subset of first-order logic to represent both background knowledge and training examples. The result of learning is a logic program. As in other machine learning methods, the logic program is treated as a hypothesis that covers as many of the positive examples and as few of the negatives as possible. In ILP, a hypothesis covers a positive example if the example can be proved from the conjunction of the hypothesis and the background knowledge.

Formally, given a hypothesis language Lh, a background theory B, and a set of positive and negative examples E = P ∪ N, the goal of an ILP system is to

find a hypothesis h ∈ Lh such that for all p ∈ P : covers(B,h, p) = true and for n ∈ N : covers(B,h,n) = f alse. The covers relation defines when an example p is covered by a hypothesis h.

The task of an ILP learner is to search the space Lh of possible logic programs, that could represent the best possible hypothesis h. As is typically conceived in machine learning, the space of possible hypotheses is partially ordered by a generality relation, and search can proceed from general-to-specific or from specific-to-general. Detailed introductions to ILP can be found in (Mitchell, 1997; Bratko, 2001; Russell and Norvig, 2010; De Raedt, 2011).

Relational Learning for Robots

A pioneering work in the application of ILP to robot learning was the work of Benson (1996). Benson’s TRAIL learner, like many other relational learning systems, extended earlier action model learning (such as (Gil, 1993)) by applying a noise-tolerant ILP algorithm. The algorithm used in TRAIL was an adaptation of the DINUS learner (Džeroski et al., 1992). TRAIL was able to acquire relational

59 2.6. Robot Learning action models in a flight simulation and a two-dimensional robotic manipulation simulation. PRODIGY (Carbonell et al., 1991) provides an early and influential example of embedding multiple learning systems into an integrated system. Learning in PRODIGY focusses on improving the performance of planning, through multiple learning modules to improve plan generation speed through search control rules, to acquire and correct domain knowledge (Carbonell and Gil, 1996), and to improve plan quality (Veloso et al., 1995). Demonstration of PRODIGY with real-world robot systems such as Haigh and Veloso (1996) represented early examples of integrated learning systems on real robots. Ryan (2000) and Lorenzo and Otero (2000) both used variations of the ILP algorithm Progol developed by Muggleton (1995) to learn action models. In Ryan’s work, ILP was used to learn the side-effects of STRIPS actions, whilst Lorenzo used it for learning action models in the situation calculus. Pasula et al. (2007) developed an algorithm that learnt probabilistic, relational rules, in a three-dimensional physics simulation. One of the distinugishing features of their approach was that the learning algorithm used a different search heuristic to the greedy hill-climbing algorithms used in previous methods. The XPERO project demonstrated the discovery of properties termed ‘insights’ about robotic domains by predicate invention within an ILP framework (Leban et al., 2008). Predicate invention involves the construction of new relationals to expand the learner’s vocabulary. The robot was able to invent a concept of ‘movability’ that described its ability to move certain objects by pushing them, while others were immovable. Later research in XPERO drew comparisons between several machine learning methodologies for autonomous robotic discovery (Bratko, 2010). Most closely related to the learning systems in this thesis is that of Brown (2009), who constructed a learning system that enabled an autonomous robot to learn to use tools to solve planning problems. Brown’s work combined a form of relational learning from demonstration, EBL, and an active learning system.

60 2.7. Hybrid Learning Architectures for Robots

The EBL component constructs an initial relational model of a trainer’s demon- stration. This initial hypothesis is overly specific, loaded with unnecessary details from that particularly use case. The active learning system performs experiments in order to generalise the model. As will be described in detail in Chapter 5, the work in that chapter extends Brown’s work in several ways, most importantly, by combining it with a motor-level learning from demonstration system. This enabled it to learn and execute more complex manipulation actions (such as hitting) that Brown’s system could not.

2.7 Hybrid Learning Architectures for Robots

The previous section discussed multiple types of robot learning and the benefits of active, by demonstration, and relational learning methods. However, these learning approaches were generally implemented in an isolated setting. The robot was tasked with learning to perform a particular task in isolation, rather than interleaved with the pursuit of long term goals in an ongoing framework. Moreover, the robot learning systems so far discussed typically are not embedded within a complete software architecture to perform all the functions outlined in Section 2.2.1 such as perception, reasoning, motor control and others. When learning systems are embedded within robot architectures, it becomes clear that each component can benefit from incorporating a learning capability (Sammut and Yik, 2010). This section distinguishes between two forms of learning, and argues that integrated architectures for long-term autonomous robots must possess both. First, an integrated architecture should make use of a variety of learning algorithms, each tailored to the software component in which it is embedded. Sammut and Yik (2010) termed this a multistrategy approach to robot learning following Michalski (1993). Second, since, as discussed previously, that robot representation and processing must occur at multiple levels of abstraction, learning processes must also have this character. As was pointed out in Section 1.3.4, learning processes at different levels are not separate, but are coupled together, constraining and

61 2.7. Hybrid Learning Architectures for Robots influencing each other. For example, task-level learning systems must contrain the search of control-level learners. This type of system can be regarded as a hybrid learning system.

Multistrategy Learning

It is well known that different machine learning algorithms have different strengths, and some may be particularly well suited to particular domains (Michalski, 1993). Thus, it is appropriate to include diverse machine learning algorithms in a complete robot architecture to match the strength of the learning algorithm to the task performed by the component. In some situations, training examples can be gathered so that the batch learning of a classifier may be used, such as in the case of collecting traces of opponent game play. For other tasks, trial-and error learning may be more appropriate, particularly in the case where the supervision of a trainer is not available. The problem of noisy sensor data is ubiquitous in robotics, and in this case, Bayesian approaches have been used with great success. This variety of representations and tasks required for integrated robotic systems make them an ideal setting for application of Michalski’s (1993) concept of Task-adaptive Learning. Michalski suggested that the strengths of particular learning algorithms should be matched to problems in which they are known to perform well. Ideally, this selection process can occur dynamically. Accommodating such a variety in robot learning suggests a loosely coupled architecture, such as MALA, which incorporates not only specialised processes and representations, but the specialised learning settings and algorithms they require. At present, there is no robot that incorporates learning throughout its architecture. Furthermore, there are few systems that have adopted a multistrategy approach, although this would be beneficial.

62 2.7. Hybrid Learning Architectures for Robots

2.7.1 Hybrid Learning in MALA

Autonomous robots may be required to perform planning, particularly for long- term missions. To support planning, specifications of actions at multiple levels of abstraction are needed. This is because representations of action suitable for planning, such as PDDL, are usually insufficiently detailed to permit direct exe- cution of actions by a robot. This section introduces the multi-level conception of robot action used in MALA, geared towards the hybrid learning system to be introduced in Chapter 5, and the next section discusses its relationship to earlier work in hybrid learning systems.

1. Actions must be represented at (at least) two levels, referred to as task and control levels. The task level enables reasoning through use of a planner- friendly formalism such as PDDL. Control-level actions specify directly executable motor commands.

2. Task-level actions may specify constraints on the control-level state space.

3. A task-level plan consists of a set of transitions between constrained regions of the control-level state space.

4. Execution of plans consists of finding trajectories that link these regions. These trajectories can be constructed either through learning or planning.

In this conception of a multi-level representation of action, the function of symbolic knowledge is to allow reasoning about and sequencing of control-level behaviours and providing bounds on parameters that generate these behaviours. Life-long learning robots require learning at each level in this representation: both logical models and control level parameters must be continuously refined. The capabilities and information flow required of such a system is depicted in Figure 2.2 on the following page. Although it is somewhat unusual to construct robot learning systems capable of a high degree of integration between learning at multiple abstraction levels, several

63 2.7. Hybrid Learning Architectures for Robots

Figure 2.2: Conceptual view of a layered hybrid learning architecture showing relationships be- tween previous work in hybrid robot learning. systems have shown how many of the capabilities can be achieved. The remainder of this section reviews the contributions of these systems, their relationship to each other and to hybrid learning in MALA.

2.7.2 Related Hybrid Learning Architectures

Previous work by Sammut et al. (2010; 2013) and Ryan (1998; 2002b) has demonstrated several hybrid systems for difficult robot learning tasks. As discussed in Section 2.6.3, Brown (2009) combined ILP and EBL to induce action models for robots to use tools, demonstrating the induction and refinement of task-level knowledge from demonstration. However, Brown’s work was unable to acquire or refine control-level knowledge. Sammut and Yik (2010) used a symbolic planner to generate constraints for a trial-and-error learner to induce control polices for a robot walk, demonstrating that task-level knowledge could be used to restrict

64 2.7. Hybrid Learning Architectures for Robots the search space for a low-level learner. This restriction can enable control-level knowledge to be successfully acquired in high-dimensional search spaces. Ryan (1998) used a planner to automatically construct task hierarchies for hierarchical RL based on abstract models of the behaviours purpose, combined with RL to produce concrete implementations of the abstractly defined behaviours. This provided a demonstration of feedback from the task-level being used to construct control-level knowledge. Ryan and Reid (2000) demonstrated that the post-conditions of action models could be refined using an ILP algorithm, demon- strating an integration of acquisition of task-level knowledge with the control-level learning in his earlier work. The relationship between these architectures is de- picted in Figure 2.2. The hybrid learning framework outlined here has much in common with the layered learning systems described by Stone and Veloso (2000). Like our paradigm, the layered learning system suggests a set of learning algorithms at different levels of representation. In addition, the layered systems interact with each other, with higher level learners constraining the search of more finely grained representations at lower levels. Another similarity is that Stone and Veloso also advocate a het- erogenous collection of learners. That is, any learning algorithm and representation can be used at any layer of computational units, enabling the learning algorithm used to be matched to the task learned at that particular layer. The authors demon- strate an implementation of a three-layer learning system to control a team of soccer playing robots (Stone, 2000). Unlike abstraction layers in MALA, each layer in the system is physically distinct. The lowest layer is at the level of an individual robot, while higher layers involve multi robots and eventually the whole team. A second difference is that the layered learning system does not include a formalism suitable for integration with a planner. The type of knowledge termed task-level (Section 2.7.1) is not represented, and knowledge is sub-symbolic at all layers. Hierarchical Reinforcement Learning (HRL) (Dietterich, 2000) is a related attempt to use temporal abstraction and hierarchical decomposition to enable learn-

65 2.7. Hybrid Learning Architectures for Robots ing to succeed in larger search spaces. Like hybrid learning, HRL has produced many successful learning episodes in real robotic systems (Barto and Mahadevan, 2003). Recent HRL approaches to HRL for robots based on DMPs, such as Stulp and Schaal (2011), have been particularly successful. The central difference is that the representation of knowledge is the same at every layer of a HRL framework. While this provides simplicity of uniform representations and learning algorithms at each level of the hierarchy, it does not allow specialised representations and algorithms to be matched to the task. A closely related system of learning through kinesthetic demonstration is that proposed by Abdo et al. (2013). Their work has much in common with the hybrid learning system proposed in Chapter 5. As in that system, action preconditions and postconditions in Abdo et al.’s system are learned based on teacher demonstration, enabling actions to be acquired for planning. Action preconditions and effects are identified by clustering object poses and features in traces of several teacher demonstrations of each action. Unlike the work described in Chapter 5, however, which assumes that the action model inferred from the teacher demonstration is correct (albeit overly specific), Abdo et al. allow the teacher to provide feedback in order to remove incorrectly inferred or irrelevant predicates from the descriptions on each action. A significant difference between their system and the one described in Chapter 5 is that, in their system, once the inference of action models has been corrected using teacher feedback, no further generalisation is performed. It is assumed that the features present in all demonstrations provide a sufficiently general model of the action that it may be used in practise. In contrast, the hybrid system described in this thesis performs a set of experiments with the action to discover the most general preconditions under which it may be used. Abdo et al. demonstrate the ability of their method to successfully solve planning problems, such as stacking blocks and opening doors, with experiments performed on a PR2 robot. Hybrid task and motion planning methods such as the hierarchical planning in the now (HPN) framework introduced by Kaelbling and Lozano-Perez (2011;

66 2.7. Hybrid Learning Architectures for Robots

2013) are somewhat related to the hybrid learning architecture presented in Chapter 5, even though they typically do not involve learning. These systems integrate symbolic reasoning with metric action planning, enabling robots to both reason about, and then physically perform, logical sequences of actions. Such architec- tures, though limited to action execution, provide an elegant method to achieve such integration. The advantage of these hybrid planning systems is that they do not require a demonstration or any learning to achieve their goal, however, they typically do not extend to generation of complex dynamic movements such as hitting or twisting. Other work in hybrid planning such as (Stulp and Beetz, 2008), (Cambon et al., 2009), and (Guitton and Farges, 2008), are similarly related. This section describes many implementations of various aspects of machine learning for robotics. However, the learning systems are usually not embedded within a larger system capable of operating autonomously and solving tasks over an extended period. There is a need for an architecture capable of incorporating multistrategy, hybrid machine learning algorithms for robots. In the next chapter MALA is presented, which is a step toward this goal. MALA is designed to be versatile enough to incorporate all of these learning settings and algorithms. Chapter 4 presents a two-level learning system. The lower level acquires control-level knowledge of how to drive a wheeled robot over rough terrain, while the higher level learns to recognise extremely difficult terrain, based on the performance of low-level learning. The hybrid learning implementation presented in Chapter 5 learns both task-level and control-level knowledge simultaneously from demonstration. It then refines this task-level knowledge by experimentation to generalise and improve over the knowledge gained from the trainer’s demonstration. Figure 2.3, on the next page, shows how these aspects of hybrid learning relate to each other, and how the system to be introduced in Chapter 5 relates to the general hybrid learning framework outlined above. Also shown in Figure 2.3 are the aspects of hybrid learning that are not yet implemented in MALA, which are the subjects of further research, and are discussed in Chapter 6.

67 2.8. Autonomous Robots for Rescue

As was discussed in the introduction, an application domain for autonomous robots is in urban search and rescue. This is the domain has been the key focus of the research described in this thesis. The next section gives a brief overview of the history and state of the field of rescue robots, to provide context and motivation for the experiments detailed in later chapters.

2.8 Autonomous Robots for Rescue

There is a large range of potential applications for robotic systems to assist emer- gency services. Much research is focused on searching for victims in disaster sites,

Figure 2.3: Hybrid learning architecture implemented in Chapter 5. This system implements several of the learning functionalities of the hypothetical, complete architecture shown in Figure 2.2. Task-level knowledge is acquired by ‘explaining the effects’ of the trainer’s demonstration, and control knowledge is acquired through learning from demonstration via dynamic motion primitives (DMP LfD). Task-level knowledge is then refined via active relational learning.

68 2.8. Autonomous Robots for Rescue such as earthquakes or tsunamis (Murphy, 2014). Rescue robots have also proven useful in the aftermath of hurricanes, fires, nuclear accidents (Yoshida et al., 2014), and other disasters (Micire, 2008). The motivation for such robots is twofold. First, robots can withstand greater temperatures, or environmental toxicity than living systems, and can provide rescue teams with greater access to dangerous environments. Second and more importantly, using robots can improve the safety of the rescuers. Disaster rescue is particularly dangerous for rescue workers, as unstable structures can collapse with tragic consequences. An example of this danger was the aftermath of the Mexico City earthquake of 1985, in which 135 workers were killed during the rescue operation (Casper and Murphy, 2003). In a highly delicate and dynamic situation such as rescuing a victim from the site of a disaster, an ideal rescue team combines the expertise of human operators with the robustness and expendability of a robotic system. This can be achieved through teleoperation of a robot by a human operator. However, in a disaster, the unstructured and chaotic nature of the environment and terrain means that robots must frequently operate inside enclosed spaces, within voids, at the bottom of pits, and underneath dense rubble, which can render radio-based communication unreliable. Under these circumstances, robots with little or no autonomous capability may be lost as soon as contact with their operators fails. In several recent disasters where robots were used, robots were simply abandoned when they entered areas with no radio reception (Davids, 2002; Casper and Murphy, 2003). To avoid this, a robot with a greater level of autonomy might navigate to regions of better communication if they find themselves cut off from their human operators, or even make some progress towards finding victims themselves.

2.8.1 The DARPA Robotics Challenge (2012-15)

In recent years, the absence of robot rescue workers at disasters has become conspicuous (Guizzo, 2014). The disaster at Fukushima Daichii was particularly

69 2.8. Autonomous Robots for Rescue notable in this respect, as the tsunami caused large-scale damage, and due to the hazards of radiation exposure, it proved extremely difficult to organise a response using human workers. There is now a need for robots capable of negotiating human-built environments with a high degree of autonomy, to provide assistance or even replace human rescue and disaster response workers. In response to Fukushima Daichii in particular, and the needs of future USAR operations in general, the United States Military, through Defense Advanced Research Projects Agency (DARPA), organised a large-scale contest to accelerate the development of autonomous robots to achieve these aims. Announced in 2012, with trials during 2013 and finals ongoing in 2014-2015, the DARPA Robotics Challenge (2014) is a large-scale simulation of a disaster rescue scenario occurring in a human-built environment. Thus the requirements for the challenge are largely driven by the need to negotiate situations designed for use by humans. Such scenarios typically involve the use of tools, devices, valve handles, and doors. To guide development of autonomous robots towards this aim, DARPA specified a set of specific tasks. The complete set of tasks are:

1. Drive a utility vehicle at the site. 2. Travel dismounted across rubble. 3. Climb an industrial ladder and traverse an industrial walkway. 4. Clear a doorway by removing debris 5. Open a door and enter a building. 6. Use a tool to break through a concrete panel. 7. Locate and close a valve. 8. Attach fire hose couplings.

One of the humanoid robots participating in the contest is shown in Figure 2.4, clearing a pile of debris from a doorway (task number four). These tasks are a challenging set of milestones for rescue robots. As successful execution of these or similar tasks are likely to be required for autonomous rescue robots in future deployments, they provide a representative set of goals for research in autonomous rescue robots. In Chapter 5, a set of manipulation experiments are used to evaluate a

70 2.8. Autonomous Robots for Rescue

Figure 2.4: The winner of the DARPA Robotics Challenge trial phase competition, the S-One robot from team SCHAFT, in 2013. S-One is shown performing task number four, clearing a pile of debris from a doorway. (Photo Credit: DARPA). novel hybrid robot learning system, with the experimental domain closely modelled on the tasks specified for the Robotics Challenge.

2.8.2 Need For Increased Autonomy

One particularly important insight has come out of analysis of the performance of human-robot rescue teams during large scale rescue situations, most notably the World Trade Centre attack of September 11, 2001. In large scale disasters, rescue efforts may go on over an extended period. Typically rescuers will work very long shifts, often going without sleep for several days (Casper and Murphy, 2003). In this case, human cognitive resources are stretched to the limit. It becomes very difficult for a robot operator to stare at a video screen, searching what is typically a low quality image, for survivors. Improved robot autonomy, perception and mobility is needed to enable robots to operate under less stringent supervision to

71 2.8. Autonomous Robots for Rescue reduce the need for human operators to work extreme hours (Balakirsky et al., 2007; Liu and Nejat, 2013). This would spare valuable human hours and improve the chances of success for the rescue team. To achieve increased autonomy, both improvement of sensors, hardware and software components, and an improvement in the integrated system software that controls the robots behaviour, is needed. Research into robot architectures is aimed at the latter goal. The next chapter introduces MALA, a novel robot architecture intended to improve the performance of autonomous robots, particularly in the context of urban search and rescue.

72 Chapter 3

Multi Agent Learning Architecture (MALA)

3.1 Chapter Overview

This chapter introduces Multi Agent Learning Architecture (MALA), the cognitive architecture that is one of the contributions of this thesis. Beginning with a brief discussion of the requirements imposed on architecture design by robot hardware, the components of MALA are enumerated and discussed, and their global structure is described. In particular, the relationship between the features of the system and the principles of architecture design enumerated in Chapter 1 is analysed, and the general, domain-independent features of the architecture are detailed in Section 3.3 and their software implementation in Section 3.4. The structural commitments of the architecture are compared to related architectures in Section 3.5. To make the presentation of the architecture as concrete as possible, a domain-specific instance of a MALA system that controls an autonomous rescue robot is discussed in detail in Section 3.6 and evaluated in Section 3.7.

73 3.2. Robots In Unstructured Environments

3.2 Robots In Unstructured Environments

Cognitive architectures suitable for autonomous robots must operate, over a long period of time, in an unstructured and dynamic environment. This is the aim of MALA. MALA’s design attempts to minimise the assumptions made about the nature of the robot’s environment. In general, it is not possible to assume the robot will have a map of the surroundings in advance. Similarly one cannot assume the presence of any easily identifiable landmarks known before the start of the mission. Objects are not conveniently coloured or shaped to make recognition and manipulation easier and they may move, sometimes quickly. The terrain is not assumed to be flat, but may be rough and uneven, necessitating complex locomotion. These are the kinds of conditions that are encountered in real applications of robots, such as mine automation, underwater monitoring, or health care. In these unstructured environments, most of the robot’s hardware and software systems must be devoted to perception and actuation since the associated problems are usually very challenging and expensive, in terms of both hardware costs and computation. The algorithms that drive these computations are necessarily spe- cialised. For example, Simultaneous Localisation and Mapping (SLAM) is handled, almost always, by creating a probabilistic estimate of the robot’s pose in relation to its surroundings and stitching together successive estimates to create a global map (Thrun et al., 2006). Perception is often handled by statistical estimation of the occurrences of certain high-dimensional features to identify objects in scenes. Locomotion, particularly for a legged or segmented robot traversing rough terrain, also requires substantial, specialised computational resources to calculate feasible trajectories for the robot’s joints. The same holds for manipulation. Langley (2012) has asserted that integrated cognitive systems need not include such specialised machinery, since the intelligent abilities of humans or machines are decoupled from their ability to perceive and manipulate the environment. This amounts to the assumption that perceptual and motor mechanisms are not involved in higher-level reasoning. In humans, at least two processes belie this assertion.

74 3.2. Robots In Unstructured Environments

These are the process of mental imagery, in which high-level cognition makes use of perceptual and motor systems to determine feasibility of actions before performing them, and of top-down perception, in which abstract concepts and contexts influence which aspects of the world are perceived. These processes are equally crucial for robots, which frequently utilise metric representations and path planning algorithms to ‘visualise’ whether a path to a goal is possible, or which select the type of low-level perceptual processing to perform in a manner sensitive to high-level context. This thesis reiterates the argument made by Sammut (2012) that to pursue human-level intelligence in robots, perception and motor skills cannot be left out of architecture design. Although it may be possible, in principle, to implement all the required algo- rithms for perception and actuation using a uniform representation, in practice this will introduce many inefficiencies. For example, it would be very inefficient, as well as inconvenient, to perform large-scale matrix manipulation, directly, in a rule-based system. It does, however, make sense to call specialised modules from a high-level system that serves as glue to join many components and performs tasks at a cognitive level. At present there is a gap between robot software architectures and cognitive architectures. Most robot architectures, such as Robot Operating System (ROS) (Quigley et al., 2009), follow a component model, providing a communication mechanism between multiple nodes in a potentially complex and ad hoc network of software components. Cognitive architectures (Langley and Choi, 2006; Anderson, 2007; Laird, 2012) usually emphasise uniformity in representation and reasoning mechanisms. Almost all fielded robotics applications take an ad hoc, modular approach to controlling the robot. The drawback of this approach is that a custom- designed system is less retaskable, and less flexible than a structured framework capable of adapting its own control systems through introspection. However, the modular, loosely coupled approach taken by these architectures is ideally suited for integrating diverse representations in a flexible manner. Further, the asynchronous nature of such systems helps to process incoming sensory information as rapidly

75 3.2. Robots In Unstructured Environments as possible, facilitating the highest possible degree of reactivity. Thus, such a processing model is well suited for integrated robot software frameworks, and we adopt it here. Although asynchronous subsymbolic processing is necessary for an autonomous robot, it is not sufficient, in general, to solve complex problems, reason introspec- tively, or explain its actions to human operators. Further, systems capable of abstract representation and planning are capable of being adapted to new domains much more easily, which is essential to produce a retaskable robot system. Thus, symbolic, relational reasoning is an essential capability for integrated robot ar- chitectures. In particular, architectures must provide mechanisms to translate processing content between relational reasoning and subsymbolic processing. This discussion began with the statement that it is desirable for robots to be able to operate in an unstructured and dynamic environment over a long period of time. This is sometimes called long-term autonomy. In its current usage (Meyer-Delius et al., 2012), this term generally means continuing to perform a task, usually navigation, despite changing conditions. MALA is designed to enforce a minimal set of assumptions. The architecture of the robot should not constrain its capabilities to perform only one kind of task, but should enable it to be capable of performing a variety of tasks, under changing environmental conditions, and in the face of possible changes to the robot itself. Adaptation to changing conditions necessarily entails learning, either from failures, experimentation, or demonstration. Because we adopt a heterogeneous structure, with different modules using different representations and algorithms, different learning methods are also required. For example, a low-level skill such as locomotion requires adjusting motor parameters, so numerical optimisation may be appropriate, whereas learning action models for high-level task planning is more likely to be accomplished by symbolic learning. Thus, the cognitive architecture should be able to support different kinds of learning in its different components. An architecture that can provide this type of long-term autonomy should pos- sess several qualities, particularly, specialised representations, loosely coupled

76 3.3. MALA: Agents and Blackboards asynchronous processing components, relational reasoning, mechanisms to trans- late between task-level and primitive representations, and integrated multi-strategy learning. This chapter describes MALA, an architecture that possesses these char- acteristics. First the components of MALA and how information flows between them are described. To make this description more concrete, an application of the architecture to control an autonomous robot for urban search and rescue is described in detail. This implementation demonstrates all these capabilities with the exception of learning. Learning capabilities within MALA are the subjects of Chapters 4 and 5.

3.3 MALA: Agents and Blackboards

In the next section, discussion of the operation of MALA begins with its principal components: agents and blackboards. Communication mechanisms and global representations are then discussed in Section 3.3.2. At a finer level of detail, internal representations used within, and information flow between individual components are described. In particular, the mechanisms by which multiple software modules asynchronously process raw sensory information to create a symbolic world model are outlined. This knowledge can then be used to generate task-level actions using a classical planner. Finally, Section 3.3.3 shows how these task-level actions can then be implemented by a physical robot through the use of an example.

3.3.1 Architectural Components

MALA consists of two types of modules: a collection of agents that perform all necessary computational tasks, and one or more blackboards which provide storage and communication between agents.

77 3.3. MALA: Agents and Blackboards

Blackboards

Blackboards perform several important functions in the architecture. They serve as a communication mechanism between the agents that sense, plan, and act. They are also the primary storage mechanism for short-term and long-term memory. Objects on a blackboard are represented as frames (Roberts and Goldstein, 1977). Agents interact with each other indirectly by posting frames to and receiving them from a blackboard. The system may consist of more than one blackboard and associated agents. Typically, there are many more agents than blackboards. The blackboard system described here is an adaptation of the Multimodel Interagent Communication Architecture (MICA) blackboard system developed by Kadous and Sammut (2004b). MICA has since been adapted to provide a high-level architecture for a rescue robot (Sheh et al., 2009) and the knowledge-level processing in an assistive robot for a home environment (Kirchner et al., 2010).

Agents

Connected to the blackboard(s) are the components that perform all computational functions of the system. Following Minsky’s (1988) terminology, we call these agents. In contrast to other cognitive architectures that often make commitments to particular representations of knowledge and reasoning algorithms, there is no restriction on the internal workings of an agent. Thus, each agent uses the most appropriate method for its particular task. Agents are grouped into several categories. Low-level perception agents oper- ate directly on raw sensor data to extract useful features. Higher-level perception agents operate on these features and fuse them into more abstract representations describing objects and their poses. These representations then form the input to fur- ther high-level perception agents that update the architecture’s current knowledge of the world. Goal generation agents encode the drives and objectives of the system and operate on the description of the world state produced by perception agents. Ac- tion generation agents include action planners, constraint solvers, motion planners, and motion controllers. Together, these agents provide the necessary processing

78 3.3. MALA: Agents and Blackboards to generate task-level plans to achieve goals and to execute the behaviours that implement these plans.

3.3.2 Communication via FRS

Frames

The common data structure through which agents communicate with each other (via the blackboard) is frames (Roberts and Goldstein, 1977). Frames may be designated generic, which describe classes of objects, or instance, representing specific objects. Properties and their values are inherited from superclasses but may be overridden if redefined in a subclass or instance frame. Each generic frame contains a set of slots which specify the datatypes that each frame contains. When frames are instances, the slots may be populated with data. Slots may have small procedures called daemons associated with them that are triggered when the slots of the frame are accessed, including when a new instance frame is created or destroyed or when a property value is added, removed or replaced.

FRS

Frames in MALA are defined in FRS, a scripting language that combines a rule- based system with a frame implementation for data storage and retrieval, both of which are built upon an expression language. FRS is a based on FrameScript (McGill et al., 2008), which was originally designed and used to implement a smart personal assistant in a ubiquitous computing environment (Kadous and Sammut, 2004a). FRS programs are called scripts and are collections of FRS expressions, which provide FRS with rule-based processing for controlling interactions with users and devices. The basic unit of an FRS program is the pattern-response rule:

pattern ==> response

79 3.3. MALA: Agents and Blackboards

Person ako Object with name: if_needed "Anonymous" age: if_new 0 ;;

Figure 3.1: FRS definition of a generic frame. This generic frame has two slots which can store data about specific instance frames. Each of these slots has an associated daemon which specifies a procedure to execute if the slot is accessed.The if_needed keyword ensures that a value is available for the slot if it accessed while empty, while if_new provides a default value for new instances. Example reproduced from (McGill et al., 2008).

Much like ‘If-Then’ semantics, if the pattern on the left-hand side evaluates to ‘True’ then the right hand side is executed. The inheritance property of frames provides FRS with an object-oriented programming paradigm. Frames were originally developed as a knowledge representation system in Artificial Intelligence (Minsky, 1974). Simple scripts will simply be a collection of rules like the ones above. Interactions that require more complex behaviour may use the full power of the frame representation system. The FRS definition of a simple generic frame is shown in Figure 3.1. Fur- ther examples of FRS code are shown in later sections (see Section 3.6.3). The blackboard stores a database of frames. When instance frames are written to the blackboard, they may be marked as transient, which means that they are only kept in the in-memory database and not written to permanent storage, as are the persis- tent frames. All instance frames are time-stamped on creation, allowing queries on temporal relations. Transient frames are forgotten between invocations of the system and may be thought of as constituting the system’s short-term memory. Persistent frames remain and can be used in further processing, including learning.

Listening for Frames on the Blackboard

An agent may connect to a blackboard in one of two ways. First, it may submit a listen request, which is a request to be notified about certain types of frames.

80 3.3. MALA: Agents and Blackboards

Specifically, it will be notified of instance frames that inherit from a particular generic. The blackboard then activates the agent when objects of that type are written to the blackboard. The active agent performs its task and posts its results to the blackboard. A new object written to the blackboard may trigger another agent into action and this process continues. Other agents that only post to the blackboard but do not receive frames may simply connect without providing any frames to listen to. The cascade of frames posted to the blackboard triggering further processing by other agents is somewhat similar to the way a rule-based system operates when rules update working memory, triggering further rule activation. However, agents are larger computational units than rules and often contain significant algorithmic processing. Another difference is that there is no explicit conflict resolution mechanism, as all agents that are enabled by the current state of the blackboard may activate concurrently. Agents communicating indirectly through the blackboard do not have to be aware of each other. The isolation of agents enforced by the blackboard makes it possible to introduce new agents or replace existing ones without affecting other agents. Furthermore, indirect communication allows the information going through the blackboard to be observed by other agents, including ones that learn from the succession of blackboard events. The communication paradigm in MALA is a form of publish-subscribe pro- tocol. This provides a loose coupling between agents. There is no facility for a direct request-reply interaction between agents in MALA12. All communication is through the blackboard, with one important exception. As discussed in Chapter 2, other modular architectures such as the Cognitive systems Architecture Schema (CAS) (Hawes and Wyatt, 2006; Hawes and Wyatt, 2010) and ROS (Quigley et al., 2009) commit to a peer-to-peer topology for all types of communication in their architecture13. Although this has the advantage of maintaining a uniform commu-

12The original MICA blackboard does provide a remote function call (Kadous and Sammut, 2004b), however to enforce loose coupling between agents this functionality has been removed from MALA. 13More recent versions of ROS support in-memory communication in the form of nodelets.

81 3.3. MALA: Agents and Blackboards

Figure 3.2: Information flow within MALA. Smaller boxes indicate agents or groups of agents, while arrows indicate posting and listening to frames, mediated by the blackboard. Abstraction increases as information flows up the stack of perceptual agents on the left, and decreases again as task-level action plans are made operational by the action generation agents on the right. Figure from (Haber and Sammut, 2013). nication paradigm across the architecture, there are two cases where low-latency, rapid-update communication has proven essential in robotics software. Most impor- tantly, when information is flowing from high resolution sensors such as cameras or depth-cameras, Transmission Control Protocol (TCP) and Internet Protocol (TCP/IP) transport often cannot provide the necessary data rates. The same is true, although to a lesser degree, in the case of data flow between controllers and robot actuators. Thus, in MALA, all communication is via TCP/IP except for that between low-level perception agents and sensors, and between low-level control agents and actuators. In this case, communication occurs in-memory. This aspect of information flow in MALA is depicted in Figure 3.2, in which data, depicted by arrows, flows directly from sensors to low-level perception agents, and directly from low-level motion controllers to robot actuators.

82 3.4. Software Infrastructure

3.3.3 Agent Interactions

MALA specifies not only a mechanism for encapsulating processing into agents, but also provides a structure in which agents may work together to control an au- tonomous robot. This section describes in more detail this functional arrangement of agents within MALA. Agents in a MALA system may be broadly classified as low or high-level perception, goal-generation, and low or high-level action agents. The memory functions of the architecture are fulfilled by both the store of frames on the black- board and the internal contents of individual agents. The collective action of the perception agents from low to high-level results in increasingly abstract representa- tions on the blackboard. Similarly, action generation agents convert task-level plans into motor commands. It is also worth pointing out that the class of motion con- troller agents is intended to be broad enough to support any form of robot actuation, including for example, the production of speech via a ‘text-to-speech’ module. The agent hierarchy shown in Figure 3.2 that is constructed by the interactions between the various types of agents, represents one of the theoretical commitments of the architecture. This hierarchy is related to Nilsson’s (2001) triple-tower architecture. Although the types of agents and their interactions are constant, the specific agents that populate each group depend on the domain to which the system is being applied. Another application-specific concern is the number of blackboards, which may increase with the complexity of the system, to reduce unnecessary commu- nication between unrelated agents, and to provide the possibility of specialised representations being used amongst groups of agents.

3.4 Software Infrastructure

MALA is designed to provide a language-neutral infrastructure to enable the diverse set of formalisms and algorithms needed for intelligent robots to work together.

83 3.4. Software Infrastructure

Algorithm 1 Event-driven processing of a MALA agent.

AGENT_MAIN Input: Agent script s and queue incoming_msgs 1: while incoming_msgs , {} do 2: RosjavaMessage msg = incoming_msgs.poll(); 3: String fstring = msg.toString() 4: Frame f = eval( fstring) . eval causes the FRS interpreter to evaluate the FRS code in the string argument. 5: Frame out = s.apply( f ) 6: postToBlackboard(out) 7: end while

3.4.1 The MALA Agent

The processing of an agent runs in FRS in an event driven manner, triggered by incoming frames. Pseudocode for the main algorithm of a MALA agent is given in Algorithm 1 on this page. Each agent is given a script, which is an FRS program that it applies to all incoming frames, producing a result that is then posted to the blackboard. All algorithms required by the agent are called from FRS during execution of the script. As FRS is written in Java, calling pure Java code is a simple matter. If the agent requires functions from other languages, MALA provides several interfaces (see Table B.1 in Appendix B.2) to call them. Agents can run and register with blackboards either remotely or locally, using the communication infrastructure described in the next section. Importantly, communication in MALA can be easily changed at runtime. This facilitates the online introduction of new agents, or even the ability to optimise the communication graph structure between the set of agents in a MALA instance. All MALA agents run as independent Java processes (within their own JVM), with the exception of any low-level agents that interact directly with sensors or actuators. These agents run in their own thread (rather than process), within the same process as the jmeSim simulator, providing them with in-memory access to the sensors and actuators of the simulated robot.

84 3.4. Software Infrastructure

Communication via rosjava

Low-level communication capability is provided by the Robot Operating System (ROS) (Quigley et al., 2009). ROS is based on publish/subscribe message passing between a distributed network of nodes. Each node is an independent thread of computation. This paradigm naturally enables the event-driven communication paradigm in MALA. ROS itself is language neutral, which allows agents to be writ- ten completely in a variety of languages. However, MALA requires all agents run as Java processes to enable the use of FRS as a global communication mechanism. MALA is built on top of the ROS ecosystem using the rosjava client library (Stonier, 2014). Rosjava provides mechanisms to interact with the ROS distributed computing environment in pure java. Each agent or blackboard launches a ros- java node that connects with the rosmaster process. The rosmaster provides a lookup mechanism to allow processes to find each other at runtime. To facilitate message passing, MALA uses the ROS environment mechanism for sending mes- sages containing various data types over TCP/IP. MALA uses ROS messages to transmit frames expressed as FRS strings. Frames are transported from agents to blackboards as strings, and then evaluated by the FRS interpreter on receipt.

MALA Global Node

A running MALA system also includes a global monitor process called the global node that performs several functions required to coordinate agent processing, such as starting up and shutdown of the architecture. The global node is the start process that spawns all the agent and blackboard processes. The set of agents, and their communication structure required for a given instance of the architecture is specified in an FRS script. When the architecture is started, the global node executes the script. Script execution calls functions that initialise and start agent and blackboard processes, providing each with the required connection parameters to correctly register with each other and subscribe to the appropriate frames. In addition to publish/subscribe connectivity via FRS frames packaged in ROS

85 3.5. Comparison of MALA to Other Architectures messages, ROS provides a shared dictionary that is accessible via network APIs called the Parameter Server. MALA uses the parameter server to give all agents access to the complete set of frames in use. On startup, the global node posts the definitions and inheritance hierarchy of frames. This enables the complete set of agents and blackboards to refer to a common frame database.

3.5 Comparison of MALA to Other Architectures

We can summarise MALA’s relationship to previous architectures in four main points:

• Like other node-based architectures, MALA differs from traditional cogni- tive architectures by structuring processing into autonomous computational agents. This lets it support specialised representations and, in particular, specialised forms of learning, such as for perceptual recognition or low-level motor control.

• Unlike Distributed Integrated Affect, Reflection, Cognition (DIARC), Teleo- Reactive Executive (T-REX) and Motor Affective Cognitive Scaffolding for the iCub Robot (MACSi), but like CAS, MALA restricts information to flow through blackboards. This facilitates the inclusion of architecture- wide introspective learning mechanisms, because monitoring the blackboard provides access to the system-wide flow of information.

• In contrast to other node-based architectures, including CAS, DIARC, and T-REX, which use ad-hoc methods for generating robot behaviour from task-level action models, the combination of constraint solving and motion planning for action generation in MALA provides a method of generating behaviour online, that is general enough to let a robot execute new task-level actions as they are acquired.

86 3.6. Domain Specific Configuration: Rescue Robot

• MALA emphasises the idea of hybrid learning in which learning systems at various levels of abstraction interact with and constrain each other. This enables the complex system to iteratively refine both control and task-level knowledge. In contrast, other architectures either do not include learning explicitly at all, such as T-REX, or include learning modules as separate, distinct components, that do not explicitly interact with each other. This is the case for learning in MACSi, CAS and DIARC.

MALA’s design enables it to support a richer repertoire of learning behaviour than other integrated robot architectures, through its node-based design, loosely coupled communication structure, novel action generation pipeline, and commit- ment to hybrid learning.

3.6 Domain Specific Configuration: Rescue Robot

The operation of MALA can best be described with reference to an example. The example implements a control system for a simulated autonomous mobile robot performing Urban Search and Rescue (USAR). This task requires exploration within a complex environment, multi-objective search, challenging perceptual tasks, and obstacle avoidance. We perform experiments in a test arena closely modelled on the RoboCup Rescue competition. As is the case throughout this thesis, experiments are conducted using a simu- lation environment, called jmeSim, built upon the open source jMonkeyEngine3 game engine (Vainer, 2013) introduced by Haber et al. (2012), and detailed in Appendix A.

3.6.1 Robot and Sensors

As introduced in Section 1.5.1, ‘Emu’ is a four-wheel-drive robot that was built for autonomous operation in RoboCup Rescue (Sheh et al., 2009). It is capable of traversing a variety of terrains, from flat flooring to low rubble. The simulated

87 3.6. Domain Specific Configuration: Rescue Robot

Emu, along with its physical counterpart, is shown in Figure 3.3. Emu’s motors are arranged in a skid-steering configuration, which allows the robot to pivot on the spot by driving left and right pairs of wheels in opposite directions and provides a relatively manoeuvrable mobile platform for exploration. The physical Emu robot is equipped with an XSens MTx heading-attitude sensor and Hokuyo UTM-30LX laser range finder on a self-levelling mount. The laser is a particularly important sensor, as its output is used for position tracking and SLAM. A Flir ThermoVision A10 thermal camera provides a 160 x 128 pixel heat image, which is used for victim identification. Colour and depth data are produced by a Asus Xtion RGB-D camera. The colour, thermal, and depth cameras are arranged vertically on a movable neck with both a “shoulder” pan-tilt and a “neck” pan-tilt joint for a total of four degrees of freedom. Emu has competed in RoboCup Rescue (Milstein et al., 2011; Sheh et al., 2009) in several of the most recent competitions. The outputs from Emu’s simulated colour, depth, and thermal cameras can be seen in the right panels of Figure 3.4. Each of these sensors is simulated in the jmeSim robot software package, described in details by (Haber et al., 2012). The robot control architecture is able to use only the data from the simulated sensors to perceive the simulated environment.

3.6.2 Agents for Rescue

The robot is tasked with exploring a complex, relatively unstructured, but largely static, disaster site (Figure 3.4, left). The environment consists of a maze-like arrangement of corridors. The robot must contend with various obstacles in order to find several human-shaped “victims” scattered through the arena. The goal of the robot is to navigate through the arena, exploring the maximum amount of territory, finding the maximum number of victims, and return to the start point within a fixed time limit. We now detail the representations, inputs, and outputs of each type of agent in the architecture, first describing the function that the architecture commits it to, and then using the specific agents required for the autonomous rescue control architecture as concrete examples.

88 3.6. Domain Specific Configuration: Rescue Robot

Figure 3.3: Simulation of the ‘Emu’ autonomous rescue robot. Left: The physical Emu, with three sensors, colour, depth and thermal camera arranged vertically on the movable ‘neck’, and the laser scanner visible on the smaller pan-tilt unit below and to the right. Right: The simulation of Emu produced using jmeSim (Haber et al., 2012), used for the experiments described in Section 3.7. The locations of the simulated sensors are marked with coloured circles. Magenta: depth camera, cyan: colour camera, yellow: thermal camera, red: laser scanner.

Low-Level Perception Agents

Each sensor provides rapidly updating frames containing raw sensor data. As described earlier, low-level agents have in-memory access to these frames to facilitate rapid transfer of large sensor frames such as point-clouds. As discussed in the introduction, an essential aspect of intelligent robotics is to process multi-modal sensor data asynchronously to obtain information about the environment. Thus, in MALA, each low-level perceptual agent performs a specialised computation. In the implemented system, there are several such agents.

89 3.6. Domain Specific Configuration: Rescue Robot

Figure 3.4: Rescue robot simulation produced using jmeSim (Haber et al., 2012). Left: Simulated Emu robot during exploration the rescue area, in which a victim has just come into view. Right: Descending from top: simulated colour, depth, and thermal camera images showing the robot’s view of the victim.

A SLAM agent operates on laser data to produce a map of the environment as the robot explores. This agent incorporates a modified Iterative Closest Point (ICP) algorithm, called Metric-based Iterative Closest Point (MBICP), for position tracking and an implementation of the FastSLAM algorithm for mapping (Milstein et al., 2011). The output of the FastSLAM algorithm is a probabilistic occupancy grid map of the environment. The SLAM agent posts frames containing the latest map as it becomes available14.

14For extended autonomous missions, SLAM maps can become very large. Although MALA commits to communicating all information (with the exception of raw sensor data) through the blackboard, for problem domains where the map is likely to become extremely large it may be

90 3.6. Domain Specific Configuration: Rescue Robot

The main object of the robot’s search is to locate several disaster victims and mark them in its map. In the rescue arena, victims are represented by simple manikins with skin-coloured heads and arms. Since the robot’s task is to identify and visit victim locations, the implemented architecture incorporates an agent for victim finding that detects the colour and warmth of human skin. The victim detection agent scans the colour camera image for skin coloured pixels based on an adaptive Hue thresholding algorithm (Bradski and Kaehler, 2008), and matches them to pixels that have approximately human skin temperature in the thermal camera image. For each pixel in the heat image, a simple absolute difference from a reference temperature determines skin temperature15. Connected-components blob extraction (Samet and Tamminen, 1988) is then used to identify possible detections. The agent then uses information from the depth camera to determine the location of the detected victim, and posts a frame containing this information to the blackboard.

High-Level Perception Agents

When the results of low-level perception computations are posted to the blackboard, they may trigger further processing by higher-level perceptual agents, such as classification, recognition, or abstraction into symbolic representations. One high- level perception agent in the implemented system is a victim assessment agent that stores a database of known victim locations and compares each new victim detection frame with this database to determine if it is a previously seen victim. The other high-level perception agent is the obstacle recognition agent. In the present implementation, this agent retrieves the identity and location of obstacles directly from the simulation whenever obstacles are in view. This is not a significant simplification of the problem, since software components to recognise objects more suitable to provide in-memory access to the SLAM map where possible. While enabling more performative access to this large data structure, this limits the extent to which the system may be distributed and modular. This tradeoff is non-trivial, and MALA falls on the side of more modular system design. 15Note that on the real robot, a more sophisticated victim identification method described by Wiley et al. (2012) is used.

91 3.6. Domain Specific Configuration: Rescue Robot from point cloud data have been developed independently of this work (Farid and Sammut, 2012). The highest-level perception agents build an abstract state description of com- plex properties and relationships that exist between objects in the world, expressed in first-order logic. The exact poses of individual objects are not represented explicitly in this manner. For example, an abstract state defined by on(book,table) describes many different primitive states in which the book is in different positions on the table. In the rescue system, a simple topological mapper defines regions of space and labels them as explored or unexplored. This produces an abstract description of the robot’s spatial surroundings.

Goal Generation

Goal generator agents produce desired world states that, if achieved, would result in progress towards the system’s global objective. The goal generation agent uses a simple set of rules for generating first-order world states that the agent should seek to make true and places frames containing these predicates on the blackboard. The world state is expressed in Planning Domain Definition Language (PDDL) (McDermott et al., 1998), thus the relational representation of world state is not as expressive as full first-order logic, but a restricted subset based for the most part, on arbitrary sentences in function-free first-order predicate logic. Further details on the expressiveness of PDDL are given by McDermott et al. (1998). More complex planning problems are attempted in Chapters 4 and 5, and more detail is given on PDDL descriptions of the robot’s tasks (see Section 4.3.3).

Action Planner

The first-order world state constructed by perception agents, and current goal constructed by goal generation agents provide input to an Action Planner agent. This action is equipped with a with a collection of task-level action models, which may either be provided a priori or learned through the robot’s experience. The planner attempts to produce a sequence of actions that the robot can then attempt to

92 3.6. Domain Specific Configuration: Rescue Robot execute. The action planner in the rescue control system uses an implementation of the Fast Forward planner (Hoffmann and Nebel, 2001), which places action frames on the blackboard when it has constructed a plan.

Constraint Solving and Motion Planning Agents

This thesis argues that the integration of symbolic and sub-symbolic processing is an important feature to have in a robot architecture. This functionality requires a mechanism to map logical to metric representations. MALA achieves this by adapting a method introduced by Brown (2011), who showed that actions can be made operational using a combination of constraint solving and motion planning. A constraint solver agent enables each task-level action to be executed by the robot. This is achieved by treating the effects of the action as constraints for a constraint satisfaction problem (CSP). Each literal in the effects of an action provides a spatial constraint on the pose of the robot and other objects. Finding a pose that achieves the action is equivalent to solving the CSP specified by the conjunction of constraints. Motion planner agents then search for a trajectory through metric space that can achieve this pose. This trajectory is then followed by motion controller agents, which implement reactive control of actuation. It is worth noting that because of the multiplicity of sensors and actuators that most robots possess, MALA specifies that there should be multiple agents, each encoding a specialised algorithm to process the data for each sensor or actuator. In later chapters, this is made concrete as MALA is applied to robots with multiple modes of actuation. The constraint solver agent, implemented using the constraint logic program- ming language ECLiPSe (Apt and Wallace, 2006), accepts an action frame as input. The constraints are extracted from the spatial subgoals in the effects slot of the action frame. Each logical literal in the PDDL effects of an action adds a constraint to a CSP in ECLiPSe. For example, the move(A,B) action has the effect near(A,B). Algorithm 2, on the next page, shows the definition of the near(A,B) constraint in ECLiPSe Prolog. Of particular note are the constraint operators, notated using $,

93 3.6. Domain Specific Configuration: Rescue Robot

Algorithm 2 Definition of spatial constraints in ECLiPSe Prolog.

NEAR CONSTRAINT Input: World state State, objects Ob jA and Ob jB. 1: near(Ob jA,Ob jB,State) :- 2: pose(Ob jA,PoseA,State), 3: pose(Ob jB,PoseB,State), 4: distance(PoseA,PoseB,Dist), 5: Dist $< 3, 6: Dist $> 0.75. DISTANCE BETWEEN TWO POSES Input: Poses Pose1 and Pose2 1: distance(Pose1,Pose2,Dist) :- 2: nth1(1,Pose1,X1), 3: nth1(2,Pose1,Y1), 4: nth1(1,Pose2,X2), 5: nth1(2,Pose2,Y2), 6: Dist $= sqrt((X2 − X1) ∗ (X2 − X1) + (Y2 −Y1) ∗ (Y2 −Y1)). EXTRACT OBJECT POSE FROM STATE Input: Object Ob j and state State . object is a Prolog struct with two fields, and State is a list of objects 1: pose(Ob j,Pose,State) :- 2: for each Elem ∈ State do 3: param(Ob j,Pose) 4: arg(name of object, Elem,Name), 5: arg(pose of object, Elem,Pose_temp), 6: (Name = Ob j → Pose = Pose_temp ). 7: end for that enforce constraints on various floating point values. For example, Dist $< 3 enforces a constraint that the variable Dist must be bound to a value less than 3. In general, the arithmetic comparisons {<,≤,=,≥,>} are replaced by the cor- responding arithmetic constraints {$ <,$ ≤,$ =,$ ≥,$ >}. The logical conjunc- tion of constraints is denoted by $&. When ECLiPSe encounters these primitive constraints as it evaluates a predicate, it applies the relevant constraint to numeric variables involved,. In our predicates the variables represent, or are directly linked to, the poses of objects. ECLiPSe propagates these constraints so that constraints

94 3.6. Domain Specific Configuration: Rescue Robot applied to one object are also applied to related objects. Thus, this definition of near(A,B) constrains the locations of objects A and B to be within a maximum distance of 3m and a minimum distance of 0.75m. In the next section, an example of the use of this constraint to construct motion goals for the robot to explore unknown territory is given. Each literal in the effects of an action specifies a constraint on final poses of the objects involved. The constraint solver produces a metric pose that satisfies the spatial constraints. This is placed in a MotionGoal frame that contains a metric goal pose for the robot or its actuators and that can be input to a motion planner. The motion planner implemented in the current system uses an A* search to find drivable trajectories to reach the goals specified by metric goal frames, which it posts to the blackboard as trajectory frames that contain a list of waypoints. The drive controller agent provides reactive obstacle avoidance using the sensor data from the laser range finder. It is an implementation of the “Almost-Voronoi” controller described by Sheh et al. (2009).

3.6.3 Frames for Rescue

Section 3.3.2 introduced the communication packets used in MALA, and specified that all information sent between agents is done so in the form of frames. This section examines the types of frames present in the rescue robot implementation of MALA, with the aim of providing a more concrete understanding of their use. The inheritance hierarchy, shown in Figure 3.5, of generic frames is defined by the ako (a kind of) relation. If a generic frame inherits from another, then all of its slots, their associated daemons, and default values are inherited as well. The base generic frame from which all others inherit is the Object frame. The ako relation is implemented in FRS using the ako keyword, indicating that the generic is ‘a-kind-of’ its parent frame. Slots are defined using the with keyword, instructing the FRS interpreter to expect a list of slots. Instance frames are built using the isa keyword, indicating that the frame ‘is-an’ instance of a particular generic. Recall that generic frames define slots, which may have default

95 3.6. Domain Specific Configuration: Rescue Robot values. These values are implemented in FRS using default . The multivalued keyword indicates that the slot must contain a list of values. Frames in the architecture fall into three broad categories. First, perception frames, which inherit from the Percept generic, carry sensory information. An example of a perception frame encoding data about the detection of a victim is shown in Figure 3.6. The frame contains all relevant properties of the detection event, such as spatial location, and the size of the detection in the camera sensor image. As mentioned earlier, frames such as this are listened to by other agents (in this case, the Victim Assess agent), that use the data to construct further frames. A second type of frames is a plan frame, that encodes symbolic action models to enable the robot to sequence its actions to achieve goals. MALA uses PDDL to encode these action models. The contents of PDDL frames are shown in Figure 3.7, and again in Figure 4.3.3 in the next chapter, where more complex planning prob-

Figure 3.5: Abridged inheritance hierarchy of generic frames used for MALA rescue robot control architecture. Arrows depict the ako – ‘a-kind-of’ inheritance relation. Due to space constraints, several frames, such as SLAM map and PDDL frames, are not shown.

96 3.6. Domain Specific Configuration: Rescue Robot

Message ako Object with name: ID: Age: default 0;

/*A VictimDetection generic frame*/ VictimDetection ako ObjectDetection with location: default [] multivalued true screenLocation: default [] multivalued true ... nVictims: ;

/*A VictimDetection instance frame*/ VictimDetection54 isa VictimDetection with location: [11.55, 1.20, 15.11] screenLocation: [128, 228] width: 98 height: 98 index: 3 nVictims: 5 name: VictimDetection54 ID: VD54 Age: 19.43;

Figure 3.6: FRS definitions of sensory frames in the MALA rescue robot control architecture. Top: The generic Message frame that provides global slots to identify all frames. Centre: The generic VictimDetection frame. In the interest of brevity, not all empty slots are shown. Bottom:A VictimDetection instance frame that inherits from ObjectDetection (Figure 3.5 on page 96). Note that the name, ID, and age slots are inherited from Message. The VictionDetection frame provides details of where the victim was detected, and how large the detection appeared in the robot’s camera (in pixels). lems are attempted. PDDL frames provide a mechanism for communicating either world states or plans of action between agents. The definition of each PDDL action model may either be provided a priori or learned through the robot’s experience. Systems of both these types are described in Chapters 4 and 5 respectively. The final type of frames are motor frames. These specify the commands that the robot must send to its actuators in order to execute the desired action. In the current implementation, actions consist only of driving to metric goals, however in later chapters, actions involve manipulation and active sensing. A MotionPlan frame, which provides an example of a motor frame, is shown in Figure 3.8. This frame

97 3.6. Domain Specific Configuration: Rescue Robot

/*A PDDL generic frame*/ PDDL ako Message with objects: default [] multivalued true init: default [] multivalued true goal: default [] multivalued true;

/*A PDDL instance frame*/ PDDL1803 isa PDDL with objects: [robot, explore_goal3, victim1] init: [near(robot, explore_goal3), location( explore_goal3, [5.31, 0, 12.62]), location(robot, [4.92, 0.41, 13.39])] goal: [near(robot,victim1), location(victim1, [11.55, 1.22, 15.11])];

Figure 3.7: FRS definitions of PDDL plan frames in the MALA rescue robot control architecture. Top: The generic PDDL frame that inherits from Message. Bottom:A PDDL instance frame. Not all slots are shown for brevity. The PDDL frame provides the definition of a planning problem to be solved by the Action Planner agent. contains a trajectory produced by the Motion Planner agent. The frame inherits from the MotionPlan generic, and is listened for by the Motion Controller agent, which uses reactive control to follow the series of poses given in the waypoints slot. This has the effect of driving the robot from the start pose to reach the goal.

3.6.4 System Operation

It is now possible to illustrate how these agents interact to perceive the world and produce goal-directed behaviour. Consider the flow of blackboard frames involved in trying to reach a location for exploration. Figure 3.9 shows the succession of objects posted to the blackboard by the agents during this scenario.

1. Placed at the arena entrance, the robot begins its exploration of an unknown environment. An agent that controls a scanning laser range finder writes a frame on the blackboard that contains range data.

98 3.6. Domain Specific Configuration: Rescue Robot

/*A MotionPlan generic frame*/ MotionPlan ako LowLevelAction with waypoints: default [] multivalued true start: default [] multivalued true goal: default [] multivalued true;

/*A MotionPlan instance frame*/ MotionPlan22 isa LowLevelAction with waypoints: [[5.78, 0, 13.10], [6.75, 0, 13.25], [7.00, 0, 13.25], [7.25, 0, 13.50], [7.50, 0, 13.50], [7.75, 0, 13.50], [8.00, 0, 13.50], ... [10.25, 0, 14.25], [10.79, 0, 14.43]] start: [5.78, 0, 13.10] goal: [10.79, 0, 14.43] name: MotionPlan22 ID: MP22 Age: 4.21;

Figure 3.8: FRS definitions of motor frames in the MALA rescue robot control architecture. Top: The generic MotionPlan frame that inherits from LowLevelAction (Figure 3.5 on page 96). Bottom:A MotionPlan instance frame. Note that the name, ID, and age slots are inherited from Message. The MotionPlan frame provides a series of waypoints for the Motion Controller agent to follow to move the robot from a start location to a goal location.

2. This frame triggers the SLAM agent, a listener for laser scans, to update its occupancy grid map of the corridor, which it posts to the blackboard.

3. A topological mapper listens for the occupancy grid map and splits the metric spatial data into regions, each of which it labels as unexplored, such as region loc1 in the figure. The topological map agent posts an unexplored location frame to the blackboard.

4. The explore goal generation agent listens for unexplored location frames and, in response, generates a goal frame at(robot,loc1) .

5. The motion planner agent listens for goal frames and occupancy grid maps; when both appear on the blackboard, it determines that a path to the goal is

99 3.6. Domain Specific Configuration: Rescue Robot

Explore Laser goal scanner Action planner generation agent agent

Goal = at(robot,loc1)

move(robot, loc1) Constraint solver agent SLAM agent metric_goal = (1,0,2) Unexplored(loc1) waypoint1 = (0.1,0,0.1)

obstacle(2.1, 3.1, 1) Topological mapping agent Motion planner agent

Depth camera agent Obstacle recognition agent Motion controller agent

Figure 3.9: Agents communicate by posting frames to a blackboard. The process builds an abstract spatial model from sensor data, generates a goal, and constructs a plan to achieve it. Task-level actions are translated to motor actions via a constraint solver that provides parameters to a path planner and reactive controller. This process is described in detail in the text.

possible.

6. The task planner agent listens for goal frames, determines that the precondi- tions of move(robot, loc1) are satisfied (path_to_goal = true) , and that its post conditions will achieve the current goal at(robot,loc1) . As a result, it places move(robot, loc1) on the blackboard.

7. The constraint solver agent listens for action frames and converts the action’s effects predicates into constraints. After the constraints have been enforced by the solving process, it selects a pose within the resultant region of space and posts the pose as a metric goal frame to the blackboard.

100 3.7. Experimental Demonstration and Evaluation

8. Concurrently, an object recognition agent analyses depth-camera frames and posts the locations of any obstacles to the blackboard. The SLAM agent listens for obstacles and incorporates them into the map.

9. The motion planner listens for metric goal frames and using the SLAM map, plans a path that avoids the obstacle and posts a trajectory frame containing a list of primitive waypoints.

10. Finally, the reactive drive controller agent accepts the trajectory and operates the robots motors to follow it.

Thus, through the asynchronous operation of the components of MALA, goal- directed autonomous robot control may be achieved. To demonstrate this fact in more concrete terms, the next section examines the performance of this implemen- tation of MALA on an urban search and rescue mission.

3.7 Experimental Demonstration and Evaluation

The implemented system was tasked with exploring five environments, one of which is shown in Figure 3.10 (right). The five environments are randomly gener- ated using a maze generation algorithm based on Prim’s algorithm (Wiener, 2011). Full details of environment generation are given in Algorithm 14 in Appendix A.1. Within each environment there are five victims in random locations. As men- tioned earlier, these victims are the objects of the robot’s search. Along with the victims, there are also three obstacles that can trap the robot if it does not avoid them. These are grids of blocks of random heights known as stepfields. These obstacles are intended to simulate the difficulty of driving over rubble or complex terrain. Each environment is 256m2 in area, and the robot has 15 minutes to explore as much as possible before it must return to the start or stop moving, to model the finite battery life of real robots. The goal of the robot is to produce an accurate map of the environment, labelling the location of victims and visiting as many of their

101 3.7. Experimental Demonstration and Evaluation

Figure 3.10: Trace of a demonstration run of the system on Map 1, with the robot beginning in the bottom left hand corner. Left: The map generated and robot’s path during the run. Right: Overhead view of the run that provides ground truth, with victims visible as blue and green figures, and with stepfield obstacles as lighter areas of the floor. locations physically as possible. To assess the performance of the system, we record the number of victims visited and the fraction of the environment successfully explored and mapped. Table 3.1, on the following page, shows the performance of the system on the five environments, while Figure 3.10 (left) shows the generated map and path of the robot during a single run. This demonstration of the architecture in operation highlights several of the characteristics essential for robot architectures identified in the introduction. First, asynchronous processing of agents enables a high degree of reactivity by allowing interrupts of different types to propagate to various levels of the architecture. At the lowest level of reactivity, in the event that laser data indicate that the robot is too close to a wall, the drive controller agent listens for laser data directly and can react rapidly to keep the robot away from obstacles. An example of higher level interrupts occurs as the robot detects obstacles while navigating towards a goal and the system responds by replanning its route around the obstacle, as described in steps 8 and 9 of the process described in Section 3.6.4. The highest level of interrupt is apparent on inspection of the trace of blackboard frames in Figure 3.11.

102 3.7. Experimental Demonstration and Evaluation

Map Victims Found Environment Mapped 1 5.0 94 % 2 5.0 86 % 3 4.0 81 % 4 3.0 84 % 5 5.0 70 % Average 4.4 85 %

Table 3.1: Performance on the autonomous rescue task for five randomly generated environments, including the one shown in Figure 3.10.

While navigating towards an exploration goal, a victim is detected and a detection frame is placed on the blackboard. After the victim assessment agent determines that the victim is previously unknown, the goal generation agent places a goal frame on the blackboard, which causes the architecture to switch goals and attempt to reach the victim’s location. As noted earlier, the low-level perception agents in this implementation are specialised for interpreting camera frames and laser scans. The ability of the architecture to recognise victims and obstacles, as well as to explore and map the environment, demonstrates how these components can be integrated in a robot architecture. Autonomous robots require the ability to generate plans that solve novel and complex problems. This capability is particularly important for retaska- bility and introspection (Sammut, 2012). The operation of the action planner in constructing plans to achieve goals in the rescue environment demonstrates this ability, but, for robots to use symbolic reasoning, abstract plans must be made operational by translating high-level actions into motor primitives. In MALA, this is done by a novel combination of constraint solving and motion planning. This is particularly important if action models are acquired online, which is necessary for robots that learn to perform new tasks. The translation process is explicitly demonstrated in the trace of frames from a task-level action model to a metric goal representation shown in Figure 3.11. The results of this demonstration, summarised in Table 3.1 on the current page,

103 3.7. Experimental Demonstration and Evaluation

/*VictimDetect executes, and writes to blackboard:*/ VictimDetection1204 isa [VictimDetection] with width: 98, height: 98, screen_y: 128, screen_x: 228, location: [11.5, 1.2, 15.1];

/*VictimAssess is a listener for [VictimDetection] frames. VictimAssess executes, and writes to blackboard:*/ VictimAssessment1430 isa [VictimAssessment] with ID: 1, VisitedAlready: false, Location: [11.5, 1.2, 15.1];

/*GoalGenerator is a listener for [VictimAssessment] frames. GoalGenerator executes, and writes to blackboard:*/ PDDL1803 isa [PDDL] with goal: [near(robot,victim1), location(victim1, [11.5, 1.2, 15.1])], init_state: [near(robot,explore_goal3), location(explore_goal3, [5.3, 0, 12.6]), location(robot, [4.9, 0.4, 13.3])];

/*Planner is a listener for [PDDL] frames. Planner executes, and writes to blackboard:*/ action8072 isa [Action] with actions: [move(robot,ExploreGoal3, victim1)], objects: [[explore_goal3, [5.3, 0, 12.6]], [victim1, [11.5, 1.2, 15.1]], [robot, [4.9, 0.4, 13.3]]], effects: near(robot, victim1);

/*Solver is a listener for [Action] frames. Solver executes, and writes to the blackboard:*/ MetricGoal5650 isa [MetricGoal] with goal: [10.8, 0, 14.4];

/*MotionPlanner is a listener for [MetricGoal] frames. MotionPlanner executes, and writes to the blackboard:*/ MotionPlan2251 isa [MotionPlan] with waypoints: [[5.6, 0, 13.1], [5.7, 0, 13.0], ..., [10.8, 0, 14.4]];

/*MotionController is a listener for [MotionPlan] frames. MotionController executes */ ... Figure 3.11: Abridged trace of frames on the blackboard during the run shown in Figure 3.10, only relevant slot values are shown. The frames illustrate one mechanism for interrupts in the architecture; as the system navigates towards an exploration goal, a victim is detected, and it is determined to be previously unknown. A goal is generated to reach the victims location, and the system attempts to reach it.

104 3.8. Chapter Summary illustrate that the control architecture constructed by these agents is able to explore the majority of an unknown rescue arena, and to identify and reach almost all victims, while avoiding obstacle hazards that can trap the robot. However, during these demonstration runs, several limitations emerged. In particular, when an interrupt caused the switching of one goal for another, as described in the text above and detailed in Figure 3.11, the current system simply discards the older goal. This leads to a somewhat greedy exploration strategy that can lead to some areas of the map being neglected, especially when all victims are located in a single region. This was the case for Map 5, which caused a lower than expected coverage of the rescue environment. This can be corrected by pushing the previous goal onto a stack if it remains consistent with the world model state after the interruption. A second problem the system faced was that, once a victim was reached, it was then identified as an obstacle to prevent the robot from driving over it. If the victim lies in the centre of a corridor, this can make areas of the map inaccessible to the robot. Although this meant that the robot was often unable to map the complete environment, it is a problem that is common to real rescue and does not indicate a flaw in the architecture’s performance.

3.8 Chapter Summary

A central goal of cognitive systems research is to study systems that are the products of multiple interacting components. This stems from the recognition that the phenomenon of intelligence is not restricted to any one of its component abilities, but results from their interactions and the framework in which they are embedded (Langley, 2012). An autonomous mobile robot is a classic example of an agent composed of many interacting parts, and thus provides an excellent analogue for the study of integrative cognitive systems. However, despite many impressive successes and considerable progress, robot systems tend to be brittle and task specific. This has historically been caused by the difficulty of perception and actuation. However, it has also resulted from a lack of general reasoning

105 3.8. Chapter Summary capabilities in robot architectures, which enable more flexible retasking and greater possibilities for introspective monitoring and failure recovery. Although cognitive architectures can provide the integrated frameworks and reasoning capabilities to address the latter need, they are often difficult to implement directly on real robots, not least because they do not model control of actuators or perception. To achieve greater levels of autonomy, architectures for robots must merge reasoning and planning capabilities with specialised perception and actuation so they can obtain information about the world without convenient labelling of objects, manipulate nontrivial objects, and drive over difficult terrain. We have outlined the characteristics that an architecture should possess to address the above issues. We claim that these requirements cannot be achieved without specialised representations, modularity, and loosely coupled, concurrent processing elements. This chapter presented the MALA architecture, that enables the fusing of symbolic processing at the level of abstract tasks using a classical planner with the sensory-actuation of a robot, and we demonstrated its successful operation in a challenging robotics domain. The system was successful in the USAR domain, able to perceive and detect victims and map the maze-like environment. Although this implementation of MALA demonstrated successful operation in a challenging domain, the eventual goal of this research is to operate an autonomous robot in more realistic and complex environments over a long time. To achieve this, the system described in this chapter requires extension in two ways: by embedding multiple types of learning algorithms and by augmented the system’s capabilities through the addition of more agents. As previously argued, learning should be embedded within all components of the architecture to provide a robot with robust long-term autonomy. The features of the MALA framework enable it to accommodate several types of learning. The next chapter introduces learning in MALA by describing an application of control-level learning, and examines how this enables robot performance to improve.

106 Chapter 4

Control-level Learning in MALA

4.1 Chapter Overview

This chapter describes an implementation of a Multi Agent Learning Architecture (MALA) system that incorporates several components for control-level learning. As in the previous chapter, agents for perception, goal generation, task planning and execution enable an autonomous robot to negotiate its environment. In this case, however, the problem domain is made significantly richer, requiring the robot to generate and execute more complex plans, and in particular, to adapt its behaviour in order to cope with a much more challenging setting. The learning system consists of a high-level learner that uses the results of a low-level learner as training data. This allows the system to monitor its own learning behaviour with the aim of selecting learning episodes that are as safe as possible. The higher-level learner is dependent on what the low-level has learned, and thus, the high-level learner is a meta-learner. The system was evaluated over the course of a simulated urban search and rescue mission incorporating several additional planning tasks based on the ma- nipulation challenges that are currently being incorporated into RoboCup Rescue Urban Search and Rescue (USAR) simulation, for the 2015 competition. The system demonstrates successful completion of several difficult tasks in a realistic

107 environment, and shows a consistent improvement over the course of its missions. In a second experiment, comparison of the system’s performance with and without learning demonstrates the benefit of the higher level learner to provide circum- spection in more challenging terrain. Finally, the features of MALA that allow it to incorporate these components to simultaneously explore, execute plans, and incrementally improve its performance through learning are discussed.

4.2 Learning in a Robot for USAR

In the simulated disaster environment described in Chapter 3, stepfields were present and, if not successfully avoided, could trap the wheeled robot. However, it was assumed that these never blocked any important access corridors. Thus, the entire arena could be traversed simply by recognising and avoiding these hazards. In this chapter, we relax this assumption, requiring the robot to negotiate rough terrain to fully explore the rescue arena. Stepfields are difficult for wheeled robots to negotiate. One possible approach to this locomotion problem is to derive control equations from mathematical models of interactions between the robot and the terrain. This approach works well for simple terrain, but as complexity increases these models become intractably difficult to construct. In particular, if the terrain is unpredictable, such strong control knowledge cannot be given to the robot in advance. Clearly this is the case in real world rescue situtations, and it is also the case when traversing random stepfields. To solve tasks in this problem domain, the presence of stepfields requires the robot to adapt its driving behaviour to the contours of the terrain so that it does not get stuck or tip over. Thus, the system must learn a policy to negotiate these difficult areas. However, a result of this attempt at learning may be that the robot either fails to cross the terrain, or in the worst case, becomes irrevocably stuck, terminating the attempt at rescue. Ideally, a flexible system would learn from the experience that some terrain is not traversable at all and should be avoided. The 4.2. Learning in a Robot for USAR robot must learn to drive, and at the same time, learn to recognise terrain that is too difficult to even attempt, and should be avoided, and to use this information to plan routes that are as safe as possible. The system described in this chapter achieves this functionality through a meta- level learning agent called the Terrain Recogniser. This agent observes the results from the low-level learner to acquire a concept of traversability, which a motion planner can then use to avoid very challenging terrain. The rescue scenario described in the previous chapter required the robot to explore a disaster site and search for victims. In this chapter, MALA is applied to a much richer version of the rescue task. In addition to exploration and mapping, the robot must also deliver water bottles to the victims when they are found, and it may also be required to shore up damaged ceilings if there is danger of a collapse. Repeated experiments are performed in a set of randomly generated rescue arenas. The robot’s performance is scored to assess the ability of the robot to simultaneously explore the arena, locate victims, deliver water bottles to them, and perform a shoring task in which wooden blocks are placed beneath a damaged ceiling in a specific spatial configuration. The scoring metric used for these experiments is similar to that used in the RoboCup Rescue competition. Experimental results demonstrate that the robot is able to successfully complete these difficult planning and manipulation tasks. It is important to point out that these tasks are executed in the context of a relatively long-term autonomous exploration mission, and not in an isolated setting. As the learning agents accumulate training data, the performance of the robot steadily improves over the course of multiple missions in the environment. This demonstrates the ability of the learning modules to aid the robot in adapting to very challenging conditions during mission execution. In a second experiment, both the mean and variance of the height of the blocks comprising the stepfields is altered, providing a set of environments with variable difficulty. When the robot attempts the rescue task in the presence of extremely difficult obstacles, the benefit of the Terrain Recogniser agent is most pronounced. That is, in the case of extremely difficult terrain, the robot is able to recognise

109 and avoid areas that it had previously been unable to traverse. This experiment demonstrates the benefit of the hybrid learning approach taken here. This richer rescue task designed here requires the combined presence of clas- sical planning and control learning. The successful performance of this MALA implementation demonstrates the utility of providing mechanisms to integrate sub-symbolic and symbolic reasoning. Further, the improving performance over the course of the runs shows the benefit of embedding multiple online learning components into robot software architectures. Most importantly, it demonstrates how the MALA framework is able to achieve both these goals in a unified manner. Section 4.3 describes the robot’s control architecture and gives detail on the execution of planning tasks through the interaction of the system components. Section 4.4 explains the learning algorithms and their integration with the rest of the architecture. Section 4.5 outlines the details of the exploration task, based on RoboCup Rescue, along with the additional planning challenges, while Section 4.6 describes the performance of the complete system on the extended rescue task.

4.3 System Architecture

The same simulated robot and environment is used throughout to facilitate repeated experiments, needed to validate the machine learning methods. The simulated ‘Emu’ robot from the previous chapter is again utilised. As mentioned previously, Emu is a model of a real robot used in RoboCup Rescue competitions, and the simulated environment models real elements of the competition arena. The robot’s motors and sensing equipment were described in detail in Section 3.6.1, but, briefly the sensors present are: a colour video camera, depth camera, thermal camera, all of which have a resolution of 640 × 480 pixels and frame rates of approximately 15 Hz, a 180◦ laser range finder with 0.3◦ angular resolution and 30m maximum range, and wheel encoders. To enable the robot to perform the manipulation tasks required to achieve the additional planning tasks, a ‘parallel- jaw’ gripper is mounted on the front of the robot. This modified version of Emu is

110 4.3. System Architecture

Figure 4.1: USAR with the Emu autonomous rescue robot. Top Left: Real Emu during its best- in-class Autonomy run at the 2011 RoboCup Rescue Competition. Image Credit: Matthew McGill. Right: Simulated Emu with parallel-jaw gripper used in autonomous rescue experiments in this chapter. Bottom Left: Real Emu about to drive over difficult ‘stepfield’ terrain using a learned drive controller. Image credit: (Sheh et al., 2011). shown in the right panel of Figure 4.1.

4.3.1 Agents For Rescue

In the following sections, each of the agents in the system is described, together with a discussion of how it interacts with the others. The complete set of agents implemented for the current rescue robot control system, along with the flow of frames between them, is shown in Figure 4.2.

111 4.3. System Architecture

Figure 4.2: Rescue robot architecture incorporating control-level learning. Agents are shown posting frames to the blackboard and listening for further frames to process. Note that not all frame-listening relationships are shown to aid visibility, and the Emu robot’s sensor neck control agent is also not shown. Perception agents, on the left, process sensor information and post percepts to the blackboard, which trigger the generation of goals. This, in turn, triggers the generation of Planning Domain Definition Language (PDDL) action plans, which are executed by converting their effects to spatial constraints, solving for acceptable target poses, and using a motion planner to find a trajectory that executes the action. The learning agents for terrain recognition and drive control are highlighted (yellow online).

4.3.2 Perception

For the present rescue robot task, there are five perception agents. Of these, the Victim Detection, and Simultaneous Localisation and Mapping (SLAM) agents were included in the implementation in the previous chapter, and details on their implementation were given in Section 3.6.2), so they are not described again here.

112 4.3. System Architecture

Object Detection

Detection of shoring blocks and water bottles is performed by an Object Detection agent. This agent searches for distinct objects within the point cloud produced by the robot’s depth camera. The point cloud data are down sampled using a voxel grid filter, which treats the point cloud as a set of 3-D cubes called voxels, approximates all points within a given voxel by their centroid. The size of the voxels is referred to as the ‘leaf size’. Points representing the walls and floor are removed using RANSAC segmentation (Fischler and Bolles, 1981) with a planar model. Finally, regions of the point cloud representing sufficiently distinct objects are detected using a clustering algorithm based on Euclidean distance (Rusu, 2009).

Algorithm 3 Object detection agent update

OBJECT_DETECTION Input: Point cloud C, plane segment removal ratio r 1: Remove points in C with y < 0.5 or y > 10 . y is distance from the robot. 2: Downsample further using a voxel grid filter with leaf size of 0.75cm 3: n filtered = |C| 4: while |C| > n filtered ∗ r do . r=0.17 was used 5: Extract the largest planar component p from C using RANSAC segmentation. 6: Remove all points in p from C 7: end while 8: Remove points from C classified as outliers. . Based on k-nearest neighbours us- ing Euclidean distance. K=50, σthresh = 1 9: Construct T, a k-d tree representation of C 10: Construct the set of possible objects O using Euclidean cluster extraction on T 11: Let V be an empty list of 6-D vectors. 12: for each cluster o ∈ O do 13: Compute the 3D-centroid c and bounding box b 14: Vo = {cx,cy,cz,bx,by,bz} 15: Add Vo to V 16: end for 17: return V

These algorithms are implemented using the Point Cloud Library (Rusu and Cousins, 2011a). Each detected object is tracked using a Kalman filter (Kalman,

113 4.3. System Architecture

1960) with a constant-velocity model to iteratively update an object’s position and avoid detecting the same object many times. False positive detections on the walls are removed by comparing the coordinates of each detection with the areas designated as walls by the SLAM map. Full details of the object detection algorithm are given in Algorithm 3. To provide a proxy for the detection of damaged or dangerous ceiling areas, several of the ceiling tiles in the world map are coloured red. These are detected by the Damaged Ceiling Detector agent, using the same combination of adaptive Hue thresholding (Bradski and Kaehler, 2008) and connected-components blob extraction (Samet and Tamminen, 1988) used by the Victim Detection agent to detect skin colour (Section 3.6.2). The height of the ceiling is measured using information from the depth camera. The final perception agent is the Terrain Recogniser agent, which analyses the point cloud data from the robot’s depth camera to extract descriptive features of terrain. This agent is described more detail in later sections when its function in learning to recognise difficult terrain is presented.

4.3.3 Goal Generation

To direct MALA’s behaviour, one possibility would be to create a set of global goals explicitly and continually replan to achieve these goal. This is not desirable for several reasons. First, MALA is intended for long-term autonomous operation in open-ended domains, which means that fixed global goals may become outdated. Second, computational resources are limited, so it is desirable to minimise the time spent replanning. Thus, MALA progressively generates goals in response to percepts, allowing planning to proceed in a restricted search space. The goals constructed by the generator become the input to a planner, which can construct plans to achieve them. This approach performs better in domains in which the long term goal of an agent is difficult to formalise as a conjunction of logical statements, or in which continuous replanning towards such a goal is intractable (Wilensky, 1983; Beaudoin, 1995; Hanheide et al., 2010).

114 4.3. System Architecture

Crucially, only some of these goals require planning. For many goals, a known routine, which may be the result of a low level learning process, can be applied without the need for search. It is only in the case where no known behaviour can achieve a given goal that planning must proceed. Generated goals are annotated with importance and urgency measures to allow prioritisation as in Beaudoin’s (1995) work, and maintained on a goal stack until their completion, to allow resuming execution of interrupted goals. The goal generation agent uses a simple set of rules for generating first-order predicates to describe world states that the agent should try to achieve. It places frames containing these predicates on the blackboard. There are four types of goals generated: three of which invoke the task planner, while the fourth is listened for by the motion planner agent. In decreasing order of urgency, the goals generated are:

1. Visit a nearby victim 2. Take water to an observed victim 3. Shore up a damaged ceiling 4. Explore

The list of active goals is maintained on a stack so that more urgent goals may interrupt the execution of those with lower urgency. Goals are removed from the stack when achieved or when the Execution Monitor agent (introduced in the next section) reports that they have failed. The first three types of goals are generated in response to percepts being placed on the blackboard, while, exploring, the default behaviour, is a goal generated when there are no percepts on the blackboard, and no active goals to pursue. Explore goals do not trigger planning, they simply trigger the motion planner agent to perform frontier-based exploration (Yamauchi, 1997), by iteratively selecting the closest point in the SLAM map which is not yet explored and attempting to drive the robot to that point. As an example of goal generation, if the Damaged Ceiling Detector agent has detected the presence of a ceiling which needs to be shored, it will post a frame

115 4.3. System Architecture containing the name of the damaged ceiling detected, and its location. In the event that the Object Detection agent has also detected the location of the required shoring materials, the Goal Generator creates a description of the world state, and posts it to the blackboard. As in Chapter 3, descriptions of goals are in PDDL. The PDDL description of a goal to shore up a damaged ceiling generated by the Goal Generator is shown below: (define (problem maladomain 311587068) (:domain maladomain) (:objects Robot damagedCeiling0 ShoringBlock0 ShoringBlock1 ShoringBlock2 ) (:init (hasgripperempty Robot) (object ShoringBlock0) (robot Robot) (object ShoringBlock1) (object ShoringBlock2) (hasdamagedceiling damagedCeiling0) (place damagedCeiling0) (object damagedCeiling0) (isshoringblock ShoringBlock0) (isbracingblock ShoringBlock1) (isbracingblock ShoringBlock2)) (:goal (isshored damagedCeiling0)))

This PDDL frame is used by the Action Planner agent to construct a task-level plan. To execute such a plan requires a cascade of execution agents.

4.3.4 Planning and Execution

A central principle of this thesis is that robot architectures must contain both symbolic reasoning components, and sub-symbolic execution components capable of controlling robot actuators. Thus, a mechanism must be provided to translate from symbolic to metric representations. An advantageous aspect of MALA is that it effects this translation in a principled manner using an extension of the method

116 4.3. System Architecture introduced by Brown (2009). In this approach, actions are made operational using a combination of constraint solving and motion planning. As discussed in Chapter 2, this is in contrast to most architectures, which utilise ad hoc translation methods, limiting the extent to which representations can change online through learning. The Constraint Solver agent enables each task-level action to be implemented by the robot by treating the effects of the action as a constraint satisfaction problem, and solving for a pose of the robot that satisfies this problem. Motion planner agents then search for a trajectory through metric space that can achieve this pose. This trajectory is then followed by motion controller agents, which implement reactive control of actuation. The PDDL frame, which describes the current world state and goal, is the input to the Action Planner agent, along with a collection of task-level action models. The action models describe the actions that the robot is capable of executing, such as moving to a location or picking up an object. The Action Planner agent is then invoked to find a set of actions to get from the current state to the goal state. The Action Planner agent in the rescue control system is built around an off-the-shelf implementation of the Fast Forward planner (Hoffmann and Nebel, 2001). If it is successful, it posts a plan frame to the blackboard. The plan frame contains a sequence of task-level actions the robot can then attempt to perform in order to achieve the goal. The Execution Monitor agent listens for plans posted to the blackboard. When it receives a plan, it executes each action in sequence, monitoring for action failure or timeout. Failures are detected when the lower level agents report that execution is impossible, and timeouts occur after 3 minutes per action, or 6 minutes for the whole plan. An action or plan failure is posted to the blackboard and the goal generator generates a new goal. Each action is executed by extracting the PDDL effects from the action model, and constructing a constraint satisfaction problem to find a metric pose that will implement the action. For example, one of the actions in the plan produced by Fast Forward in response to the PDDL shown above is the action put_between , in which the robot

117 4.3. System Architecture attempts to place an object between two other objects. This is necessary for the shoring task. The action model for put_between is given in Figure 4.3. The most important part of the model for the robot to execute the action is the effects. Each literal in the effects of an action specifies a constraint on final poses of the objects involved. Thus, the execution monitor extracts the effects of each action, in this case, between(W,Y,Z) , and places it in a frame to post to the constraint solver agent.

(:action put_between :parameters (?robot ?obA ?obB ?target) :precondition (and (isRobot ?robot) (near ?robot ?obA) (isholding ?robot ?target) (near ?obA ?obB)) :effect (and (between ?target ?obA ?obB) (not (isholding ?robot ?target))))

Figure 4.3: PDDL action model for put_between(x,y,z,w) action, in which the robot attempts to place an object between two other objects. This action is an essential component of the ceiling shoring task, in which the tall shoring block must be placed between the two shorter blocks.

The constraint solver agent uses the constraint logic programming language ECLiPSe (Apt and Wallace, 2006). It receives the effects of an action as a con- junction of predicates such as those in Figure 4.3, and converts them to ECLiPSe Prolog. Along with the constraints, the constraint solver must also receive the poses of all objects involved. To provide this information the Execution Monitor listens for the poses of all known objects and posts the most up to date estimate to the constraint solver. The ECLiPSe Prolog query produced when solving for a pose that satisfies put_between is: X = [[shoringblock0,[-6.47, -6.45, 0.05]], [shoringblock1,[2.74, -3.69, 0.40]], [shoringblock2,[-0.87, -0.36, 0.70]]], build_state(X,State), solve([between(robot, shoringblock0,

118 4.3. System Architecture

shoringblock1)], State, Soln).

where the names and positions of the objects are specified in the form [name, [x,y,z]] . Object positions shown above were obtained from the object detection agent, which placed the object names16 and positions on the blackboard. The build_state predicate constructs the necessary data structures for ECLiPSe. The actual constraint solving is called by the solve predicate. This calls the real-valued interval arithmetic constraint solver (the ic interval constraint library in ECLiPSe) to search for a specific solution to the constraints specified by the given predicates. The variable Soln is the output parameter which stores the solution found by the constraint solving process. Each predicate used in our planning system has a corresponding Prolog predicate that defines the spatial constraints imposed by the predicate. The output of the constraint solver is a metric pose that satisfies the spatial constraints. This is placed in a metric goal frame that contains a metric goal pose for the robot or its actuators and that forms the goal pose for the motion planner.

4.3.5 Motion Planning and Control

The motion planner implemented in the current system uses an A* search (Hart et al., 1968) through the SLAM map to find drivable trajectories to reach the goals specified by metric goal frames, which it posts to the blackboard as trajectory frames that contain a list of waypoints (Section 3.6.3). The Emu robot has three methods of actuation, namely, driving with its wheels, grasping and lifting with its gripper, and moving its “neck” to change the field of view of its sensor array. Each of the actuators is controlled by a motion controller agent. To operate the Emu’s motors, the Drive Controller agent provides reactive obstacle avoidance using the sensor data from the laser range finder. As in Chapter 3, the driving controller is an implementation of the “Almost-Voronoi” controller described by Sheh et al. (2009). The Gripper Control is a straightforward purpose

16The non-meaningful names given by the object detection agent have been replaced for read- ability.

119 4.4. Learning built Proportional-Derivative (PD) controller which operates the Emu’s parallel-jaw gripper. The Gripper Control agent is capable of closing the gripper to pick up an object, and opening it in order to lower the object to the ground. These behaviours are used to manipulate objects. In particular, they are used to carry water to victims and to move shoring blocks around to complete the shoring task, which is described in more detail in Section 4.5. Like the gripper controller, the Neck Controller agent is based on a purpose built PD controller. The Neck Controller can execute simple scanning search behaviours, looking up to check for damaged ceilings, and looking down at the ground during driving over difficult terrain.

4.4 Learning

The two agents responsible for improving the robot’s behaviour are the Terrain Driver agent, and the Terrain Recogniser agent. The Terrain Driver agent at- tempts to learn a driving policy to guide the robot over difficult terrain, while the Recogniser observes the results of the driving agent’s efforts, and attempts to learn a model of the types of terrain that are likely to cause the robot to become irrecoverably stuck.

4.4.1 Learning to Drive Over Difficult Terrain: The Terrain Driver Agent

Sheh’s (2010; 2011) work demonstrated that controllers capable of driving wheeled robots over very challenging terrain can be learned using a form of learning from demonstration (Argall et al., 2009) called behavioural cloning (Michie et al., 1990; Sammut et al., 1992). Behavioural cloning, which was introduced in Section 2.6.2, involves a learner observing the actions of a human expert, and inducing a model of the expert’s behaviour to construct a controller that can perform the task. Sheh’s system uses a decision tree learner to induce a mapping from terrain features to

120 4.4. Learning

Motor Command (% throttle to Index Action Name Symbol left and right pair of wheels) 0 Forward-Left - [0,100] 1 Forward ↑ [100,100] 2 Forward-Right % [100,0] 3 Spin-Left x [-100,100] 4 Stop × [0,0] 5 Spin-Right y [100,-100] 6 Back-Left . [0,-100] 7 Backward ↓ [-100,-100] 8 Back-Right & [-100,0]

Table 4.1: Discrete drive actions used by the learned drive controller. Each action involves operating the Emu robot’s motors for one second using the given throttle.

motor control decisions. Training examples were collected offline, by observing the actions of either a human expert, or what Sheh termed an autonomous demonstrator. The autonomous demonstrator performed an A* search over the simulated terrain, producing a sequence of actions that lead the robot to the goal. These state-action pairs are then used as training data to build a controller that can then drive the physical robot. The nodes in the search graph represent poses of the robot in the simulator, while the edges represent actions that move the robot from one pose to another. The set of actions that are available to the demonstrator are one of eight discrete actions, a ∈ A = {0 =-,1 =↑,2 =%,3 =x,4 = ×,5 =y,6 =.,7 =↓,8 =&}. Each action operates Emu’s motors for one second, moving the robot in a range of directions. The definitions and motor commands for each action are shown in Table 4.1. On flat ground, each action moves the robot by approximately the same amount, however, on rough terrain they may have widely varying effects. Sheh (2010) referred to the autonomous demonstrator’s process of searching through simulated poses of the robot and the effects of each drive action from each pose as envisaging, as it is reminiscent of imagining the effects of possible actions that a human might perform before physically performing them. To expand a node within the search, the simulated robot is teleported to the location and

121 4.4. Learning rotation specified by the node, and every possible action is tried, enabling the values for the heuristics to be calculated. The heuristic function used for the search is Euclidean distance to goal. This process enabled the automatic generation of training examples without requiring many hours of driving by a human expert. Sheh’s system collected training data in batches before beginning to learn. The present work is an adaptation of Sheh’s system to work online, as a component within an integrated system that is executing plans and exploring an unknown environment. For online operation, the robot must collect training examples during its experience in the rescue arena, as shown in Algorithm 4, to incrementally construct a drive controller for unseen terrain.

Algorithm 4 Terrain driver agent learning update

TERRAIN_DRIVER Input: Navigation corridor N, goal box Gbox, and goal point G

if TerrainDrive Agent receives N, Gbox and G then Begin the A* search towards G. if A* search finds a path then Add each action taken by the demonstrator and the corresponding features to current training set. Rebuild the decision tree model using the new training set. Attempt to drive towards G using the current version of the controller’s deci- sion tree model. if Robot is inside Gbox then Post ‘Success’ to the Blackboard. else if Robot exits N or makes no progress for t > 1min then Post ‘Fail’ to the Blackboard. end if else Post ‘Fail’ to the Blackboard. end if end if

Feature Extraction

The Terrain Driver agent must consider the terrain around the robot in order to select the appropriate drive action. This requires a memory of past sensor data

122 4.4. Learning so that it can reason about what is under and around the robot. To provide this memory, past and present data from the range imager is fused with robot pose data from the position tracker. The result of this process is a robot-relative grid-based height map of the terrain around the robot. This map has a small (5mm-10mm) cell size to capture the high spatial frequency in the terrain. Because of the large amount of data in such a map, feature extraction is necessary to make learning tractable. To perform the necessary feature extraction, we use the same feature extractor as used by Sheh (2011), initially proposed by Kadous et al. (2006). The extractor is called Planes and Deviations. This involves dividing the height map into several areas of interest (AOIs) around the robot’s wheels, and for each AOI, calculating the plane of best fit, and the coordinates of the greatest deviation from this plane. Figure 4.4(a) shows a visualisation of the planes extracted from the complete set17 of AOIs. The attributes of this plane, and the greatest deviation from it, that are used as a feature vector, are shown in Table 4.2, and the map of AOIs around the robot’s base is shown in the bottom panel of Figure 4.4(c-e). Further details on feature extraction are given by Sheh (2011).

Learning from Terrain Features

As detailed in Algorithm 4, when the Terrain Driver is invoked, it initiates the A* autonomous demonstrator to attempt to find a sequence of actions that enables the robot to cross the terrain. If a sequence is found, then each drive action taken by the demonstrator is marked as a new positive drive action. Note that learning progresses without negative examples. As is typical in behavioural cloning, these training examples are then used to construct a decision tree that models the behaviour of the demonstrator. Since the decision tree maps each class of terrain features to the most appropriate drive action, it can be used as a controller. This controller is the product of the Terrain Driver’s learning process. Induction of the decision tree is achieved using the C4.5 algorithm (Quinlan, 1993). It is implemented using the

17The larger AOIs 8, 19, and 25, shown in Figure 4.4(e), are not visualised for clarity.

123 4.4. Learning

Figure 4.4: Extracting features from difficult terrain using Planes and Deviations. (a): Extracted plane of best fit for each AOI, showing difficult feature beneath Emu’s front-left wheel. Planes are coloured according to slope, steeper planes are red, flatter are green. Heightmap points are visible in grey beneath the planes, and the position of the robot is shown in wireframe. (b): The simulated Emu robot that generated the point cloud data for (a), poised in a precarious position close to the difficult feature. (c): Position of the robot’s chassis relative to the AOIs. (d): Locations of AOIs (e): Larger AOIs covering the area in front, underneath, and behind the robot. (c-e) reproduced from (Sheh, 2010).

J48 algorithm from the Weka machine learning toolkit (Witten et al., 2011). Figure 4.5 shows an example18 of a decision tree constructed by the Terrain Driver over the course of one learning episode, 10 rescue missions of 30 minutes

18As mentioned in the caption of Figure 4.5, the tree shown is heavily pruned, and is not the actual decision tree used by the Terrain Driver agent.

124 4.4. Learning are discussed in the text. A-C 4.4.1 ) to drive decisions. The tree shown is a heavily pruned (confidence on the next page. The leaf nodes marked 4.2 feature extractor (Section Planes and Deviations Learning to drive over difficult terrainby by the constructing a decisionfactor tree. = The 0.001) decision version tree of mapsfeature the from names complete terrain are tree features shown constructed constructed in by Table the Terrain Driver agent, which is too large to show. Meanings of e 4.5: Figur

125 4.4. Learning

Feature Name Description yaw Heading of the robot relative to the corridor corridorOffset Distance from the corridor centreline, normalised to half the corridor width pitch Pitch of the robot roll Roll of the robot aoinph Height of the plane through AOI n aoinps Slope θ of the plane through AOI n aoinpd Direction ϕ of the plane through AOI n aoindx x position of the greatest deviation through AOI n aoindy y position of the greatest deviation through AOI n aoindz Height z of the greatest deviation through AOI n

Table 4.2: Components of the feature vector constructed by the Planes and Deviations feature extractor. x and y coordinates of the deviation are in normalised coordinates within the AOI, and the deviation is the greatest deviation from the plane of best fit. Reproduced from (Sheh, 2010). each. Each leaf node is labelled with the index of the most likely drive action, {0 =-,1 =↑,2 =%,3 =x,4 = ×,5 =y,6 =.,7 =↓,8 =&} and the number of training examples that arrived at that leaf that matched, and did not match, that action. Several of the decision nodes can be interpreted as intuitive driving choices, for example:

A If the robot is facing forward (heading between 49◦ left and 33◦ right), on the left-hand side of the navigation corridor (offset by more than 0.45w from the centreline), and the terrain in front of the left wheel of the robot is sufficiently flat (slope of AOI 5 is less than 6), then drive forward and to the right.

B If the robot is facing forward (heading between 49◦ left and 33◦ right), near the centre of the navigation corridor (between 0.45w on the left and 0.91w on the right), and the terrain in front of the robot is sufficiently flat (the greatest deviation from the plane of best fit is on the far side of AOI 6 from the robot) then drive forward.

C If the robot is facing away from the goal (heading more than 133◦ right), back up.

126 4.4. Learning

Where w is the width of the navigation corridor. These three decision nodes are labeled in Figure 4.5. From the prominence of the ‘yaw’ and ‘corridorOffset’ attributes in the tree, it can be inferred that the most important aspect of the learned drive policy is to keep the robot facing forward, in the centre of the navigation corridor. This accords well with Sheh’s (2010) analysis. Experiments with drive learning are described further in Section 4.5. Learning performed by the Terrain Driver is not truly incremental as the tree is completely rebuilt as new training data is acquired. However, for the purposes of this work, the induction of the decision tree is much faster than the time taken to acquire a new training example. Thus, the tree can be constructed quickly enough that it can be rebuilt online as new training examples become available. Although C4.5 is a relatively simple learning algorithm, Sheh’s (2010) comparisons showed that it performs as well or better at this task compared with the other induction algorithms implemented in Weka. Whenever difficult terrain is detected, the Terrain Driver agent can be invoked to take control of the robot and attempt to cross the difficult terrain, and in the process, improve its driving skills. However, it is beyond its responsibility to detect when difficult terrain is imminent. This requires detection of difficult terrain, and is the task of the Terrain Recogniser agent.

4.4.2 Learning a Concept of Traversability: The Terrain Recog- niser Agent

The discussion in Section 2.6.1 pointed out that robots can act as active learners by performing actions to generate their own training data, ideally chosen in such a way as to optimally inform the learner’s model of the world. However, in the course of experimentation, their actions may fail so badly that the robot cannot recover, terminating the learning process prematurely. This is particularly problematic in the case of a mobile robot learning to drive or walk. A natural method to address this difficulty is have another process dictate which experiments are safe, and which are too risky to attempt. This is the task of the

127 4.4. Learning

Terrain Recogniser agent. It is an introspective process that observes the result of the Terrain Driver’s attempts, and attempts to built a model of the type of terrain that is particularly dangerous. It can then use this model to try and predict which driving routes are too difficult, and are likely to cause catastrophic failure. This is achieved by using each driving episode performed by the Terrain Driver agent as a training example. If the terrain drive learner successfully crosses the region, this provides a positive example of terrain that is drivable by the learned controller. If the robot becomes stuck, takes too long to reach the goal, or exits the navigation corridor during driving, then it provides a negative example. Like the learning system embedded within Terrain Driver agent, the Terrain Recogniser also uses the Weka J48 decision tree learner to build a model of the data set generated by the performance Driver agent. Thus, the recogniser works as a monitor process: it learns from the Terrain Driver’s successes and failures. Since the driver itself is learning to drive, the Terrain Recogniser is a meta-level learner. Another function of the Terrain Recogniser is to analyse the point cloud of the area in front of the robot, as seen by the depth camera, and warn the other agents when rough terrain is imminent. This information is used to determine whether the regular drive controller can handle it, whether the Terrain Drive agent should be activated to attempt to drive over it, or whether it is too difficult to attempt at all and should be avoided. However, this process of selecting whether to engage the Terrain Driver or not should only be invoked if it is actually deemed necessary to cross the region that is currently being inspected. That is, when the robot is moving forwards, directly following the path towards a goal (i.e, not when it is rotating in place, for example). To determine when these conditions are met, the Terrain Recogniser listens for the latest path that has been published by the motion planner, as the robot drives towards a goal. If the robot is facing the direction in which the path leads, i.e, if it is about to drive forwards, then the Terrain Recogniser is activated and begins to analyse the point cloud. To analyse the terrain the robot is about to drive over, the Terrain Recogniser, like the Terrain Driver, builds a grid of AOIs and for each AOI, finds the plane of

128 4.4. Learning best fit. The same Planes and Deviations feature extractor as the Terrain Driver (Section 4.4.1) is used to describe the terrain using a feature vector. In the early stages of learning, the Terrain Recogniser has not had access to sufficient training examples to be confident in its decision so it adopts a simpler decision strategy. If any of the planes extracted from the terrain have non-zero slope, i.e, if the ground is not completely flat, it activates the Terrain Driver agent. The Terrain Driver agent has a model of how to drive over difficult terrain, however it has no concept of navigation or obstacles. Thus, the Terrain Recogniser must provide a goal point for the Terrain Driver to attempt to reach. A set of possible goal points is provided by the path towards goal that the motion planner has published that the regular drive controller is currently following as the robot moves. However, although the reactive Drive Controller agent will follow the given path (avoiding walls), the Terrain Driver simply follows the drive choices made by its learned decision tree, which causes it to drive more or less straight towards the goal location. Choosing a point too far from the robot’s current position may mean that there is a wall blocking the robot, and the Terrain Driver will fail as it has no concept of walls. Thus, the goal point is selected as the furthest point on the path, that a straight line can be drawn in the SLAM map to (from the robot’s current position). Along with the goal point, the Terrain Recogniser constructs a navigation corridor that enables the Terrain Driver to determine if it has faltered off course and a goal box that determines if the goal has been reached. These provide the parameters for the Terrain Driver to begin driving. Figure 4.6 shows the relationship between the global path produced by the motion planner, nearby obstacles, and the navigation corridor and goal box constructed by the Terrain Recogniser. The Terrain Driver has succeeded if the robot is inside the goal box, and has failed if it has exited the navigation corridor. When the Terrain Recogniser has acquired a sufficient number n19 of examples of the Terrain Driver’s success and failure, it begins to use the decision tree that it

19n = 25 was used for these experiments.

129 4.4. Learning

Figure 4.6: Selecting a navigation corridor and goal point for learning to drive over difficult terrain. The Terrain Recogniser must select a goal for the Terrain Driver that enables a clear straight-line path to goal, avoiding walls. Thus, the point on the global path is as far from the robot as possible, but every point between it and the robot is unoccupied. A wall is shown in brown with the dangerous region near to the wall shown in red. has built to classify terrain. The decision tree is a binary classifier, its class values are ‘traversable’, and ‘impassable’. If an area of terrain is classified as impassable, the terrain recogniser can post a frame to the blackboard which the SLAM agent listens for, and will cause the area to be marked as occupied in the SLAM map. This causes the motion planner to plan paths around the terrain, avoiding the hazard. This process is shown in Figure 4.7. If the terrain is classified as traversable, then the Terrain Driver agent is activated to attempt to cross the terrain.

130 4.5. Experimental Setting: Urban Search and Rescue (USAR)

Figure 4.7: Avoiding danger by planning a path around a region recognised as impassable. Top Left: A stepfield appears in the area analysed by the terrain recogniser (marked in red). Top Right: The heightmap and planes of best fit extracted from the depth camera’s point cloud. The planes are coloured according to their slope. The large step in the stepfield (red planes) is recognised as a very difficult feature. Bottom: The location of the stepfield (S) has been added to the SLAM map used by the motion planner to plan routes, and the planner has found a new path around the terrain. Obstacles are shown in white, regions dangerously close to walls in red, free space in green, and the new path found by the planner (P) is in purple.

4.5 Experimental Setting: Urban Search and Res- cue (USAR)

In the RoboCup Rescue competition, several regions of the arena are littered with very rough terrain, designated the ‘Red’ areas, and are designed for teleoperated

131 4.5. Experimental Setting: Urban Search and Rescue (USAR) robots with high degrees of mobility. The present goal is to investigate the au- tonomous navigation of difficult terrain in combination with the performance of planning tasks, so instead of designating specific areas of the map for difficult terrain, ‘stepfield’ hazards are scattered throughout the entire rescue arena. These stepfields provide the principal difficulty which the robot must cope with in order to successfully explore the rescue environment. Stepfields are driving haz- ards, intended to mimic the difficulty of driving over of rubble in a disaster setting. They consist of palettes of blocks that are reconfigured between experiments to produce a wide variety of novel terrains. Stepfield hazards are visible in Figures 4.1 (page 111), 4.8 (page 133) and 4.9 (page 135). Simulated victims are scattered throughout the arena, represented by humanoid figures in a variety of poses, which have skin coloured arm and head areas, and are also warmer than the surrounding area. A large part of the task of the robot is to detect the location of victims, store their location, and plan a path to visit them. Two planning tasks have been added, which are inspired by the mobile manipula- tion challenges which are being phased into the RoboCup Rescue competition for future contests. The aim of this more difficult experimental domain is to construct a task that requires the integration of several robot capabilities, which cannot be solved by a single method in isolation. Specifically, simultaneously acquiring control-level knowledge to negotiate an unpredictable environment, integrating diverse, specialised capabilities to perceive important environmental features, and reasoning about, and constructing multi-step plans of action to achieve complex goals. Demanding tasks that require this combination of robot abilities require an integrated approach. In particular, they require an integrated architecture providing the combination of continuous learning, hybrid symbolic and sub-symbolic repre- sentations, and specialised components to perform perception and actuation. Recall from Section 1.3 that MALA is constructed to provide exactly this functionality. Thus, by demonstrating successful completion of this extended autonomous rescue task, the principles upon which MALA is designed are validated, and a step toward improved autonomy for rescue robots is taken.

132 4.5. Experimental Setting: Urban Search and Rescue (USAR)

Figure 4.8: Shoring of damaged ceilings. Damaged ceilings are coloured red to facilitate detection by the robot. A ceiling is considered shored when the tall ‘shoring’ block is placed between two shorted ‘bracing’ blocks underneath the damaged area. Left: Simulated Emu robot shoring a damaged ceiling. Right inset: Emu’s colour and depth camera images respectively.

The first planning task involves the delivery of water bottles to the victims. For each victim in the environment, there is a nearby bottle of water. If the robot is able to successfully detect the victim and also the water bottle, it may attempt to grasp the water bottle and deliver it near to the victim for added points. The water bottle is considered delivered to the victim if it is placed within a 1m radius of the victims location. The second and more difficult planning task involves shoring damaged ceilings. Shoring describes the task of supporting a partly collapsed structure by temporary items, e.g. using wooden beams. It is an important technique for USAR responders to protect both the victims of a disaster and themselves during a rescue. The current challenge task for the RoboCup Rescue competition is to build a supporting structure around a given pole by laying shoring blocks in a cross-hatched pattern. A simplified version of this task has been adopted here. The shoring task, shown in Figure 4.8, is a particularly important component of this experiment, because it requires the sequencing of many actions to achieve a goal. Shoring therefore represents a task that is difficult for a robot to achieve

133 4.5. Experimental Setting: Urban Search and Rescue (USAR) without the use of symbolic representations, in this case, classical planning. Several authors, most notably Brooks (1991), have argued that such representations are unnecessary for robots. However, as robots are required to perform tasks, such as shoring, that require the construction and execution of a relatively long sequence of actions that cannot be predicted in advance, it becomes more difficult to make this argument. Thus, the shoring task is an important demonstration of MALA’s ability to integrate symbolic representations with low-level robot control. In the event that the robot detects a damaged ceiling area, it and may request shoring materials. Shoring materials consist of a long ‘shoring’ block, which is the same height as the damaged ceiling, and two smaller ‘bracing’ blocks. The task is to place the shoring block underneath the damaged ceiling, so that it is positioned between the two bracing blocks. The structure is not required to have any mechanical strength, simply to have the component blocks arranged in the correct relative spatial positions. The simulation environment consists of a 8 × 8 grid of 1.44m2 cells, each of which may be free, occupied by a stepfield, a water bottle, or a victim. Each map contains a single cell that has a damaged ceiling, which is always located close to the entrance. This is to minimise the need for the robot to carry shoring blocks over stepfields, which is extremely difficult. An aerial view of the complete simulated rescue arena is shown in Figure 4.9. Maps are generated randomly using a maze generation algorithm (Algorithm 14 in Appendix A), and are populated by placing victims in dead end cells to minimise the effect of them blocking the path of the robot and limiting the area available for exploration. Following the physical RoboCup Rescue, the stepfield areas are placed in such a way that there is often a victim on the far side of stepfields, to provide a scoring incentive to cross the difficult terrain. These stepfields may be considered ‘necessary’, since the goal generation rules specify that the robot should always attempt to reach the locations of victims, or to cross them to reach new, unexplored, regions of the map. In addition to these necessary stepfields, there are several stepfields that do not block anything in

134 4.5. Experimental Setting: Urban Search and Rescue (USAR)

Figure 4.9: Aerial view of one of the simulated rescue arena maps used in the current experiments. The ‘damaged ceiling’ is the only map cell with a ceiling in the top left of the image (marked red online). Multiple stepfield hazards are visible near the centre of the map. Four of the five victims are visible scattered around the edges of the arena, with blue water bottles visible near each victim. particular, and it is not difficult to avoid them, provided that they can be recognised as dangerous. These stepfields are referred to as ‘isolated hazards’. Each stepfield is an unstructured grid of 87mm×87mm blocks of randomly varying heights, with maximum height 20cm and minimum height 5cm following the work of (Sheh, 2010), originally based on the (2010) specification provided by the National Institute of Standards and Technology (NIST). There is no pattern in the variation of the blocks in these stepfields. Thus, they are known as unstructured random stepfields, and are amongst the most difficult for wheeled robots to traverse (Sheh et al., 2011).

135 4.6. Results and Discussion

4.6 Results and Discussion

For each experimental run, the robot has 30 minutes, measured in real time, to achieve the maximum score possible. To give the robot an opportunity to improve its score due to learning, 10 trials were run on the same map. The robot’s memory of terrain experience is persistent for each map. That is, training examples accumulated during runs on the same map are accessible to the robot during later runs, allowing the robot to learn from mistakes and successes on earlier runs. To remove the influence of any particular map configuration on the results, each run is performed on each of five randomly generated maps (see Algorithm 14 in Appendix A), and the results are averaged.

4.6.1 Experiment 1: Learning to Traverse the Rescue Environ- ment

To assess the performance of the complete system, a slightly modified version of the scoring metric used in the RoboCup Rescue competition was adopted. For the most part, aspects of the RoboCup scoring metric that evaluate aspects of the robot’s performance that are related to hardware were removed. The success of the robot’s exploration is measured by the extent of the simulated arena that the robot is able to traverse and map during each 30 minute rescue run. The SLAM map is used to extract how much of the environment was physically traversed by the robot. Another difference between the metric used and that in RoboCup Rescue is that the robot is not awarded points for detecting victims using different sensor modalities. The scoring metric used is:

Score = 100M + 20W + 20V + 20Sp + 30Sc. (4.1) where M is the fraction of the rescue arena traversed by the robot, W is the number of water bottles delivered to victims, V is the number of victims visited, Sp is the number of ceilings partially shored, and Sc is the number of ceilings completely

136 4.6. Results and Discussion shored. A damaged ceiling is considered partially shored when one of the three shoring blocks is placed beneath it, and completely shored when all three are placed beneath it, in the configuration discussed earlier, with the tall shoring block placed between the shorter bracing blocks. There are five victims placed in each map, and so the task has a theoretical maximum score of 350, but in practise a 30 minute run is not sufficiently long for the robot to achieve a perfect score.

Figure 4.10: Performance of the complete system on the rescue task. Success on the task incorpo- rates the number of victims found, extent of exploration of the arena, and success on the two planning tasks. The scoring metric is defined in Equation 4.1.

137 4.6. Results and Discussion

For comparison, two control experiments are included. The first control (shown in blue in Figure 4.10) is run with the stepfield hazards removed, and the robot proceeds with the same control architecture. The second control (shown in red in Figure 4.10) runs with all stepfields present, however, the learning capability of the architecture is removed. That is, the two learning agents, the Terrain Recogniser and the Terrain driver, are not included in the architecture. Figure 4.10 shows the score of the robot, averaged over all 5 maps. The results are averaged over 10 learning episodes (10 runs each), on each of 5 maps. Thus, the robot performs is a total of 500 runs in the rescue arena20. Due to the difficulty of the tasks, and the limited time available, the robot is not able to achieve all the goals of the rescue arena, and never achieves a perfect score of 350. Predictably, the highest scoring configuration is the one in which the stepfields are removed. Clearly it is much easier for the robot to navigate the arena in this instance. The other two configurations, which do have stepfields, score much lower. In the initial runs, both the configurations with and without the two learning agents achieve a similar score, achieving around a third of the objectives for the rescue, a score around 80. However, an upward trend is visible in the robot’s score over the course of the ten training runs, indicating that the learning system is able to improve the robot’s ability to explore the environment and find victims. The improvement is not sufficient to produce a score comparable with that achieved without stepfields. That is, the learning system is not able to navigate the stepfields perfectly. However, by the end of the learning epochs, after 10 runs, the learning system is consistently scoring significantly higher than the configuration without learning. Although the environment is simulated, there is still a large degree of random- ness, which is reflected in the robot’s fluctuating score. This is mainly caused by the difficulty of stepfields as an obstacle for wheeled robots. Even if the robot’s learner converges to a good drive policy, drive actions may still cause the robot

20Only the experiments using the complete architecture in the presence of stepfields are repeated this many times. The two control configurations are repeated 10 times on each of 5 maps, 50 runs total.

138 4.6. Results and Discussion

Figure 4.11: Performance of the learned driving controller over the course of ten missions in the rescue arena. During each mission, the robot invokes the driving controller several times (driving attempt), each of which generates several training examples for the driving learner. The vertical axis shows the fraction of the distance to the desired goal that the robot is able to reach. to become stuck. This is evident in the erratic curve of the robot’s score, where there is an early peak, presumably produced by several runs in which the robot is somewhat lucky, and scrapes through a stepfield that is too difficult for it to cross consistently. To further investigate the effect of the learned controller’s driving on the robot’s performance, the progress that the robot made towards its local goal was measured, each time the learned controller was activated. Figure 4.11 shows the fraction of the distance towards the local goal achieved by each invocation of the drive

139 4.6. Results and Discussion controller during the course of the ten runs in the rescue arena, again averaged over all maps. Again, a high degree of noise is visible in the Terrain Driver’s learning curve (Figure 4.11), caused by the complex interaction between the robot’s wheels and the random stepfields. Despite this, once again an upward trend is visible, with the average distance to goal reaching approximately 80% by the end of the learning epoch. These results demonstrate that the system is able to achieve many of the objectives of the complex rescue task posed to it, and moreover, that the embedded learning systems enabled the system to significantly improve its performance over the course of 10 runs in the rescue arena. However, the interaction between the two learning agents is not clear, as there is no mechanism to attribute the robots improvement to either one, based on the results of this experiment. Therefore, a further experiment was performed to investigate the extent to which the two-layer learning system is successful in its aims.

4.6.2 Experiment 2: Learning to Cope with Extreme Terrain

A novel feature of the system described here is the coupling of the two learning agents, to provide an introspective capability to the robot’s driving strategy. By using the performance of the Drive Learner agent as training data for the Terrain Recogniser agent, the complete system attempts to classify kinds of terrain upon which attempts to learn how to drive have proven unsuccessful. That is, terrain that is too difficult for any sort of driving policy to successfully negotiate, and constitutes a hazard even for an expert driver. To explore this aspect of the integrated architecture, the performance of the system in environments populated with increasingly difficult stepfield hazards was investigated. As mentioned earlier, the experiments described in Section 4.6.1 were conducted with the maximum stepfield block height of 20cm and minimum height of 5cm. This terrain is very challenging for wheeled robots. To test the ability of the system to recognise terrain that is beyond its ability to learn, the rescue

140 4.6. Results and Discussion arena is populated with a set stepfields of varying difficulty, some of which are even more difficult. Both the maximum height and the variance of the blocks in a stepfield were varied, multiplying them both by scaling parameter d, which can be considered a difficulty factor. The effect of varying d on the difficulty of stepfields shown in Table 4.3 and visualised in Figure 4.12.

Difficulty Maximum Variation(m) Parameter d Height (m) 0.2 0.10 0.05 0.4 0.15 0.10 0.6 0.20 0.15 0.8 0.25 0.20 1.0 0.30 0.25

Table 4.3: Varying stepfield difficulty to investigate MALA’s capability to recognise difficult terrain. The effect of the difficulty parameter d on the maximum height and variation of stepfields.

Figure 4.12: Varying stepfield difficulty to investigate MALA’s capability to recognise difficult terrain. From left to right, difficulty parameter d varies from 0.2 to 1.0. Heights and variations for these d values are shown in Table 4.3. The Emu robot is shown for scale.

The robot explored the rescue arena, performing the same tasks as those de- scribed in Section 4.6.1. The experiment was run using three configurations of the robot’s software architecture. First, the complete configuration, using both terrain learning agents, and second, a ‘Driving Learning Only’ configuration. In

141 4.6. Results and Discussion this configuration the Terrain Recogniser agent is deactivated, and thus, the robot makes no attempt to recognise terrain that is too difficult. The performance of the control architecture with both learning agents inactivated was also measured as a control. In this configuration the robot has no awareness of rough terrain at all and simply drives to each goal using the flat-terrain motion planner and controller described in Section 4.3.5. As before, each of the three architecture configurations is run for 10 training episodes, each 30 minutes, averaged over 5 maps. This is performed for 5 different values of the difficulty parameter d. The results of this experiment are shown in Figure 4.13. It is immediately obvious, in all three variations of the architecture, that the score of the robot decreases substantially as the difficulty of the terrain increases. In particular, for d > 0.8 it is difficult, if not impossible, for the robot to traverse the stepfields, and as a result the robot is unable to obtain a score much above 50, regardless of configuration. The score of 50 is achieved through shoring the damaged ceiling. The ceiling is always very close to the arena entrance and does not require crossing stepfields. It is clear from the data in Figure 4.13 that although for the easier stepfields (d < 0.4) all three configurations achieve similar scores, when the difficulty increases, the complete system incorporating both learning agents (bottom graph) achieves the best performance. Thus, there is a steady upward trend for all levels of difficulty, except for the most difficult (d = 1.0) which, as mentioned earlier, is so difficult that no configuration can achieve a reasonable score. These stepfields, with a maximum height of 30cm, frequently contain abrupt changes of height of over 20cm, which the Emu’s wheels cannot climb, regardless of drive policy. Interestingly, the configuration that does not include the Terrain Recogniser does not display a a particularly prominent increase in performance over the course of the missions, for any difficulty. Recall that without the Terrain Recogniser, the Terrain Driver is activated to begin learning whenever a stepfield is detected. It may be that the benefit that is obtained from increasing driving proficiency is somewhat canceled out by the fact the robot spends all its time learning to

142 4.6. Results and Discussion

Figure 4.13: Rescue performance in environments of varying difficulty for several variants of the control architecture. Performance data for each variant is from 10 training runs on a single map. Difficulty of terrain is varied by altering the maximum height and variance of the blocks in the stepfield obstacles. The benefit of the higher-level learning component, terrain recognition, is most pronounced in the more (but not most) difficult environments. Top Left: The system without learning. Top Right: The Drive Learner agent only. Bottom: The complete system with Drive Learner and Terrain Recogniser agents.

143 4.6. Results and Discussion drive, and insufficient time pursuing the objectives of the mission. In contrast, the complete configuration is able to continue to improve its score even in the case whether the stepfields are very difficult (d = 0.8). These results demonstrate that the hybrid learning system of the two learning agents is able, at least to some extent, to capture a concept of traversability that enables the robot to recognise the limits of its ability to learn to cross stepfields. In particular, as the difficulty increases, and there is less benefit from attempting to drive the robot over stepfields, the added circumspection of the Terrain Recogniser enables the robot to improve its chances of achieving the mission objectives by avoiding stepfields where possible.

4.6.3 Conclusions

This chapter described an implementation of MALA in an urban search and rescue domain. Several of the task objectives required the generation and execution of logical plans. Because of the difficulty of the terrain in the environment, the extent to which the robot was able to execute these plans depended on control-level learning. That is, the acquisition of the control-level knowledge required to safely traverse the terrain. The successful completion of this challenging task requires many integrated components, including agents that are capable of learning and adapting. These results simultaneously demonstrate several phenomena that are core principles of this thesis. First, as has been reiterated through this thesis, online learning is an essential capability for autonomous robots. The ability of the learning system to enable the robot to improve its performance and cross the stepfields demonstrates this. In the case of environmental difficulties such as terrain that cannot be anticipated in advance, embedding learning within a robot’s architecture is the only option. Second, learning occurs at multiple levels of abstraction. In this case, as the robot learns to drive over rough terrain, a meta-learner observes the quality of the driving behaviour to determine when it is safe to attempt to drive over a stepfield

144 4.6. Results and Discussion or avoid it. The experimental results obtained indicate that this learning scheme is capable of improving performance of the robot, aiding it in coping with a very challenging and unconstrained environment. Third, the integration of sub-symbolic and symbolic formalisms is necessary to solve tasks involving intelligent robots. The rescue task described in this chapter is an example of a problem domain that requires the robot to reason about sequences of actions to achieve complex goals, while simultaneously processing high-throughput data from multi-model sensors to make sense of its environment. Many cognitive architectures that advocate a unified formalism would struggle to solve a task with this character. Finally, and most importantly, the MALA architecture enables all of the above capabilities to be integrated within a unified framework. This implementation represents a step forward in the construction of robotic systems that integrate planning and learning during long term execution of an open ended and challenging task. The remaining aspect of learning within MALA that is yet to be demonstrated is the implementation of online learning at all levels within the architecture, and in particular, the learning of task-level action models to facilitate the learning of new behaviours and skills.

145 Chapter 5

Integrated Task and Control-level Learning from Demonstration

5.1 Chapter Overview

In the previous two chapters, the Multi Agent Learning Architecture (MALA) was introduced, and its ability to successfully integrate symbolic planning with low- level robot control software was demonstrated. Chapter 4 described the addition of learning to MALA. However, the learning was limited to control-level knowledge. In particular, a control policy enabling a robot to drive over difficult terrain, and a feature level description of rough terrain that is traversable by this controller were learnt. MALA is intended to facilitate the inclusion of learning at all levels of abstraction. This chapter demonstrates how task-level learning can be incorporated into the framework. At the same time, it describes how the task-level learning system is coupled to a control-level learner, forming a hybrid learning system capable of solving difficult manipulation problems.

146 5.2. Introduction

5.2 Introduction

Robots continue to migrate from highly structured and repetitive environments, such as industrial assembly lines, towards more open-ended scenarios. As envi- ronmental restrictions and assumptions are relaxed, autonomous systems must be capable of dealing with frequently varying conditions. Thus, the ability to con- tinuously refine both high-level knowledge and behaviour implementation details is crucial. Increasingly, intelligent robot systems use machine learning, such as Learning from Demonstration (LfD), to cope with dynamic scenarios and to solve complex problems. This section briefly reviews LfD in general and Dynamic Mo- tion Primitive (DMP)s, a particularly influential formalism for LfD, in particular. It describes how the DMP framework is extended and integrated with classical planning to form a novel hybrid learning architecture.

5.2.1 Learning from Demonstration

Robots are dynamical systems with many degrees of freedom. Therefore, machine learning for robot control must often cope with a large search space. As discussed in Section 2.6.2, a natural method of alleviating this complexity is to instruct a robot to imitate the actions of a human trainer. This enables the robot learner to begin its search in a useful region of the space, ideally, already reasonably close to an optimal policy. This type of robot learning is referred to as LfD.

Dynamic Motion Primitives (DMPs)

To learn from a demonstration, a robot must have some mechanism for representing the actions of the trainer. It is also crucial that the representation can be generalised to related motions, to enable the robot to perform a similar behaviour in different situations. Beginning with the work of Ijspeert et al. (2002), DMPs have emerged as a formalism that has proven well suited to this task. A DMP represents the trainer’s motions using a system of nonlinear differential equations. An introduction to

147 5.2. Introduction

DMPs and related background was given in Section 2.6.2. The use of the DMP framework to acquire novel robot behaviours has proved very successful. But the approach, on its own, is limited in several ways. First, once DMPs are constructed and refined, there is no mechanism in the framework to sequence primitives to execute robot tasks requiring multiple movements. Although ad hoc methods such as that proposed by Neikum (2013) have been proposed to learn sequences of primitives from longer demonstrations, without the use of an automated planner, the application of the learned primitives is limited to tasks that are essentially the same as the demonstrated task. Second, DMPs use differential equations as their representation language. This has proven a powerful mechanism to represent movements, partly because the continuous, smooth trajectories generated by both humans and good robot control policies, are easily represented by a set of differential equations. However, a dynamical systems representation has a limited ability to capture relational information, and many complex concepts are very difficult to acquire without the ability to explicitly represent relationships. In particular, a purely numerical representation of actions such as DMPs limits the extent to which the logical consequences of actions can be represented. For complex robot manipulation tasks, numerical attribute representations of actions may not be sufficient. For example, Brown (2009) points out that in the case of tool use:

Tools are structured objects where the relationship between the con- stituent parts – and between the user, tool and target object – is critical to the effective functioning of the tool. A hammer, for example, should have a flat head which lies at a right-angle to the handle; the handle should be held near the base, and the flat surface of the hammer should be brought into contact with the blunt end of the nail, which in turn is held at right-angles to the piece of wood. Learning this sort of information in propositional (attribute-value) form is difficult. The learner would need to represent every possible relationship between each part of every object by an attribute, leading to a combinatorial

148 5.3. Theory

explosion.

Thus, the use of relational action models enables the capture of richer concepts with greater ease. In this chapter, a hybrid system is presented that integrates the DMP learning formalism with two additional components that address these two shortcomings. First, a task planner is integrated with the DMP representation, enabling the sequencing of multiple robot actions, each of which is implemented using a DMP at the control-level. This enables the system to assemble complex plans of action to perform challenging tasks that may be quite different from the initial trainer demonstration. Second, the DMP learner and task planner are coupled to an Inductive Logic Programming (ILP) learner that continuously refines a relational description of action preconditions based on the robot’s experience as it attempts to perform actions. This enables the hybrid framework to reason about relational properties of objects and motion. The learning system is evaluated using complex robot manipulation tasks that involve logical reasoning, in the context of an autonomous robot performing manipulation tasks that are inspired by the Defense Advanced Research Projects Agency (DARPA) Robotics Challenge (Section 2.8.1). Representative tasks include opening doors and using objects as tools. Autonomous robots require the ability to integrate statistical learning with symbolic planning and reasoning, and the proposed system represents a step towards that goal.

5.3 Theory

As is typical when implementing symbolic models for robots (see Section 2.5.1), each action that the robot can perform is represented by an action model. The action model expresses the actions preconditions and effects in first order logic. We extend the Planning Domain Definition Language (PDDL) action model by combining each action model with a DMP describing the physical characteristics of the motion involved in executing the action. The combined representation is termed a hybrid action model. This section introduces these action models in more

149 5.3. Theory detail, and then describes how both components of the model are acquired from the trainer’s demonstration. In Section 5.5, more detail is given on how the two representations are coupled together.

5.3.1 Hybrid Action Models

The robot’s representation of actions has two components. The low-level rep- resentation is provided by the DMPs, while the higher level is a PDDL action model. PDDL action models are logical representations of actions that define an action in terms of its preconditions (what must be established before the action is performed) and postconditions (what is established after the action is performed). Each action is assigned a PDDL action model to represent the task-level properties of the action and a DMP to represent its control-level properties. Hybrid action models are explored in detail and a specific example is given in Section 5.5.3. As outlined in the previous section, this hybrid representation provides two important improvements over current LfD approaches. First, since each robot behaviour is represented in PDDL, a language designed for planning, the automated planner can find a sequence of actions acquired by the learning system, enabling the robot to use the actions that it learns in isolation as part of novel, possibly complex, behaviours. Second, as the robot executes plans using the behaviours that it has learned, each successful or failed execution of a plan may be used as a training example to refine the pre- and post-conditions of each PDDL action model. Thus, each successful plan is used as a positive example of the pre- and post-conditions, and each failed plan a negative example, and an Inductive Logic Programming (ILP) learner is used to infer a logical description based on this data, allowing the system to refine its logical models of actions. High-level information flow between the various components in our system is shown in Figure 5.1. In the next section (5.3.2), the theoretical framework for control-level LfD is detailed, and Section 5.3.3 outlines the task-level learning components.

150 5.3. Theory

Figure 5.1: Information flow between components of the hybrid learning system. Task-level knowledge is represented by PDDL action models and learned using Explanation-Based Learning (EBL)/ILP. Control-level knowledge is represented by Dynamic Movement Primitives and learned from demonstration.

5.3.2 Control-level Learning

Dynamic Motion Primitives (DMPs) (Ijspeert et al., 2002; Ijspeert et al., 2013) provide a framework for encoding dynamical properties of motions demonstrated by a trainer in a generalisable fashion. DMPs model trajectories using coupled systems of nonlinear differential equations with well understood attractor dynamics (see Section 2.6.2 for further details). In this work, we use the initial formalism introduced by Ijspeert et al. (2002). As was described in Section 2.6.2, this framework specifies that attractor states of the dynamical system must be either points or limit cycles. Thus, both discrete actions with a specified goal point and rhythmic movements may be represented. Trajectories generated by integrating

151 5.3. Theory a DMP are trajectories in the phase space of the canonical dynamical system and are bound to converge to the specified attractor point or limit cycle. Following Ijspeert et al. (2002), a DMP for a single degree of freedom, discrete movement y is described by

τz˙ = αz(βz(g − y) − z) + f (5.1)

τy˙ = z (5.2) where g is the goal position, τ a time constant, αz and βz are positive constants. Choosing τ > 0 and αz = 4βz for critical damping ensures that y and z monotoni- cally converge to the attractor point at y = g, z = 0. The nonlinear forcing function f enables the system to represent any smooth trajectory from the initial position y0 to the goal point g. The forcing term is represented as a linear combination of N basis functions Ψi with weights wi as

N ∑ Ψi(x)wi f (x) = i=1 x(g − y ) (5.3) N 0 ∑ Ψi(x) i=1 where the phase variable x removes the time dependence from the system and is defined τx˙ = −αx (5.4) where α is a constant. The nonlinear radial basis functions are defined

2 Ψi(x) = exp(−hi(x − ci) ) (5.5)

Since our aim is to model robot motions in cartesian space, the variables y, y˙, y¨ are interpreted as desired position, velocity, and acceleration. Thus an n-dimensional DMP can be fitted to a demonstrated motion by using linear regression to find the weights wi. Once a DMP is fitted to the motion, the equations of motion can be integrated to enable the system to reproduce the motion terminating in an arbitrary goal point and at an arbitrary speed. As will be shown in more detail in Section

152 5.3. Theory

5.5.3, a DMP is stored and embedded into a hybrid action as a vector of the weights wi ...wn, where n is the number of basis functions used to fit the DMP.

5.3.3 Task-level Learning

As discussed earlier, demonstration from the trainer provides not only information about how to perform the action, but also, when the action is applicable, and what its effects are. We use a form of EBL (Mitchell et al., 1986) to infer a new PDDL action model for every previously unseen action that the trainer performs. Following Brown (2013), the explanation-based heuristic we use is based on the assumption that the teacher is acting rationally to achieve a set of desired effects. Details of the algorithm are given in Section 5.5.5. Relational models explicitly capture the structure and relationships between entities in the environment. A single relational model is able to refer to a wide range of instances by encoding important relationships either between objects, or between components of a single object. Using the earlier example, a hammer can be described as a flat, heavy hitting surface attached to a handle at right angles. The application of a relational learning algorithm provides the ability to acquire relational descriptions of the contexts in which actions may be executed. These descriptions, the preconditions of each PDDL action model, are represented in Horn clause logic (Horn, 1951). To gather training examples that the ILP algorithm can generalise over, an active learning algorithm autonomously generates experiments for each action model. Each experiment consists of attempting to execute the action from a different set of preconditions. If the experiment succeeds, the example is a positive example for the preconditions for the action, otherwise it is negative. The active learning algorithm is detailed in Section 5.5.5.

153 5.4. Application Domain: Manipulation and Tool-use

5.4 Application Domain: Manipulation and Tool-use

As was pointed out in Section 1.5.2, many types of disaster rescue require robots to negotiate and engage with dynamic environments that are designed for humans. This has been the motivation for the recent DARPA Robotics Challenge (2014) described in Section 2.8.1. To successfully navigate such an environment, an autonomous robot must be able to manipulate many types of objects. To exit and enter rooms it is necessary to grasp and pull or twist a door handle. To engage or disengage hydraulic systems it may be necessary to rotate valves. In other circumstances the use of tools may be required to achieve necessary objectives. Motivated by the goal of rescue robots that can perform this type of task, the domain of manipulation and tool-use has been selected here to evaluate the hybrid learning system introduced in this chapter. The architecture provides control for a robot equipped with a six degree-of-freedom manipulator arm mounted on the Emu mobile robot used in previous Chapters. This modified Emu robot is shown in Figure 5.2. The system is given several difficult manipulation tasks involving opening doors and using tools. Section 5.6.2 describes the experimental domain and specific tasks given to the robot in more detail, and provides an analysis of the robot’s performance.

5.5 Hybrid Learning Implementation

This section details the implementation of the integrated learning system that ac- quires hybrid action models and enables sequencing and execution of these models to achieve goals. Section 5.5.1 begins by describing the perception components in the robot’s architecture, and how a relational world state description is constructed. Section 5.5.2 details how these relations, that describe the state of the world, can be used to construct and solve constraint satisfaction problems (CSPs) in a manner similar to Section 3.6.2. Solution of these CSPs is used to infer whether the robot is able to move to a pose that will satisfy particular clauses, composed of conjunctions

154 5.5. Hybrid Learning Implementation

Figure 5.2: Simulated Emu robot with six degree-of-freedom JACO manipulator arm mounted at the front. of relations, which is necessary to enable the robot to experiment using the initial action model learned from the trainer’s demonstration. Hybrid action models are introduced in more detail in Section 5.5.3, and the coupling between task- and control-level representations is explained. This enables the task-level specification in PDDL to bind to control-level motion goals so that the robot can execute the actions in any context. The control-level policy to execute a hybrid action model is a trajectory produced by the DMP motion planner, and is followed using an inverse dynamics controller, which is described in Section 5.5.4. Lastly, the learning components of the system are detailed in Section 5.5.5, namely, the EBL and DMP learners that acquire task- and control-level knowledge of new actions respectively from trainer demonstrations, and the active learner based on the Progolem (Muggleton et al., 2010) ILP algorithm that generalises new action models by experimenting with them. The complete set of software components

155 5.5. Hybrid Learning Implementation

Figure 5.3: Software components for hybrid learning architecture introduced in this chapter. Agents are shown posting messages and listening for further frames to process. Perception agents, on the left, process sensor information and post percepts to the blackboard, which triggers the generation of goals. Learning agents that acquire and refine hybrid action models are highlighted (yellow online). Execution components are shown on the right. and the information flow between them is shown in Figure 5.3.

5.5.1 Perception and State Representation

The central component of a mobile robot’s perception system is a simultaneous localisation and mapping (Simultaneous Localisation and Mapping (SLAM)) mod- ule that produces a map that the robot can use to navigate. As in Chapter 4, Metric-based Iterative Closest Point (MBICP) is used for position tracking, and an implementation of the FastSLAM algorithm for mapping (Milstein et al., 2011).

156 5.5. Hybrid Learning Implementation

Relational World State

The robot is provided with a complete low-level description of the world state, that is, the world is fully observable. This consists of both dynamic information specifying the positions and orientations of all objects in the world at each time step, and static information about the unchanging geometry and structure of each object in the world. These data consist of the shape, dimensions, and attachment points for each object and its components. However, this primitive state information does not enable the robot to reason about the logical state of the world. To do this, the robot constructs a relational state description consisting of first-order predicates. The state description is generated from a set of predicate definitions that are provided to the robot as background knowledge. Predicates specify relations between object poses, such as on(A,B), or static properties of objects, such has has_attached_side(A,B). This task is performed by the World Modeler agent. For every known predicate, and for every possible combination of world objects, the World Modeler evaluates whether the spatial relationship specified by the predicate holds, for that particular combination of objects. If so, the predicate is added to the relational world state. For example, the definition of the on(A,B) predicate, given in Table 5.1, evaluates to true in situations where object A is both touching and above B . It is worth pointing out that the relational state description built in this manner does not explicitly represent exact object poses. For example the predicate on(mug,book) specifies a very large number of possible primitive states in which the mug is on top of the book.

5.5.2 Constraint Solving

As described in previous chapters, first-order predicates represent spatial infor- mation, and can be interpreted as constraints on primitive state variables, and, in MALA, finding poses that satisfy sets of relationships is formulated as solving a CSP. For the current hybrid learning system, the constraint solver must be called in order to determine the feasibility of various spatial relationships. As mentioned

157 5.5. Hybrid Learning Implementation

Predicate Name FRS Definition def on(A,B) = colliding(A,B) $& getZ(loc(A)) $> on(A,B) getZ(loc(B)); near(A,B) def near(A,B) = dist(loc(A),loc(B)) $< 2; def above(A,B) = getZ(loc(A)) $- getZ(loc(B)) $> above(A,B) 0.2); def facing(A,B) = diff(heading_of(A,B) - yaw(A)) $< facing(A,B) 0.2;

Table 5.1: Definitions of several spatial predicates in FRS. These definitions are used by World Modeler to determine whether a relation or set of relations given by specific predicates holds (Section 5.5.1), and by the Constraint Solver to determine whether a spatial pose for the robot specified by a conjunction of predicates is feasible (Section 5.5.2). The definitions are specified in terms of the constraint operators, notated using $’s, that were introduced in Section 3.6.2. As mentioned in the text, the constraints are also defined using several ‘built-in’ functions. These are getZ, which extracts the (vertical) z-component of a vector, loc, which extracts the location of an object, dist, which computes the Euclidean distance between two vectors, heading_of, the relative heading of one object from another, diff, which computes the absolute difference between two values, and finally, yaw, the angle that the robot’s gripper is facing. in Section 5.3.3, the active learning system, to be introduced in detail in 5.5.5, generates experiments to test the current hypotheses about the preconditions of the action model being learnt. However, to select an experiment to perform, the robot must first consider whether a particular experiment is feasible. That is, whether or not the conditions specified by the experiment are physically possible. In Chapters 3 and 4, a constraint solver written using ECLiPSe Prolog was used to search for solutions to CSPs, however, in this chapter a solver built using the JaCoP (Kuchcinski and Szymanek, 2014) constraint solving library is used, as it was more convenient to define all constraints solely in Java. To integrate the JaCoP solver with the MALA ecosystem, the frame language FRS (Section 3.3.2) was extended to include constraint operators (Section 3.6.2) with which constraints may be defined. The FRS definitions of each of the predicates used in this example clause are given in Table 5.1. The constraints are defined not only in terms of the constraint operators, notated using $ symbols, but also using several additional ‘built-in’ functions that enable the constraints to refer to other properties of objects. For example, the function colliding(A,B) returns true if and only if objects A

158 5.5. Hybrid Learning Implementation and B are currently touching each other in the simulator. As is clear from the definitions in Table 5.1, each predicate specifies one or more constraints on the relative poses of the objects that it refers to. Thus, to determine whether the robot can feasibly move to a pose that can satisfy these constraints, the pose of the robot is treated as a variable, and the set of constraints, relative to the fixed poses of the other objects, in this case objA and objB , form a CSP. The algorithm used by the JaCoP constraint solver to solve a CSP provided by a general clause C is given in Algorithm 5. To give a specific example of such constraint solving, consider the following simplified clause, given in Prolog notation, regarding a robot and two objects: simple(robot,objA,objB) :- near(robot,objA), above(robot,objB), facing(robot,objA).

In this example, the clause specifies a situation in which the robot is close to one object, facing it, and simultaneously above the other. Constraint solving can be used to check whether such a pose is feasible. The FRS interpreter may then be called to satisfy the clause, which will cause it to construct the complete set of variables and respective constraints, specifying a CSP. If the locations of objA and objB are (5,0,2) and (3,0,−2) respectively, calling the FRS interpreter to satisfy the set of constraints imposed by the current example constructs a set of 23 variables, 4 of which are the solution variables the x,y,z,θ pose21 of the robot, linked by a set of 22 constraints. These variables either refer to pose of the robot, or are directly linked to the pose of the robot: FloatVar: theta::{-1.00e+150..1.00e+150}, has 2 constraints. FloatVar: xsqrd::{-1.00e+150..1.00e+150}, has 2 constraints. FloatVar: suma::{-1.00e+150..1.00e+150}, has 2 constraints. ... FloatVar: z::{-1.00e+150..1.00e+150}, has 2 constraints. FloatVar: heading::{-3.14..3.14}, has 2 constraints. FloatVar: y::{-1.00e+150..1.00e+150}, has 2 constraints.

21In general, the pose of the robot is specified using a spatial vector (x,y,z) and an orientation quaternion (rx,ry,rz,rw) as in Algorithm 5. In the example given, the angular representation is simplified for readability.

159 5.5. Hybrid Learning Implementation

Algorithm 5 Solving a CSP using JaCoP

SOLVE_CONSTRAINTS Input: A clause C specifying a set of relations. A world model W containing poses of the objects. 1: Let S be a constraint store . The ‘store’ is a JaCoP construct that specifies a CSP 2: Set the components of the pose of the robot (x,y,z,rx,ry,rz,rw) as the set of solution variables for S. 3: satisfy(C) . Call the FRS interpreter to satisfy the clause. This will cause it to execute the following for loops to construct the CSP. 4: for each variable v in C do 5: Add a FloatVar to S . A FloatVar specifies a floating point vari- able domain. 6: end for 7: for each constraint c in C do 8: Impose c on the FloatVars in S 9: end for 10: Let boolean b = checkConsistency(S) . Once a set of constraints are imposed, Ja- CoP can check if they are consistent. 11: if b then 12: Run depth-first search to solve S . Depth-first search in JaCoP is performed on intervals of floating point variables by iteratively bisecting floating domains and testing for solutions in either half. 13: return the set Solns found by the search 14: else return null . The constraints in C are unsatisfiable. 15: end if

FloatVar: x::{-1.00e+150..1.00e+150}, has 2 constraints. where the variables xsqrd , suma , and heading , are intermediate variables constructed to express the constraints between related quantities, such as the distance between the robot and the objects. Finally, it is possible to call the JaCoP solver to directly search for specific solutions to the given CSP, which produces: Solution : [x=4.43, y=0.20, z=0.09, theta=0.80]

As mentioned in the comments within Algorithm 5, FloatVar is a JaCoP construct that specifies the domain of a floating point variable. To check whether the CSP specified by this set of constraints has any solutions, it is possible to call

160 5.5. Hybrid Learning Implementation

JaCoP to check consistency (line 9 of5). This produces: After imposing consistency.. Store: x::{4.38..6.99} y::{0.20..2.00} z::{0.01..3.83} theta::{0.80..1.20}

The existence of this solution is an existence proof that the pose is feasible, and thus the robot can attempt to reach it. Typically, the specific solution is not relevant, it is simply important to know that solutions do exist. This is important when testing whether or not an experiment can be performed, as will be seen in later sections. The final function of the constraint solver is to provide a bounded region that contains all the solutions to the CSP. This is important to enable the robot to find a pose that not only satisfies the constraints enforced by a clause, but is also reachable by the robot arm. To produce such a bounded region, the bounds on the solution variables produced by the consistency check can be used to build a constrained region22. For example, for the example so far considered, the constrained region is:   x ∈ {4.38,6.99}   y ∈ {0.20,2.00} R =  z ∈ {0.01,3.83}   θ ∈ {0.80,1.20} which can then be used in combination with inverse kinematics (Section 5.5.2) to search for a pose that satisfies a given clause, and is reachable by the robot. In later sections (particularly Section 5.5.5), the clauses that are used to construct CSPs refer specifically to preconditions of the hybrid action model being learned, and the solution of the relevant CSP enables the robot to determine whether a spatial pose exists that enables it to satisfy the constraints in the preconditions, and thus,

22In general there is no guarantee that the solution space of a CSP will be convex, and so a region bounded in this manner may contain points that do not satisfy the constraints. In practise, this situation was never observed and this simplification was sufficient for our purposes.

161 5.5. Hybrid Learning Implementation whether it is possible to perform that experiment. If solutions to the CSP exist, experiment is feasible and can be attempted.

Trainer Demonstration

As mentioned earlier, the robot’s learning process begins with a demonstration of an action by a human trainer. The demonstration is performed by teleoperating the robot using a joystick. The state of the robot’s arm is specified by j, a vector of joint angles, and the gripper state:

j(t) = ( j1, j2, j3, j4, j5, j6,g) (5.6)

The gripper state is a real value between 0 and 1, with 0 describing a completely closed hand and 1 a completely open hand. During a demonstration, this arm state vector is recorded, producing a time-series of arm pose data describing the action performed by the trainer. These data are then down sampled to 10Hz and converted to a list of end effector poses using forward kinematics following (Niekum et al., 2012) and (Stulp and Schaal, 2011). To perform both forward and inverse kinematics calculations for the JACO arm, an analytical kinematics solver built using the IKFast module of OpenRave (Diankov, 2010) was used. The end effector pose is specified by a position vector and an orientation quaternion. The state of the robot’s gripper is also specified. Thus, the data produced by the trainer’s demonstration are provided to the robot in the form of an 8 variable time series.

x(t) = (x,y,z,rx,ry,rz,rw,g) (5.7)

5.5.3 Level-coupling for Robot Manipulation

Interaction between the task-level and control-level is required to ensure proper coupling, i.e, that the logical conception of action can be operationalised. Specifi- cally, a goal point must be given to the DMP motion planner so that the end-point of the control-level trajectory is appropriately grounded by the parameters mentioned

162 5.5. Hybrid Learning Implementation in the PDDL model. To effect this coupling in a manner general enough to enable a robot to execute complex manipulation actions involving the use of objects as tools, two assumptions are made. The first is the existence of a target parameter ptarget, and the second the existence of a tool parameter ptool. First, it is assumed that one of the parameters mentioned in the effects of the action represents the identity of an object that is the target of the action. The target parameter refers to an object that the robot’s arm is intended to move towards to execute the action. For example, the hybrid action model for hit_with , hitting a target object with a hammer, is shown in Figure 5.4. The target object is inferred from the trainer’s demonstration by selecting the object in the world that is closest to the robot’s hand, discounting if any, the object the robot is holding, at the conclusion of the demonstration. For the hit_with action model, the parameter in the PDDL model that has been identified as referring to the target object ptarget is ?target . This assumption amounts to the restriction that only actions that refer to the execution of a movement relative to a target object are able to be represented. Not all actions satisfy this requirement (for example, ‘waving’ its arm does not), however many useful actions such as kicking a ball, grasping an object, or toggling a switch do have this character. To execute the hit_with action, the robot must, whilst holding a hammer object, move its gripper, with some speed, towards the location of the target object. However, the goal point for the trajectory of the robot’s gripper is not the location of the target itself, but a point near it that ensures the hammer will make suitable contact with the target object. This point is offset from the location of the target object by a fixed vector Otrainer. The offset vector is different for every action, but it is assumed that the same offset vector enables a particular action to be executed using many different target objects. For the hit_with action (Figure 5.4), the offset vector has been inferred from the trainer’s demonstration to be Otrainer = (−0.37,−0.44,0.43) by taking the Euclidean distance between the robot’s gripper and the location of the target object at the conclusion of the trainer’s

163 5.5. Hybrid Learning Implementation

task-level: (:action hit_with :parameters (?robot ?hammer ?target ?gripper) :precondition (isRobot ?robot) :effect (and (is_hit_by ?target ?hammer) (is_not_hit_by ?target ?gripper) (facing ?robot ?target) (near ?robot ?target) (above ?robot ?target)))

coupling parameters: Otrainer = (−0.37,−0.44,0.43), Ttrainer = (0.22,0.11,0.22) Rtrainer = (0.84,−0.54,−0.03,−0.07) ptarget = ?target , ptool= ?hammer

control-level: DMP-x:{−2.604,−1.698,...,104.272} DMP-y:{1.136,−1.336,...,−0.143} ... DMP-g:{3.275,−5.756,...,0.366}

Figure 5.4: Hybrid action model for the hit_with action, in which the robot attempts to hit a target object using a tool object as a hammer. This model was inferred from a trainer demonstration, enabling the fitting of DMPs to describe the motion involved. For each dimension, a single DMP is fitted using n = 20 basis functions, thus each DMP is specified using the weights wi of the basis functions Ψi. Coupling parameters Otrainer, the offset of the goal point for the robot’s hand from the location of the target object ptarget =?target , Rtrainer, the rotation of the robot’s gripper during execution of the trainer’s action, and Ttrainer, the dimensions of the tool object ptool =?tool , are also inferred from the trainer’s demonstration. An initial PDDL model of the task-level action model has been constructed via EBL, however the preconditions remain to be refined using active learning. demonstration. A second assumption is necessary to enable the robot to experiment with tools of different sizes, and to some extent, of different shapes. This is the existence of a tool parameter ptool for actions that involve tool use. If the preconditions of an action, constructed by explaining the trainer’s demonstration, specify that the robot is holding an object, the parameter referring to the held object is designated a tool parameter for the action. For example, when the trainer began demonstration

164 5.5. Hybrid Learning Implementation of hit_with , the robot was holding the hammer object. Thus, when the constants in the trainer’s demonstration are converted to variables as part of EBL (Section 5.5.5), the variable that referred to the hammer is identified as the tool parameter. In this case, ?hammer is identified as ptool as shown in Figure 5.4, and the dimensions Ttrainer are the dimensions of the minimum volume axis-aligned bounding box for the hammer tool. Once the tool object is identified in this manner, it is possible to compensate for the use of a tool of a different size to that used by the trainer, by scaling the relevant dimension of the offset vector. This is achieved by the scaling matrix S in Equation 5.8. This enables the robot to adjust, for example, the control-level goal of a hitting action to compensate for hammers of different lengths. Thus, the coupling parameters enable the hybrid action model to be applied to arbitrary objects, which is necessary for the model to be used in planning experiments (Section 5.6.3). For example, if it is necessary to use a new hammer with dimensions T = {Tx,Ty,Tz} to hit a nail which is at location Lnail, then the control-level goal point G for the DMP motion planner23 can be calculated using the coupling parameters:

−1 G = Lnail + R · (Rtrainer · Otrainer) × S (5.8) where R is the current rotation quaternion of the robot’s hand, and the symbol ‘·’ denotes quaternion multiplication, × denotes matrix multiplication, the index i given by trainer trainer trainer i = argmax{Ox ,Oy ,Oz } (5.9)

23Since movements are represented by 8-D trajectories (Section 5.5.2), an 8-D goal point must be specified to the DMP motion planner. The cartesian (x,y,z) component is given by Equation 5.8, and the orientation components (rx,ry,rz,rw) are given by the current orientation of the hand. As will be detailed in Section 5.5.5, the rotation of the robot’s hand remains fixed during execution of an action. The gripper component g is given by the final value of the trainer’s demonstration trajectory.

165 5.5. Hybrid Learning Implementation

is the index of the largest component of Otrainer, and the scaling matrix S by

Ti S = I + [ trainer − 1]ei (5.10) Ti where I is the identity matrix, and ei is the ith Euclidean basis vector. The quater- nion multiplication for a particular manipulation action ensures that the initial offset Otrainer is transformed into the current frame of the robot’s hand to provide the correct control-level goal point for the action. The DMP motion planner can then generate a trajectory to reach G, for the robot’s arm to follow, to implement the action.

5.5.4 Execution

As shown in Figure 5.4, the task-level component of the hybrid action models are PDDL actions. Thus, it is straightforward to use an automated PDDL planner to find action sequences to achieve specified goals. As in earlier chapters, an off the shelf implementation of the Fast-Forward planning system (Hoffmann and Nebel, 2001) was used as the automated planner. This can, however, be replaced by any equivalent PDDL planner. When the robot is presented with a task-level goal (Section 5.6.3), the task planner attempts to produce a sequence of hybrid action models that will achieve it. These must then be executed, one by one, by implementing their control-level policy.

Inverse Dynamics Controller

For the robot to execute an action described by a hybrid action model, it must determine the goal point of the action (Equation 5.8). Once given this goal point, the DMP motion planner integrates the set of DMPs encoded by the model, which constructs a trajectory that specifies a series of waypoints for the robot’s end effector, so that the robot can perform the action. Since the set of DMPs for each action are fit one per dimension of the trainer’s demonstration, the trajectory produced by the DMP motion planner is in terms of

166 5.5. Hybrid Learning Implementation the pose of the end effector. Trajectories of this type are often referred to as ‘task space’ or ‘operational space’. That is, the output of the DMP motion planner is a time series of waypoints, with the same 8-D structure as the trainer’s demonstration (Equation 5.7 on page 162). Typically, the robot cannot directly execute commands in task space, and requires the commands to be given in ‘joint-space’, that is, specified in terms of a set of desired joint angles for the robot’s arm. It is possible to convert trajectories to joint space using inverse kinematics, however there is no guarantee that the resulting joint-space trajectory will be smooth. Naive inverse kinematics solving of points that are close in task space often produces ‘joint flips’, configurations of the arm that are not close in joint space resulting in large discontinuous movement, which is undesirable and can be dangerous on a physical robot. To smoothly follow these task space trajectories, an inverse dynamics controller is needed. Inverse dynamics control refers to calculation of the appropriate joint positions and velocities so that the end-effector reaches a desired position and velocity. To provide this control, the desired task space velocity is calculated from the joint-space velocity using an inverse velocity kinematics algorithm based on the weighted pseudo inverse with damped least-squares (Wampler, 1986). This is implemented using the Orocos Kinematics and Dynamics Library (KDL) (Smits, 2014). Desired task space positions and accelerations are calculated by numerical integration and differentiation respectively. Further details on the formalism and implementation of this controller can be found in Appendix C.

5.5.5 Learning

The components of the system that enable it to acquire and and refine task-level descriptions of new actions, and also acquire control-level details of their imple- mentation, are now described.

167 5.5. Hybrid Learning Implementation

Explanation-based Learner

Through observation of the abstract world state, both before and after a trainer takes control of the robot arm and performs an action, it is possible for the learner to construct an abstract description for a new action model based on the setup and effects of trainer’s action. Following the work of Brown (2009), a form of EBL is used to produce this novel action model. The aim is to explain what the new action does, and to build an initial abstract model of this action. In this work, the trainer’s demonstration consists of a single action only, and thus, segmentation is not necessary. However, in the event where the demonstration is less structured, for example, when the teacher performs a complete sequence of actions to achieve a goal, segmentation must be performed to identify individual actions in the plan. Heuristic (Brown, 2009) or Bayesian (Niekum et al., 2012) approaches have been used to effect this segmentation. A PDDL model for the novel action can be constructed by identifying the subset of literals in the novel action’s start and end states which are relevant to explaining how the teacher achieved the goal. The preconditions for the action are simply the logical state of the world when the trainer’s demonstration begins, and the effects of the PDDL model are the observed state changes which occur during the execution of the teacher’s action. Specifically, the effects are defined as the set of literals that change between the logical world state at the beginning and end of the trainer’s action. The algorithm for the induction of a new PDDL action using target target EBL is given in Algorithm 6. One term of note is (Send ∩ Sstart ), which ensures that true literals regarding the target object are not destroyed by the action. This is important in order to learn correct action models, as will be seen in more detail during discussion of learning experiments, particularly when learning to use a tool as a hammer.

Dynamic Motion Primitive Learner

As described in Section 5.5.1, the demonstration data vector x(t) produced by the trainer’s demonstration is given to the robot in the form of an 8 variable time series.

168 5.5. Hybrid Learning Implementation

Algorithm 6 Induction of new action model

1: Identify relational state description Sstart at the start of the trainer’s demo. 2: Identify Send at the end of demo. target target 3: Identify the sets Sstart and Send of relations that involve the target object. target target 4: Construct the set of effects E = (Send − Sstart) ∪ (Send ∩ Sstart ) 5: Construct the set of preconditions P = Sstart 6: Construct a new action model M. The preconditions are set to P and the effects to E.

The tuple giving the components of the vector is reproduced here for convenience.

x(t) = (x,y,z,rx,ry,rz,rw,g) (5.7)

where (x,y,z) is the cartesian position of the robot’s end-effector, (rx,ry,rz,rw) is a unit quaternion describing its orientation in space, and g is the state of the robot’s gripper. To implement control-level learning of the trainer’s movements, a DMP is fitted to each dimension of the demonstration trajectory using the formalism given in Section 5.3.2. The MATLAB implementation of DMPs provided by Schaal (2010) was used to perform fitting and integration of DMPs. Since the components of a quaternion are not linearly independent, quaternions produced by directly integrating the individual components of a quaternion are not guaranteed to correspond to valid rotations (Ude et al., 2014). Thus it is not possible to construct rotational trajectories by integrating each rotational DMP in isolation and recombining them, as is done for cartesian dimensions (x,y,z). To address this, like Stulp et al. (2011), it is assumed that the orientation of the robot’s hand remains fixed during execution of a DMP trajectory.

Progolem Active Learner

An important contribution of this research is that the hybrid action model, as defined here, captures what an action is for as well as how to perform it. However, the definition of an action’s purpose acquired through EBL (described in Section 5.5.5) is too specific. Thus, it is essential for the learner to generalise its model of

169 5.5. Hybrid Learning Implementation

Figure 5.5: Visualisation of a version space of hypotheses, bounded by the most-specific (hS) and most-general (hG) boundaries. The + and – symbols in the figure represent positive and negative examples respectively. Figure from (Brown, 2009). each action to maximise the set of situations in which the action can be applied with a reasonable expectation of success. LEX (1983) demonstrated that a form of learning by experimentation is appli- cable to this type of problem. The learner performs a loop, in which it iteratively constructs experiments designed to test its current hypotheses in the most infor- mative manner possible. The learning algorithm presented here builds on that developed by Brown (2009), which conducts experiments to generalise knowledge in a manner reminiscent of LEX. In this section, this learning method, and how it is used to generalise action models, is described. The extensions made to Brown’s initial active learning algorithm are also outlined. The generalisation of the action model is formulated as an ILP (see Section 2.6.3) learning task. The target for the ILP learner is the precondition of the PDDL action model. Training examples are generated by experimentation with the action, such that if the effects of the action are achieved by the experiment, it is labelled a positive training example, otherwise it is a negative example. Instead of maintaining a single best hypothesis for the current representation of the target

170 5.5. Hybrid Learning Implementation concept, the learner constructs a form of version space (Mitchell, 1982), in which the space of possible hypotheses is bounded by a most-general hypothesis hG and a most-specific hypothesis hS. This construction (shown in Figure 5.5) enables the learner to generate new experiments by picking clauses that lie within the version space, that is, the region bounded by hG and hS, and testing them. The main steps of the learning algorithm are shown in Algorithm 7 on the current page, and each of the subtasks, denoted by function names in capitals are given in Algorithms 8-13.

Algorithm 7 Generalisation of new action model by active learning

LEARN_FROM_EXPERIMENTATION Input: New Action Model M, Ntrials 1: hS = preconditions of M. 2: hG = true. 3: while index < Ntrials do 4: e = GENERATE_EXPERIMENTS(hS,hG) (Algorithm 10) 5: if M has a tool parameter (Section 5.5.3) then 6: tool= SELECT_TOOL(hS,hG) (Algorithm 11) 7: end if 8: for each ei in e do 9: eground = FIND_FEASIBLE(ei) (Algorithm 12) 10: if eground , null then 11: break 12: end if 13: end for 14: if FoundExperiment then 15: success = EXECUTE_EXPERIMENT(tool,eground) (Algorithm 8) 16: if success then 17: Label eground pos. 18: hS = GENERATE_MOST_SPECIFIC (Algorithm 13) 19: else 20: Label eground neg. 21: hG = GENERATE_MOST_GENERAL (Algorithm 9) 22: end if 23: Add ei to training data. 24: end if 25: Increment index 26: end while

171 5.5. Hybrid Learning Implementation

Hypothesis Generation

At each iteration during the learning process, the first step is to construct the bound- ary hypotheses hG and hS. Two components of the Progolem learner described by Muggleton et al. (2010) are used for this process. In Brown’s work, the construc- tion of hS was performed by randomly selecting pairs of positive examples and computing a generalisation operator, resulting in a minimally general clause that covers both examples. This process is repeated until all examples are covered or coverage stops improving. This algorithm was first used in the GOLEM learner described by Muggleton and Feng (1992). The generalisation operator used in GOLEM and also in Brown’s work is based on Plotkin’s (1972) Relative Least Gen- eral Generalisation (RLGG). The length of a clause constructed using the RLGG grows exponentially in the number of examples, requiring systems using the RLGG to restrict background knowledge clauses to satisfy an i j-determinancy property. That is, only clauses in which all terms are determinate, in the sense that there is only one binding possible for existentially quantified variables, are acceptable. Muggleton (2010) points out that this restriction is equivalent to requiring that predicates in the background knowledge must represent functions. In this work, the RLGG is replaced by the Asymmetric Relative Minimal Generalisation (ARMG) proposed by Muggleton et al. (2010), and the random selection of pairs of examples to generalise is replaced by a greedy beam search to select the best ARMG with respect to coverage. This algorithm, referred to by Muggleton et al. as the Best-ARMG algorithm, and shown in Algorithm 13, is used to construct the most-specific boundary clause hS. Clauses produced by ARMG are bounded by the length of the initial bottom clause, so no restriction of the background knowledge is necessary. Details of the algorithm are found in (Muggleton et al., 2010).

Generation of the most-general hypothesis hG is achieved using negative-based reduction. This algorithm aims to generalise a clause, in this case, hS by greedily removing any literal that does not stop a negative example from being proved.

Thus hG is always a subset of hS. The negative-based reduction algorithm that is

172 5.5. Hybrid Learning Implementation

Algorithm 8 Execute a learned action to perform an experiment

EXECUTE_EXPERIMENT Input: Experiment clause e, tool object tool, primitive world state P, number N of iterations find a suitable start pose 1: if tool , {} then 2: PICK_UP(tool) . PICK_UP is a hand-coded behaviour to pick up an object of interest. 3: end if 4: Construct the constraint satisfaction problem C, specified by conjunction of con- straints (Section 5.5.2) imposed by all literals in eground . 5: Query the constraint solver for the bounding region that contains all solutions to C. 6: while startPose={} & i < N do 7: r = random point inside R 8: Solve for inverse kinematics (Section 5.5.2) for r 9: if solution exists then 10: startPose = r 11: end if 12: increment i 13: end while 14: if i ≥ N then 15: return false 16: end if 17: Set the arm’s pose to startPose 18: o = M’s offset vector (Section 5.5.3) 19: ocurrent = transform o using the transformation of the current relative pose of the target object, relative to the pose that existed in the trainer’s frame. 20: t is the location of the target object identified by the target parameter (Section 5.5.3) of M and queried from the primitive state P. 21: goalPose = t + ocurrent 22: motionPlan = use the DMP to plan a trajectory from startPose to goalPose 23: Execute motionPlan using inverse dynamics controller (Section 5.5.4). 24: success = check the current relational state to determine if the effects of the action hold 25: return success

173 5.5. Hybrid Learning Implementation

Algorithm 9 Induction of most general hypothesis using GOLEM’s negative-based reduction. Adapted from (Muggleton and Feng, 1992).

GENERATE_MOST_GENERAL Input: Most specific clause hS

1: hG = hS 2: n =length(hG) 3: while true do 4: Let Bi be the first literal in hG = B1,B2,...,Bn such that A = B1,...,Bi covers no negative examples 5: hG = Bi,B1,...,Bi−1 6: if length(hG) < n then 7: n =length(hG) 8: else 9: return hG 10: end if 11: end while

Algorithm 10 Generate a set of possible experiments

GENERATE_EXPERIMENTS Input: Boundary clauses hS and hS, enum constant R

1: e = {}, experiment = hG 2: for each literal L ∈ hS do 3: Append L to experiment 4: Add experiment to e 5: end for 6: if R = Speci ficFirst then 7: Reverse e 8: else 9: if R = Random then 10: Shuffle e 11: end if 12: end if 13: return e

174 5.5. Hybrid Learning Implementation

Algorithm 11 Select a tool for the next experiment. From Brown (2009).

TOOL_SELECTION Input: Boundary clauses hS and hS, relational world state W 1: Retrieve the set T of possible tools from W 2: Scoremax = 0, toolchosen = {} 3: for each tool ∈ T do 4: Score = 0 5: for each literal LG ∈ hG do 6: if tool satisfies LG then 7: Increment Score 8: else 9: Score = 0 10: break 11: end if 12: end for 13: if Score , 0 then 14: for each literal LS ∈ hS do 15: if tool satisfies LS then 16: Increment Score 17: end if 18: end for 19: end if 20: if Score > Scoremax then 21: Scoremax = Score 22: toolchosen = t 23: end if 24: end for 25: return toolchosen

175 5.5. Hybrid Learning Implementation

Algorithm 12 Test if an experiment is feasible

FIND_FEASIBLE Input: Experiment clause e, relational world state W 1: O = list of all objects mentioned in W 2: P = set of free parameters mentioned in e 3: A = set of possible assignments of parameters to objects in O 4: for each assignment a ∈ A do 5: Assign the free parameters as specified by a, producing a ground clause, eground 6: Construct the constraint satisfaction problem C, specified by conjunction of con- straints imposed by all literals in eground. 7: Query the constraint solver (Section 5.5.1) if solutions exist to C 8: if solution exists then 9: return eground 10: end if 11: end for

176 5.5. Hybrid Learning Implementation

Algorithm 13 Induction of most specific hypothesis using Progolem’s BEST- ARMG. Reproduced from (Muggleton et al., 2010).

GENERATE_MOST_SPECIFIC Input: Examples E sample size K, beam width N 1: Select a random example positive example e 2: Calculate ⊥e, the bottom clause relative to the background knowledge 3: best_armgs = {⊥e} 4: while new_armgs , {} do 5: Sbest = highest SCORE from best_armgs 6: Erandom = K randomly selected positive examples from E 7: new_armgs = {} 8: for each C ∈ best_armgs do 0 9: for each e ∈ Erandom do 10: C0 = ARMG (C,e0) 0 11: if SCORE (C ) > Sbest then 12: new_armgs = new_armgs ∪C0 13: end if 14: end for 15: end for 16: if new_armgs , {} then 17: best_armgs = highest scoring N clauses from new_armgs 18: end if 19: end while 20: return highest SCORE from best_armgs

SCORE Input: Examples E, Clause c 1: P = number of positive examples covered by c 2: N = number of negative examples covered by c 3: return P − N−length(c)

177 5.5. Hybrid Learning Implementation

Santos (2010).

Experiment Selection

The next step is to use the boundary hypotheses hG and hS to generate experimental hypotheses that can be tested, expanding the set of training data available to the learner. If the action model being learned contains a tool parameter, then part of the experiment selection involves selecting a tool, to test current hypotheses about the required structural properties of the tool. To perform tool selection, the same algorithm developed by Brown, given in Algorithm 11 on page 175, is used here. Two algorithms are used to generate experiments. First, the method developed by Brown (2009), denoted Specific-First, and a second variant, termed Random. The Specific-First algorithm generates experiments which are maximally specific, that is, as close to hS, as possible. The algorithm, detailed in Algorithm 10, successively adds literals from hS to hG until no more can feasibly be added. As described in earlier sections, each predicate describes either a structural property (e.g has_num_attachments(hammer,1) ) or a spatial relationship that enforces a constraint on object poses (such as near(hammer,nail) ). The feasibility of experiments is evaluated by searching exhaustively for a set of parameter values to ground the experiment clause such that all structural literals are true, and all spatial constraints have solutions. The random experiment generation algorithm is the same in all respects, except that it only adds literals up to a random maximum clause length 0 < n < |hS|. The effect of this shuffling is that experiments chosen vary in their generality, causing the learner to perform less constrained experiments, since there are fewer literals in the experiment hypothesis. The effect of the choice of experiment selection algorithm on the learning process is investigated in the experiments in the next section.

178 5.6. Experiments

5.6 Experiments

The hybrid learning system was evaluated using a set of manipulation tasks, inspired by the recent DARPA Robotics Challenge (2014). This challenge is a large scale disaster rescue scenario, loosely based on the aftermath of the Great East Japan Earthquake of 2011 at the Fukishima-Daichii nuclear power station. As was detailed in Section 2.8.1, the central focus of the challenge required robots to engage with human-engineered environments to enable their use in rescue situations. DARPA specified a set of very challenging tasks to encourage the development of such robots (2013). The tasks are general examples of the types of tasks autonomous rescue robots need to accomplish in such disaster scenarios. For example, robots require the ability to open doors (Figure 5.6). The complete set of tasks required for the challenge were listed in Section 2.8.1. The experiments described in this section are focused on learning to perform actions that are crucial in these settings. In particular the system is used to learn a representative set of four actions: [grasp, place_on, hit_with, open_door] . These actions are essential when attempting to solve manipulation problems in indoor environments, and could be used as building blocks to execute the types of tasks required for the DARPA Robotics Challenge. Section 5.6.2 describes experiments that evaluate the ability of the system to learn each action from demonstration and discusses the results. Section 5.6.3 investigates the ability of the system to use the learned action models to formulate plans of action to achieve goals requiring reasoning and use of multiple actions in sequence.

5.6.1 Experimental Domain

As in other chapters, experiments are conducted using the open source jMon- keyEngine3 game engine (see Appendix A), which simulates physical interactions using the Bullet physics engine (Coumans, 2010).

179 5.6. Experiments

Figure 5.6: The winner of the trial phase of the DARPA Robotics Challenge, the S-One robot. S-One, from team SCHAFT, is shown performing the ‘clear-debris’ task during trials in 2013. (Photo Credit: DARPA).

Simulated Robot Arm

The simple parallel jaw gripper used for manipulation in the previous chapter is insufficiently dextrous to perform many types of complex manipulation that are of interest. To explore the learning of manipulation behaviours, a more sophisticated robot manipulator was modelled. The manipulator arm used is a model of the JACO robot arm produced by Kinova Robotics (2013), mounted on the Emu autonomous rescue robot used for experiments in earlier chapters. The JACO arm is a 6-DOF arm that is controlled by independent torque controllers for each joint. The arm and robot can be seen in Figures 5.7, 5.10 (page 190), and 5.11 (page 192). To demonstrate the performance of an action, the control of the robot is switched to teleoperation and the trainer controls the movement of the robot arm using a

180 5.6. Experiments joystick.

Tools and Objects

Experiments are conducted in a relatively sparse simulated world, with the robot arm either mounted to a fixed pedestal or mounted to the mobile Emu robot for tasks requiring movement. Since grasping of unknown or complex objects is not the focus of this research, the objects that the robot manipulates are either simple shapes such as boxes and cylinders, or in the case of tools such as hammers, they are compound shapes built of several such simple shapes. Some of the tools and objects used in the experiments can be seen Figure 5.7.

5.6.2 Experiment 1: Learning New Actions

The action learning architecture in this research is general enough to acquire many kinds of manipulation actions. To demonstrate this, four different actions are learned. The first action, grasp , is simply reaching for and grasping a target object. The place_on action, involves carrying an (already grasped) object to a point above another object releasing it so as to place it upon a target. The open_door action requires the robot to grasp the lever-style handle of a door, pull down until the door unlocks (at an angle of 30◦) and then pull backwards to open the door. The final action, hit_with involves swinging a tool object so that it collides with a target object in a hammering motion. For each of the four actions, the trainer performs the action successfully to bootstrap the learning process. The initial model of the action24 is constructed using the EBL module (Section 5.5.5). The robot then begins experimentation to generalise the action model further. This experimentation requires the robot to execute the action beginning from a variety of different preconditions and observe the results. The results of learning are shown in Figure 5.8, and the clauses constructed by learning are shown in Tables 5.2 and 5.3 on page 186.

24The initial models of the four actions constructed in this manner are shown in Table C.2 in Appendix C.

181 5.6. Experiments

Figure 5.7: Visualisation of the four learned manipulation actions. (a) grasp = Grasping a target object. (b) place_on = Placing an object upon a target object. (c) open_door = Opening a door. The robot is seen grasping and pulling downwards on the handle of a door, required in order to unlock it. (d) hit_with = Hitting a target object using another object as a hammer. The robot is shown holding the hammer tool, poised to hit the cylindrical target object.

Figure 5.8 shows the mean success rate improving as the active learner ex- periments with each action. As already described, each experiment consists of attempting to perform the action with a different set of preconditions and record-

182 5.6. Experiments

Figure 5.8: Learning curves for the four manipulation actions. Clockwise from top left: grasp, place_on, hit_with, open_door. Success rates are averaged over 20 learning episodes. Error bars depict the 95% confidence intervals. ing whether the action succeeds. As the learner incrementally generalises the clause representing the actions preconditions using the training data provided by the results of previous experiments, its success rate improves. All learning trials were performed using both the Random and Specific-First experiment generation algorithms (Section 5.5.5). Inspection of the learning curves in Figure 5.8 shows that the robot’s perfor- mance converges towards perfect execution significantly faster when the learner generates its experiments using the Specific-First strategy. This is not surprising,

183 5.6. Experiments since more specific experiments are more similar to the demonstration and less likely to generate negative examples. However, the disadvantage to this learning strategy is that it is inherently less adventurous – its lower failure rate leads to fewer negative examples which in turns means its ability to generalise is limited (recall that construction of the most general clause is via negative-based reduction). This difference is clear when inspecting the preconditions learned using the two algorithms (Tables 5.2 and 5.3). The most-general clauses learned using Specific- First experiment generation are very short – only one or two literals – leaving most of the description to the most-specific clause. This indicates that although the robot is able to perform the task successfully, the learning process is not near convergence, since at convergence we would expect the hypotheses hG and hS to be the same. In the case of the learner that generates its experiments in a more random manner, the two clauses are much more similar. Although the two clauses are only identical for the case of the place_on action (Table 5.2), in all cases the most-general clause is significantly longer, indicating improved inference of constraints necessary to execute the various actions. One might expect that since the action models learned using Specific-First experiment generation are overly general due to lack of negative examples, they would be insufficiently constrained to succeed when applied in a wider context. We investigate this possibility in Section 5.6.3, and the data are found to support this proposition.

184 5.6. Experiments

Most Action Most specific clause Most general Most specific clause general Name (s) clause (r) (r) clause (s) above(A,C), above(A,C), facing(A,B), level_with level_with level_with(A,B), (A,B), (A,B), grasp facing open_gripper(A), facing(A,B), facing(A,B), (A,B) (A,B). parallel(A,C), open_gripper on(B,C), near(A,B), (A), near(A, open_gripper near(B,D), B). (A), near(A,B). on(D,E).

facing(A,C), holding(A,B), near(A,C), facing(A,C), near(A,C), facing facing(A,C), holding(A,B), place_on above(A,C), (A,C), holding(A,B), above(A,D), (A,B,C) above(A,D), on(C,D). above(A,D), near(A,C), on(C,D), on(C,D). on(C,D). perpendicular (B,D).

Table 5.2: Results of learning action model preconditions for the two actions grasp and place_on. First two columns learned with ‘Specific-First’ (s) experiment generation and second two columns learned with ‘Random’ (r) experiment generation. Both experiment generation algorithms are detailed in 5.5.5.

185 5.6. Experiments near(A,C), facing(A,C), closed), specific clause (r) longer(B,E), Most perpendicular(B,D). . First two columns learned has_shape(C,sticklike), above(A,C), above(A,D), has_shape(B,sticklike), has_attached_side(B,G), holding(A,C), facing(A,B),longer(C,D), handle_state(C,unlocked), has_num_attachments(C,0), longer(D,B), longer(B,C), has_num_attachments(B,F), holding(A,B), has_attached_side(C,none), longer(C,B), door_state(B, facing(D,B), longer(D,B). hit_with C), and general (B,D), closed), clause (r) near(A, sticklike), above(A,C), Most has_shape(B, facing(A,C), longer(B,E). holding(A,C), door_state(B, holding(A,B), perpendicular facing(A,B). open_door facing(A,B), holding(A,B), 5.5.5 . specific clause (s) near(D,B), above(A,D), longer(B,E). parallel(C,B), has_handle(B,C), Most perpendicular(B,D), perpendicular(D,B), perpendicular(B,D), door_state(B,closed), has_shape(C,sticklike), has_shape(B,sticklike), has_shape(B,sticklike), longer(C,B), longer(C,D, handle_state(C,unlocked), longer(D,B), longer(B,C), has_num_attachments(C,0), has_num_attachments(B,0), has_num_attachments(B,1), above(D,B), longer(D,B). holding(A,C), has_attached_side(C,none), has_attached_side(B,none), facing(A,C), has_attached_side(B,both), , . general clause (s) door_state Most (B,closed) facing(A,B) facing(A,C). holding(A,B), experiment generation algorithms are detailed in Name Results of learning action model preconditions for the two actions with ‘Specific-First’ (s) experiment generation and second two columns learned with ‘Random’ experiment generation. Both (A,B,C) (A,B,C) hit_with open_door Action able 5.3: T

186 5.6. Experiments

As described in Section 5.5.5, the active learner represents the current space of hypotheses consistent with the current set of training examples using the most- specific and most-general boundary hypotheses hS and hG. This is related to the version space as defined by Mitchell (1982). Given enough additional training instances, hS and hG should eventually converge to sets containing the same descrip- tion. At this point the system has converged to the only consistent generalization within the given hypothesis language. However, Mitchell pointed out (1982; 1997) that there are two cases in which this version space convergence will not occur even after many training examples. First, if there is noise in the training examples. In the case of execution of tasks by a robot, there is clearly significant noise within the training data set, as actions may unexpectedly fail, even if executed in the correct context, or unexpectedly succeed despite being executed in a nominally incorrect manner. This is even the case in a simulated robotics domain like the one in which the current experiments are performed, as the simulation of physics produces interactions which are highly complex and often not reproducible. Second, the version space will not converge if there is no hypothesis within the description language that correctly expresses the target concept. Since the target concept in the present situtation is the preconditions for an action, a correct expression would be a set of preconditions that always produced a successful execution of the action. The existence of noise within the system means that such an action model does not exist, even in principle, thus the complete convergence of the version space cannot be used as a marker for learning having succeeded. Therefore, the robot is allowed to perform n trials, and the mean success rate of the robot’s actions is measured as it performs experiments. To demonstrate the robustness of the system, results are averaged over N = 20 learning episodes of n trials each. Experimentation with the learning system showed that 20 trials was sufficient for the robot to achieve good performance on the manipulation actions that did not involve the use of a tool, while the hit_with action required somewhat longer. Thus, the value of n was set to 20 for the simpler

187 5.6. Experiments actions and 30 for hit_with . To provide a comparison of the number of trials taken to learn each manipulation action, the number of trials taken to reach a mean success rate of 90% were measured. The results of this comparison are shown in Figure 5.9.

Figure 5.9: Comparison of the number of trials required to learn various actions. For the purpose of this comparison, convergence is defined by a mean success rate of 90% for a given action.

Learning to Adapt to a Control-level Limitation

One literal that is conspicuously present in the most-general clause for all four of the actions is facing(A,B) . Since A always refers to the robot, this literal indicates that the robot’s hand must be oriented with the fingers pointing towards the target object. In the current implementation, one restriction placed upon the trajectories that can be generated by the DMP motion planner is that the orientation of the hand is fixed (see Section 5.5.5). Thus, if the action begins with the hand facing

188 5.6. Experiments away from the target, it will be unable to perform any action requiring fingers which, of course, destroys its ability to manipulate things. The appearance of the facing literal in all the most-general clauses indicates that the ILP system has discovered, and adapted to, this restriction on the control-level implementation of robotic actions. The larger number of trials required to learn the hit_with action (Figures 5.8 and 5.9) can be explained by the need to learn a description of a suitable tool with which to hit the target object. The learner experiments with the tools provided, gradually refining its hypothesis to describe those tools that are used successfully. In the given scenario, there are five possible hammers (Figure 5.10), three of which are useful, and two are not. The two bad hammers are problematic for different reasons, one – the ‘shortstick’ – is too short, so that attempting to use it ensures that the robot’s hand collides with the target object. Comparison with the required effects for ‘hit_with’25 shows that the robot’s hand must not hit the target object for the action to be deemed a successful attempt at hammering. The second unusable tool, termed the ‘zigzagstick’ is complex in shape, and attempting to use it to hit is difficult or impossible. The extent to which the learner has successfully discovered these problems with two of the tools can be seen in the inferred most general clause (Table 5.3). Although properties of the tool are not mentioned in the clause constructed by the learner using the Specific-First algorithm, the Random selection learner does include literals describing the tools structure. First the has_shape(B,sticklike) literal indicates that the tool parameter B must have a simple shape, ruling out the ‘zigzagstick’. Second, the presence of the longer(B,E) literal indicates that the chosen tool must be longer than a free parameter E. In practice, this eliminates the ‘shortstick’ tool since E matches to the robot’s hand. Thus, the robot has successfully learned a description of the structural properties that distinguish acceptable hammers. The results of this experiment can thus be summarised as follows:

25see Table C.2 in AppendixC.

189 5.6. Experiments

Figure 5.10: Learning which tools can be used to execute the hit_with action. During all learning episodes, the robot is provided with a set of objects (in this case, potential hammers) to experiment with. From left to right, the tools are referred to in the text as ‘zigzagstick’, ‘hammer’, ‘hookstick’, ‘stick’, and ‘shortstick’. The green cylinder is the target object.

• The convergence of the mean success rates indicates that the system has successfully learned to execute all four manipulation actions.

• The task-level learning was able to adapt to and overcome a limitation of the control-level system, namely, learning that for all actions, it was necessary for the robot’s hand to begin facing the target object.

• Of the two experiment generation algorithms used, Specific-First produced faster learning convergence, however the action models acquired using the Random method produced a more accurate description of the action, a point that is investigated further in the next set of experiments.

190 5.6. Experiments

5.6.3 Experiment 2: Planning Using Learned Actions

A strength of the current system is its ability to use classical planning to sequence DMP-based actions to achieve complex goals. This capability is demonstrated through several reasoning tasks, each requiring approximately eight or nine actions to be executed in sequence to achieve the goal. To enable the robot to use the actions learned during the learning episodes described in the previous section, the robot was given two additional actions, which are manually programmed. The first, named position_relative_to is a preparatory action that makes the pose of the robot’s hand satisfy the spatial constraints specified in preconditions, for example, the facing(A,B) constraint that proved essential to execute all four actions. The second addition action, called move_near_to , allows the robot to drive to a target location, enabling it to be near a target object. Both of these actions are executed using a combination of the constraint solver and motion planner following the algorithm described in previous chapters – the constraint solver is used to find a goal point satisfying the conjunction of constraints, and the motion planner to solve for a trajectory taking the robot’s hand from its current pose to the target pose. For position_relative_to , the MoveIt! motion planner (Sucan and Chitta, 2014) is used to plan trajectories for the robotic arm, and for move_near_to , the A∗ planner described in Chapter 4 is used to find paths for the robot’s mobile base. To demonstrate the planning capabilities of the system and the applicability of the hybrid action models learned during Experiment 1, three planning tasks were given to the robot. First, and most simple, is a ‘Build_Tower’ task, in which the robot must stack several objects of varying shapes and sizes in a variation of the classic ‘blocks world’ AI problem. The second task, ‘traverse_rooms’, requires the robot to obtain a target object that is in an adjacent room joined by a closed door. Finally, the ‘attach’ task is a simplified version of using a hammer and nail to attach two objects. The robot must stack two target objects one upon the other, place the nail on top, and hit the nail, using another object as a hammer. Figure 5.11 shows

191 5.6. Experiments

Figure 5.11: Execution of planning tasks using learned action models. (a) The ‘Build-Tower’ task, discs or blocks are stacked to form a tower. The robot is shown about to complete a tower by placing the red disc upon the blue block. (b) ‘Traverse-Room’, the robot must open a door and pick up a target object in an adjacent room. The robot is shown grasping the handle of the door and, having ‘unlocked’ the handle, pulling the door open. (c) ‘Attach’, a simplified version of using a hammer to hit a nail to attach two objects. The nail is placed upon the two objects in a stack and then the nail is hit with another object. The nail is visible as the small grey object on top of the stacked red and green objects. the simulated robot performing these tasks. Each trial consists of giving the robot the goal that will lead it to construct a plan to achieve each task. The experiment is repeated 20 times, and performance data averaged over all trials. The Prolog clauses learned by the ILP system during Experiment 1 are con-

192 5.6. Experiments

Learning Plan Success rate Completion Successful Task strategy length (%) time (s) actions Build Specific First 6 18.9 ± 11.1 92.1 2.65 Tower Random 8 85.7 ± 9.2 105.3 7.53 Attach Specific First 10 18.0 ± 10.7 84.5 3.62 Random 12 80.3 ± 11.1 166.1 9.57 Traverse Specific First 6 - - 0 Rooms Random 9 84.3 ± 10.4 115.8 8.11

Table 5.4: Results of execution of planning problems using the hybrid action models learned in Experiment 1. Each goal was attempted 20 times and data shown is the average over all trials. Success rate is the mean rate of achievement of the goal, shown with 95% confidence interval.

verted into PDDL action models so that they can be used by the Fast-Forward planner to assemble plans (See Section 5.5.4). However, as was discussed in Sec- tion 5.6.2, although the learning of each action has converged to the point where the mean success rate is very high, the version space for each section has not collapsed 26 into a single clause . That is, the boundary hypotheses hG and hS are not the same at the end of learning. For this reason there is a choice as to whether to use the learned most-specific or most-general clause to construct the PDDL preconditions. Since the most-general clause is constructed to be the widest set of circum- stances under which an action is feasible, this clause is chosen. The focus of the present experiment is the applicability of learned actions for solving planning tasks, so that if an action fails, no replanning is done. Completion time is measured from the beginning of the execution trial until the experiment ends either in action failure or successful achievement of the goal. Results of this experiment are shown in Table 5.4. Inspection of the results shows a differing degree of success between the two systems that act using the action models learned with the two different experiment generation strategies discussed in the previous section. The system using the Random actions is able to achieve the goal with a high success rate for all three goals. The failures that are

26With one exception. For place_on the version space has converged completely (Table 5.2)

193 5.6. Experiments present are as a result of the small failure rate for each action chained together in a sequence, for example the robot’s hand occasionally slips off the door’s handle, or the hammer will occasionally miss the nail if the robot has grasped the hammer at an angle sufficiently different from that in the demonstration. A far more dramatic source of execution failure is obvious when it comes to execution using using Specific-First actions – the robot is unable to succeed more than ∼18% of the time. It is clear that when planning using these actions, the system constructs shorter plans, both from the plan length and the completion time data. The reason for this is that the overly general preconditions constructed by the learner when using Specific-First experiment generation do not contain spatial constraints on the robot’s hand such as facing, parallel , or above, and thus the preparatory action position_relative_to is rarely selected by the planner. This results in the robot attempting to execute grasping actions from initial poses which make it particularly difficult for action execution to succeed. This situation is made significantly worse by the fact that in the case of the planning problems, objects are located both in front of and behind the robot (in the learning trials all objects were in front of the robot, compare Figure 5.10 and 5.11a). Thus, if the robot picks up an object that is behind it, and attempts to place it on an object in front of it without executing the position_relative_to action to change the orientation of the hand, it cannot succeed. Although the learner discovered this using the Random experiments, it did not when Specific-First experiments were used, and thus the actions learned do not generalise to the more challenging setting in this experiment. A further problem with these learned actions is that during the learning of the grasp action, the near predicate was not acquired, thus the robot never recognises the need to physically move close to an object before it can be grasped. This is manifested in the planner not using move_near_to to construct plans. In the case of the ‘Traverse-Rooms’, in which it is necessary to move the robot close to the door in order to open it (Fig 5.11b), the plans generated are unfeasible, and the robot never succeeds.

194 5.6. Experiments

5.6.4 Conclusion

The results of the experiments described in this chapter demonstrate that the system presented is capable of sequencing robotic actions requiring relatively complex dynamic movements into logically valid plans to solve goals. It is also clear that it is essential to perform sufficient experimentation with each action during learning for the system to appreciate the limitations of each action – the Random experiment generation was capable of this, but the Specific-First algorithm used in earlier work (Brown, 2009), was not. The system presented in this chapter contains several novel features. The ability to learn from a demonstration at two distinct levels of representation enables the system to acquire hybrid models that are simultaneously useful for logical planning and directly executable by a robot. Second, the refinement of these models by active experimentation allows the system to autonomously generalise the knowledge acquired from the demonstration to produce models of action that are useful in much wider contexts. Of particular note was the fact that the task-level learning system was able to discover a limitation in the control-level representation and compensate for it. This hybrid learning system demonstrates several of the types of capabilities that, as argued in earlier chapters, are necessary for robots to exist in a life-long learning or long-term autonomy context, and represents a step towards architectures capable of reaching that goal in the future.

195 Chapter 6

Discussion and Conclusion

Autonomous robots capable of long-term operation in human-built environments must, like humans, be capable of many types of learning. This is a direct con- sequence of the dynamic character of the world. The Multi Agent Learning Architecture (MALA) described in this thesis attempts to demonstrate one type of structure that can support this range of learning. The hybrid nature of the learning is emphasises throughout. In a complex entity, learning does not occur in isolation. Learning at high levels of abstraction constrains the search space of lower level learners, and the results of low-level learning must simultaneously influence that at a higher level. To construct MALA, several principles were identified. They are:

• Specialised modular components • Integration of symbolic and sub-symbolic processing • Continuous online learning • Hybrid learning

This combination of features separates MALA from other architectures for robots. MALA was used to implement control systems for a simulated autonomous rescue robot. Inclusion of specialised components enabled the system to perform various aspects of robot sensing and control often ignored by research into cognitive

196 architectures. Integration of classical planning with sub-symbolic formalisms enabled the robot to reason about its actions and construct plans to achieve goals (Section 4.3.4). Continuous online learning enables the robot to gradually improve its performance in a difficult and unpredictable environment, to the extent that it is able to achieve goals successfully (Section 4.6). Finally, inclusion of hybrid learning enabled the robot to simultaneously acquire task and control-level models of manipulation actions from trainer demonstrations (Section 5.6), and use this knowledge to solve difficult tasks in a complex manipulation domain. Much research in robot learning is conducted without consideration of the complete scope of robot software architecture. Often, a learning module is coupled to the smallest possible set of software components, and improvement in robot performance is demonstrated. In contrast, the learning demonstrated using MALA is embedded within a framework that provides a more complete set of the capabili- ties required by an autonomous robot. This enables learning to be interleaved with execution of complex tasks. It is somewhat unusual to demonstrate improvement of robot performance of complex tasks whilst the robot is embedded in a relatively long-term problem- solving episode, and using MALA, it has been possible to investigate this in some detail. Since the nature of human intelligence is clearly dependent on the interaction of many capabilities and structures, research restricted to individual components cannot hope to emulate or explain human performance (Langley, 2012). Thus, much of the work in this thesis is systems-level research, that considers operation of many interacting components, rather than at the level of individual algorithms. Future work that attempts to drive robot performance towards the level of human performance in unstructured, realistic, long-term settings, should preserve this integrated character. Drawing conclusions about systems level research has been notoriously difficult (Kortenkamp and Simmons, 2008; Langley et al., 2009; Christensen et al., 2010). However, it is argued here that the variety of robot functionality, in particular, learning and planning, that has been demonstrated using MALA is difficult to

197 6.1. Lessons and Limitations implement either without a specific robot architecture, or using an architecture that does adhere to the principles listed above. Thus, the experiments described in this thesis are, to some degree, a validation of these guiding principles for robot architectures. The remainder of this chapter discusses several insights gained during this research, notable limitations of the framework as presented, and describes avenues for future research.

6.1 Lessons and Limitations

6.1.1 Evaluation of Robot Architectures Remains a Challenge

A particular challenge associated with research in integrated architectures is how best to evaluate and make comparisons amongst different frameworks. Since inte- grative research occurs at the systems level, evaluating it poses different challenges than the evaluation of component-level methods. Several authors, including Lampe (2006) and Jones et al. (2012), suggest the construction of a performance metric to evaluate performance of robot architectures in realistic environments, that can measure reactivity, robustness, information gain, and other aspects of task success. A related approach was used to evaluate the performance of MALA in the extended autonomous rescue domain in Chapter 4, as the extent to which the robot was persistently able to explore, and gain knowledge of the layout of the rescue arena and the locations of victims within it, was encapsulated into the rescue score. However, this type of demonstration does not explicitly compare the perfor- mance of MALA with other architectures. To some extent, it would be possible to implement the same control systems in other architectural frameworks, and compare them directly. This type of comparative study is often missing from research in architectures (Langley et al., 2009), at least partially because of the sig- nificant development time required to implement robot control systems in multiple frameworks. It remains a challenge to provide unbiased mechanisms of analysing the strengths and weaknesses of different architectures. In this regard, Langley

198 6.2. Future Work and Messina (2004) give a discussion of difficulties involved in systematically evaluating cognitive architectures.

6.1.2 Symbolic AI and Knowledge Engineering for Robots

The application of symbolic planning techniques to robots requires a set of relations to be given to the robot. For example, during implementation of the system demon- strated in Chapter 5, although task-level models of manipulation actions necessary for rescue missions were learned from experimentation, a large vocabulary of relations and their definitions as spatial constraints had to be given to the robot as background knowledge. The construction of such models by hand is generally a very time-consuming process, and limits the extent to which robots using symbolic representations are actually deployed. An alternative is to learn the complete set of relations and their definitions in underlying state variables from interaction with a human teacher. Recent work by Kulick et al. (2013) and Misra et al. (2014) has begun to show that this is a promising avenue to fully enable the use of AI planning in autonomous robotics, and to begin a more natural human-robot dialogue through the use of shared symbols.

6.2 Future Work

The learning systems demonstrated in MALA could be extended in many ways. This section outlines several particularly promising directions for extension.

6.2.1 Richer Control-level Representation

The work in this thesis that relied on using Dynamic Motion Primitives (DMPs) for learning from demonstration (Section 5.3.2) was based on the initial formulation of DMPs by Ijspeert et al. (2002). This formulation requires several assumptions about the nature of the movement being modelled, which restricts the types of

199 6.2. Future Work actions that are learnable from demonstration. First, as described in Section 5.3.2, an n-dimensional demonstration trajectory is used to train n separate DMPs. In the present case, there are three spatial dimensions (x, y, z), four rotational values specifying a rotation quaternion (rx, ry, rz, rw), and g, the state of the robot’s gripper. Multi-dimensional trajectories for robot motion are generated by integrating each DMP independently, and then combining the results. However, the components of a quaternion are not linearly independent. There- fore, unlike linear spatial dimensions, angular trajectories cannot be constructed by combining the 1-D trajectories, so the orientation of the end effector remains fixed during action execution. To relax this assumption, the current framework could be replaced by an extended formulation of DMPs, such as that proposed by Ude et al. (2014), capable of improved representation of orientation. This would enable the learning of manipulation behaviours that depend on rotation of a robot’s hand, such as operating circular handles, driving a vehicle using a conventional steering wheel, or using doorknobs instead of handles.

6.2.2 Extended Hybrid Learning

A complete hybrid learning architecture such as that introduced in Section 2.7.1 implements learning at several levels of abstraction, and, rather than learning at each level occurring in isolation, supports a high degree of interaction between the levels. As discussed in Section 1.3.4 and 2.7.1, higher levels of abstraction provide constraints on the search spaces of lower-level learners, while low-level learners in turn influence higher based on their interaction with the environment. The hybrid architecture demonstrated in Chapter 5 demonstrated some degree of interaction between task and control-level learning, however, two important capabilities are not yet implemented. These missing capabilities, namely refinement of control-level knowledge and the use of task-level knowledge to contain low-level search space, are depicted in Figure 6.1, reproduced from Chapter 2. The following two sections describe how these added capabilities could be integrated within MALA.

200 6.2. Future Work

Figure 6.1: Hybrid learning capabilities not yet implemented in MALA. In Chapter 5, task-level knowledge is acquired by ‘explaining the effects’ of the trainer’s demonstration, and control knowledge is acquired through Learning from Demonstration (LfD) via DMPs. Task-level knowledge is then refined via active learning. However, the control-level learning is not constrained by task-level knowledge, nor is control-level knowledge further refined from experience.

Continuous Refinement of Control-level Knowledge

Representation of the trainer’s actions using DMPs enables the robot to reproduce and generalise demonstrated movements. However, the performance of the learned movements is limited by the quality of the initial demonstration, and further, cannot adapt to changes in the robot or in the environment, without an additional control- level learning mechanism. To provide this functionality, the learned DMP can be used as a seed hypothesis for a learning algorithm that attempts to continually refine the robot’s execution of the task by optimising the parameters of the DMP. Such a learner must use feedback from the environment to determine how to improve the movement. This setting is ideal for reinforcement learning, where

201 6.2. Future Work the robot iteratively refines a policy, in this case, the parameters of the DMP, to maximise a reward function. Many reinforcement learning algorithms could be applied to achieve this end. Policy Improvement with Path Integrals (PI2) (Theodorou et al., 2010) has proven particularly successful in such settings (Stulp and Schaal, 2011). PI2 is a probabilistic reinforcement learning method that directly learns the parameters of a parameterized policy.

Task-level Knowledge Constrains Control-level Learning

Because robots typically have many degrees of freedom, learning control policies for robots often involves searching through a large space of possible policies. Although the problem is alleviated to some degree by representing motion using DMPs, which restrict the search to smooth, mechanically feasible motions, the difficulty of convergence remains a problem. To address this complexity, it is desirable to use task-level knowledge to constrain the size of the search space to the largest extent possible. One method to achieve this is to use a symbolic planner to construct an approx- imate solution to the control task, and a low-level learner to refine the qualitative plan into an operational policy. This follows the work of Ryan (2002a) and later Sammut and Yik (2010), that demonstrated that hybrid representations enable reinforcement learning to perform better in very large search domains. Addition of this capability would enable a MALA system to learn robot actions for which a demonstration is impossible, such as multi-legged gaits.

6.2.3 Learning From Unstructured Demonstrations

The learning system in Chapter 5 showed that an initial relational action model could be inferred from a trainer’s demonstration. It was assumed that the trainer’s demonstration consisted only of the target action executed in isolation. Less restricted demonstrations require the learner to segment discrete actions from longer, more unstructured, traces of trainer behaviour.

202 6.2. Future Work

Neikum (2013) demonstrated that this segmentation can be achieved using statistical analysis of the time series data produced by trainer actions. A non- parametric Bayesian approach called the Beta-Process Auto-Regressive Hidden Markov Model (BP-AR-HMM) (Fox et al., 2008; Fox, 2009) was used to infer the number of actions in each unstructured demonstration, and also to recognise actions performed multiple times or from previous demonstrations. Integration with the BP-AR-HMM would enable the robot learning system presented here to learn from less structured demonstrations.

6.2.4 Improving Active Learning

Chapter 5 demonstrated that relational models of manipulation actions could be successfully generalised using active experimentation. Results showed that random selection of hypotheses to test from the learner’s version space produced a better action model than sampling near the most-specific boundary as Brown (2009) did (Section 5.6.3). However, the random experiment generation strategy invariably took longer to converge to a successful action model, since it was more likely to generate experiments that failed. A compromise between these strategies would be to select experiments from the version space according to a criterion that maximises the information gained from each experiment, as is the case for many active learners (Settles, 2012). Mitchell (1982; 1997) points out that an optimal strategy for a version space learner is to select experiments that satisfy exactly half of the hypotheses in the current version space. If this is possible, the correct target concept (if one exists) can be found within log2 |V| trials, where |V| is the number of hypotheses in the version space V. Alternatively, an experiment could be chosen by maximising expected information gain, following Sushkov and Sammut (2012). Either way, the use of an optimality criterion to select training examples for the active learner would be expected to reduce the number of trials needed to induce a robust action model, and this represents a promising area for extension of the system in Chapter 5.

203 6.2. Future Work

6.2.5 Real Robot Implementation

Although simulation provides a test of MALA and its learning capabilities, MALA is an architecture for robots, and the best test of its performance would be on a physical robot platform. Although many of the component algorithms used in the robot control systems described in Chapters 3-5 were unchanged from the way in which they are used with real robots, additional work would be required to implement MALA on a real robot. The largest obstacle would be adding the necessary perception components to determine the poses and internal structure of objects. This challenge is discussed further in the next section.

Perception of Object Structure for Tool Use

The hybrid learning system presented in Chapter 5 was evaluated in a manipulation and tool-use domain loosely based on the tasks set out in the Defense Advanced Research Projects Agency (DARPA) Robotics Challenge (Sections 2.8.1). The experiments made two strong assumptions that could be relaxed. First, it was as- sumed that the global poses, and internal structure, of all objects could be obtained directly from the simulator. To implement detection of object poses, standard methods for object detection such as detection of clusters in depth camera point clouds (Section 4.3.2) could be used. However, these methods cannot detect the internal structure of objects. If the robot is to use objects as tools, this information is essential. To infer the structure of an object, the robot must construct a model of the rela- tionships between the object’s parts. One method to achieve this is to approximate the shape of perceived objects using a set of primitive shapes, and then represent internal structure using the relationships between the primitives. Huebner et al. (2008) describe a method to approximate the shape of complex objects using a hierarchy of minimum volume bounding boxes (MVBBs). This method could be applied27 to the perception of internal structure of tool objects. The hierarchy of

27The BADGr software toolkit for MVBB shape approximation is open source (Huebner, 2012).

204 6.2. Future Work

MVBBs could be used to construct a relational description of the objects structure. This would provide an important step to enable the tool-use learning architecture presented here to be implemented on a real robot.

6.2.6 Robot Competitions

One possible avenue to address the challenges of evaluating integrated systems is via organised competitions involving different aspects of autonomous robotics. Two such competitions already mentioned are the RoboCup Rescue and the DARPA Robotics Challenge (2014). Both of these contests focus on the ability of robots to autonomously navigate through disaster scenarios. Other such competitions include RoboCup@Home28, which requires robots to negotiate normal indoor enviroments such as offices and homes and perform daily tasks, and the RoCKIn contest29, which offers a similar ‘@Home’ challenge, and an ‘@Work’ contest focussed on capabilities of industrial robots. Such competitions typically provide principled scoring metrics that enable the relative strengths and weaknesses of various robot architectures to be assessed. As such, they represent possibly the strongest currently available method for assessing integrated robot systems. Entering a robot controlled using the MALA framework in any such contest (particularly RoboCup Rescue) would provide rigorous assessment of the claims presented in this thesis.

6.2.7 Control of Multiple Robots

The distributed nature of MALA makes it possible that the architecture could be generalised to provide a control system for a team of robots, rather than restricting all agents to be embedded within a single robot. To some extent, the components to achieve this are already in place, a single blackboard with a system like that described in this system would be present on each robot. 28http://www.robocupathome.org/ 29http://rockinrobotchallenge.eu/

205 6.2. Future Work

A centralised multi-robot MALA system could be implemented by providing an additional, global set of agents to generate goals for the entire ensemble of robots, and to make plans to achieve these goals by allocating tasks to each robot. Each goal could then be posted to the appropriate agents internal blackboard. Alternatively, a more distributed, and potentially more robust approach, would remove the need for centralised goal generation by enabling robots to post to each others blackboards to enable them to share and communicate about pertinent goals generated by their local observations. One possibility is that access to the blackboards of other robots could operate in a restricted manner so that only an appropriate set of robots (perhaps either those spatially near each other, engaged in related tasks, or possessing a particular capability) could post goals and/or percepts to each others blackboards.

6.2.8 Autonomous Tool Building

3D printers are devices that are able to manufactory almost any three-dimensional physical object, relatively quickly and cheaply. An interesting extension to learning architecture presented here would be to translate the relational model of a necessary tool for a given problem into a 3D model of a specific tool, and then build the tool using a 3D printer. This would enable robots to learn a model of a number of tool use actions from demonstrations, possibly using a range of tools, and to autonomously construct new tools to solve novel problems when appropriate tools are not available.

206 References

Abdo, N., H. Kretzschmar, L. Spinello, and C. Stachniss (2013). “Learning manip- ulation actions from a few demonstrations.” Robotics and Automation, IEEE International Conference on (ICRA 2013). IEEE, pp. 1268–1275 (cit. on p. 66). Albus, J. (1991). “Outline for a theory of intelligence.” IEEE Transactions on Systems, Man, and Cybernetics 21 (3), pp. 473–509 (cit. on pp. 39, 40). Albus, J. (1996). “A reference model architecture for design and implementation of intelligent control in large and complex systems.” International Journal of Intelligent Control and Systems 1 (1), pp. 15–30 (cit. on p. 40). Albus, J. (2005). “RCS: A cognitive architecture for intelligent multi-agent sys- tems.” Annual Reviews in Control 29 (1), pp. 87–99 (cit. on pp. 26, 39, 40). Albus, J. and A. M. Meystel (2001). Engineering of mind: An introduction to the science of intelligent systems. New York, NY: Wiley (cit. on p. 40). Aler, R., O. Garcia, and J. Valls (2005). “Correcting and improving imitation models of humans for robosoccer agents.” Evolutionary Computation 3 (2-5), pp. 2402–2409 (cit. on p. 55). Anderson, J. R., M. Albert, and J. Fincham (2005). “Tracing problem solving in real time: fMRI analysis of the subject-paced Tower of Hanoi.” Journal of Cognitive Neuroscience 17 (8), pp. 1261–1274 (cit. on p. 30). Anderson, J. R. and J. P. Borst (2014). “Using the ACT-R cognitive architecture in combination with fMRI data.” An Introduction to Model-Based Cognitive Neuroscience. Ed. by B. U. Forstmann and E.-J. Wagenmakers. New York, NY: Springer (cit. on p. 30).

207 References

Anderson, J. R., D. Bothell, M. D. Byrne, S. Douglass, C. Lebiere, and Y. Qin (2004). “An Integrated Theory of the Mind.” Psychological Review 111 (4), pp. 1036–1060 (cit. on p. 34). Anderson, J. R. and J. Fincham (2013). “Discovering the sequential structure of thought.” Cognitive Science 37 (6), pp. 1–31 (cit. on p. 30). Anderson, J. R. (2007). How can the human mind occur in the physical universe? New York, NY: Oxford University Press (cit. on pp. 26, 29, 37, 75). Apt, K. R. and M. Wallace (2006). Constraint logic programming using Eclipse. Cambridge, UK: Cambridge University Press (cit. on pp. 93, 118). Argall, B., S. Chernova, and M. Veloso (2009). “A survey of robot learning from demonstration.” Robotics and Autonomous Systems 57 (5), pp. 469–483 (cit. on pp. 55, 56, 120). Arkin, R. C. (1998). Behavior-based robotics. Cambridge, MA: The MIT Press (cit. on p. 42). Audet, S. (2014). JavaCPP: The missing bridge between Java and native C++. URL: https://github.com/bytedeco/javacpp (cit. on p. 248). Baars, B. (1986). The cognitive revolution in psychology. New York, NY: The Guilford Press (cit. on p. 32). Baars, B. (2003). “How conscious experience and working memory interact.” Trends in Cognitive Science 7 (4), pp. 166–172 (cit. on p. 32). Balakirsky, S., S. Carpin, and A. Kleiner (2007). “Towards heterogeneous robot teams for disaster mitigation: Results and performance metrics from RoboCup Rescue.” Journal of Field Robotics 24 (11–12), pp. 943–967 (cit. on p. 72). Barrett, H. C. and R. Kurzban (2006). “Modularity in cognition: framing the debate.” Psychological Review 113 (3), pp. 628–647 (cit. on p.8). Barto, A. and S. Mahadevan (2003). “Recent advances in hierarchical reinforcement learning.” Discrete Event Dynamic Systems 13 (4), pp. 341–379 (cit. on p. 66). Barton, R. and P. Harvey (2000). “Mosaic evolution of brain structure in mammals.” Nature 405 (6790), pp. 1055–1058 (cit. on p.8).

208 References

Beaudoin, L. P. (1995). “Goal processing in autonomous agents.” PhD thesis. School of Computer Science, University of Birmingham (cit. on pp. 114, 115). Bekey, G. A. (2005). Autonomous robots: From biological inspiration to imple- mentation and control. Cambridge, MA: The MIT Press (cit. on p. 42). Benson, S. (1996). “Learning action models for reactive autonomous agents.” PhD thesis. Stanford Artificial Intelligence Laboratory (SAIL), Stanford University (cit. on p. 59). Boeing, A. and T. Bräunl (2007). “Evaluation of real-time physics simulation systems.” Proceedings of the 5th International Conference on Computer Graph- ics and Interactive Techniques in Australia and Southeast Asia (GRAPHITE ’07). Ed. by A. Rohl. Perth, Australia: Association for Computing Machinary (ACM), pp. 281–288 (cit. on p. 239). Bonasso, P. R., J. R. Firby, E. Gat, D. Kortenkamp, D. P. Miller, and M. G. Slack (1997). “Experiences with an architecture for intelligent, reactive agents.” Journal Of Experimental & Theoretical Artificial Intelligence 9 (2-3), pp. 237– 256 (cit. on pp. 43, 46). Bouganis, A. and M. Shanahan (2010). “Training a spiking neural network to con- trol a 4-DoF robotic arm based on spike timing-dependent plasticity.” Neural Networks (IJCNN), The 2010 International Joint Conference on. Barcelona, Spain: IEEE, pp. 1–8 (cit. on p. 32). Bradski, G. and A. Kaehler (2008). Learning OpenCV: with the OpenCV library. Sebastopol, CA: O’Reilly Media (cit. on pp. 91, 114, 249). Braga, A. d. P. (1994). “Attractor Properties of Recurrent Networks with Generalis- ing Boolean Nodes.” Proceedings of the International Conference on Artificial Neural Networks (ICANN ’94), Sorrento, Italy: Springer Verlag, pp. 421–424 (cit. on p. 32). Bratko, I. (2010). “Comparison of machine learning for autonomous robot discov- ery.” Advances in Machine Learning I. Ed. by J. Koronacki, Z. W. Ras, S. T. Wierzchon, and J. Kacprzyk. Vol. 262. Studies in Computational Intelligence. Heidelberg, Germany: Springer, pp. 441–456 (cit. on p. 60).

209 References

Bratko, I. (2001). Prolog programming for artificial intelligence. 3rd ed. Boston, MA: Addison-Wesley (cit. on p. 59). Brooks, R. (1986). “A robust layered control system for a mobile robot.” IEEE Journal on Robotics and Automation 2 (1), pp. 14–23 (cit. on pp. 11, 42). Brooks, R. (1991). “Intelligence without representation.” Artificial Intelligence 47, pp. 139–159 (cit. on pp. 42, 134). Brown, S. (2009). “A relational approach to tool-use learning in robots.” PhD thesis. School of Computer Science and Engineering, University of New South Wales (cit. on pp. xvii, 52, 60, 64, 117, 148, 168, 170, 175, 178, 195, 203, 250). Brown, S. and C. Sammut (2011). “Tool Use Learning in Robots.” Proceedings of the 2011 AAAI Fall Symposium Series on Advances in Cognitive Systems. Arlington, VA: AAAI Press (cit. on pp. 52, 93). Brown, S. and C. Sammut (2013). “A relational approach to tool-use learning in robots.” Inductive Logic Programming: 22nd International Conference (ILP 2012), Revised Selected Papers. Ed. by F. Riguzzi and F. Železný. Vol. 7842. Lecture Notes in Computer Science. Dubrovnik, Croatia: Springer, pp. 1–15 (cit. on pp. 64, 153). Buss, S. (2009). Introduction to inverse kinematics with jacobian transpose, pseu- doinverse and damped least squares methods. Tech. rep. Department of Math- emetics, University of California San Diego (cit. on p. 252). Cakmak, M., C. Chao, and A. L. Thomaz (2010). “Designing interactions for robot active learners.” Autonomous Mental Development, IEEE Transactions on 2 (2), pp. 108–118 (cit. on p. 50). Calinon, S., F. Guenter, and A. Billard (2007). “On learning, representing, and generalizing a task in a humanoid robot.” Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on 37 (2), pp. 286–298 (cit. on p. 55). Callebaut, W. and D. Rasskin-Gutman (2005). Modularity: Understanding the development and evolution of natural complex systems. The Vienna Series in Theoretical Biology. Cambridge, MA: The MIT Press (cit. on p. 7).

210 References

Cambon, S., R. Alami, and F. Gravot (2009). “A hybrid approach to intricate motion, manipulation and task planning.” The International Journal of Robotics Research 28, pp. 104–126 (cit. on p. 67). Carbonell, J. G. and Y. Gil (1996). “Learning by experimentation: The operator refinement method.” Machine learning: An artificial intelligence approach 3, pp. 191–213 (cit. on p. 60). Carbonell, J., O. Etzioni, Y. Gil, R. Joseph, C. Knoblock, S. Minton, and M. Veloso (1991). “PRODIGY: an integrated architecture for planning and learning.” ACM SIGART Bulletin 2 (4), pp. 51–55 (cit. on p. 60). Casper, J. and R. R. Murphy (2003). “Human-robot interactions during the robot- assisted urban search and rescue response at the world trade center.” Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on 33 (3), pp. 367–385 (cit. on pp. 69, 71). Cassimatis, N. L. (2001). “Polyscheme: a cognitive architecture for intergrating multiple representation and inference schemes.” PhD thesis. Department of Electrical Engineering and Computer Science (EECS), Massachusetts Institute of Technology (cit. on p. 39). Cassimatis, N. L., P. Bignoli, M. Bugajska, S. Dugas, U. Kurup, M. A., and P. Bello (2010). “An architecture for adaptive algorithmic hybrids.” Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on 40 (3), pp. 903–914 (cit. on p. 10). Cassimatis, N. L., J. Trafton, M. Bugajska, and A. Schultz (2004). “Integrating cognition, perception and action through mental simulation in robots.” Robotics and Autonomous Systems 49 (1-2), pp. 13–23 (cit. on p. 33). Chambers, R. A. and D. Michie (1969). “Man-machine co-operation on a learning task.” Computer Graphics: Techniques and Applications. Ed. by R. Parslow, R. Prowse, and E.-G. R. London, UK: Plenum (cit. on p. 55). Chernova, S. and A. L. Thomaz (2014). Robot learning from human teachers. Synthesis Lectures on Artificial Intelligence and Machine Learning. San Rafael, CA: Morgan & Claypool (cit. on p. 56).

211 References

Christensen, H. I., G.-J. M. Kruijff, and J. Wyatt, eds. (2010). Cognitive systems. Vol. 8. Cognitive Systems Monographs. Heidelberg, Germany: Springer-Verlag (cit. on p. 197). Cohn, D., L. Atlas, and R. Ladner (1994). “Improving generalization with active learning.” Machine Learning 15, pp. 201–221 (cit. on p. 49). Corke, P. (2011). Robotics, vision and control: Fundamental algorithms in MAT- LAB. Vol. 73. Springer Tracts in Advanced Robotics. Heidelberg, Germany: Springer (cit. on p. 252). Coumans, E. (2010). Bullet User Manual (cit. on pp. 179, 239). Craig, I. (1988). “Blackboard systems.” Artificial Intelligence Review 2 (2), pp. 103– 118 (cit. on p. 35). Daniel, C., M. Viering, J. Metz, O. Kroemer, and J. Peters (2014). “Active reward learning.” Robotics: Science and Systems X. Ed. by L. E. K. Dieter Fox and H. Kurniawati. Berkeley, USA (cit. on p. 51). DARPA (2013). DARPA Robotics Challenge trials task description. URL: http: //archive.darpa.mil/roboticschallengetrialsarchive/sites/default/files/DRC%20Trials% 20Task%20Description%20Release%2011%20DISTAR%2022197.pdf (cit. on p. 179). DARPA (2014). DARPA Robotics Challenge (DRC). URL: http://www.darpa.mil/ Our_Work/TTO/Programs/DARPA_Robotics_Challenge.aspx (cit. on pp. 70, 154, 179, 205). Davids, A. (2002). “Urban search and rescue robots: From tragedy to technology.” IEEE Intelligent Systems 17 (2), pp. 81–83 (cit. on p. 69). De Raedt, L. (2011). “Inductive Logic Programming.” Encyclopedia of Machine Learning. Ed. by C. Sammut and G. I. Webb. Heidelberg, Germany: Springer (cit. on p. 59). Diankov, R. (2008). Openrave: A planning architecture for autonomous robotics. Tech. rep. CMU-RI-TR-08-34. Robotics Institute, Carnegie Mellon University (cit. on p. 251).

212 References

Diankov, R. (2010). “Automated construction of robotic manipulation programs.” PhD thesis. Robotics Institute, Carnegie Mellon University (cit. on p. 162). Dietterich, T. G. (2000). “Hierarchical reinforcement learning with the MAXQ value function decomposition.” Journal of Artificial Intelligence Research 13, pp. 227–303 (cit. on p. 65). Dufay, B. and J. C. Latombe (1984). “An approach to automatic robot programming based on inductive learning.” The International Journal of Robotics Research 3 (4), pp. 3–20 (cit. on p. 55). Duvallet, F., T. Kollar, and A. Stentz (2013). “Imitation learning for natural lan- guage direction following through unknown environments.” Robotics and Au- tomation, IEEE International Conference on (ICRA 2013). IEEE, pp. 1047– 1053 (cit. on p. 55). Džeroski, S., S. Muggleton, and S. Russell (1992). “PAC learnability of determinate logic programs.” Proceedings of the Fifth ACM Workshop on Computational Learning Theory, pp. 128–135 (cit. on p. 59). Erman, L., F. Hayes-Roth, and V.Lesser (1980). “The Hearsay-II speech-understanding system: Integrating knowledge to resolve uncertainty.” ACM Computing Sur- veys 12, pp. 213–253 (cit. on p. 35). Fairfield, N. and C. Urmson (2011). “Traffic light mapping and detection.” Robotics and Automation, IEEE International Conference on (ICRA 2011). Shanghai, China: IEEE, pp. 5421–5426 (cit. on p. 5). Farid, R. and C. Sammut (2012). “A relational approach to plane-based object categorization.” Proceedings of the Robotics Science and Systems (RSS-VIII) Workshop on RGB-D Cameras. Sydney, Australia (cit. on p. 92). Ferguson, I. A. (1992). TouringMachines: An architecture for dynamic, rational, mobile agents. Tech. rep. 273. University of Cambridge (cit. on p. 43). Ferrein, A. and G. Lakemeyer (2008). “Logic-based robot control in highly dynamic domains.” Robotics and Autonomous Systems 56 (11), pp. 980–991 (cit. on p. 33).

213 References

Fikes, R. E., P. E. Hart, and N. J. Nilsson (1972). “Learning and executing general- ized robot plans.” Artificial Intelligence 3, pp. 251–288 (cit. on p. 42). Fikes, R. and N. Nilsson (1971). “STRIPS: A new approach to the application of theorem proving to problem solving.” Artificial Intelligence 2, pp. 189–208 (cit. on p. 41). Fincham, J., C. Carter, V. van Veen, V. A. Stenger, and J. R. Anderson (2002). “Neural mechanisms of planning: A computational analysis using event-related fMRI.” Proceedings of the National Academy of Sciences 5 (99), pp. 3346–51 (cit. on p. 30). Fischler, M. A. and R. C. Bolles (1981). “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartogra- phy.” Communications of the ACM 24 (6), pp. 381–395 (cit. on p. 113). Fox, E. B., E. B. Sudderth, M. I. Jordan, and A. S. Willsky (2008). “Nonparametric Bayesian learning of switching linear dynamical systems.” Advances in Neural Information Processing Systems (NIPS), pp. 457–464 (cit. on p. 203). Fox, E. B. (2009). “Bayesian nonparametric learning of complex dynamical phe- nomena.” PhD thesis. Cambridge, MA: Department of Electrical Engineering and Computer Science (EECS), Massachusetts Institute of Technology (cit. on pp. 57, 203). Fox, M. and D. Long (2003). “PDDL2.1 : An extension to PDDL for expressing temporal planning domains.” Journal of Artificial Intelligence Research 20, pp. 61–124 (cit. on p. 42). Frenken, K. and L. Marengo (1999). “Interdependencies, nearly-decomposability and adaptation.” Computational Technqiues for Modelling Learning in Eco- nomics. London, UK: Kluwer Academics, pp. 146–165 (cit. on p. 7). Gat, E. (1997). “On three-layer architectures.” Artificial Intelligence and Mobile Robots. Artificial intelligence and mobile robots (cit. on pp. 39, 43). Gil, Y. (1993). “Learning new planning operators by exploration and experimenta- tion.” Proceedings of the AAAI Workshop on Learning Action Models. Wash- ington, DC, pp. 1–4 (cit. on p. 59).

214 References

Graham, J., K. Decker, and M. Mersic (2003). “Decaf: A flexible multi agent system architecture.” Autonomous Agents and Multi-Agent Systems 7 (1), pp. 7– 27 (cit. on p. 33). Gratch, J. and S. Marsella (2001). “Tears and fears: Modeling emotions and emo- tional behaviors in synthetic agents.” Proceedings of the 5th International Conference on Autonomous Agents. Montreal, Canada: Association for Com- puting Machinery (ACM), pp. 278–285 (cit. on p. 29). Gratch, J. and S. Marsella (2009). “Modeling the cognitive antecedents and con- sequences of emotion.” Cognitive Systems Research 10 (1), pp. 1–5 (cit. on p. 29). Gray, W. D., ed. (2007). Integrated models of cognitive systems. Series on Cognitive Models and Architectures. New York, NY: Oxford University Press (cit. on p. 34). Guitton, J. and J. Farges (2008). “Geometric and symbolic reasoning for mobile robotics.” Proceedings of the 3rd National Conference on Control Architectures of Robots. Bourges, France (cit. on p. 67). Guizzo, E. (2014). “Rescue-robot show-down.” IEEE Spectrum 51 (1), pp. 52–55 (cit. on p. 69). Haber, A. L., M. McGill, and C. Sammut (2012). “jmeSim: An open source, multi- platform .” Proceedings of the Australasian Conference on Robotics and Automation (ACRA 2012). Wellington, New Zealand (cit. on pp. 22, 87–90, 243). Haber, A. L. and C. Sammut (2013). “A cognitive architecture for autonomous robots.” Advances in Cognitive Systems 2, pp. 257–275 (cit. on pp. 15, 16, 82). Haigh, K. Z. and M. M. Veloso (1996). “Using Perception Information for Robot Planning and Execution.” Proceedings of 96 AAAI Workshop: Intelligent Adap- tive Agents. Portland, OR (cit. on p. 60). Hall, M., E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. Witten (2009). “The WEKA data mining software: an update.” ACM SIGKDD Explorations Newsletter 11 (1), pp. 10–18 (cit. on pp. 238, 250).

215 References

Hanford, S. D., O. Janrathitikarn, and L. N. Long (2009). “Control of mobile robots using the Soar cognitive architecture.” Journal Of Aerospace Computing Information And Communication 6 (2), pp. 69–91 (cit. on p. 29). Hanheide, M., N. Hawes, J. Wyatt, M. Göbelbecker, M. Brenner, K. Sjöö, A. Aydemir, P. Jensfelt, H. Zender, and G. J. Kruijff (2010). “A framework for goal generation and management.” Proceedings of the AAAI Workshop on Goal-Directed Autonomy. Atlanta, GA: AAAI Press (cit. on p. 114). Hart, P. E., N. J. Nilsson, and B. Raphael (1968). “A formal basis for the heuristic determination of minimum cost paths.” IEEE Transactions on Systems Science and Cybernetics SSC4 4 (2), pp. 100–107 (cit. on p. 119). Hawes, N. and J. Wyatt (2006). An architecture schema for embodied cognitive systems. Tech. rep. CSR-06-12. School of Computer Science, University of Birmingham (cit. on p. 81). Hawes, N. and J. Wyatt (2010). “Engineering intelligent information-processing systems with CAST.” Advanced Engineering Informatics 24 (1), pp. 27–39 (cit. on pp. 44, 81). Hawes, N., M. Brenner, and K. Sjöö (2009). “Planning as an architectural control mechanism.” Proceedings of the 4th ACM/IEEE International Conference on Human Robot Interaction. La Jolla, CA, pp. 229–230 (cit. on p. 44). Hawes, N., M. Hanheide, K. Sjöö, A. Aydemir, P. Jensfelt, M. Göbelbecker, M. Brenner, H. Zender, P. Lison, and I. Kruijff-Korbayová (2010). “Dora the explorer: A motivated robot.” Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems. Toronto, Canada, pp. 1617– 1618 (cit. on p. 6). Hawes, N., M. Klenk, K. Lockwood, J. D. Kelleher, and G. S. Horn (2012). “Towards a Cognitive System That Can Recognize Spatial Regions Based on Context.” Proceedings of the Twenty-Sixth National Conference on Artificial Intelligence (AAAI 2012). Toronto, Canada: AAAI Press, pp. 200–206 (cit. on p. 47).

216 References

Hayes-Roth, B. (1985). “A blackboard architecture for control.” Artificial Intelli- gence 26 (3), pp. 251–321 (cit. on p. 35). Hermans, T., J. M. Rehg, and A. F. Bobick (2013). “Decoupling behavior, per- ception, and control for autonomous learning of affordances.” Robotics and Automation, IEEE International Conference on (ICRA 2013). IEEE, pp. 4989– 4996 (cit. on p. 53). Hoffmann, J. and B. Nebel (2001). “The FF planning system: Fast plan generation through heuristic search.” Journal of Artificial Intelligence Research (14), pp. 253–302 (cit. on pp. 93, 117, 166, 251). Horn, A. (1951). “On Sentences Which Are True of Direct Unions of Algebras.” Journal of Symbolic Logic 16 (1), pp. 14–21 (cit. on p. 153). Huebner, K., S. Ruthotto, and D. Kragic (2008). “Minimum volume bounding box decomposition for shape approximation in robot grasping.” Robotics and Automation, IEEE International Conference on (ICRA 2008). IEEE, pp. 1628– 1633 (cit. on p. 204). Huebner, K. (2012). “BADGr–A toolbox for box-based approximation, decompo- sition and grasping.” Robotics and Autonomous Systems 60 (3), pp. 367–376 (cit. on p. 204). Huhns, M. N., U. Mukhopadhyay, L. M. Stephens, and R. D. Bonnell (1987). “DAI for Document Retrieval: The MINDS Project.” Ed. by M. N. Huhns. London, England: Pittman. Chap. 9 (cit. on p. 37). Ijspeert, A. J., J. Nakanishi, H. Hoffmann, P. Pastor, and S. Schaal (2013). “Dy- namical movement primitives: learning attractor models for motor behaviors.” Neural Computation 25 (2), pp. 328–373 (cit. on p. 151). Ijspeert, A. J., J. Nakanishi, and S. Schaal (2002). “Learning attractor landscapes for learning motor primitives.” Advances in Neural Information Processing Systems (NIPS ’02) 15, pp. 1547–1554 (cit. on pp. 56, 147, 151, 152, 199). Ivaldi, S., S. M. Nguyen, N. Lyubova, A. Droniou, V. Padois, D. Filliat, P.-Y. Oudeyer, and O. Sigaud (2014). “Object Learning Through Active Exploration.”

217 References

Autonomous Mental Development, IEEE Transactions on 6 (1), pp. 56–72 (cit. on pp. 6, 44, 46, 47). Iwaniuk, A. and K. Dean (2004). “A mosaic pattern characterizes the evolution of the avian brain.” Proceedings of The Royal Society B: Biological Sciences 271 (Suppl 4), S148–S151 (cit. on p. 8). JACO: Lightweight robot arm by Kinova Robotics (2013). URL: http://kinovarobotics. com/products/jaco-research-edition/ (cit. on p. 180). Jarvis, E., O. Güntürkün, L. Bruce, and A. Csillag (2005). “Avian brains and a new understanding of vertebrate brain evolution.” Nature Reviews Neuroscience 6 (2), pp. 151–159 (cit. on p. 8). Jennings, N. and K. Sycara (1998). “A roadmap of agent research and develop- ment.” International Journal of Autonomous Agents and Multi-Agent Systems 1 (1), pp. 7–38 (cit. on p. 35). Jilk, D. J., C. Lebiere, R. C. O’Reilly, and J. R. Anderson (2008). “SAL: an explic- itly pluralistic cognitive architecture.” Journal Of Experimental & Theoretical Artificial Intelligence 20 (3), pp. 197–218 (cit. on p. 33). Jones, R. M., R. E. Wray III, and M. van Lent (2012). “Practical Evaluation of Integrated Cognitive Systems.” Advances in Cognitive Systems 1 (1), pp. 83–92 (cit. on p. 198). Jones, R., J. E. Laird, P. Nielsen, K. Coulter, P. Kenny, and F. Koss (1999). “Au- tomated intelligent pilots for combat flight simulation.” AI Magazine 20 (1), p. 27 (cit. on p. 29). Kadous, M. W. and C. Sammut (2004a). “InCA: A Mobile Conversational Agent.” Proceedings of the Eighth Pacific Rim International Conference on Artificial Intelligence. Ed. by C. Zhang, H. W. Guesgen, and W. K. Yeap. Auckland, New Zealand: Springer, pp. 644–653 (cit. on p. 79). Kadous, M. W., C. Sammut, and R. K.-M. Sheh (2006). “Autonomous Traversal of Rough Terrain Using Behavioural Cloning.” Autonomous Robots and Agents (ICARA ’06), 3rd International Conference on. Ed. by S. Mukhopadhyay and

218 References

G. S. Gupta. Palmerston North, New Zealand: IEEE, pp. 219–225 (cit. on p. 123). Kadous, M. and C. Sammut (2004b). “MICA: Pervasive middleware for learning, sharing and talking.” Pervasive Computing and Communications Workshops, 2004. Proceedings of the Second IEEE Annual Conference on. Orlando, Florida: IEEE, pp. 176–180 (cit. on pp. 37, 78, 81). Kaelbling, L. P. and T. Lozano-Pérez (2011). “Hierarchical task and motion plan- ning in the now.” Robotics and Automation, IEEE International Conference on (ICRA 2011). Shanghai, China: IEEE, pp. 1470–1477 (cit. on p. 66). Kaelbling, L. P. and T. Lozano-Pérez (2013). “Integrated task and motion planning in belief space.” The International Journal of Robotics Research 32 (9-10), pp. 1194–1227 (cit. on pp. 5, 67). Kalakrishnan, M., L. Righetti, and P. Pastor (2011). “Learning force control policies for compliant manipulation.” Intelligent Robots and Systems (IROS 2011), IEEE/RSJ International Conference on. San Francisco, CA: IEEE, pp. 4639– 4644 (cit. on p. 57). Kalman, R. E. (1960). “A New Approach to Linear Filtering and Prediction Prob- lems.” Journal of Fluids Engineering 82 (1), pp. 35–45 (cit. on p. 113). Kamei, K., S. Nishio, N. Hagita, and M. Sato (2012). “Cloud networked robotics.” Network (cit. on p. 21). Kandel, E. R., J. H. Schwartz, T. M. Jessell, S. A. Siegelbaum, and A. J. Hudspeth (2013). Principles of Neural Science. 5th ed. New York, NY: McGraw-Hill (cit. on pp. 17, 32, 34). King, R., K. E. Whelan, F. M. Jones, P. G. K. Reiser, C. H. Bryant, S. Muggleton, D. B. Kell, and S. G. Oliver (2004). “Functional genomic hypothesis generation and experimentation by a robot scientist.” Nature 427 (6971), pp. 247–252 (cit. on p. 52). Kirchner, N., A. Alempijevic, S. Caraian, R. Fitch, D. Hordern, G. Hu, G. Paul, D. Richards, S. P. N. Singh, and S. Webb (2010). “RobotAssist - a platform for human robot interaction research.” Proceedings of the Australasian Conference

219 References

on Robotics and Automation (ACRA 2010). Brisbane, Australia, pp. 1–10 (cit. on p. 78). Klahr, D., P. Langley, and R. Neches (1987). Production system models of learning and development. Cambridge, MA: The MIT Press (cit. on p. 28). Kober, J., K. Mulling, O. Kro x0308 mer, C. H. Lampert, B. Scholkopf, and J. Peters (2010). “Movement templates for learning of hitting and batting.” Robotics and Automation, IEEE International Conference on (ICRA 2010). Anchorage, AL: IEEE, pp. 853–858 (cit. on p. 57). Kober, J., J. A. Bagnell, and J. Peters (2013). “Reinforcement learning in robotics: A survey.” The International Journal of Robotics Research 32 (11), pp. 1238– 1274 (cit. on p. 50). Koenig, N. and A. Howard (2004). “Design and use paradigms for Gazebo, an open- source multi-robot simulator.” Intelligent Robots and Systems (IROS 2004), IEEE/RSJ International Conference on, pp. 2149–2154 (cit. on p. 21). Konolige, K. and K. Myers (1998). “The Saphira Architecture for Autonomous Mo- bile Robots.” Artificial Intelligence and Mobile Robots. Ed. by D. Kortenkamp, P. R. Bonasso, and R. R. Murphy. Cambridge, MA: MIT Press (cit. on p. 43). Kortenkamp, D. and R. Simmons (2008). “Robotic Systems Architectures and Programming.” Springer Handbook of Robotics. Ed. by B. Siciliano and O. Khatib. Heidelberg, Germany: Springer, pp. 187–204 (cit. on p. 197). Krause, E., P. Schermerhorn, and M. Scheutz (2012). “Crossing boundaries: Multi- level introspection in a complex robotic architecture for automatic performance improvements.” Proceedings of the Twenty-Sixth National Conference on Arti- ficial Intelligence (AAAI 2012). Toronto, Canada: AAAI Press, pp. 214–220 (cit. on p. 47). Kroemer, O., R. Detry, J. Piater, and J. Peters (2009). “Active learning using mean shift optimization for robot grasping.” Intelligent Robots and Systems (IROS 2009), IEEE/RSJ International Conference on. IEEE, pp. 2610–2615 (cit. on p. 51).

220 References

Kuchcinski, K. and R. Szymanek (2014). JaCoP Library User’s Guide. URL: http://jacopguide.osolpro.com/guideJaCoP.html (cit. on pp. 158, 251). Kulick, J., M. Toussaint, T. Lang, and M. Lopes (2013). “Active learning for teaching a robot grounded relational symbols.” Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence (IJCAI-13). Beijing, China, pp. 1451–1457 (cit. on pp. 52, 199). Laird, J. E., A. Newell, and P. S. Rosenbloom (1987). “Soar: An architecture for general intelligence.” Artificial Intelligence 33 (1), pp. 1–64 (cit. on p. 28). Laird, J. E. (2001). “It knows what you’re going to do: Adding anticipation to a Quakebot.” Proceedings of the 5th International Conference on Autonomous Agents. Montreal, Canada: Association for Computing Machinery (ACM), pp. 385–392 (cit. on p. 29). Laird, J. E. (2012). The Soar Cognitive Architecture. The MIT Press (cit. on pp. 28, 75). Lampe, A. (2006). “Performance measure for the evaluation of mobile robot au- tonomy.” Proceedings of the 2006 IEEE International Conference on Robotics and Automation. Orlando, FL: IEEE, pp. 4057–4062 (cit. on p. 198). Langley, P. (2012). “The Cognitive Systems Paradigm.” Advances in Cognitive Systems 1 (1), pp. 3–13 (cit. on pp. 27, 74, 105, 197). Langley, P. and D. Choi (2006). “A unified cognitive architecture for physical agents.” Proceedings of the Twentieth National Conference on Artificial Intel- ligence (AAAI 2006). Vol. 21. 2. Boston, MA: AAAI Press, pp. 1469–1474 (cit. on pp. 30, 33, 75). Langley, P., J. E. Laird, and S. Rogers (2009). “Cognitive architectures: Research issues and challenges.” Cognitive Systems Research 10 (2), pp. 141–160 (cit. on pp. 6, 26, 28, 33, 42, 197, 198). Langley, P. and E. Messina (2004). “Experimental studies of integrated cognitive systems.” Proceedings of the Performance Metrics for Intelligent Systems Workshop. Gaithersburg, MD (cit. on p. 199).

221 References

Leban, G., J. Žabkar, and I. Bratko (2008). “An experiment in robot discovery with ilp.” Inductive Logic Programming, 18th International Conference, ILP 2008. Ed. by F. Zelený and N. Lavrac. Prague, Czech Republic: Springer, pp. 77–90 (cit. on p. 60). Lesser, V. R. and D. G. Corkill (1983). “The Distributed Vehicle Monitoring Testbed: A Tool for Investigating Distributed Problem Solving Networks.” AI Magazine 4 (3), pp. 15–33 (cit. on p. 37). Lesser, V., R. Fennell, L. Erman, and D. Reddy (1975). “Organization of the Hearsay-II speech understanding system.” IEEE Transactions on Acoustics, Speech, and Signal Processing ASSP-23 (1), pp. 11–24 (cit. on p. 37). Lettvin, J., H. Maturana, W. McCulloch, and W. Pitts (1959). “What the frog’s eye tells the frog’s brain.” Proceedings of the Institute of Radio Engineers 47 (11), pp. 1940–1951 (cit. on p. 9). Liu, Y. and G. Nejat (2013). “Robotic Urban Search and Rescue: A Survey from the Control Perspective.” Journal Of Intelligent & Robotic Systems 72 (2), pp. 147–165 (cit. on pp. 5, 72). Lorenzo, D. and R. P. Otero (2000). “Learning to reason about actions.” Proceed- ings of the 14th European Conference on Artificial Intelligence (ECAI 2000). Amsterdam, The Netherlands.: IOS Press (cit. on p. 60). Lovett, M., L. Reder, and C. Lebiere (1999). “Modeling working memory in a unified architecture: An ACT-R perspective.” Models of working memory. Ed. by A. Miyake and P. Shah. Cambridge, UK: Cambridge University Press, pp. 135–182 (cit. on p. 30). Marr, D. (1982). Vision: A Computational Investigation into the Human Represen- tation and Processing of Visual Information. New York, NY: W. H. Freeman (cit. on pp. 3, 9). matlabcontrol: A Java API to interact with MATLAB (2014). URL: https://code. google.com/p/matlabcontrol/ (cit. on p. 248). McCorduck, P. (1979). Machines Who Think. San Francisco, CA: W.H. Freeman & Company (cit. on p. 1).

222 References

McDermott, D., M. Ghallab, A. Howe, C. Knoblock, A. Ram, M. Veloso, D. Weld, and D. Wilkins (1998). PDDL: The planning domain definition language. Tech. rep. CVC TR98003/DCS TR1165. Yale Center for Computational Vision and Control (cit. on pp. 41, 92, 238). McGann, C., F. Py, K. Rajan, H. Thomas, R. Henthorn, and R. McEwen (2008). “A deliberative architecture for AUV control.” Robotics and Automation, IEEE International Conference on (ICRA 2008). Pasadena, CA: IEEE, pp. 1049–1054 (cit. on pp. 44, 45, 47). McGee, K. and A. T. Abraham (2010). “Real-time team-mate AI in games: a definition, survey, & critique.” Proceedings of the 5th International Conference on Foundations of Digital Games (FDG ’10). Ed. by I. Horswill and Y. Pisan. Pacific Grove, CA, pp. 124–131 (cit. on p. 33). McGill, M., C. Sammut, J. Westendorp, and M. W. Kadous (2008). FrameScript: A Multi-modal Scripting Language. Tech. rep. School of Computer Science, University of New South Wales (cit. on pp. 79, 80). Meunier, D., R. Lambiotte, and A. Fornito (2009). “Hierarchical modularity in human brain functional networks.” Frontiers in Neuroinformatics 3 (37) (cit. on p. 8). Meyer-Delius, D., P. Pfaff, and G. D. Tipaldi, eds. (2012). Proceedings of the RSS Workshop on Long-term Operation of Autonomous Robotic Systems in Changing Environments. Sydney, Australia. URL: http://ais.informatik.uni- freiburg.de/longtermoperation_ws12rss/ (cit. on pp. 12, 76). Michalski, R. S. (1993). “Toward a unified theory of learning: Multistrategy task- adaptive learning.” Readings in knowledge acquisition and learning: Automat- ing the construction and improvement of Expert Systems. Ed. by B. Buchanan and D. Wikins. San Mateo, CA: Morgan Kaufmann (cit. on pp. 61, 62). Michie, D., M. Bain, and J. Hayes-Michie (1990). “Cognitive Models from Subcog- nitive Skills.” Knowledge-base systems in industrial control. Ed. by M. Grimble, S. McGhee, and P. Mowforth. Stevenage, UK, Institution Of Engineering and Technology (cit. on pp. 55, 120).

223 References

Michie, D. and R. A. Chambers (1968). “BOXES: An experiment in adaptive control.” Machine Intelligence 2. Ed. by E. Dale and D. Michie. Edinburgh, Scotland: Oliver & Boyd, pp. 137–152 (cit. on p. 55). Michini, B., M. Cutler, and J. P. How (2013). “Scalable reward learning from demonstration.” Robotics and Automation, IEEE International Conference on (ICRA 2013). Karlsruhe, Germany: IEEE, pp. 303–308 (cit. on p. 58). Micire, M. (2008). “Evolution and field performance of a rescue robot.” Journal of Field Robotics 25 (1), pp. 17–30 (cit. on p. 69). Milstein, A., M. McGill, T. Wiley, R. Salleh, and C. Sammut (2011). “A Method for Fast Encoder-Free Mapping in Unstructured Environments.” Journal of Field Robotics 28 (6), pp. 817–831 (cit. on pp. 88, 90, 156, 249). Minsky, M. (1974). A Framework for Representing Knowledge. Tech. rep. Cam- bridge, MA: Massachusetts Insitute of Technology Artificial Intelligence Labo- ratory (cit. on p. 80). Minsky, M. (1988). The society of mind. New York, NY: Simon & Schuster (cit. on pp. 8–10, 15, 34, 36, 38, 78). Minsky, M. (1991). “Logical versus analogical or symbolic versus connectionist or neat versus scruffy.” AI Magazine 12 (2), pp. 34–51 (cit. on p. 38). Minsky, M. (2006). The Emotion Machine: Commensense thinking, Artificial Intelligence, and the future of the human mind. New York, NY: Simon & Schuster (cit. on pp. 10, 11, 36). Misra, D. K., J. Sung, K. Lee, and A. Saxena (2014). “Tell me dave: Context- sensitive grounding of natural language to mobile manipulation instructions.” Robotics: Science and Systems X. Ed. by L. E. K. Dieter Fox and H. Kurniawati. Berkeley, USA (cit. on p. 199). Mitchell, T. M. (1982). “Generalization as search.” Artificial Intelligence 18 (2), pp. 203–226 (cit. on pp. 171, 187, 203, 238). Mitchell, T. (1997). Machine learning. New York, NY: McGraw-Hill (cit. on pp. 48, 59, 187, 203).

224 References

Mitchell, T., R. Banerji, and P. E. Utgoff (1983). “Learning by Experimentation: Acquiring and Refining Problem-Solving Heuristics.” Machine Learning: An Artificial Intelligence Approach. New York: Springer Verlag (cit. on p. 170). Mitchell, T., R. Keller, and S. T. Kedar-Cabelli (1986). “Explanation-based gener- alization: A unifying view.” Machine Learning 1 (1), pp. 47–80 (cit. on pp. 29, 153). Moon, F. C. (2007). The machines of Leonardo Da Vinci and Franz Reuleaux. Heidelberg, Germany: Springer (cit. on p. 2). Muggleton, S. (1991). “Inductive Logic Programming.” New Generation Comput- ing 8 (4), pp. 295–318 (cit. on p. 59). Muggleton, S. (1995). “Inverse Entailment and Progol.” New Generation Comput- ing 13 (3-4), pp. 245–286 (cit. on p. 60). Muggleton, S. and C. Feng (1992). “Efficient Induction of Logic Programs.” Inductive Logic Programming. Ed. by S. Muggleton. Academic Press, pp. 281– 298 (cit. on pp. xvii, 172, 174). Muggleton, S., J. Santos, and A. Tamaddoni-Nezhad (2010). “ProGolem: A System Based on Relative Minimal Generalisation.” Inductive Logic Programming. Berlin, Heidelberg: Springer, pp. 131–148 (cit. on pp. xvii, 155, 172, 177, 250). Müller, J. P. and M. Pischel (1994). “Modelling interacting agents in dynamic environments.” Proceedings of the Eleventh European Conference on Artificial Intelligence (ECAI-94). Amsterdam, The Netherlands, pp. 709–713 (cit. on p. 43). Murphy, R. (2014). “Disaster Robotics.” Cambridge, MA: MIT Press (cit. on p. 69). Nakanishi, J., R. Cory, M. Mistry, J. Peters, and S. Schaal (2008). “Operational space control: A theoretical and empirical comparison.” The International Journal of Robotics Research 27 (6), pp. 737–757 (cit. on pp. 238, 254). Nason, S. and J. E. Laird (2005). “Soar-RL: Integrating reinforcement learning with Soar.” Cognitive Systems Research 6 (1), pp. 51–59 (cit. on pp. 29, 30).

225 References

National Institute of Standards and Technology (NIST) (2010). Department of Homeland Security Urban Search and Rescue Robot Performance Standards. URL: http://www.nist.gov/el/isd/ks/ (cit. on p. 135). Newell, A. (1990). Unified Theories of Cognition. Cambridge, MA: Harvard Uni- versity Press (cit. on pp. 26, 28, 46). Newell, A., J. C. Shaw, and H. A. Simon (1959). “Report on a General Problem- solving Program.” Proceedings of the International Conference on Information Processing. Paris, France, pp. 256–264 (cit. on p. 28). Ng, A. Y., A. Coates, M. Diel, V. Ganapathi, J. Schulte, B. Tse, E. Berger, and E. Liang (2004). “Autonomous inverted helicopter flight via reinforcement learning.” Experimental Robotics IX, The 9th International Symposium on Experimental Robotics (ISER 2004). Ed. by M. H. Ang JR. and O. Khatib. Singapore, pp. 363–372 (cit. on p. 55). Nguyen, S. M., S. Ivaldi, N. Lyubova, A. Droniou, D. Gerardeaux-Viret, D. Filliat, V. Padois, O. Sigaud, and P. Y. Oudeyer (2013). “Learning to recognize objects through curiosity-driven manipulation with the iCub humanoid robot.” Devel- opment and Learning and Epigenetic Robotics (ICDL), 2013 IEEE Third Joint International Conference on. Osaka, Japan: IEEE, pp. 1–8 (cit. on pp. 46, 47). Niekum, S., S. Chitta, B. Marthi, S. Osentoski, and A. G. Barto (2013). “Incremen- tal semantically grounded learning from demonstration.” Robotics: Science and Systems IX. Ed. by D. F. Paul Newman and D. Hsu. Vol. 9. Berlin, Germany (cit. on pp. 57, 148, 203). Niekum, S., S. Osentoski, G. Konidaris, and A. G. Barto (2012). “Learning and generalization of complex tasks from unstructured demonstrations.” Vilamoura, Portugal: IEEE, pp. 5239–5246 (cit. on pp. 57, 162, 168). Nilsson, N. (1994). “Teleo-reactive programs for agent control.” Journal of Artifi- cial Intelligence Research 1, pp. 139–158 (cit. on p. 31). Nilsson, N. (2001). “Teleo-reactive programs and the triple-tower architecture.” Electronic Transactions on Artificial Intelligence 5, pp. 99–110 (cit. on p. 83).

226 References

Nilsson, N. (1984). Shakey the robot. Tech. rep. 323. Artificial Intelligence Center, Computer Science and Technology Division, Stanford University (cit. on pp. 11, 41). Nino, G. L. (2014). Wikimedia Commons: MQ-9 Reaper CBP. URL: http://upload. wikimedia.org/wikipedia/commons/f/f5/MQ-9_Reaper_CBP.jpg (cit. on p.4). O’Reilly, R. C. (1998). “Six principles for biologically based computational models of cortical cognition.” Trends in Cognitive Science 2 (11), pp. 455–462 (cit. on p. 31). Pastor, P., L. Righetti, and M. Kalakrishnan (2011). “Online movement adaptation based on previous sensor experiences.” Intelligent Robots and Systems (IROS 2011), IEEE/RSJ International Conference on. San Francisco, CA: IEEE (cit. on p. 57). Pastor, P., S. Schaal, H. Hoffmann, and T. Asfour (2009). “Learning and generaliza- tion of motor skills by learning from demonstration.” Robotics and Automation, IEEE International Conference on (ICRA 2009). Kobe, Japan: IEEE, pp. 763– 768 (cit. on p. 57). Pasula, H. M., L. S. Zettlemoyer, and L. P. Kaelbling (2007). “Learning symbolic models of stochastic domains.” Journal of Artificial Intelligence Research 29 (1), pp. 309–352 (cit. on p. 60). Peters, J., R. Tedrake, N. Roy, and J. Morimoto (2011). “Robot learning.” En- cyclopedia of Machine Learning. Ed. by C. Sammut and G. I. Webb. Berlin, Germany: Springer (cit. on pp. 47, 54). Pigliucci, M. (2008). “Is evolvability evolvable?” Nature Reviews Genetics 9, pp. 75–82 (cit. on p. 8). Pinker, S. (1997). How the mind works. New York, NY: W. W. Norton & Company (cit. on p. 8). Plotkin, G. (1972). “Automatic methods of inductive inference.” PhD thesis. Edin- burgh University (cit. on p. 172).

227 References

Pommerleau, D. A. (1988). “ALVINN: An autonomous land vehicle in a neural net.” Advances in Neural Information Processing Systems (NIPS). Vol. 1. San Francisco, CA: Morgan Kaufmann, pp. 305–313 (cit. on p. 55). Quigley, M., K. Conley, and B. Gerkey (2009). “ROS: an open-source robot operating system.” Proceedings of the 2009 ICRA Workshop on Open Source Software. Kobe, Japan.: IEEE (cit. on pp. 21, 75, 81, 85). Quinlan, J. R. (1986). “Induction of Decision Trees.” Machine Learning 1 (1), pp. 81–106 (cit. on p. 51). Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Francisco, CA: Morgan Kaufmann (cit. on p. 123). Reid, M. and M. Ryan (2000). “Using ILP to improve planning in Hierarchical Reinforcement Learning.” ILP ’00: Proceedings of the 10th International Conference on Inductive Logic Programming. London, UK: Springer-Verlag, pp. 174–190 (cit. on p. 60). Roberts, R. B. and I. P. Goldstein (1977). The FRL primer. Tech. rep. AI-Memo 408. Artificial Intelligence Laboratory, Massachusetts Insitute of Technology (cit. on pp. 78, 79). Rosheim, M. (1994). Robot evolution: The development of anthrobotics. New York, New York: John Wiley & Sons (cit. on p. 2). Ross, S., G. J. Gordon, and J. A. Bagnell (2011). “A reduction of imitation learning and structured prediction to no-regret online learning.” Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2011). Fort Lauderdale, FL, USA, pp. 627–635 (cit. on p. 56). Russell, S. and P. Norvig (2010). Artificial Intelligence: A modern approach. 3rd ed. Upper Saddle River, NJ: Pearson Education (cit. on pp. 15, 26, 34, 36, 59). Rusu, R. B. and S. Cousins (2011a). “3D is here: Point cloud library (PCL).” Robotics and Automation, IEEE International Conference on (ICRA 2011). Shanghai, China: IEEE, pp. 1–4 (cit. on pp. 113, 249). Rusu, R. and S. Cousins (2011b). “3D is here: Point Cloud Library (PCL).” Pro- ceedings of the 2011 IEEE . . . (Cit. on p. 21).

228 References

Rusu, R. B. (2009). “Semantic 3D object maps for everyday manipulation in human living environments.” PhD thesis. Computer Science Department, Technische Universitaet Muenchen (cit. on p. 113). Ryan, M. (2002a). “Hierarchical Reinforcement Learning: A hybrid approach.” PhD thesis. School of Computer Science and Engineering, University of New South Wales (cit. on p. 202). Ryan, M. (2002b). “Using abstract models of behaviours to automatically generate reinforcement learning hierarchies.” In Proceedings of The 19th International Conference on Machine Learning (ICML ’02), pp. 522–529 (cit. on p. 64). Ryan, M. and M. Pendrith (1998). “RL-TOPs: an architecture for modularity and re-use in reinforcement learning.” Proc. 15th International Conf. on Machine Learning, pp. 481–487 (cit. on pp. 64, 65). Ryan, M. and M. Reid (2000). “Learning to fly: An application of Hierarchical Reinforcement Learning.” Proceedings of the 17th International Conference on Machine Learning (ICML ’00). Ed. by P. Langley. Stanford, CA: Morgan Kaufmann, pp. 807–814 (cit. on p. 65). Salganicoff, M., L. Ungar, and R. Bajcsy (1996). “Active learning for vision-based robot learning.” Machine Learning 23, pp. 251–278 (cit. on p. 51). Samet, H. and M. Tamminen (1988). “Efficient component labeling of images of arbitrary dimension represented by linear bintrees.” IEEE Transactions on Pattern Analysis and Machine Intelligence (TIEEE Trans. Pattern Anal. Mach. Intell.) 10 (4), pp. 579–586 (cit. on pp. 91, 114). Samling, D. (2014). The David Collection. Islamic art: Miniature painting. URL: http://www.davidmus.dk/en/collections/islamic/materials/miniatures/art/20-1988 (cit. on p. 2). Sammut, C. (2010). “Robot soccer.” WIREs Cognitive Science 1 (6), pp. 824–833 (cit. on p. 10). Sammut, C. (2012). “When Do Robots Have to Think?” Advances in Cognitive Systems 1 (1), 73–81 (cit. on pp.5, 10, 11, 75, 103).

229 References

Sammut, C., S. Hurst, D. Kedzier, and D. Michie (1992). “Learning to fly.” Pro- ceedings of the Ninth International Conference on Machine Learning (ICML ’92). Ed. by D. Sleeman and P. Edwards. Aberdeen, Scotland, UK: Morgan Kaufmann, pp. 385–393 (cit. on pp. 55, 120). Sammut, C. and T. F. Yik (2010). “Multistrategy learning for robot behaviours.” Ad- vances in Machine Learning I. Ed. by J. Koronacki, Z. W. Ras, S. T. Wierzchon, and J. Kacprzyk. Studies in Computational Intelligence. Heidelberg, Germany: Springer, pp. 457–476 (cit. on pp. 54, 61, 64, 202). Santos Costa, V., R. Rocha, and L. Damas (2012). “The YAP Prolog system.” Theory and Practice of Logic Programming 12 (Special Issue 1–2), pp. 5–34 (cit. on p. 248). Santos, J. (2010). “Efficient learning and evaluation of complex concepts in In- ductive Logic Programming.” PhD thesis. Department of Computing, Imperial College London (cit. on pp. 178, 250). Sariyildiz, E. and H. Temeltas (2011). “Performance analysis of numerical integra- tion methods in the trajectory tracking application of redundant robot manipu- lators.” International Journal of Advanced Robotic Systems 8 (5), pp. 25–38 (cit. on p. 253). Schaal, S. (2010). Dynamic Movement Primitives (DMP) software. URL: http: //www-clmc.usc.edu/Resources/Software (cit. on pp. 169, 250). Schaal, S., A. Ijspeert, and A. Billard (2003). “Computational approaches to motor learning by imitation.” Philosophical Transactions of the Royal Society B: Biological Sciences 358 (1431), pp. 537–547 (cit. on pp. 55, 56). Scheutz, M. (2006). “ADE: Steps toward a distributed development and runtime environment for complex robotic agent architectures.” Applied Artificial Intelli- gence 20 (2-4), pp. 275–304 (cit. on p. 45). Scheutz, M., P. Schermerhorn, J. Kramer, and D. Anderson (2007). “First steps toward natural human-like HRI.” Autonomous Robots 22 (4), pp. 411–423 (cit. on p. 44).

230 References

Schmidt, W. (1899). Wikimedia Commons: Heron’s Windwheel. Herons von Alexan- dria Druckwerke und Automatentheater. URL: http://commons.wikimedia.org/ wiki/File:Heron%27s_Windwheel.jpg (cit. on p.2). Sen, S., G. Sherrick, D. Ruiken, and R. Grupen (2011). “Choosing informative ac- tions for manipulation tasks.” Humanoid Robots, 11th IEEE-RAS International Conference on (Humanoids 2011). Bled, Slovenia: IEEE, pp. 721–726 (cit. on p. 53). Settles, B. (2012). Active Learning. Synthesis lectures on Artificial Intelligence and Machine Learning. San Rafael, CA: Morgan & Claypool (cit. on pp. 49, 203). Shanahan, M. (2006). “A cognitive architecture that combines internal simulation with a global workspace.” Consciousness and Cognition 15, pp. 433–449 (cit. on p. 31). Shanahan, M. (2007). “A vision-based intelligent system for packing 2-D irregular shapes.” IEEE Transactions on Automation Science and Engineering 4, pp. 382– 394 (cit. on p. 32). Shanahan, M. and A. Bouganis (2008). “Object Detection for a Humanoid Robot Using a Probabilistic Global Workspace.” 4th International Cognitive Vision Workshop (ICVW 2008). Ed. by B. Caputo and M. Vincze. Santorini, Greece: Springer (cit. on pp. 31, 32). Sharkey, N. (2007). “The Programmable Robot of Ancient Greece.” New Scientist 2611, pp. 32–35 (cit. on p. 2). Sheh, R. K.-M. (2010). “Learning Robot Behaviours by Observing and Envisag- ing.” PhD thesis. School of Computer Science and Engineering, University of New South Wales (cit. on pp. 120, 121, 124, 126, 127, 135, 242, 250). Sheh, R. K.-M., B. Hengst, and C. Sammut (2011). “Behavioural cloning for driving robots over rough terrain.” Intelligent Robots and Systems (IROS 2011), IEEE/RSJ International Conference on. San Francisco, CA: IEEE (cit. on pp. 19, 111, 120, 123, 135, 249).

231 References

Sheh, R. K.-M., A. Milstein, M. McGill, R. Salleh, and B. Hengst (2009). “Semi- Autonomous Robots for RoboCup Rescue.” Proceedings of the Australasian Conference on Robotics and Automation (ACRA 2009). Sydney, Australia, pp. 1–10 (cit. on pp. 78, 87, 88, 95, 119, 251). Siciliano, B. and O. Khatib, eds. (2008). Springer handbook of robotics. Heidelberg, Germany: Springer-Verlag (cit. on p. 253). Simon, H. (1962). “The architecture of complexity.” Proceedings of the American Philosophical Society 106 (6), pp. 467–482 (cit. on p.7). Singh, P. (2004). “Examining the society of mind.” Computing and Informatics 22 (6), pp. 521–544 (cit. on p. 39). Singh, P. (2005). “Em-one: an architecture for reflective commonsense think- ing.” PhD thesis. Department of Electrical Engineering and Computer Science (EECS), Massachusetts Institute of Technology (cit. on pp. 11, 39). Singh, R. and K. Sycara (2004). Securing multi agent societies. Tech. rep. CMU- RI-TR-04-02. Robotics Institute, Carnegie Mellon University (cit. on p. 34). Smits, R. (2014). KDL: Kinematics and Dynamics Library. URL: http://www.orocos. org/kdl (cit. on pp. 167, 251, 253). Sohn, M., S. Ursu, J. R. Anderson, V. A. Stenger, and C. Carter (2000). “The role of prefrontal cortex and posterior parietal cortex in task switching.” Proceedings of the National Academy of Sciences 97 (24), pp. 13448–13453 (cit. on p. 30). Stone, P. (2000). Layered learning in multiagent systems: A winning approach to robotic soccer. Cambridge, MA: MIT Press (cit. on p. 65). Stone, P. and M. Veloso (2000). “Layered Learning.” 11th European Conference on Machine Learning (ECML 2000). Ed. by R. López de Mántars and E. Plaza. Barcelona, Spain: Springer (cit. on p. 65). Stonier, D. (2014). RosJava: the first pure Java implementation of ROS. URL: http://wiki.ros.org/rosjava (cit. on pp. 85, 243). Stoytchev, A. (2007). “Robot behavior: A developmental approach to autonomous tool use.” PhD thesis. Institute of Robotics and Intelligent Machines, George Institute of Technology (cit. on p. 53).

232 References

Stulp, F. and M. Beetz (2008). “Refining the execution of abstract actions with learned action models.” Journal of Artificial Intelligence Research 32, pp. 487– 523 (cit. on p. 67). Stulp, F., G. Raiola, A. Hoarau, S. Ivaldi, and O. Sigaud (2013). “Learning compact parameterized skills with a single regression.” Humanoid Robots, 13th IEEE- RAS International Conference on (Humanoids 2013). Atlanta, GA: IEEE, pp. 1– 7 (cit. on p. 47). Stulp, F. and S. Schaal (2011). “Hierarchical reinforcement learning with movement primitives.” Humanoid Robots, 11th IEEE-RAS International Conference on (Humanoids 2011). Bled, Slovenia: IEEE, pp. 231–238 (cit. on pp. 58, 66, 162, 169, 202). Stulp, F., S. Schaal, E. Theodorou, and J. Buchli (2011). “Learning to grasp under uncertainty.” Robotics and Automation, IEEE International Conference on (ICRA 2011). Shanghai, China: IEEE, pp. 5703–5708 (cit. on p. 57). Sucan, I. A. and S. Chitta (2014). MoveIt! State of the art software for mobile manipulation. URL: http://moveit.ros.org (cit. on pp. 191, 251). Sun, R. (2005). “The CLARION cognitive architecture: Extending cognitive model- ing to social simulation.” Cognition and multi-agent interaction: From cognitive modeling to social simulation. Ed. by R. Sun. Cambridge University Press. Chap. 4 (cit. on p. 33). Sushkov, O. O. and C. Sammut (2012). “Active robot learning of object proper- ties.” Intelligent Robots and Systems (IROS 2012), IEEE/RSJ International Conference on. Vilamoura, Algarve, Portugal: IEEE (cit. on pp. 54, 203). Sutton, R. S. and A. G. Barto (1998). Reinforcement Learning: An introduction. Cambridge, MA: MIT Press (cit. on p. 50). Swartout, W., J. Gratch, W. R. Hill, E. Hovy, S. Marsella, J. Rickel, and D. Traum (2006). “Toward virtual humans.” AI Magazine 27 (2), pp. 96–108 (cit. on pp. 29, 33).

233 References

Sycara, K., J. Giampapa, B. Langley, and M. Paolucci (2003). “The RETSINA MAS, a case study.” Software engineering for large-scale multi-agent systems. Springer-Verlag, pp. 232–250 (cit. on p. 33). Taatgen, N. (2005). “Modeling Parallelization and Flexibility Improvements in Skill Acquisition: From Dual Tasks to Complex Dynamic Skills.” Cognitive Science 29, pp. 421–455 (cit. on p. 30). Tenorth, M. and M. Beetz (2013). “KnowRob: A knowledge processing infras- tructure for cognition-enabled robots.” The International Journal of Robotics Research 32 (5), pp. 566–590 (cit. on p.5). Theodorou, E., J. Buchli, and S. Schaal (2010). “A generalized path integral control approach to reinforcement learning.” The Journal of Machine Learning Research 11, pp. 3137–3181 (cit. on pp. 58, 202). Ng-Thow-Hing, V., K. Thórisson, R. Sarvadevabhatla, J. Wormer, and T. List (2009). “Cognitive map architecture.” Robotics & Automation Magazine, IEEE 16 (1), pp. 55–66 (cit. on p. 33). Thrun, S., W. Burgard, and D. Fox (2006). Probabilistic robotics. Intelligent Robots and Autonomous Agents Series. The MIT Press (cit. on pp. 11, 74). Trafton, J., N. L. Cassimatis, M. D. Bugajska, D. P. Brock, F. E. Mintz, and A. C. Schultz (2005). “Enabling effective human-robot interaction using perspective- taking in robots.” Systems, Man and Cybernetics, IEEE Transactions on 35, pp. 460–470 (cit. on p. 39). Turing, A. M. (1950). “Computing Machinery and Intelligence.” Mind 59 (236), pp. 433–460 (cit. on p. 3). Ude, A., B. Nemec, T. Petric, and J. Morimoto (2014). “Orientation in Cartesian Space Dynamic Movement Primitives.” Robotics and Automation, IEEE In- ternational Conference on (ICRA 2014). Hong Kong, China: IEEE (cit. on pp. 169, 200). Ugur, E., E. Oztop, and E. Sahin (2011). “Goal emulation and planning in per- ceptual space using learned affordances.” Robotics and Autonomous Systems 59 (7-8), pp. 580–595 (cit. on p. 53).

234 References

Vainer, K. (2013). jMonkeyEngine 3.0 | Java OpenGL game engine. URL: http: //jmonkeyengine.com/ (cit. on pp. 18, 87, 239). Veloso, M., J. G. Carbonell, A. Perez, D. Borrajo, E. Fink, and J. Blythe (1995). “Integrating planning and learning: The PRODIGY architecture.” Journal Of Experimental & Theoretical Artificial Intelligence 7 (1), pp. 81–120 (cit. on p. 60). Vernon, D., G. Metta, and G. Sandini (2007). “A survey of artificial cognitive systems: Implications for the autonomous development of mental capabilities in computational agents.” Evolutionary Computation, IEEE Transactions on 11 (2), pp. 151–180 (cit. on pp. 31, 33). Wagner, G., M. Pavlicev, and M. Cheverud (2007). “The road to modularity.” Nature Reviews Genetics 8, pp. 921–931 (cit. on p.7). Wagner, G. and J. Zhang (2011). “The pleiotropic structure of the genotype– phenotype map: the evolvability of complex organisms.” Nature Reviews Ge- netics 12, pp. 204–213 (cit. on pp.7,8). Wampler, C. W. (1986). “Manipulator Inverse Kinematic Solutions Based on Vector Formulations and Damped Least-Squares Methods.” Systems, Man and Cybernetics, IEEE Transactions on 16 (1), pp. 93–101 (cit. on pp. 167, 253). Wiener, E. J. (2011). Maze Generators - Prim’s Algorithm. http://www.ccs.neu.edu/ home/snuffy/maze/prim/ (cit. on pp. 101, 240). Wilensky, R. (1983). Planning and understanding: A computational approach to human reasoning. Reading, MA: Addison-Wesley (cit. on p. 114). Wiley, T., M. McGill, A. Milstein, R. Salleh, and C. Sammut (2012). “Spatial Correlation of Multi-sensor Features for Autonomous Victim Identification.” RoboCup 2011: Robot Soccer World Cup XV. Vol. 7416. Lecture Notes in Computer Science. Berlin / Heidelberg: Springer, pp. 538–549 (cit. on p. 91). Witten, I. H., E. Frank, and M. A. Hall (2011). Data Mining: Practical Machine Learning Tools and Techniques. 3rd ed. Amsterdam, The Netherlands: Morgan Kaufmann (cit. on p. 124).

235 References

Wyatt, J. and N. Hawes (2008). “Multiple workspaces as an architecture for cog- nition.” Biologically Inspired Cognitive Architectures, Papers from the 2008 AAAI Fall Symposium. Arlington, Virginia: AAAI Press, pp. 201–206 (cit. on p. 9). Yamauchi, B. (1997). “A frontier-based approach for autonomous exploration.” Proceedings of the 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation (CIRA’97). Monterey, California: IEEE, pp. 146–151 (cit. on pp. 115, 251). Yoshida, T., K. Nagatani, S. Tadokoro, T. Nishimura, and E. Koyanagi (2014). “Improvements to the rescue robot Quince toward future indoor surveillance missions in the Fukushima Daiichi nuclear power plant.” Field and Service Robotics. Ed. by K. Yoshida and S. Tadokoro. Springer Tracts in Advanced Robotics 92. Heidelberg, Germany: Springer-Verlag, pp. 19–32 (cit. on p. 69).

236 Glossary

‘pick and place’ Behaviours in which a robot grasps and lifts one or more objects and places it or them in a desired location. 52, 57

A* A best-first graph search algorithm which aims to find the lowest cost path between an initial node and one of a set of goal nodes. A* follows a path of lowest expected total cost or distance, based on a heuristic function. 119, 121–123

DMP A mathematical formalism that encodes physical movements as systems of coupled differential equations using methods from Dynamical Systems. They are introduced in Section 2.6.2 and formally defined in Section 5.3.2. xviii, 147, 250

ECLiPSe ECLiPSe Constraint Logic Programming System, a software system for specification and solution of constraint satisfaction problems. It contains constraint solver libraries, and forms a high-level constraint modelling lan- guage which is a superset of Prolog. See http://eclipseclp.org. xvii, 93, 94, 118, 119, 158, 248, 251 inverse kinematics The transformation from the coordinates of the end-effector of a mobile manipulator, usually a robot arm, to the set of joint angles that achieve that pose. The opposite transformation is known as forward kinematics. 162, 167, 173

237 Glossary

LIDAR A remote sensing technology that measures distance by illuminating a target with a laser and analyzing the reflected light. 11

PDDL Planning Domain Definition Language (PDDL) is a standard formalism for specification of automated planning problems, introduced by (McDermott et al., 1998). xiv, xix, 41, 92, 112, 149

PR2 Personal Robot 2, a two-armed mobile manipulator robot built by Willow Garage robotics. See https://www.willowgarage.com/pages/pr2/overview. 66

RoboCup Rescue An international contest for autonomous and teleoperated res- cue robots to execute search and rescue missions in disaster settings. See: http://www.robocuprescue.org. For 2014 competition rules, see: http://wiki. ssrrsummerschool.org/doku.php?id=rrl-rules-2014. 18, 87, 88, 107, 109–111, 131–134, 136, 205 rosjava A pure java implementation of the robot operating system (ROS). 85 stepfield A rectangular grid of wooden blocks of varying heights, designed to simulate rough terrain. xiv, xvi, 19, 101, 102, 108, 109, 111, 131, 132, 134, 135, 138–144 task space The coordinate frame of the end-effector of a robot manipulator. Also known as operational space (Nakanishi et al., 2008). 167 teleoperation Remote control of a robot by a human operator, typically via radio frequency (RF) communication. 69 version space The space of possible hypotheses that are consistent with the train- ing data available to a machine learner. Typically bounded by most specific and most-general hypotheses (Mitchell, 1982). 171, 187, 193

Weka An open source collection of machine learning algorithms for data mining tasks (Hall et al., 2009). 124, 127, 128

238 Appendix A

The jmeSim Environment

jmeSim is built on top of a game engine, specifically, the open source jMon- keyEngine3 Game Engine (Vainer, 2013). jmeSim is written in pure Java, and is platform neutral: it can run on any operating system. The jMonkeyEngine3 engine uses OpenGL for rendering, incorporating high performance rendering methods including shaders and per-pixel lighting, and an array of post-processing filters, which efficiently produce excellent graphical performance at real time frame rates on desktop machines. Qualitatively speaking, graphics performance of jMoneySim is as good or better than other robotics simulators, including commercial packages. Physics in jmeSim is performed using the Bullet Physics library (Coumans, 2010). Bullet physics is based on a multi-threaded dynamics solver which provides excellent accuracy, speed, and stability, outperforming other open source and even some commercial physics engines (Boeing and Bräunl, 2007). The game engine runs as two threads, one for physics updates, and another to render the scene. Notably, Bullet provides facility for convex hull generation for complex meshes, which means that mesh accurate collisions can be modelled for shapes that are not easily expressible as a collection of primitives. This is a decided advantage over older physics engines such as ODE, which does not have this capacity natively. Although a quantitative analysis of the CPU requirements of jmeSim is beyond the scope of this thesis,

239 A.1. Environment Generation the multi threaded Bullet physics allow simulations involving complex articulated robots and hundreds of physics objects to run at realistic frame rates on standard desktop hardware. jMonkeyEngine3 provides a collection of low level libraries for object culling, rendering, and texturing, so that simulator objects can be specified in a high level manner. jmeSim takes advantage of these capabilities to provide a high- level interface to enable definition of robots, world objects, and environment configurations. Robot initial locations and sensor capabilities are specified in a configuration file, and environmental descriptions are specified using vector graphics files, which can be drawn by hand, or generated programmatically, to populate the world with rooms, corridors and other objects.

A.1 Environment Generation

Since our research is primarily in the area of mobile rescue robots, the simulation package includes a module for batch generation of random mazes, including obstacles and victims, based on a modified version of Prim’s algorithm (Wiener, 2011), shown in Algorithm 14. An example of a generated vector graphics file and the corresponding simulation environment produced is shown in Figure A.1 on page 242.

A.2 Robot Model and Control jmeSim contains a model of a wheeled rescue robot, with a four degree-of-freedom sensor head, containing an array of sensors. The robot model, along with a photo of the real robot it is based on, is shown in Figure 1.4. As in many simulation packages, jMonkeyEngine allows specification of complex articulated physical objects in terms of primitive geometric objects, so that a wide range of robot models could be included into jmeSim. If the simulation software is implemented in a stand-alone configuration, control

240 A.2. Robot Model and Control

Algorithm 14 Random generation of rescue arena mazes. Java-like Pseudocode

GENERATE_MAZE Input: Maze width w, height h, nremove 1: Create Maze a new grid of cell locations 2: Frontiers is an empty list of cells 3: Select a random start cell c 4: Add all c’s neighbour’s to Frontiers 5: while Frontiers , {} do 6: c = pick a random cell from Frontiers 7: c.breakWall(c.randomInNeighbor()) . break the wall between c and a neighbor 8: c.setIn(); 9: for Cell i ∈ c. f rontierNeighbors() do 10: if !Frontiers.contains(i) then Frontiers.add(i); 11: end if 12: end for 13: frontiers.remove(c); 14: end while 15: for int i=0 : nremove do 16: c = pick a random cell from Maze 17: c.removeAllWalls() 18: end for code for robots can be written in Java. However, the simulation is designed for integration with other packages to define robot control, in particular, ROS, which is language neutral, allowing control code to written in any required formalism. Integration of ROS and jmeSim is described in detail in later sections.

A.2.1 Sensor Simulation jmeSim provides an array of sensors which provide data typically needed by an autonomous mobile robot. The sensors built in to jmeSim are laser range finder, sonar range finder, colour camera, depth camera, thermal camera, wheel odometry, and inertial measurement unit for orientation sensing. Simple sensors such as sonar, and LIDAR are modelled using physics collision detection, and ray casts respectively. More advanced sensors such as cameras and depth cameras require rendering the simulation scene from the perspective of the simulated camera,

241 A.2. Robot Model and Control

Figure A.1: Environment generation in jmeSim. Right: A vector graphics file specifying locations of victims, raised areas, and obstacles in blue, red, and black, respectively. Left: The jmeSim environment generated by this world file. and jMoneySim takes advantage of the multithreaded rendering capabilities of jMonkeyEngine3 to produce high frame rates. The colour camera is produced in exactly the same way as the main scene image, however the depth camera is slightly more complex. For each virtual camera in the scene, OpenGL, the rendering engine used by jmeSim, stores the distances of every world object in view in order to produce a visualisation with objects appropriately occluded. These values are stored in an array called the depth buffer or Z-buffer. The depth camera image in jmeSim is rendered by reading the depth buffer, and constructing an image using the values. A detailed description of the method can be found in Sheh, 2010. We are particularly interested in using jmeSim for simulating autonomous robotic rescue, and therefore it is important to model a thermal camera, which is essential for detecting body heat. As such, jmeSim includes a thermal camera sensor, which assigns objects with a temperature, and renders their brightness in a greyscale image accordingly. To simulate the image of a real thermal camera, a Gaussian blur is applied to the image. The sensor output from the colour, depth, and thermal cameras can be seen in Figure A.2.

242 A.3. Integration with ROS

Figure A.2: Sensor data in jmeSim. On the left, the main camera view shows a simulated robot traversing some rough terrain, approaching a victim lying on the ground. Simulated sensor data feeds clockwise from top right: colour camera, thermal camera, depth camera, and laser range finder.

A.3 Integration with ROS

As discussed above, ROS has been gaining steadily in popularity amongst the research community, because of its open source, free, and multi platform nature. For this reason it is important to be able to integrate jmeSim with ROS. This is achieved through the recently developed rosjava framework (Stonier, 2014), a pure Java implementation of ROS. A system built using ROS is composed of a number of independent processes, connected in a peer-to-peer topology, which communicate via publish-subscribe message passing. Each process is referred to as a ROS node. To integrate with ROS, jmeSim runs a rosjava node which can act as an interface between the Java simulation and other ROS processes in the system. The rosjava node runs in a separate thread, and allows messages to pass back and forth from the simulation to a ROS system. A schematic of the connectivity of ROS and jmeSim is shown in Figure A.3. Comparisons of the results of ROS software modules run using input from jmeSim and from a real robot are given by Haber et al. (2012).

243 A.4. Running jmeSim

ROS rosjava Core

Render jmeSim

Physics ROS Nodes

Figure A.3: ROS-jmeSim connectivity. jmeSim runs in three threads, one for solving physics calculations, one performs rendering, and a third runs a rosjava node that publishes sensor data, and subscribes to control data.

A.4 Running jmeSim

To use jmeSim, begin by downloading the package30. To run a demonstration of the simulation environment:

1. Ensure that a Java Runtime Environment (JRE) is present on the system.

2. Unzip the jmeSim.zip archive

3. If not already present, install Apache Ant java compiler.

4. To build jmeSim type: » ant clean jar

5. Navigate to the /dist directory. Then, to launch jmeSim31, type:

30See http://www.cse.unsw.edu.au/~adamh/jme.html 31jmeSim requires reasonable graphics capability to run at acceptable speeds. See http://hub. jmonkeyengine.org/wiki/doku.php/jme3:hardware_compatibility_list for detailed system requirements.

244 A.4. Running jmeSim

» java -jar jmeSim.jar

You will see a mobile robot which can be driven around the environment using the ‘R,T,Y,G’ keys, while the camera may be moved using ‘W,A,S,D’ and the mouse. The smaller windows on the left of the main window are the output produced by the robot’s simulated sensors. To edit the simulation, add robot control code, view and edit the source files in the /src directory. A preferred option is to download the jMonkeyPlatform32 SDK (or the Netbeans IDE, on which jMonkeyPlatform is based) and edit jmeSim within the integrated development environment. To do this, select ‘Import Project’ from the file menu, and navigate to the (unzipped) jmeSim directory, which will be visible as a project directory.

32jMonkeyPlatform can be downloaded from http://jmonkeyengine.org/downloads/.

245 Appendix B

MALA Implementation

B.1 User Guide

In this section we provide instructions to compile and run MALA. This will launch a jmeSim instance providing a simulation of the robot and environment. Details on running jmeSim in isolation are given in Section A.4.

B.1.1 Installation and Dependencies

MALA depends on several external libraries. To build MALA, install:

• ROS [http://wiki.ros.org/ROS/Installation]

• MATLAB [http://www.mathworks.com.au/products/matlab/]

• Point Cloud Library (PCL) [http://pointclouds.org/downloads/]

• OpenCV [http://opencv.org/downloads.html]

Note that several of the above libraries (particularly PCL) depend on further 3rd party libraries and these dependencies must also be satisfied to build MALA.

246 B.1. User Guide

B.1.2 Running MALA

MALA has been tested on Mac OS X and Linux systems, Windows operation is not supported. To run the MALA system used for the experiments of Chapter 4, the following steps must be followed:

• Run the following command (in the MALA base directory) to compile MALA using the ANT java build system: » ant clean jar

• Select a machine (local or remote) to run the ROS master process33. Run the ROS core on that machine: » ssh user@coremachine » roscore

• Execute the launch script (in the MALA home directory) to start MALA with a stepfield difficulty index (Section 4.6.2) d = 0.5: » ./scripts/final/exp1/learning.sh coremachine 0.5

To run the active learning system described in Chapter 5, follow a similar set of steps, starting in the LearningByDemonstration directory:

• Run the following command (in the LearningByDemonstration base direc- tory) to compile, using the ANT java build system: » ant clean jar

• Select a machine (local or remote) to run the ROS master process33. Run the ROS core on that machine: » ssh user@coremachine » roscore

• Execute the launch script (in the LearningByDemonstration home directory) to start MALA: » ./run_learning.sh coremachine 33 See http://wiki.ros.org/ROS/Tutorials for ROS tutorials.

247 B.2. Agent Implementation

B.2 Agent Implementation

This section provides implementation details, including algorithms and packages used, to build the complete set of agents used for MALA instances in this thesis. Table B.1 lists the languages that are currently supported for MALA agents to run in. Tables B.2 – B.4 on pages 249–251 give details on the algorithms, packages, and languages for the complete set of agents implemented for the research in this thesis.

Langauge Interface Libraries Java/Scala - - JavaCPP C/C++ Java Native Interface (JNI) (Audet, 2014) matlabcontrol Matlab Java Matlab Interface (JMI) (2014) ECLiPSe Prolog ECLiPSe C/C++ Interface - Prolog Java Prolog Library (JPL)34 -

Table B.1: Agents written in five languages are currently supported in Mala. This table gives details about the interfaces through which the Java agent wrapper is implemented. Note that no interface to Java/Scala is needed since the base Agent wrapper is written in Java.

34Where YAP prolog (Santos Costa et al., 2012) is used, the JPL could not be used. In this case, interprocess communication was used to communicate with the Prolog interpreter.

248 B.2. Agent Implementation

Function Language Algorithms Packages Adaptive Hue thresholding OpenCV (Bradski and Victim detection Java and connected-components Kaehler, 2008) blob extraction Point Cloud Library (Rusu Object detection C++ Euclidean cluster detection and Cousins, 2011a) Obstacle Java Sobel edge detection - detection Mapping C++ FastSLAM (Milstein et al., 2011) Metric-based Iterative Localisation C++ Iterative Closest Point (ICP) Closest Point (MBICP) (Milstein et al., 2011) Terrain feature Java Planes of best fit (Sheh et al., 2011) extraction World model Java - - builder Adaptive Hue thresholding Damaged ceiling OpenCV (Bradski and Java and connected-components detector Kaehler, 2008) blob extraction

Table B.2: Details of algorithms, language, and software packages used to implement all perception agents needed for the experiments described in Chapters 3, 4 and 5.

249 B.2. Agent Implementation

Function Language Algorithms Packages DMP software package Dynamic Motion (Schaal, 2010), published by Primitive (DMP) Matlab - the Motor Control Learning induction and lab at the University of planning Southern California. Inductive Logic Generalised Inductive Logic Programming Progolem (Muggleton et al., Prolog Programming System (ILP) active 2010) (GILPS) (Santos, 2010) learner Explanation Java / Prolog (Brown, 2009) - based learner A* autonomous Terrain drive demonstrator, plane feature J48 Decision Tree from Java learner extraction from (Sheh, Weka (Hall et al., 2009) 2010). Terrain recognition Java ” ” learner

Table B.3: Details of algorithms, language, and software packages used to implement all learning agents needed for the experiments described in Chapters 3, 4 and 5.

250 B.2. Agent Implementation

Function Language Algorithms Packages Robot base path C++ A∗ planning Robot base drive C++ Almost-Voronoi controller (Sheh et al., 2009) control Robot arm Rapidly-exploring Random MoveIt! (Sucan and Chitta, C++ motion planning Trees (RRT) 2014) Inverse dynamics Java/C++ Damped inverse squares Orocos KDL (Smits, 2014) controller Constant velocity ikFast analytical kinematic Java/C++ OpenRave (Diankov, 2008) controller solver Task planner C Fast-Forward (Hoffmann and Nebel, 2001) Constraint solver ECLiPSe -- (Ch3,4) Prolog JaCop (Kuchcinski and Constraint solver Java/Scala - Szymanek, 2014) floating (Ch5) point constraints. Parallel-jaw Java PID-control - gripper control Emu robot sensor Java PID-control - ‘neck’ control Execution Java - - monitor Frontier-based exploration Exploration Java - (Yamauchi, 1997) Goal generator Java - -

Table B.4: Details of algorithms, language, and software packages used to implement all action agents needed for the experiments described in Chapters 3, 4 and 5.

251 Appendix C

Supplementary Analysis and Results

C.1 Inverse Dynamics Controller

There are number of possible formulations for inverse dynamics control of redun- dant robotic manipulators. The fundamental equation relating motion in task and joint space is given in terms of the Jacobian matrix J

x˙ = J(q)q˙ (C.1)

where x˙ = (vx,vy,vz,ωx,ωy,ωz) is the spatial velocity vector of the end effector, qi are the joint angles of the robot arm, the ωi are the components of the angular velocity vector, and the Jacobian is defined

∂xi J(q˙) = ( )i, j (C.2) ∂qi Thus, the desired joint velocity q˙ can be obtained by inverting (C.1). This is the basis of resolved-rate motion control (Corke, 2011). However, problems arise when the Jacobian matrix is singular or close to singular as the matrix inverse no longer exists. A typical approach used in the literature to avoid this problem is based on the Jacobian pseudo-inverse which has improved properties in the neighbourhood of singularities (Buss, 2009) – it always exists. However, although

252 C.1. Inverse Dynamics Controller the pseudo-inverse always exists, controllers based on the pseudo-inverse tend to exhibit significant instability near singular configurations (Siciliano and Khatib, 2008). In this work the method of damped-least squares first proposed by Wampler (1986) is used to further improve upon performance near singularities. This is achieved by minimising the quantity

|Jq˙ − x˙|2 + λ 2|q˙|2, (C.3)

where λ is a non-zero damping constant. Then the desired joint velocity is given by

q˙ = JT (JJT + λ 2I)−1x˙. (C.4)

The damped least-squares velocity given in Eqn. C.4 was calculated using the Kinematics and Dynamics Library (KDL) (Smits, 2014) with the damping constant set to 0.01, and desired task-space pose for a given time was calculated by spline interpolation of the given task-space trajectory. The desired joint accelerations and positions are obtained by numerical differentiation and integration of the desired joint velocity q˙. Numerical differentiation was performed using the second order centred difference approximation which is accurate to fourth order terms. A recent survey by Sariyildiz and Temeltas (2011) showed that the choice of numerical integration method can significantly impact the tracking accuracy of the inverse dynamics controller, and that 4th order Adams-Bashforth provides a good balance between accuracy and computational efficiency. Thus, numerical integration was performed using the 4th order Adams-Bashforth approximation. The final torque τ command sent to the arm is calculated using the computed torque control method with PD control

τ = M(qr)q¨ r + C(qr)(qr,q˙ r) + Kq,d(q˙ r − q˙) + Kq,p(qr − q) (C.5)

where M and C are the Inertial and Coriolis matrices for the robot arm, and

Kq,p and Kq,d are positive definite P and D gain matrices respectively. Since

253 C.1. Inverse Dynamics Controller the robotic arm has very light links the values for M and C are both the identity matrix I. This classical velocity-based controller, although mathematically simple, performed as well or better (in terms of tracking accuracy) as more sophisticated methods in a comparative study by Nakanishi et al. (2008) and was adequate for the purposes of this research. This controller was able to track task-space trajectories with precision less than 0.1cm with an update rate of 100Hz.

254 C.2. Action Models Induced by EBL

C.2 Action Models Induced by Explanation-Based Learning (EBL)

Action Preconditions Effects open_gripper(A), facing(A,B), above(A,C), level_with(A,B), near(A,B), near(B,C), on(B,C), has_shape(B,complex), longer(B,D), near(D,B), near(B,D), in_room(room1,B), clear(B), above(D,B), grasp(A,B) has_attached_side(B,none), holding(A,C), has_num_attachments(B,0), facing(A,B), has_shape(D,boxlike), longer(B,D), near(A,B), near(D,B), near(B,D), in_room(room1,D), has_shape above(D,B), has_attached_side(D,none), (B,complex). has_num_attachments(D,0) on(C,D), facing(A,C), holding(A,B), near(A,C), perpendicular(B, D), above(A, C), above(A, D), has_shape(C, boxlike), parallel(E,C), parallel(C,E), parallel(C,E), parallel(E,C), near(E, C), place_on near(C, E), above(E,C), on(B,C), (A,B,C) has_num_attachments(C,0), open_gripper(A), level_with(E,C), level_with(C,E), longer(C,E), level_with(C,E), level_with(E,C), has_shape(C, in_room(room1,C), clear(C), boxlike), has_attached_side(C,none), longer(C,E). above(A,C).

Table C.1: The initial action models induced from the trainer’s demonstration by the version of EBL described in Section 5.5.5. Note that the preconditions involve complete state descriptions of the world, and are too long to include completely, so irrelevant literals have been removed. Note that many of these literals are quickly removed from the action model during the active learning process (Section 5.5.5)

255 C.2. Action Models Induced by EBL 5.5.5 . Note that Effects (B,open), has_shape (C,none). near(A,C), near(A,C), door_state above(A,C). facing(A,B), longer(C,B), handle_state (C,unlocked), (C,sticklike), is_hit_by(C,B), is_not_hit_by(C, E), facing(A,C), has_attached_side Preconditions in_room(room1, B). holding(A,C), has_handle(B,C), near(A,C), on(C,D), facing(A,C), holding(A,B), near(A,C), parallel(C,B), longer(D,C), longer(C,B), level_with(C,B), has_shape(B,sticklike). level_with(D,C), level_with(C,D), level_with(C,D), level_with(D,C), level_with(C,B), level_with(B,C), level_with(B,D), level_with(B,D), level_with(D,B), level_with(C,B), level_with(B,C), level_with(B,C), level_with(C,B), level_with(B,C), level_with(B,C), , in_room(room1,C), has_handle(B,C), parallel(C,B) , level_with(B,C), level_with(C,B), perpendicular(D,C), facing(A,B), has_attached_side(B,none) , clear(B), level_with(D,B), perpendicular(B,D), above(A,C), above(A,D), longer(B,E), longer(B,C), above(E,B), near(E,B), near(B,E), near(C,B), parallel(B,C), parallel(B,C), parallel(C,B), longer(D,B) , 5.5.5 ). longer(B,E), level_with(C,B), door_state(B,closed), perpendicular(D,B) , perpendicular(E,B), perpendicular(B,C), perpendicular(C,B), has_attached_side(C,none), handle_state(C,locked), clear(C), has_handle(B,C), parallel(C,B), parallel(B,C), parallel(B,C), near(B,C), facing(B,C), in_room(room1,B), perpendicular(B,E), has_num_attachments(C,0), facing(D,C), has_shape(C,sticklike), has_attached_side(B,both), clear(B), has_num_attachments(B,1), longer(C,B), has_num_attachments(B,0) , has_shape(B,sticklike), perpendicular(B,D), near(D,B), near(B,D), near(C,B), near(B,C), perpendicular(C,D) , near(D,C), near(C,D), near(C,B) , near(B,C) the preconditions involve complete state descriptions of the world, and are too long to include completely, so irrelevant literals have been removed. Note thatprocess many (Section of these literals are quickly removed from the action model during the active learning The initial action models induced from the trainer’s demonstration by the version of EBL described in Section Action (A,B,C) hit_with open_door able C.2: T

256