<<

1.1 AI in the Media - the glitz and glamour

3 sci-fi CITS 7210 Algorithms for Artificial Intelligence — Kubric, Spielberg,. . . 3 “science” programs Topic 1 — “Towards 2000” 3 news/current affairs Introduction — Kasparov 3 advertisements — washing machines, TVs, cars,. . . . 3 What is AI? 3 Contributions to AI Don’t believe a word you hear! 3 History of AI (. . . without proof) 3 Modern AI

Reading: Russel and Norvig, Chapter 1

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Introduction Slide 1 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Introduction Slide 2

1.2 The AI Literature 1.3 Thinking humanly: cognitive modelling

“[The automation of] activities “The study of mental faculties • determine how humans think that we associate with human through the use of computational • develop theory of human mind — psychological experiments thinking, activities such as models” decision-making, problem solving, (Charniak+McDermott, 1985) • model theory using computer programs learning . . .” (Bellman, 1978) eg. General Problem Solver (GPS) [Newel & Simon, 1961] “The study of how to make com- “The branch of computer science puters do things at which, at that is concerned with the au- the moment, people are better” tomation of intelligent behavior” Requires scientific theories of internal activities of the brain (Rich+Knight, 1991) (Luger+Stubblefield, 1993) • What level of abstraction? “Knowledge” or “circuits”? • How to validate? Views of AI fall into four categories: 1. Predicting and testing behavior of human subjects (top- Thinking humanly Thinking rationally down) Acting humanly Acting rationally ⇒ 2. Direct identification from neurological data (bottom-up) ⇒ Cognitive Neuroscience

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Introduction Slide 3 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Introduction Slide 4 1.4 Acting humanly: The Turing test 1.4 Acting humanly: The Turing test

Alan Turing (1950) “Computing machinery and intelligence”: ♦ Predicted that by 2000, a machine might have a 30% chance of fooling a lay person for 5 minutes ♦ “Can machines think?” −→ “Can they behave intelligently?” ♦ Anticipated all major arguments against AI in following 50 intelligence = ability to act indistinguishably from a human years in cognitive tasks ♦ Suggested major components of AI: knowledge, reasoning, ♦ Operational test for intelligent behavior ⇒ Turing Test language understanding, learning

HUMAN Problem: Turing test is not reproducible, constructive, or HUMAN amenable to mathematical analysis INTERROGATOR ?

AI SYSTEM

– human interrogates computer via teletype – passes test if human cannot tell if there’s a human or com- puter at the other end

What else might today’s turing test include. . . ?

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Introduction Slide 5 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Introduction Slide 6

1.5 Thinking rationally: Laws of Thought 1.5 Thinking rationally: Laws of Thought

Began with Greeks in 4th century BC (1st century BE) Direct line through mathematics and philosophy to modern AI, e.g. — e.g. Aristotle’s logical syllogisms • Boole (1815–1864, Laws of Thought 1854, Boolean ) All men are mortal, • Frege (1848–1925, Begriffsschrift 1879, first-order logic) Socrates is a man, therefore Socrates is mortal • Hilbert (1862–1943), (Hilbert systems) • Russell & Whitehead (Principia Mathematica 1918) Several Greek schools developed various forms of logic: • Tarski (1902–1983, Tarski semantics 1933) notation and rules of derivation for thoughts; Problems: — may or may not have proceeded to the idea of mechanization 1. Normative (or prescriptive) rather than descriptive 2. Not all intelligent behavior is mediated by logical deliberation

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Introduction Slide 7 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Introduction Slide 8 1.6 Acting rationally 1.7 An Engineering Viewpoint

⇒ act in such a way to achieve goals, given beliefs “a bunch of methodologies, inspired by socio-biological (Doesn’t necessarily involve thinking—e.g., blinking reflex. analogy, for getting machines to do stuff ” Is a thermostatically controlled heater “intelligent”?)

Define agent — entity that percieves surroundings and acts ac- cordingly

AI = study and construction of rational agents.

AI as intelligent agent design • incorporates aspects from the other three approaches • currently the dominant view of AI

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Introduction Slide 9 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Introduction Slide 10

2. Pre-history — Contributions to AI 2. Pre-history — Contributions to AI

Philosophy (428 BC → present) Computing (1940 → present)

• logic, methods of reasoning • provision of programmable machines • mind as a physical system • algorithms • foundations of learning, language, rationality • declarative languages ( and Lisp) • neural computing Mathematics (c. 800 → present) Linguistics (1957 → present) • formal representation and proof • computation, (un)decidability, (in)tractability • language theory: grammar, semantics • probability

Psychology (1879 → present)

• perception and motor control • cognitive neuroscience • learning (reinforcement)

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Introduction Slide 11 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Introduction Slide 12 3. The Chequered History of AI 3. The Chequered History of AI

In the beginning. . . “Look, Ma, no hands!” era 1943 McCulloch & Pitts: Boolean circuit model of brain 1952 Samuel’s checker playing programs on/off neurons, corresponding to propositions — eventually tournament level even suggested networks could learn! ⇒ disproved computers can only do what they are told ⇒ forerunner of symbolic and connectionist traditions 1958 McCarthy 1950 Turing’s “Computing Machinery and Intelligence” — defined Lisp ⇒ dominant AI programming language Turing and Shannon writing chess programs — he and others invented time-sharing ⇒ birth of DEC — with no computers! — published “Programs with Common Sense” ⇒ defined hypothetical program Advice Taker 1951 Minsky & Edmonds, first neural net computer, Snark AI program that includes general knowledge, axioms forerunner to knowledge representation and reasoning today 1950s Newell & Simon’s (machine coded by hand!) “We have created a computer program capable of 1959 McCarthy & Hayes “Philisophical Investigations from thinking non-numerically, and thereby solved the the Standpoint of AI” ⇒ KR, reasoning, planning,. . . venerable mind-body problem.” — Simon Able to prove theorems in Russell and Whitehead’s 1961 Newell and Simon’s General Problem Solver (GPS) Principia Mathematica imitate human problem solving — “thinking humanly” — even came up with a shorter proof!

1956 McCarthy’s Dartmouth meeting: “Artificial Intelligence”

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Introduction Slide 13 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Introduction Slide 14

3. The Chequered History of AI 3. The Chequered History of AI

1965 Robinson’s resolution principle Falling off the bike complete theorem proving algorithm for 1st-order logic Bold predictions prove elusive. . . made Prolog possible ♦ Lack of domain knowledge 1971 Strips — practical logic-based planning system Shakey — integration of logical reasoning and eg. machine translation physical activity US goverment funding to translate Russian to English after Sput- nik in 1957 Initially thought syntactic transformations using grammar and electronic dictionary “the spirit is willing but the flesh is weak”

“the vodka is good but the meat is rotten” 1966 report — no immediate prospect of success All government funding cancelled

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Introduction Slide 15 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Introduction Slide 16 3. The Chequered History of AI 3. The Chequered History of AI

♦ Intractibility — early solutions did not scale up! AI goes specialist development of computational complexity theory and NP- 1969 Buchanan et al, “Heuristic DENDRAL: a program for completeness generating explanatory hypotheses in organic chemistry.” Arguably first knowledge-intensive system. program finds solution in principle &→ has any of the mech- Later incorporated McCarthy’s Advice Taker approach anisms needed to find it in practice — clean separation of knowledge from reasoning. difficulties of combinatorial explosion one of main criticisms in ⇒ birth of expert systems. Lighthill Report in 1973 1976 Mycin — diagnosis of blood infections. ⇒ British government decision to end support for AI research ∼ 450 rules. Performed as well as some experts, in all but 2 universities better than junior doctors. No theoretical model — acquired rules from ♦ limitations of basic AI structures interviewing experts. 1969 Minsky and Papert’s book Perceptrons Early attempt to deal with uncertainty. — two-input perceptron could not be trained to recognise when its two inputs were different (exclusive-or problem) ⇒ research funding for neural nets all but disappeared

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Introduction Slide 17 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Introduction Slide 18

3. The Chequered History of AI 3. The Chequered History of AI

1979 Duda et al., Prospector Expert systems industry booms. . . Probabilistic — recommended drilling at 1982 McDermott’s R1 began operation at DEC geological site that proved to contain large molybdenum — first successful commercial expert system. deposit!! Helped configure orders for new systems — saved estimated $40 million a year. 1970s Recognition that language understanding also required knowledge and a means of using it. (Charniak) 1988 DEC: 40 deployed expert systems “There is no such thing as syntax.” — Schank Du Pont: 100 in use, 500 in development, est $100m year

1973 Woods’ Lunar system — allowed geologists to ask Increased demand for AI languages — eg Prolog questions in English about Apollo’s rock samples. — first NLP program used by others for real work. 1980s Japanese ambitious “Fifth Generation” project. “Prolog machines” — millions of inferences per sec.

US funding increased accordingly. British Alvey report reinstated funding cut by Lighthill report (under new name “Intelligent Knowledge-Based Systems”)

Boom in Expert System development tools, dedicated Lisp workstations (eg Symbolics), etc. . .

few million sales in 1980 −→ $2 billion in 1988

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Introduction Slide 19 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Introduction Slide 20 3. The Chequered History of AI 4. Modern AI and busts. . . ? AI matures (at least a bit!) ∼1986 Recognition of limitations and some disillusionment • Better understanding of difficulty! — buying expert system shell and filling it with rules • Increase in technical and theoretical depth. not enough. • Recognition (by most) of the need for both symbolic and sub- 1985–95 Neural networks return to popularity symbolic approaches, working together. — rediscovery of back-propagation learning algorithm. • Resurgence and incorporation of probabilistic and decision- Brooks’ insects. theoretic methods. Symbolic vs sub-symbolic argument intensifies! ⇒ emphasis on solid foundations and a more wholistic approach (battles for funding) The new environment! ⇒ predictions of “AI winter”. • Fast distributed hardware • New languages (eg Java), distributed programs • Increased communications (WWW). • New possibilities, eg ALife, GAs,. . . • Advances in understanding of biological and neural systems (biologically-inspired computing) • New applications, eg Web agents, Mars rovers, . . . ⇒ emphasis on intelligent “agents”, incorporating a range of AI technologies

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Introduction Slide 21 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Introduction Slide 22

1. What are agents?

agent — perceives environment through sensors CITS 7210 Algorithms for Artificial Intelligence — acts on the environment through effectors

Topic 2 sensors

percepts ? environment Intelligent Agents agent actions

effectors 3 What are agents? Examples. . . 3 Rational agents 3 Agent functions and programs Percepts light, sound, solidity, . . . 3 Types of agents Sensors human eyes, ears, skin, . . . Reading: Russell and Norvig, Chapter 2 robot infra-red detectors, cameras, microphone, accelerometers, . . . Effectors human hands, legs, voice, . . . robot grippers, wheels, speakers, . . . Actions pickup, throw, speak, . . .

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Intelligent Agents Slide 23 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Intelligent Agents Slide 24 2. Rational agents 3. Agent functions and programs

Recall: rational agent tries to. . . An agent can be completely specified by an agent function map- — “do the right thing” ping percept sequences to actions — act to achieve goals (In principle, one can supply each possible sequence to see what The “right thing” can be specified by a performance measure it does. Obviously, a lookup table would usually be immense.) defining a numerical value for any environment history One agent function (or a small equivalence class) is rational Rational action: whichever action maximizes the expected value Aim: find a way to implement the rational agent function con- of the performance measure given the percept sequence to date cisely and efficiently Rational &= omniscient Rational &= clairvoyant An agent program implements and agent function: takes a single Rational &= successful percept as input, keeps internal state, returns an action:

function Skeleton-Agent(percept) returns action static: memory, the agent’s memory of the world memory ← Update-Memory(memory, percept) action ← Choose-Best-Action(memory) memory ← Update-Memory(memory, action) return action

In OO-speak · · · ⇒

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Intelligent Agents Slide 25 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Intelligent Agents Slide 26

3.1 Agents in Java 4. Types of Agents

Environment Agent An agent program accepts percepts, combines them with any Percept Percept getPercept() Action getAction(Percept p) stored knowledge, and selects actions. void Update(Action a) Action A rational agent will choose actions so as to maximise some performance measure. (In practice try to achieve “good’ perfor- mance.)

implements Four basic types in order of increasing generality: Checkers Player Board Posn – simple reflex agents

Move – reflex agents with state – goal-based agents – utility-based agents

class Player implements Agent { ...

Action getAction(Percept percept) { Move myNextMove; // my selection algorithm return myNextMove; } ...

}

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Intelligent Agents Slide 27 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Intelligent Agents Slide 28 4.1 Simple Reflex Agents 4.2 Agents that keep track of the world

⇒ choose responses using condition-action rules (or production While simply reacting to the current state of the world is ad- rules). equate in some circumstances, most intelligent action requires more knowledge to work from: if you see the car in front’s brake lights then apply the brakes • memory of the past • knowledge about the effects of actions — how the world

Agent Sensors evolves • requires internal state What the world is like now Environment eg. You notice someone ahead signal to the bus. You know that this will cause the bus driver to stop (ideally), and conclude that you should change lanes.

Condition!action rules What action I Sensors should do now State

How the world evolves What the world

is like now Environment Effectors What my actions do

Some researches claim this is how simple life-forms (eg. insects) What action I behave. Condition!action rules should do now

Agent Effectors

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Intelligent Agents Slide 29 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Intelligent Agents Slide 30

4.3 Goal-based agents 4.4 Utility-based agents

Reacting to evolving world may keep you from crashing your car, Two views: but it doesn’t tell you where to go! 1. a successful agent is only required to satisfy objective goals Intelligent beings act in such a way as to try to achieve some (emotions are a hindrance) goals. 2. subjective measures such as happiness, security (safety), etc are important for success Sensors State What the world How the world evolves eg. Luke could maximise his material benefits by turning to the

is like now Environment dark side of the force, but he’d be very unhappy :-( What my actions do What it will be like if I do action A Researchers call this measure of happiness utility (since it sounds What action I Goals should do now more scientific)

Agent Effectors

Sensors Some goals are pretty simple. eg. Star Trek — “to go where State How the world evolves What the world no-one has gone before” is like now Environment

What my actions do What it will be like Some are more complex and require planning to achieve them. if I do action A eg. Star Wars — to defeat the Empire — find a potential Jedi Utility How happy I will be knight, ship him off to see Yoda, teach him to use the force, etc in such a state What action I Planning is a fundamental problem of AI. Usually requires search should do now through the available actions to find an appropriate sequence. Agent Effectors Some people even say AI is really search!

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Intelligent Agents Slide 31 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Intelligent Agents Slide 32 5. Next. . .

• Conceptualising the environment — states, actions (operators), goals CITS 7210 Algorithms for Artificial Intelligence

• The fundamental skills of an intelligent agent Topic 3 — problem solving and search Problem Solving and Search

3 Problem-solving and search 3 Search algorithms 3 Uninformed search algorithms – breadth-first search – uniform-cost search – depth-first search – iterative deepening search – bidirectional search

Reading: Russell and Norvig, Chapter 3

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Intelligent Agents Slide 33 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Problem Solving and Search Slide 34

1. Problem Solving and Search 1.1 States, Operators, Graphs and Trees

Seen that an intelligent agent has: state — description of the world at a particular time — impossible to describe the whole world • knowledge of the state of the “world” — need to abstract those attributes or properties that are • a notion of how actions or operations change the world important. • some goals, or states of the world, it would like to bring about Examples Finding a sequence of operations that changes the state of the Example 1: Say we wish to drive from Arad to Bucharest. First world to a desired goal state is a search problem (or basic planning we “discretise” the problem: problem). states — map of cities + our location Search algorithms are the cornerstone of AI operators — drive from one city to the next start state — driver located at Arad In this section we see some examples of how the above is encoded, goal state — driver located at Bucharest and look at some common search strategies. Find solution: sequence of cities, e.g., Arad, Sibiu, Fagaras, Bucharest

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Problem Solving and Search Slide 35 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Problem Solving and Search Slide 36 1.1 States, Operators, Graphs and Trees 1.1 States, Operators, Graphs and Trees

The states and operators form a graph — states form nodes, Example 2: The eight puzzle operators form arcs or edges. . .

5 4 5 4 Oradea 5 4 1 2 3 Neamt

Zerind 6 1 88 68 84 Iasi Arad 7 3 22 7 6 25 Sibiu Fagaras

Vaslui Start State Goal State Rimnicu Vilcea Timisoara

Pitesti Lugoj states — description of the positions of the numbered squares Hirsova Mehadia Urziceni operators — some alternatives. . . Bucharest (a) move a numbered square to an adjacent place, or Dobreta Craiova Eforie (b) move blank left, right, up or down — far fewer operators Giurgiu

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Problem Solving and Search Slide 37 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Problem Solving and Search Slide 38

1.1 States, Operators, Graphs and Trees 1.1 States, Operators, Graphs and Trees

Example 3: Missionaries and Cannibals <{MMMCCCB},{}> start state — 3 missionaries, 3 cannibals, and a boat that <{MMCC},{MCB}> <{MMMC},{CCB}> <{MMMCC},{CB}> holds two people, on one side of river goal state — all on other side <[MMMCCB},{C}> states — description of legal configurations (ie. where no-one gets eaten) of where the missionaries, cannibals, and boat are <{MMM},{CCCB}> operators — state changes possible using 2-person boat <{MMMCB},{CC}>

<{MC},{MMCCB}>

<{MMCCB},{MC}>

<{CC},{MMMCB}>

<{CCCB},{MMM}>

<{C},{MMMCCB}>

<{CCB},{MMMC}> <{MCB},{MMCC}>

<{},{MMMCCCB}>

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Problem Solving and Search Slide 39 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Problem Solving and Search Slide 40 1.2 Operator costs 1.3 Problem formulation

Graphs may also contain the cost of getting from one node to A problem is defined by four items: another (ie. associated with each operator, or each arc). initial state e.g., “at Arad”

Straight!line distance operators (or successor function S(x)) Oradea to Bucharest 71 Neamt Arad 366 e.g., Arad → Zerind Arad → Sibiu etc. Bucharest 0 Zerind 87 151 Craiova 75 160 goal test, can be Iasi Dobreta 242 Arad 140 Eforie 161 92 Fagaras 178 explicit, e.g., x = “at Bucharest” Sibiu Fagaras 99 Giurgiu 77 118 Vaslui Hirsova 151 implicit, e.g., “world peace” 80 Iasi 226 Rimnicu Vilcea Timisoara Lugoj 244 Mehadia path cost (additive) 142 241 111 Pitesti 211 Neamt 234 Lugoj 97 Oradea 380 e.g., sum of distances, number of operators exe- 70 98 Pitesti Hirsova 98 146 85 Mehadia 101 Urziceni Rimnicu Vilcea 193 cuted, etc. Sibiu 253 75 138 86 Bucharest Timisoara 329 Dobreta 120 Urziceni 90 80 Craiova Eforie Vaslui 199 Giurgiu Zerind 374 A solution is a sequence of operators leading from the initial state to a goal state.

cost of path = sum of the costs of arcs making up the path

Usually concerned not just with finding a path to the goal, but finding a cheap path. shortest path problem — find the cheapest path from a start state to a goal state Where operator costs are not given, all operators are assumed to have unit cost.

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Problem Solving and Search Slide 41 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Problem Solving and Search Slide 42

1.4 Example: robotic assembly 2. Search algorithms

P R R Basic idea: R R offline, simulated exploration of state space by generating successors of already-explored states R (a.k.a. expanding states) states??: real-valued coordinates of function General-Search(problem, strategy) returns a solution, robot joint angles or failure parts of the object to be assembled initialize the search tree using the initial state of problem loop do operators??: continuous motions of robot joints if there are no candidates for expansion goal test??: complete assembly with no robot included! then return failure choose a leaf node for expansion according to strategy path cost??: time to execute if the node contains a goal state then return the corresponding solution else expand node and add resulting nodes to the search tree end

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Problem Solving and Search Slide 43 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Problem Solving and Search Slide 44 2.1 General search example 2.2 General search in more detail

General search algorithm Arad

Given a start state s0 ∈ S, a goal function g(s) → {true, false}, Zerind Sibiu Timisoara and an (optional) terminal condition t(s) → {true, false}:

Initialise a set U = {s0} of unvisited nodes containing just Rimnicu Arad Oradea Fagaras Vilcea the start state, and an empty set V = {} of visited nodes. 1. If U is empty halt and report no goal found.

Sibiu Bucharest 2. Select, according to some (as yet undefined) strategy, a node s from U. 3. (Optional) If s ∈ V discard s and repeat from 1. 4. If g(s) = true halt and report goal found. 5. (Optional) If t(s) = true discard s and repeat from 1. 6. Otherwise move s to the set V , and add to U all the nodes reachable from s. Repeat from 1.

Step 3 is an occurs check for cycles.

• Some search strategies will still work without this ⇒ trade-off — work to check if visited vs work re-searching same nodes again. • Others may cycle forever.

With these cycles removed, the graph becomes a search tree.

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Problem Solving and Search Slide 45 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Problem Solving and Search Slide 46

2.3 Comparing search strategies 2.4 Implementation of search algorithms

The search strategy is crucial — determines in which order the Many strategies can be implemented by placing unvisited nodes nodes are expanded. Concerned with: in a queue (Step 6) and always selecting the next node to expand from the front of the queue (Step 2) completeness — does the strategy guarantee finding a solution if there is one ⇒ the way the children of expanded nodes are placed in the time complexity — how long does it take queue determines the search strategy. space complexity — how much memory is needed to store states Many different strategies have been proposed. We’ll look at some optimality — does it guarantee finding the best solution of the most common ones. . .

Time and space complexity are often measured in terms of

b — maximum branching factor of the search tree d — depth of the least-cost solution m — maximum depth of the state space (may be ∞)

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Problem Solving and Search Slide 47 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Problem Solving and Search Slide 48 3. Uninformed search strategies 3.1 Breadth-first search

Uninformed strategies use only the information available Expand shallowest unexpanded node in the problem definition Implementation: • Breadth-first search QueueingFn = put successors at end of queue • Uniform-cost search • Depth-first search Arad • Depth-limited search

• Iterative deepening search Zerind Sibiu Timisoara • Bidirectional search

Rimnicu Arad Oradea Arad Oradea Fagaras Vilcea Arad Lugoj Later we will look at informed stategies that use additional in- formation.

⇒ expand all nodes at one level before moving on to the next

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Problem Solving and Search Slide 49 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Problem Solving and Search Slide 50

3.1 Breadth-first search 3.2 Uniform-cost search

Complete? Yes (if b is finite) Problem: Varying cost operations Time? O(1 + b + b2 + b3 + . . . + bd) = O(bd), i.e., exponential in d eg. Romania with step costs in km. . . Straight!line distance d Oradea Space? O(b ) (all leaves in memory) 71 to Bucharest Neamt Arad 366 Bucharest 0 Optimal? Yes (if cost = 1 per step); not optimal in general Zerind 87 75 151 Craiova 160 Iasi Dobreta 242 Arad 140 Eforie 161 92 Fagaras 178 Sibiu Fagaras 99 Giurgiu 77 118 Hirsova 151 Space is the big problem; can easily generate nodes at 1MB/sec 80 Vaslui Iasi 226 Rimnicu Vilcea so 24hrs = 86GB. Timisoara Lugoj 244 142 Mehadia 241 111 211 Neamt 234 Lugoj 97 Pitesti Oradea 380 70 98 Pitesti 98 146 85 Hirsova Good example of computational explosion. . . Mehadia 101 Urziceni Rimnicu Vilcea 193 75 138 86 Sibiu 253 Bucharest Timisoara 120 329 Dobreta 90 Urziceni 80 Assume branching factor of 10, 1000 nodes/sec, 100 bytes/node. Craiova Eforie Vaslui 199 Giurgiu Zerind 374

Depth Nodes Time Memory 0 1 1 millisecond 100 bytes 2 111 .1 seconds 11 kilobytes 4 11,111 11 seconds 1 megabyte 6 106 18 minutes 111 megabytes 8 108 31 hours 11 gigabytes 10 1010 128 days 1 terabyte 12 1012 35 years 111 terabytes 14 1014 3500 years 11,111 terabytes

Welcome to AI!

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Problem Solving and Search Slide 51 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Problem Solving and Search Slide 52 3.2 Uniform-cost search 3.2 Uniform-cost search

Expand least total cost unexpanded node Complete? Yes (if step cost ≥ 0) Time? # of nodes with path cost g ≤ cost of optimal solution Implementation: Space? # of nodes with g ≤ cost of optimal solution QueueingFn = insert in order of increasing path cost Optimal? Yes (if step cost ≥ 0)

Arad

75 140 118

Zerind Sibiu Timisoara

75 71 118 111

Arad Oradea Arad Lugoj

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Problem Solving and Search Slide 53 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Problem Solving and Search Slide 54

3.3 Depth-first search 3.3 Depth-first search

Follow one path until you can go no further, then backtrack to Finite tree example. . . last choice point and try another alternative.

Implementation: QueueingFn = insert successors at front of queue (or use recursion — “queueing” performed automatically by internal stack)

Arad

Zerind Sibiu Timisoara

Arad Oradea

Zerind Sibiu Timisoara

Occurs check or terminal condition needed to prevent infinite cycling.

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Problem Solving and Search Slide 55 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Problem Solving and Search Slide 56 3.3 Depth-first search

Complete? No: fails in infinite-depth spaces. — complete for finite tree (in particular, require cycle check). Time? O(bm). — may do very badly if m is much larger than d — may be much faster than breadth-first if solutions are dense Space? O(bm), i.e., linear space! Optimal? No

Space performance is big advantage. Time, completeness and optimality can be big disadvantages.

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Problem Solving and Search Slide 57 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Problem Solving and Search Slide 58

3.4 Depth-limited search 3.5 Iterative deepening search

= depth-first search with depth limit l “Probe” deeper and deeper (often bounded by available re- sources). Implementation: Nodes at depth l have no successors. Can be implemented by our terminal condition. Arad

Sometimes used to apply depth-first to infinite (or effectively Zerind Sibiu Timisoara infinite) search spaces. Take “best” solution found with limited Rimnicu resources. (See Game Playing. . . ) Arad Oradea Arad Oradea Fagaras Vilcea Arad Lugoj

Also in. . .

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Problem Solving and Search Slide 59 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Problem Solving and Search Slide 60 3.5 Iterative deepening search 3.5 Iterative deepening search

Summary view. . . Complete? Yes Time? (d + 1)b0 + db1 + (d − 1)b2 + . . . + bd → O(bd) Limit = 0 Space? O(bd) Limit = 1 Optimal? Yes, if step cost = 1 Can also be modified to explore uniform-cost tree

Limit = 2 How do the above compare with

• breadth-first? Limit = 3 ..... • depth-first?

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Problem Solving and Search Slide 61 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Problem Solving and Search Slide 62

3.6 Bidirectional search 4. Summary

Problem formulation usually requires abstracting away real-world details to define a state space that can feasibly be explored.

• Abstraction reveals states and operators. Start Goal • Evaluation by goal or utility function. • Strategy implemented by queuing function (or similar).

Variety of uninformed search strategies. . . Tends to expand fewer nodes than unidirectional, but raises other difficulties — eg. how long does it take to check if a node has Breadth- Uniform- Depth- Depth- Iterative Bidirectional Criterion been visited by other half of search? First Cost First Limited Deepening (if applicable) Time bd bd bm bl bd bd/2 Space bd bd bm bl bd bd/2 Optimal? Yes Yes No No Yes Yes Complete? Yes Yes No Yes, if l d Yes Yes

Iterative deepening search uses only linear space and not much more time than other uninformed algorithms.

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Problem Solving and Search Slide 63 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Problem Solving and Search Slide 64 1. Informed (or best-first) search

Recall uninformed search: CITS 7210 Algorithms for Artificial Intelligence • select nodes for expansion on basis of distance from start Topic 4 • uses only information contained in the graph • no indication of distance to go! Informed search algorithms Informed search: • select nodes on basis of some estimate of distance to goal! • requires additional information — evaluation function, or 3 Best-first search heuristic rules • choose “best” (most promising) alternative ⇒ best-first 3 Greedy search search. 3 A∗ search 3 Admissible heuristics Implementation: 3 Memory-bounded search QueueingFn = insert successors in decreasing order of desir- ability 3 IDA∗ 3 SMA∗ Examples: • greedy search • A∗ search Reading: Russell and Norvig, Chapter 4, Sections 1–3

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Informed search algorithms Slide 65 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Informed search algorithms Slide 66

2. Greedy search 2. Greedy search

Assume we have an estimate of the distance to the goal.

For example, in our travelling to Bucharest problem, we may Arad know straight-line distances to Bucharest. . . 366

Straight!line distance Zerind Sibiu Timisoara Oradea to Bucharest 71 Neamt Arad 366 374 253 329 Bucharest 0 Zerind 87 151 75 Craiova 160 Iasi Dobreta 242 Arad Eforie Rimnicu 140 161 Arad Oradea Fagaras 92 Fagaras 178 Vilcea Sibiu Fagaras 99 Giurgiu 77 366 380 178 193 118 Vaslui Hirsova 151 80 Iasi 226 Rimnicu Vilcea Timisoara Lugoj 244 Mehadia 241 142 Sibiu Bucharest 111 Pitesti 211 Neamt 234 Lugoj 97 Oradea 380 253 0 70 98 Pitesti Hirsova 98 146 85 Mehadia 101 Urziceni Rimnicu Vilcea 193 Sibiu 253 75 138 86 Bucharest Timisoara 329 Dobreta 120 Urziceni 90 80 Craiova Eforie Vaslui 199 Giurgiu Zerind 374

Greedy search always chooses to visit the candidate node with the smallest estimate ⇒ that which appears to be closest to goal

Evaluation function h(n) (heuristic) = estimate of cost from n to goal

E.g., hSLD(n) = straight-line distance from n to Bucharest

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Informed search algorithms Slide 67 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Informed search algorithms Slide 68 2. Greedy search 2.1 Means-ends Analysis

Complete? No, in general. e.g. can get stuck in loops, Example of greedy search. (Used in SOAR problem solver.) Iasi → Neamt → Iasi → Neamt → Heuristic: Pick operations that reduce as much as possible the Complete in finite space with repeated-state checking “difference” between the intermediate state and goal state. Time? O(bm), but a good heuristic can give dramatic improve- ment eg. Missionaries and cannibals m Space? O(b )—keeps all nodes in memory <{MMMCCCB},{}>

Optimal? No <{MMCC},{MCB}> <{MMMC},{CCB}> <{MMMCC},{CB}>

<[MMMCCB},{C}>

<{MMM},{CCCB}>

<{MMMCB},{CC}>

<{MC},{MMCCB}>

<{MMCCB},{MC}>

<{CC},{MMMCB}>

<{CCCB},{MMM}>

<{C},{MMMCCB}>

<{CCB},{MMMC}> <{MCB},{MMCC}>

<{},{MMMCCCB}>

Indicates best choice in all states except for +{MC},{MMCCB}, and +{MMCCB},{MC}, (though no other choices).

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Informed search algorithms Slide 69 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Informed search algorithms Slide 70

3. A∗ search 3. A∗ search

Greedy search minimises estimated cost to goal, and thereby

(hopefully) reduces search cost, but is neither optimal nor com- Arad plete. 366 75 140 118

Uniform-cost search minimises path cost so far and is optimal Zerind Sibiu Timisoara and complete, but is costly. 449 393 447 140 99 151 80

Can we get the best of both worlds. . . ? Rimnicu Arad Oradea Fagaras Vilcea 646 526 417 413 Yes! Just add the two together to get estimate of total path 99 211 146 97 80 length of solution as our evaluation function. . . Sibiu Bucharest Craiova Pitesti Sibiu 591 450 526 415 553 Evaluation function 97 138 101 Rimnicu Vilcea Craiova Bucharest f(n) = g(n) + h(n) 607 615 418 g(n) = cost so far to reach n h(n) = estimated cost to goal from n f(n) = estimated total cost of path through n to goal

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Informed search algorithms Slide 71 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Informed search algorithms Slide 72 3. A∗ search 3.1 Optimality of A∗

A heuristic h is admissible iff Theorem: A∗ search is optimal Proof h(n) ≤ h∗(n) for all n Suppose some suboptimal goal G2 has been generated and is in ∗ where h (n) is the true cost from n. the queue. Let n be an unexpanded node on a shortest path to an optimal goal G1. i.e. h(n) never overestimates Start e.g., hSLD(n) never overestimates the actual road distance

n Can prove: G G if h(n) is admissible, f(n) provides a complete and optimal 2 search! ⇒ called A∗ search. f(G2) = g(G2) since h(G2) = 0 > g(G1) since G2 is suboptimal ≥ g(n) + h(n) since h is admissible = f(n)

∗ Since f(G2) > f(n), A will not select G2 for expansion

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Informed search algorithms Slide 73 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Informed search algorithms Slide 74

3.2 Monotonicity and the pathmax equation 3.3 Contours

To get a more intuitive view, we consider the f-values along any Lemma: A∗ (with pathmax) expands nodes in order of increasing path. f value For most admissible heuristics, f-values increase monotonically Gradually adds “f-contours” of nodes (cf. breadth-first/uniform- (see Romania problem). cost adds layers or “circles” — A∗ “stretches” towards goal) For some admissible heuristics, f may be nonmonotonic — ie it may decrease at some points. O N Z e.g., suppose n" is a successor of n I A n g=5 h=4 f=9 380 S F V 400 1 T R

L P

n’ g’=6 h’=2 f’=8 H M U " But f = 8 is redundant! 420 B D E C f(n) = 9 ⇒ true cost of a path through n is ≥ 9 G ⇒ true cost of a path through n" is ≥ 9 If f ∗ is cost of optimal solution path: ∗ Pathmax modification to A : • A∗ expands all nodes with f(n) < f ∗ ∗ ∗ f(n") = max(g(n") + h(n"), f(n)) • A expands some nodes with f(n) = f Can see intuitively that A∗ is complete and optimal. With pathmax, f is always increases monotonically · · · ;

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Informed search algorithms Slide 75 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Informed search algorithms Slide 76 3.4 Properties of A∗ 4. Admissible heuristics

Complete Yes, unless there are infinitely many nodes with f ≤ Straight line distance is an obvious heuristic for distance planning. f(G) What about other problems? Time Exponential in [relative error in h × length of soln.] This section ⇒ examine heuristics in more detail. Space Keeps all nodes in memory (see below) Optimal Yes—cannot expand fi+1 until fi is finished E.g., two heuristics for the 8-puzzle:

Among optimal algorithms of this type A∗ is optimally efficient! 5 4 51 42 3 ie. no other algorithm is guaranteed to expand fewer nodes. 6 1 88 68 84 “Proof” 7 3 22 7 6 25 Any algorithm that does not expand all nodes in each contour Start State Goal State may miss an optimal solution.

h1(n) = number of misplaced tiles h2(n) = total Manhattan distance (i.e., no. of squares from desired location of each tile)

h1(S) =??

h2(S) =??

Are both admissible?

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Informed search algorithms Slide 77 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Informed search algorithms Slide 78

4.1 Measuring performance 4.1 Measuring performance

Quality of heuristic can be characterised by effective branching Example factor b∗. Effective branching factors for iterative deepening search and A∗ Assume: with h1 and h2 (averaged over 100 randomly generated instances of 8-puzzle for each solution length): • A∗ expands N nodes

• solution depth d Search Cost Effective Branching Factor

d IDS A*(h1) A*(h2) IDS A*(h1) A*(h2) ∗ b is branching factor of uniform tree, depth d with N nodes: 2 10 6 6 2.45 1.79 1.79 4 112 13 12 2.87 1.48 1.45 6 680 20 18 2.73 1.34 1.30 ∗ ∗ 2 ∗ d 8 6384 39 25 2.80 1.33 1.24 N = 1 + b + (b ) + · · · + (b ) 10 47127 93 39 2.79 1.38 1.22 12 364404 227 73 2.78 1.42 1.24 14 3473941 539 113 2.83 1.44 1.23 16 – 1301 211 – 1.45 1.25 • tends to remain fairly constant over problem instances 18 – 3056 363 – 1.46 1.26 20 – 7276 676 – 1.47 1.27 • can be determined empirically 22 – 18094 1219 – 1.48 1.28 ⇒ fairly good guide to heuristic performance 24 – 39135 1641 – 1.48 1.26

• informed better than uninformed a good heuristic would have b∗ close to 1 • h2 better than h1

Is h2 always better than h1?

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Informed search algorithms Slide 79 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Informed search algorithms Slide 80 4.2 Dominance 4.3 Inventing heuristics — relaxed problems

Yes! • How can we come up with a heuristic? • Can the computer do it automatically? We say h2 dominates h1 if h2(n) ≥ h1(n) for all n (both admis- sible). A problem is relaxed by reducing restrictions on operators

dominance ⇒ better efficiency cost of exact solution of a relaxed problem is often a good heuris- tic for original problem

— h2 will expand fewer nodes on average than h1 Example “Proof” • if the rules of the 8-puzzle are relaxed so that a tile can move ∗ ∗ A will expand all nodes n with f(n) < f . anywhere, then h1(n) gives the shortest solution ⇒ A∗ will expand all nodes with h(n) < f ∗ − g(n) • if the rules are relaxed so that a tile can move to any adjacent square, then h2(n) gives the shortest solution But h2(n) ≥ h1(n) so all nodes expanded with h2 will also be expanded with h1 (h1 may expand others as well). Note: Must also ensure heuristic itself is not too expensive to calculate. always better to use an (admissible) heuristic function with higher Extreme case: perfect heuristic can be found by carrying out values search on original problem.

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Informed search algorithms Slide 81 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Informed search algorithms Slide 82

4.4 Automatic generation of heuristics 5. Memory-bounded Search

If problem is defined in suitable formal language ⇒ may be Good heuristics improve search, but many problems are still too possible to construct relaxed problems automatically. hard. eg. 8-puzzle operator description Usually memory restrictions that impose a hard limit.

A is adjacent to B & B is blank → can move from A to B (eg. recall estimates for breadth-first search

Depth Nodes Time Memory Relaxed rules 0 1 1 millisecond 100 bytes 2 111 .1 seconds 11 kilobytes 4 11,111 11 seconds 1 megabyte A is adjacent to B → can move from A to B 6 106 18 minutes 111 megabytes 8 108 31 hours 11 gigabytes 10 1010 128 days 1 terabyte B is blank → can move from A to B 12 1012 35 years 111 terabytes 14 1014 3500 years 11,111 terabytes can move from A to B )

Absolver (Prieditis 1993) This section — algorithms designed to save memory.

• new heuristic for 8-puzzle better than any existing one • IDA∗ • first useful heuristic for Rubik’s cube! • SMA∗

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Informed search algorithms Slide 83 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Informed search algorithms Slide 84 5.1 Iterative Deepening A∗ (IDA∗) 5.1 Iterative Deepening A∗ (IDA∗)

Recall uninformed search Can we do the same with A∗?

O

• uniform-cost/breadth-first search N Z + completeness, optimality I A − exponential space usage 380 S F V • depth-first 400 + linear space usage T R L P − incomplete, suboptimal H M U 420 B Solution: iterative-deepening ⇒ explores “uniform-cost D E C trees”, or “contours”, using linear space. G

Contours more directed, but same technique applies! O Modify depth-limited search to use f-cost limit, rather than Z depth limit ⇒ IDA∗

F A S Complete? Yes (with admissible heuristic)

T R Optimal? Yes (with admissible heuristic)

L P Space? Linear in path length

B Time? ?

M

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Informed search algorithms Slide 85 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Informed search algorithms Slide 86

5.1 Iterative Deepening A∗ (IDA∗) 5.2 Simplified memory-bounded A∗ (SMA∗)

Time complexity of IDA∗ • Uses all available memory. • Complete if available memory is sufficient to store the shal- Depends on number of different values f can take on. lowest solution path. • small number of values, few iterations • Optimal if available memory is sufficient to store the shallowest eg. 8-puzzle solution path. Otherwise best solution given available memory. • many values, many iterations How does it work? eg. Romania example, each state has different heuristic ⇒ • Needs to generate successor nodes but no memory left ⇒ only one extra town in each contour drop, or “forget”, least promising nodes. Worst case: A∗ expands N nodes, IDA∗ goes through N itera- • Keep record of best f-cost of forgotten nodes in ancestor. tions • Only regenerate nodes if all more promising options are ex- hausted. 1 + 2 + · · · + N ⇒ O(N 2)

Example (values of forgotten nodes in parentheses) · · · ⇒ A solution: increase f-cost limit by fixed amount ! in each iter- ation ⇒ returns solutions at worst ! worse than optimal

Called !-admissible.

IDA∗ was first memory-bounded optimal heuristic algorithm and solved many practical problems.

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Informed search algorithms Slide 87 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Informed search algorithms Slide 88 5.2 Simplified memory-bounded A∗ (SMA∗) 5.2 Simplified memory-bounded A∗ (SMA∗)

A • Solves significantly more difficult problems than A∗. 0+12=12 10 8 • Performs well on highly-connected state spaces and real-valued ∗ B G heuristics on which A has difficulty. 10+5=15 8+5=13 • Susceptible to continual “switching” between candidate solu- 10 10 8 16 C D H I tion paths. 20+5=25 20+0=20 16+2=18 24+0=24 ie. limit in memory can lead to intractible computation time 10 10 8 8 E F J K 30+5=35 30+0=30 24+0=24 24+5=29

1 2 3 4 A A A A 12 12 13 13(15)

B B G G 15 13 15 13

H 18

5 A 6 A 7 A 8 A 15(15) 15 15(24) 20(24)

G B G B B 24( ) 15 20( ) 15 24

I C D 25 24 20

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Informed search algorithms Slide 89 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Informed search algorithms Slide 90

1. Broadening our world view

We have assumed we are dealing with world descriptions that CITS 7210 Algorithms for Artificial Intelligence are:

Topic 5 complete — all necessary information about the problem is available to the search algorithm deterministic — effects of actions are uniquely deter- Game playing mined Real-world problems are rarely complete and deterministic. . .

Sources of Incompleteness 3 broadening our world view — dealing with incompleteness sensor limitations — not possible to gather enough in- 3 why play games? formation about the world to completely know its state 3 perfect decisions — the Minimax — includes the future! algorithm intractability — full state description is too large to store, or search tree too large to compute 3 dealing with resource limits — evaluation functions — cutting off search Sources of (Effective) Nondeterminism 3 alpha-beta pruning • humans, the weather, stress fractures, dice, . . . 3 game-playing agents in action Aside. . . Debate: incompleteness ↔ nondeterminism Reading: Russell and Norvig, Chapter 5

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Game playing Slide 91 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Game playing Slide 92 1.1 Approaches for Dealing with Incompleteness 2. Why Play Games? contingency planning • abstraction of real world • build all possibilities into the plan • well-defined, clear state descriptions • may make the tree very large • limited operations, clearly defined consequences • can only guarantee a solution if the number of contingencies is finite and tractable but! interleaving or adaptive planning • provide a mechanism for investigating many of the real-world • alternate between planning, acting, and sensing issues outlined above • requires extra work during execution — planning cannot be ⇒ more like the real world than examples so far done in advance (or “off-line”) strategy learning Added twist — the domain contains hostile agents (also making it like the real world. . . ?) • learn, from looking at examples, strategies that can be ap- plied in any situation • must decide on parameterisation, how to evaluate states, how many examples to use, . . . black art??

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Game playing Slide 93 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Game playing Slide 94

2.1 Examples 2.1 Examples

Tractable Problem with Complete Information Tractable Contingency Problem Noughts and crosses (tic-tac-toe) for control freaks — you get Noughts and crosses — allow for all the oponents moves. (Opo- to choose moves for both players! nent is non-deterministic.) How many states?

Intractable Contingency Problem X X X X X Chess

X O X O X • average branching factor 35, approx 50 operations O ⇒ search tree has about 35100 nodes (although only about 1040 different legal positions)!

X O • cannot solve by brute force, must use other approaches, eg. X O X – interleave time- (or space-) limited search with moves ⇒ this section Stop when you get to a goal state. ∗ algorithm for perfect play (Von Neumann, 1944) ∗ finite horizon, approximate evaluation (Zuse, 1945; Shan- • What uninformed search would you select? How many states non, 1950; Samuel, 1952–57) visited? ∗ pruning to reduce costs (McCarthy, 1956) • What would be an appropriate heuristic for an informed search? How many states visited? – learn strategies that determine what to do based on some aspects of the current position ⇒ later in the course

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Game playing Slide 95 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Game playing Slide 96 3. Perfect Decisions — Minimax Algorithm 3. Perfect Decisions — Minimax Algorithm

Perfect play for deterministic, perfect-information games MAX (X) • two players, Max and Min, both try to win • Max moves first X X X MIN (O) X X X ⇒ can Max find a strategy that always wins? X X X

X O X O X . . . Define a game as a kind of search problem with: MAX (X) O • initial state X O X X O X O . . . • set of legal moves (operators) MIN (O) X X • terminal test — is the game over? • utility function — how good is the outcome for each player? ......

X O X X O X X O X . . . eg. Tic-tac-toe — can Max choose a move that always results TERMINAL O X O O X X in a terminal state with a utility of +1? O X X O X O O Utility !1 0 +1

Even for this simple game the search tree is large. Try an even simpler game. . .

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Game playing Slide 97 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Game playing Slide 98

3. Perfect Decisions — Minimax Algorithm 3. Perfect Decisions — Minimax Algorithm eg. Two-ply (made-up game) function Minimax-Decision(game) returns an operator MAX for each op in Operators[game] do A A A 1 2 3 Value[op] ← Minimax-Value(Apply(op, game), game) end MIN return the op with the highest Value[op]

A A A A A A A A 11 A 12 13 21 22 23 31 32 33 function Minimax-Value(state, game) returns a utility value 3 12 8 2 4 6 14 5 2 if Terminal-Test[game](state) then (one move deep, two ply) return Utility[game](state) else if max is to move in state then • Max’s aim — maximise utility of terminal state return the highest Minimax-Value of Successors(state) • Min’s aim — minimise it else return the lowest Minimax-Value of Successors(state) • what is Max’s optimal strategy, assuming Min makes the best possible moves?

MAX 3

A 1 A 2 A 3

MIN 3 2 2

A A A A A A A A 11 A 12 13 21 22 23 31 32 33

3 12 8 2 4 6 14 5 2

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Game playing Slide 99 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Game playing Slide 100 3. Perfect Decisions — Minimax Algorithm 4. Evaluation functions

Complete Yes, if tree is finite (chess has specific rules for this) Instead of stopping at terminal states and using utility function, Optimal Yes, against an optimal opponent. Otherwise?? cut off search and use a heuristic evaluation function. Time complexity O(bm) Chess players have been doing this for years. . . Space complexity O(bm) (depth-first exploration) simple — 1 for pawn, 3 for knight/bishop, 5 for rook, etc For chess, b ≈ 35, m ≈ 100 for “reasonable” games more involved — centre pawns, rooks on open files, etc ⇒ exact solution completely infeasible

Resource limits Usually time: suppose we have 100 seconds, explore 104 nodes/second ⇒ 106 nodes per move Standard approach: • cutoff test Black to move White to move e.g., depth limit (perhaps add quiescence search) White slightly better Black winning • evaluation function Can be expressed as linear weighted sum of features = estimated desirability of position

Eval(s) = w1f1(s) + w2f2(s) + . . . + wnfn(s)

e.g., w1 = 9 with f1(s) = (number of white queens) – (number of black queens)

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Game playing Slide 101 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Game playing Slide 102

4.1 Quality of evalation functions 5. Cutting off search

Success of program depends critically on quality of evalutation Options. . . function. • fixed depth limit • agree with utility function on terminal states • iterative deepening (fixed time limit) — more robust • time efficient • reflect chances of winning Problem — inaccuracies of evaluation function can have disas- trous consequences. Note: Exact values don’t matter

MAX

MIN 1 2 1 20

1 2 2 4 1 20 20 400

Behaviour is preserved under any monotonic transformation of Eval Only the order matters: payoff acts as an ordinal utility function

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Game playing Slide 103 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Game playing Slide 104 5.1 Non-quiescence problem 5.2 Horizon problem

Consider chess evaluation function based on material advantage. White’s depth limited search stops here. . .

Black to move

Win for white, but black may be able to chase king for extent of its depth-limited search, so does not see this. Queening move is Looks like a win to white — actually a win to black. “pushed over the horizon”. Want to stop search and apply evaluation function in positions No general solution. that are quiescent. May perform quiescence search in some sit- uations — eg. after capture.

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Game playing Slide 105 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Game playing Slide 106

6. Alpha-beta pruning 6.1 α–β pruning example

Consider Minimax with reasonable evaluation function and qui- escent cut-off. Will it work in practice? MAX 3 3 Assume can search approx 5000 positions per second. Allowed approx 150 seconds per move. Order of 106 positions per move. MIN 3 2 14 5 2 bm = 106, b = 35 ⇒ m = 4

X X 4-ply lookahead is a hopeless chess player! 3 12 8 2 14 5 2 4-ply ≈ human novice 8-ply ≈ typical PC, human master 12-ply ≈ Deep Blue, Kasparov

But do we need to search all those positions? Can we eliminate some before we get there — prune the search tree? One method is alpha-beta pruning. . .

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Game playing Slide 107 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Game playing Slide 108 6.2 Why is it called α–β? 6.3 The α–β algorithm

Basically Minimax + keep track of α, β + prune

MAX function Max-Value(state, game, α, β) returns the minimax value of state

MIN inputs: state, current state in game

.. game, game description .. α, the best score for max along the path to state .. β, the best score for min along the path to state MAX if Cutoff-Test(state) then return Eval(state)

MIN V for each s in Successors(state) do α ← Max(α, Min-Value(s, game, α, β)) α is the best value (to max) found so far off the current path if α ≥ β then return β end If V is worse than α, max will avoid it ⇒ prune that branch return α

Define β similarly for min function Min-Value(state, game, α, β) returns the minimax value of state if Cutoff-Test(state) then return Eval(state) for each s in Successors(state) do β ← Min(β, Max-Value(s, game, α, β)) if β ≤ α then return α end return β

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Game playing Slide 109 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Game playing Slide 110

6.4 Properties of α–β 7. Game-playing agents in practice

Pruning does not affect final result Games that don’t include chance Good move ordering improves effectiveness of pruning Checkers: Chinook became world champion in 1994 after 40- With “perfect ordering,” time complexity = O(bm/2) year-reign of human world champion Marion Tinsley (who retired ⇒ doubles depth of search due to poor health). Used an endgame database defining perfect ⇒ can easily reach depth 8 and play good chess play for all positions involving 8 or fewer pieces on the board, a total of 443,748,401,247 positions. Perfect ordering is unknown, but a simple ordering (captures first, then threats, then forward moves, then backward moves) gets Chess: Deep Blue defeated human world champion Gary Kas- fairly close. parov in a six-game match (not a World Championship) in 1997. Deep Blue searches 200 million positions per second, uses very Can we learn appropriate orderings? ⇒ speedup learning sophisticated evaluation, and undisclosed methods for extending some lines of search up to 40 ply. (Note complexity results assume idealized tree model: Othello: human champions refuse to compete against computers, • nodes have same branching factor b who are too good. • all paths reach depth limit d • leaf evaluations randomly distributed Go: human champions refuse to compete against computers, Ultimately resort to empirical tests.) who are too bad. In go, b > 300, so most programs use pattern knowledge bases to suggest plausible moves.

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Game playing Slide 111 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Game playing Slide 112 7. Game-playing agents in practice 8. Summary

Games that include an element of chance Games are fun to work on! (and can be addictive) Dice rolls increase b: 21 possible rolls with 2 dice They illustrate several important points about AI Backgammon ≈ 20 legal moves (can be 6,000 with 1-1 roll) ♦ problems raised by — incomplete knowledge 3 9 depth 4 = 20 × (21 × 20) ≈ 1.2 × 10 — resource limits As depth increases, probability of reaching a given node shrinks ♦ perfection is unattainable ⇒ must approximate ⇒ value of lookahead is diminished Games are to AI as grand prix racing is to automobile design α–β pruning is much less effective

TDGammon uses depth-2 search + very good Eval ≈ world-champion level

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Game playing Slide 113 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Game playing Slide 114

1. Why Learn?

So far, all intelligence comes from the designer: CITS 7210 Algorithms for Artificial Intelligence • time consuming for designer Topic 6 • restricts capabilities of agent

Learning agents can: Agents that Learn • act autonomously • deal with unknown environments • synthesise rules/patterns from large volumes of data 3 why learn? • handle complex data 3 general model of learning agents • improve their own performance 3 inductive learning 3 learning decision trees

Reading: Russell & Norvig, Chapter 18

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Agents that Learn Slide 115 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Agents that Learn Slide 116 2. General Model of Learning Agents 2. General Model of Learning Agents

Idea: percepts used not just for acting, but for improving future Performance standard performance. Four basic components. . . Critic Sensors

Performance element feedback Environment — responsible for selecting actions for good agent performance — agent function considered previously: percepts →actions changes Learning Performance element element knowledge Learning element learning — responsible for improving performance element goals — requires feedback on how the agent is doing Problem generator Critic element Agent Effectors — responsible for providing feedback — comparison with objective performance standard (outside agent) E.g. taxi driver agent performance element — Lets take Winthrop Ave, I know it works. Problem generator problem generator — Nah, lets try Mounts Bay Rd for a change. — responsible for generating new experience You never know, it may be quicker. — requires exploration — taking suboptimal actions critic — Wow, it was 5 mins quicker, and what a lovely view! learning element — Yeh, in future we’ll take Mounts Bay Rd.

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Agents that Learn Slide 117 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Agents that Learn Slide 118

2.1 The Learning Element 2.1 The Learning Element

Two separate goals: Components of the performance element

1. Improve outcome of performance element Can use: — how good is the solution 1. direct mapping from states to actions 2. Improve time performance of performance element 2. means to infer properties (model) of the world from percept — how fast does it reach a solution sequence — known as speedup learning 3. information on how the world evolves Learning systems may work on one or both tasks. 4. information about how actions change the world 5. utility information about desirability of states Design of a learning element is affected by four issues: 6. action-value information about desirability of actions in states 1. the components of the performance element to be improved 7. goals whose achievement maximises utility 2. representation of those components Each component can be learned, given appropriate feedback. 3. feedback available 4. prior information available E.g. our taxi driver agent • Mounts Bay Rd has a nicer view (5) • Taking Mounts Bay Rd → arrive more quickly (4,7) • If travelling from Perth to UWA, always take Mounts Bay Rd (6)

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Agents that Learn Slide 119 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Agents that Learn Slide 120 2.1 The Learning Element 2.1 The Learning Element

Representation of components Prior knowledge Examples: • agent begins with a tabula rasa (empty slate) • agent makes use of background knowledge • utility functions, eg linear weighted polynomials for game- playing agent In practice learning is hard ⇒ use background knowledge if • logical sentences for reasoning agent available. • probabilistic descriptions, eg belief networks for decision- theoretic agent Learning = Function approximation All components of performance element can be described math- Available feedback ematically by a function. eg. supervised learning • how the world evolves: f: state 1→ state — agent provided with both inputs and correct outputs • goal: f: state 1→ {0,1} — usually by a “teacher” • utilities: f: state 1→ [−∞, ∞] reinforcement learning • action values: f: (state,action) 1→ [−∞, ∞] — agent chooses actions — receives some reward or punishment all learning can be seen as learning a function unsupervised learning — no hint about correct outputs — can only learn relationships between percepts

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Agents that Learn Slide 121 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Agents that Learn Slide 122

3. Inductive Learning 3. Inductive Learning

Assume f is function to be learned. Example: Simple reflex agent Learning algorithm supplied with sample inputs and correspond- Sample is a pair (percept,action) ing outputs ⇒ supervised learning eg. percept — chess board position Define example (or sample): pair (x, f(x)) action — best move supplied by friendly grandmaster Task: Algorithm: given a set of examples of f, return a function h that ap- proximates f. global examples ← {} h is called a hypothesis function Reflex-Performance-Element(percept) returns an action Task is called pure inductive inference or induction. if (percept, a) in examples then return a else Many choices for h h ← Induce(examples) return h(percept)

o o o o procedure Reflex-Learning-Element(percept, action) o o o o inputs: percept, feedback percept o o o o action, feedback action o o o o o o o o examples ← examples ∪ {(percept, action)}

(a) (b) (c) (d)

A preference for one over another is called a bias.

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Agents that Learn Slide 123 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Agents that Learn Slide 124 3. Inductive Learning 4. Learning Decision Trees

Many variations possible, eg. Decision tree • incremental learning — learning element updates h with each • input — description of a situation new sample — abstracted by set of properties, parameters, attributes, • reinforcement — agent receives feedback on quality of action features,. . . • output — boolean (yes/no) decision Many possible representations of h — choice of representation is — can also be thought of as defining a categorisation, classi- critical fication or concept • type of learning algorithm that can be used ⇒ set of situations with a positive response • whether learning problem is feasible f: situation 1→ {0, 1} • expressiveness vs tractability We consider

1. decision trees as performance elements 2. inducing decision trees (learning element)

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Agents that Learn Slide 125 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Agents that Learn Slide 126

4.1 Decision Trees as Performance Elements 4.1 Decision Trees as Performance Elements

Example Patrons? Problem: decide whether to wait for a table at a restaurant None Some Full Aim: provide a definition, expressed as a decision tree, for the No Yes WaitEstimate? goal concept “Will Wait” >60 30!60 10!30 0!10 First step: identify attributes No Alternate? Hungry? Yes — what factors are necessary to make a rational decision No Yes No Yes For example: Reservation? Fri/Sat? Yes Alternate? No Yes No Yes No Yes

1. alternative nearby? Bar? Yes No Yes Yes Raining? 2. bar? No Yes No Yes 3. Friday/Saturday? No Yes No Yes 4. hungry? 5. patrons? 6. price? Choice of attributes is critical ⇒ determines whether an ap- 7. raining? propriate function can be learned (“garbage-in, garbage-out”) 8. reservation? No matter how good a learning algorithm is, it will fail if appro- 9. type of food? priate features cannot be identified (cf. neural nets) 10. estimated wait? In real world problems feature selection is often the hardest task! Example of a decision tree · · · ; (black art?)

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Agents that Learn Slide 127 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Agents that Learn Slide 128 4.1 Decision Trees as Performance Elements 4.2 Decision Tree Induction

Expressiveness of Decision Trees Terminology. . .

• limited to propositional (boolean) problems example — ({attributes}, value) – cannot input arbitrary number of restaurants and ask to positive example — value = true choose negative example — value = false – cannot handle continuous information training set — set of examples used for learning • fully expressive within class of propositional problems • may be exponentially large w.r.t. no. inputs Example training set:

Attributes Goal Number of trees (size of hypothesis space) Example Alt Bar Fri Hun Pat Price Rain Res Type Est WillWait

n attributes X1 Yes No No Yes Some $$$ No Yes French 0–10 Yes X2 Yes No No Yes Full $ No No Thai 30–60 No ⇓ X3 No Yes No No Some $ No No Burger 0–10 Yes n 2 combinations of inputs X4 Yes No Yes Yes Full $ No No Thai 10–30 Yes X5 Yes No Yes No Full $$$ No Yes French >60 No ⇓ X6 No Yes No Yes Some $$ Yes Yes Italian 0–10 Yes 2n X7 No Yes No No None $ Yes No Burger 0–10 No 2 different functions X8 No No No Yes Some $$ Yes Yes Thai 0–10 Yes X9 No Yes Yes No Full $ Yes No Burger >60 No X10 Yes Yes Yes Yes Full $$$ No Yes Italian 10–30 No 19 X11 No No No No None $ No No Thai 0–10 No eg. 6 attributes, approx. 2 × 10 functions to choose from X12 Yes Yes Yes Yes Full $ No No Burger 30–60 Yes ⇒ lots of work for learning element! Goal: find a decision tree that agrees with all the examples in the training set

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Agents that Learn Slide 129 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Agents that Learn Slide 130

4.2 Decision Tree Induction 4.2 Decision Tree Induction

Trivial Solution +: X1,X3,X4,X6,X8,X12 (a) !: X2,X5,X7,X9,X10,X11

Branch on each attribute in turn, until you reach a distinct leaf Patrons? for each example. None Some Full Problems +: +: X1,X3,X6,X8 +: X4,X12 !: X7,X11 !: !: X2,X5,X9,X10 • tree is bigger than needed — does not find patterns that “sum- marise” or “simplify” information +: X1,X3,X4,X6,X8,X12 (b) !: X2,X5,X7,X9,X10,X11 • cannot answer for examples that haven’t been seen ⇒ can- not generalise Type? Two sides of the same coin! French Italian Thai Burger +: X1 +: X6 +: X4,X8 +: X3,X12 !: X5 !: X10 !: X2,X11 !: X7,X9 Ockham’s razor: the most likely hypothesis is the simplest one that is consistent with the examples +: X1,X3,X4,X6,X8,X12 (c) !: X2,X5,X7,X9,X10,X11 Finding smallest tree is intractable, but can use heuristics to Patrons? generate reasonably good solution. None Some Full Basic idea: recursively select the most “important”, or discrimi- +: +: X1,X3,X6,X8 +: X4,X12 nating, attribute for successive branch points !: X7,X11 !: !: X2,X5,X9,X10 (discriminating — formally requires information theory, but hu- Yes No Hungry? mans do it intuitively ⇒ sort the “men from the boys”, “sheep Y N from the lambs”, “oats from the chaff”. . . ) +: X4,X12 +: !: X2,X10 !: X5,X9

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Agents that Learn Slide 131 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Agents that Learn Slide 132 4.2 Decision Tree Induction 4.2 Decision Tree Induction

Recursive algorithm Algorithm. . .

Base cases function Decision-Tree-Learning(examples, attributes, default) • remaining examples all +ve or all −ve — stop and label returns a decision tree “yes” or “no” inputs: examples, set of examples • no examples left — no relevant examples have been seen, attributes, set of attributes return majority in parent node default, default value for the goal predicate • no attributes left — problem: either data is inconsistent if examples is empty then return default (“noisy”, or nondeterministic) or attributes chosen were else if all examples have the same classification insufficient to adequately discriminate (start again, or use then return the classification else if attributes is empty majority vote) then return Majority-Value(examples) Recursive case else • both +ve and −ve examples — choose next most dis- best ← Choose-Attribute(attributes, examples) criminating attribute and repeat tree ← a new decision tree with root test best for each value vi of best do examplesi ← {elements of examples with best = vi} subtree ← Decision-Tree-Learning(examplesi, attributes − best, Majority-Value(examples)) add a branch to tree with label vi and subtree subtree end return tree

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Agents that Learn Slide 133 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Agents that Learn Slide 134

4.2 Decision Tree Induction 4.3 Assessing Performance

Applied to our training set gives: good hypothesis = predicts the classifications of unseen examples Patrons? To assess performance we require further examples with known outcomes — test set None Some Full Usual methodology: No Yes Hungry? No Yes 1. Collect large set of examples. Type? No 2. Divide into two disjoint sets: training set and test set. 3. Apply learning algorithm to training set to generate hy- French Italian Thai Burger pothesis H. Yes No Fri/Sat? Yes 4. Measure performance of H (% correct) on test set. No Yes 5. Repeat for different training sets of different sizes. No Yes Plot prediction quality against training set size. . .

Note: different to original tree, despite using data generated from that tree. Is the tree (hypothesis) wrong?

• No — with respect to seen examples! In fact more concise, and highlights new patterns. • Probably — w.r.t. unseen examples. . .

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Agents that Learn Slide 135 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Agents that Learn Slide 136 4.3 Assessing Performance 4.4 Practical Uses of Decision-Tree Learning

Learning curve (or “Happy Graph”) Despite representational limitations, decision-tree learning has been used successfully in a wide variety of applications. eg. . .

1 Designing Oil Platform Equipment

0.9 Gasoil — BP, deployed 1986 (Michie) — designs complex gas-oil separation systems for offshore oil 0.8 platforms — attributes include relative proportions of gas, oil and water, 0.7 flow rate, pressure, density, viscosity, temperature and suscep- 0.6 tibility to waxing

% correct on test set — largest commercial expert system in the world at that time 0.5 — approx. 2500 rules — building by hand would have taken ∼ 10 person-years 0.4 0 20 40 60 80 100 — descision-tree learning applied to database of existing de- Training set size signs ⇒ developed in 100 person-days — outperformed human experts — said to have saved BP millions of dollars!

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Agents that Learn Slide 137 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Agents that Learn Slide 138

4.4 Practical Uses of Decision-Tree Learning

Learning to Fly CITS 7210 Algorithms for Artificial Intelligence C4.5 — 1992 (Sammut et al) — one approach — learn correct mapping from state to action Topic 7 — Cessna on flight simulator — training: 3 skilled human pilots, assigned flight plan, 30 times each Sequential Decision Problems — training example for each action taken by pilot ⇒ 90000 examples — 20 state variables — decision tree generated and fed back into simulator to fly 3 Introduction to sequential decision plane problems — Results: flies better than its teachers! 3 Value iteration ⇒ generalisation process “cleans out” mistakes by teachers 3 Policy iteration 3 Longevity in agents

Reading: Russell and Norvig, Chapter 17, Sections 1–3

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Agents that Learn Slide 139 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Sequential Decision Problems Slide 140 1. Sequential decision problems 1.1 From search algorithms to policies

Previously concerned with single decisions, where utility of each Sequential decision problems in known, accessible, deterministic action’s outcome is known. domains This section — sequential decision problems tools — search algorithms — utility depends on a sequence of decisions outcome — sequence of actions that leads to good state

Sequential decision problems which include utilities, uncertainty, Sequential decision problems in uncertain domains and sensing, generalise search and planning problems. . . tools — techniques originating from control theory, operations research, and decision analysis Search explicit actions uncertainty outcome — policy and subgoals and utility

Planning Markov decision policy = set of state-action “rules” problems (MDPs) explicit actions uncertainty and subgoals uncertain (belief states) and utility sensing • tells agent what is the best (MEU) action to try in any situa- tion Decision!theoretic Partially observable planning MDPs (POMDPs) • derived from utilities of states

This section is about finding optimal policies.

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Sequential Decision Problems Slide 141 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Sequential Decision Problems Slide 142

1.2 From search algorithms to policies – example 1.2 From search algorithms to policies – example

Consider the environment: Indeterminism deterministic version — each action (N,S,E,W) moves one

3 + 1 square in intended direction (bumping into wall results in no change)

2 ! 1 stochastic version 0.8 — actions are unreliable. . . 0.1 0.1 1 START

1 2 3 4

Problem transition model — probabilities of actions leading to transitions between states Utilities only known for terminal states a Mij ≡ P (j|i, a) = probability that doing a in i leads to j ⇒ even for deterministic actions, depth-limited search fails! Cannot be certain which state an action leads to (cf. game Utilities for other states will depend on sequence (or environment playing). history) that leads to terminal state. ⇒ generating sequence of actions in advance then executing unlikely to succeed

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Sequential Decision Problems Slide 143 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Sequential Decision Problems Slide 144 1.2 From search algorithms to policies – example 1.2 From search algorithms to policies – example

Policies Expected Utilities But, if Given a policy, can calculate expected utilities. . . • we know what state we’ve reached (accessible)

• we can calculate best action for each state 3 0.812 0.868 0.912 + 1 ⇒ always know what to do next!

2 0.762 0.660 ! 1 Mapping from states to actions is called a policy eg. Optimal policy for step costs of 0.04. . . 1 0.705 0.655 0.611 0.388

3 + 1 1 2 3 4

2 ! 1 Aim is therefore not to find action sequence, but to find optimal policy — ie. policy that maximises expected utilities.

1

1 2 3 4

Note: small step cost ⇒ conservative policy (eg. state (3,1))

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Sequential Decision Problems Slide 145 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Sequential Decision Problems Slide 146

1.2 From search algorithms to policies – example 2. Value Iteration

Policy represents agent function explicitly Basic idea:

utility-based agent 1→ simple reflex agent! • calculate utility of each state U(state) • use state utilities to select optimal action

function Simple-Policy-Agent(percept) returns an action Sequential problems usually use an additive utility function (cf. static: M, a transition model path cost in search problems): U, a utility function on environment histories P, a policy, initially unknown if P is unknown then P ← the optimal policy given U, M return P[percept] U([s1, s2, . . . , sn]) = R(s1) + R(s2) + · · · + R(sn) = R(s1) + U([s2, . . . , sn])

Problem of calculating an optimal policy in an accessible, where R(i) is reward in state i (eg. +1, -1, -0.04). stochastic environment with a known transition model is called a Markov decision problem. Utility of a state (a.k.a. its value): Markov property — transition probabilities from a given state depend only on the state (not previous history) U(si) = expected sum of rewards until termination assuming optimal actions How can we calculate optimal policies. . . ? Difficult to express mathematically. Easier is recursive form. . . expected sum of rewards = current reward + expected sum of rewards after taking best action

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Sequential Decision Problems Slide 147 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Sequential Decision Problems Slide 148 2.1 Dynamic programming 2.2 Value iteration algorithm

Bellman equation (1957) Idea

a • start with arbitrary utility values U(i) = R(i) + maa x ! MijU(j) j • update to make them locally consistent with Bellman eqn. eg. • repeat until “no change”

U(1, 1) = −0.04 Everywhere locally consistent ⇒ global optimality + max{0.8U(1, 2) + 0.1U(2, 1) + 0.1U(1, 1), up 0.9U(1, 1) + 0.1U(1, 2) left 0.9U(1, 1) + 0.1U(2, 1) down function Value-Iteration(M, R) returns a utility function 0.8U(2, 1) + 0.1U(1, 2) + 0.1U(1, 1)} right inputs: M, a transition model R, a reward function on states One equation per state = n nonlinear equations in n unknowns local variables: U, utility function, initially identical to R U!, utility function, initially identical to R Given utilities of the states, choosing best action is just maximum repeat expected utility (MEU) — choose action such that the expected U ← U! utility of the immediate successors is highest. for each state i do ! a U [i] ← R[i] + maxa "j Mij U[j] a end policy(i) = arg max ! MijU(j) a j until Close-Enough(U, U!) return U Proven optimal (Bellman & Dreyfus, 1962).

How can we solve ; a Applying to our example· · · U(i) = R(i) + max ! MijU(j) a j

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Sequential Decision Problems Slide 149 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Sequential Decision Problems Slide 150

2.2 Value iteration algorithm 2.3 Assessing performance

Under certain conditions utility values are guaranteed to con- verge. 1 (4,3) (3,3) (2,3) Do we require convergence? (1,1) (3,1) 0.5 (4,1) Two measures of progress: 1. RMS (root mean square) Error of Utitily Values 0 Utility estimates 1 -0.5

0.8 -1 (4,2) 0 5 10 15 20 25 30 0.6 Number of iterations

0.4 RMS error 3 0.812 0.868 0.912 + 1 0.2

2 0.762 0.660 ! 1 0 0 5 10 15 20 1 0.705 0.655 0.611 0.388 Number of iterations

1 2 3 4

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Sequential Decision Problems Slide 151 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Sequential Decision Problems Slide 152 2.3 Assessing performance 3. Policy iteration

2. Policy Loss • policies may not be highly sensitive to exact utility values Actual utility values less important than the policy they imply ⇒ may be less work to iterate through policies than utilities! ⇒ measure difference between expected utility obtained from Policy Iteration Algorithm policy and expected utility from optimal policy π ← an arbitrary initial policy repeat until no change in π compute utilities given π (value determination) 1 update π as if utilities were correct (i.e., local MEU)

0.8 function Policy-Iteration(M, R) returns a policy inputs: M, a transition model 0.6 R, a reward function on states local variables: U, a utility function, initially identical to R 0.4 P, a policy, initially optimal with respect to U Policy loss repeat U ← Value-Determination(P, U, M, R) 0.2 unchanged? ← true for each state i do a P[i] 0 if maxa Mij U[j] > Mij U[j] then "j "j 0 5 10 15 20 a P[i] ← arg maxa M U[j] Number of iterations "j ij unchanged? ← false until unchanged? return P Note: policy is optimal before RMS error converges.

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Sequential Decision Problems Slide 153 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Sequential Decision Problems Slide 154

3.1 Value determination 4. What if I live forever?

⇒ simpler than value iteration since action is fixed Agent continues to exist — using the additive definition of utili- Two possibilities: ties 1. Simplification of value iteration algorithm. • U(i)s are infinite! • value iteration fails to terminate " π(i) U (i) ← R(i) + ! Mij U(j) j How should we compare two infinite lifetimes? May take a long time to converge. How can we decide what to do?

2. Direct solution. One method: discounting

π(i) Future rewards are discounted at rate γ ≤ 1 U(i) = R(i) + ! Mij U(j) for all i j ∞ t U([s0, . . . s∞]) = Σt=0γ R(st) i.e., n simultaneous linear equations in n unknowns, solve in O(n3) (eg. Gaussian elimination) Intuitive justification:

Can be most efficient method for small state spaces. 1. purely pragmatic • smoothed version of limited horizons in game playing • smaller γ, shorter horizon 2. model of animal and human preference behaviour • a bird in the hand is worth two in the bush! • eg. widely used in economics to value investments

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Sequential Decision Problems Slide 155 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission.AAI 7210 Sequential Decision Problems Slide 156 1. Reinforcement Learning

Previous learning examples CITS 7210 Algorithms for Artificial Intelligence • supervised — input/output pairs provided Topic 8 eg. chess — given game situation and best move Learning can occur in much less generous environments

Reinforcement Learning • no examples provided • no model of environment • no utility function 3 passive learning in a known eg. chess — try random moves, gradually build model of environment environment and opponent 3 passive learning in unknown Must have some (absolute) feedback in order to make decision. environments eg. chess — comes at end of game 3 active learning 3 exploration ⇒ called reward or reinforcement 3 learning action-value functions Reinforcement learning — use rewards to learn a successful agent function 3 generalisation

Reading: Russell & Norvig, Chapter 20, Sections 1–7.

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Reinforcement Learning Slide 157 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Reinforcement Learning Slide 158

1. Reinforcement Learning 1. Reinforcement Learning

Harder than supervised learning Two types of reinforcement learning agents: eg. reward at end of game — which moves were the good utility learning ones? • agent learns utility function . . . but . . . • selects actions that maximise expected utitility only way to achieve very good performance in many complex Disadvantage: must have (or learn) model of environment — domains! need to know where actions lead in order to evaluate actions and make decision Aspects of reinforcement learning: Advantage: uses “deeper” knowledge about domain • accessible environment — states identifiable from percepts inaccessible environment — must maintain internal state Q-learning • model of environment known or learned (in addition to utili- • agent learns action-value function ties) — expected utility of taking action in given state • rewards only in terminal states, or in any states • rewards components of utility — eg. dollars for betting agent Advantage: no model required or hints — eg. “nice move” Disadvantage: shallow knowledge • passive learner — watches world go by — cannot look ahead active learner — act using information learned so far, use prob- lem generator to explore environment — can restrict ability to learn

We start with utility learning. . .

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Reinforcement Learning Slide 159 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Reinforcement Learning Slide 160 2. Passive Learning in a Known Environment 2. Passive Learning in a Known Environment

Assume: Terminology • accessible environment Reward-to-go = sum of rewards from state to terminal state • effects of actions known additive utilitly function: utility of sequence is sum of rewards • actions are selected for the agent ⇒ passive accumulated in sequence • known model Mij giving probability of transition from state i Thus for additive utility function and state s: to state j expected utility of s = expected reward-to-go of s Example:

.5 .5 .33 1.0 Training sequence eg. +1 3 + 1 3 !0.0380 0.0886 0.2152 + 1 .5 .33 (1,1) →(2,1) →(3,1) →(3,2) →(3,1) →(4,1) →(4,2) [-1] .5 .5 .33 .33 .33 1.0 (1,1) →(1,2) →(1,3) →(1,2) →· · · →(3,3) →(4,3) [1] 2 ! 1 !1 2 !0.1646 (!0.44301,1) →(!2,1) 1 →· · · →(3,2) →(3,3) →(4,3) [1]

.5 .5 .33 .33 .5 .5 .5 .33 1 START 1 !0.2911 !0.0380Aim!0.5443: use!0.7722samples from training sequences to learn (an approxi-

.5 .33 .5 mation to) expected reward for all states. 1 2 3 4 1 2 3 4

(a) (b) ie.(c) generate an hypothesis for the utility function

(a) environment with utilities (rewards) of terminal states Note: similar to sequential decision problem, except rewards ini- (b) transition model Mij tially unknown.

Aim: learn utility values for non-terminal states

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Reinforcement Learning Slide 161 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Reinforcement Learning Slide 162

2.1 A generic passive reinforcement learning agent 2.2 Na¨ıve Updating — LMS Approach

Learning is iterative — successively update estimates of utilities From Adaptive Control Theory, late 1950s Assumes: function Passive-RL-Agent(e) returns an action static: U, a table of utility estimates observed rewards-to-go → actual expected reward-to-go N, a table of frequencies for states M, a table of transition probabilities from state to state At end of sequence: percepts, a percept sequence (initially empty) add e to percepts • calculate (observed) reward-to-go for each state increment N[State[e]] • use observed values to update utility estimates U ← Update(U, e, percepts, M, N) if Terminal?[e] then percepts ← the empty sequence eg, utility function represented by table of values — maintain return the action Observe running average. . .

Update function LMS-Update(U, e, percepts, M, N) returns an updated U if Terminal?[e] then reward-to-go ← 0 • after transitions, or for each ei in percepts (starting at end) do • after complete sequences reward-to-go ← reward-to-go + Reward[ei] U[State[ei]] ← Running-Average(U[State[ei]], update function is one key to reinforcement learning reward-to-go, N[State[ei]]) end

Some alternatives · · · ;

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Reinforcement Learning Slide 163 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Reinforcement Learning Slide 164 2.2 Na¨ıve Updating — LMS Approach 2.2 Na¨ıve Updating — LMS Approach

Exercise Problem: Show that this approach minimises mean squared error (MSE) LMS approach ignores important information (and hence root mean squared (RMS) error) w.r.t. observed data. ⇒ interdependence of state utilities!

That is, the hypothesis values xh generated by this method min- Example (Sutton 1998) imise

2 (xi − xh) !1 ! i N

p~ 0.9 where xi are the sample values. NEW OLD For this reason this approach is sometimes called the least mean U = ? U ~ !0.8 p ~ 0.1 squares (LMS) approach. ~

In general wish to learn utility function (rather than table). +1 Have examples with: • input value — state New state awarded estimate of +1. Real value ∼ −0.8. • output value — observed reward ⇒ inductive learning problem!

Can apply any techniques for inductive function learning — linear weighted function, neural net, etc. . .

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Reinforcement Learning Slide 165 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Reinforcement Learning Slide 166

2.2 Na¨ıve Updating — LMS Approach 2.3 Adaptive Dynamic Programming

Leads to slow convergence. . . Take into account relationship between states. . .

utility of a state = probability weighted average of its 1 (4,3) successors’ utilities + its own reward

0.5 Formally, utilities are described by set of equations: (3,3) 0 (2,3) U(i) = R(i) + ! MijU(j) j (1,1) Utility estimates -0.5 (3,1) (passive version of Bellman equation — no maximisation over (4,1) -1 (4,2) actions) 0 200 400 600 800 1000 Number of epochs Since transition probabilities Mij known, once enough training sequences have been seen so that all reinforcements R(i) have 0.6 been observed: 0.5 • problem becomes well-defined sequential decision problem 0.4 • equivalent to value determination phase of policy iteration 0.3 ⇒ above equation can be solved exactly 0.2 RMS error in utility 0.1

0 0 200 400 600 800 1000 Number of epochs

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Reinforcement Learning Slide 167 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Reinforcement Learning Slide 168 2.3 Adaptive Dynamic Programming 2.4 Temporal Difference Learning

Can we get the best of both worlds — use contraints without .5 .5 .33 1.0 solving equations for all states? +1 + 1 + 1 3 .5 .33 3 !0.0380 0.0886 0.2152 ⇒ use observed transitions to adjust locally in line with con- .5 .5 .33 .33 straints .33 1.0 2 ! 1 !1 2 !0.1646 !0.4430 ! 1 U(i) ← U(i) + α(R(i) + U(j) − U(i))

.5 .5 .33 .33 .5 α is learning rate .5 .5 .33 1 START 1 !0.2911 !0.0380 !0.5443 !0.7722 Called temporal difference (TD) equation — updates according to difference in utilities between successive states. .5 .33 .5 1 2 3 4 1 2 3 4 Note: compared with (a) (b) (c)

U(i) = R(i) + ! MijU(j) Refer to learning methods that solve utility equations using dy- j namic programming as adaptive dynamic programming (ADP). — only involves observed successor rather than all successors Good benchmark, but intractable for large state spaces However, average value of U(i) converges to correct value. eg. backgammon: 1050 equations in 1050 unknowns Step further — replace α with function that decreases with num- ber of observations ⇒ U(i) converges to correct value (Dayan, 1992). Algorithm · · · ;

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Reinforcement Learning Slide 169 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Reinforcement Learning Slide 170

2.4 Temporal Difference Learning 2.4 Temporal Difference Learning

function TD-Update(U, e, percepts, M, N) returns utility table U if Terminal?[e] then 1 (4,3) U[State[e]] ← Running-Average(U[State[e]], Reward[e], N[State[e]]) 0.5 else if percepts contains more than one element then (3,3) ! (2,3) e ← the penultimate element of percepts 0 i, j ← State[e!], State[e] ! (1,1)

U[i] ← U[i] + α(N[i])(Reward[e ] + U[j] - U[i]) Utility estimates -0.5 (3,1) (4,1) Example runs · · · ; -1 (4,2) 0 200 400 600 800 1000 Number of epochs Notice:

• values more eratic 0.6 • RMS error significantly lower than LMS approach after 1000 0.5 epochs 0.4

0.3

0.2 RMS error in utility 0.1

0 0 200 400 600 800 1000 Number of epochs

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Reinforcement Learning Slide 171 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Reinforcement Learning Slide 172 3. Passive Learning, Unknown Environments 4. Active Learning in Unknown Environments

• LMS and TD learning don’t use model directly Agent must decide which actions to take. ⇒ operate unchanged in unknown environment Changes: • ADP requires estimate of model • All utility-based methods use model for action selection • agent must include performance element (and exploration el- ement) ⇒ choose action a Estimate of model can be updated during learning by observation • model must incorporate probabilities given action — Mij of transitions • constraints on utilities must take account of choice of action a U(i) = R(i) + max ! MijU(j) • each percept provides input/output example of transition func- a j tion (Bellman’s equation from sequential decision problems) eg. for tabular representation of M, simply keep track of per- centage of transitions to each neighbour Model Learning and ADP • Tabular representation — accumulate statistics in 3 dimen- Other techniques for learning stochastic functions — not covered sional table (rather than 2 dimensional) here. • Functional representation — input to function includes action taken

ADP can then use value iteration (or policy iteration) algorithms · · · ;

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Reinforcement Learning Slide 173 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Reinforcement Learning Slide 174

4. Active Learning in Unknown Environments 5. Exploration

How should performance element choose actions? function Active-ADP-Agent(e) returns an action static: U, a table of utility estimates Two outcomes: M, a table of transition probabilities from state to state for each action • gain rewards on current sequence R, a table of rewards for states • observe new percepts for learning, and improve rewards on percepts, a percept sequence (initially empty) future sequences last-action, the action just executed add e to percepts trade-off between immediate and long-term good R[State[e]] ← Reward[e] M ← Update-Active-Model(M, percepts, last-action) — not limited to automated agents! U ← Value-Iteration(U, M, R) if Terminal?[e] then percepts ← the empty sequence Non trivial last-action ← Performance-Element(e) return last-action • too conservative ⇒ get stuck in a rut • too inquisitive ⇒ inefficient, never get anything done

Temporal Difference Learning eg. taxi driver agent Learn model as per ADP. Update algorithm...?

No change! Strange rewards only occur in proportion to proba- bility of strange action outcomes

U(i) ← U(i) + α(R(i) + U(j) − U(i))

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Reinforcement Learning Slide 175 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Reinforcement Learning Slide 176 5. Exploration 5. Exploration

Example Optimal is difficult, but can get close. . . — give weight to actions that have not been tried often, while 3 + 1 tending to avoid low utilities 0.8 Alter constraint equation to assign higher utility estimates to 2 ! 1 0.1 0.1 relatively unexplored action-state pairs ⇒ optimistic “prior” — initially assume everything is good. 1 START Let 1 2 3 4 U +(i) — optimistic estimate Two extremes: N(a, i) — number of times action a tried in state i whacky — acts randomly in hope of exploring environment ADP update equation ⇒ learns good utility estimates + a + U (i) ← R(i) + max f(! MijU (j), N(a, i)) ⇒ never gets better at reaching positive reward a j greedy — acts to maximise utility given current estimates where f(u, n) is exploration function. ⇒ finds a path to positive reward + ⇒ never finds optimal route Note U (not U) on r.h.s. — propagates tendency to explore from sparsely explored regions through densely explored regions Start whacky, get greedier? Is there an optimal exploration policy?

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Reinforcement Learning Slide 177 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Reinforcement Learning Slide 178

5. Exploration 5. Exploration f(u, n) determines trade-off between “greed” and “curiosity” 2 ⇒ should increase with u, decrease with n 1.5

Simple example 1

+  R if n < N 0.5 (4,3)  e f(u, n) =  (3,3) (2,3)  u otherwise 0  Utility estimates (1,1) (3,1) + -0.5 where R is optimistic estimate of best possible reward, Ne is (4,1) (4,2) fixed parameter -1 0 20 40 60 80 100 ⇒ try each state at least Ne times. Number of iterations

+ Example for ADP agent with R = 2 and Ne = 5 · · · ; 1.4 Note policy converges on optimal very quickly RMS error 1.2 Policy loss (wacky — best policy loss ≈ 2.3 1 greedy — best policy loss ≈ 0.25) 0.8 Utility estimates take longer — after exploratory period further 0.6 exploration only by “chance” 0.4 0.2

0

RMS error, policy loss (exploratory policy) 0 20 40 60 80 100 Number of epochs

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Reinforcement Learning Slide 179 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Reinforcement Learning Slide 180 6. Learning Action-Value Functions 6. Learning Action-Value Functions

Action-value functions Algorithm:

• assign expected utility to taking action a in state i function Q-Learning-Agent(e) returns an action • also called Q-values static: Q, a table of action values N, a table of state-action frequencies • allow decision-making without use of model a, the last action taken i, the previous state visited Relationship to utility values r, the reward received in state i j ← State[e] U(i) = maa x Q(a, i) if i is non-null then N[a, i] ← N[a, i] + 1 ! Constraint equation Q[a, i] ← Q[a, i] + α(r + maxa! Q[a , j] − Q[a, i]) if Terminal?[e] then a " i ← null Q(a, i) = R(i) + ! Mij max Q(a , j) j a! else i ← j Can be used for iterative learning, but need to learn model. r ← Reward[e] ! ! a ← arg maxa! f(Q[a , j], N[a , j]) Alternative ⇒ temporal difference learning return a

TD Q-learning update equation Example · · · ; Q(a, i) ← Q(a, i) + α(R(i) + max Q(a", j) − Q(a, i)) a! Note: slower convergence, greater policy loss Consistency between values not enforced by model.

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Reinforcement Learning Slide 181 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Reinforcement Learning Slide 182

6. Learning Action-Value Functions 7. Generalisation

So far, algorithms have represented hypothesis functions as tables — explicit representation 1 eg. state/utility pairs 0.5 OK for small problems, impractical for most real-world problems.

0 (4,3) 50 120 (3,3) eg. chess and backgammon → 10 – 10 states. (2,3)

Utility estimates -0.5 (1,1) Problem is not just storage — do we have to visit all states to (3,1) (4,1) learn? (4,2) -1 Clearly humans don’t! 0 20 40 60 80 100 Number of iterations Require implicit representation — compact representation, rather than storing value, allows value to be calculated 1.4 RMS error eg. weighted linear function of features 1.2 Policy loss

1 U(i) = w1f1(i) + w2f2(i) + · · · + wnfn(i) 0.8 120 0.6 From say 10 states to 10 weights ⇒ whopping compression!

0.4 But more importantly, returns estimates for unseen states 0.2 ⇒ generalisation!!

RMS error, policy loss (TD Q-learning) 0 0 20 40 60 80 100 Number of epochs

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Reinforcement Learning Slide 183 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Reinforcement Learning Slide 184 7. Generalisation 7. Generalisation

Very powerful. eg. from examining 1 in 1044 backgammon states, And last but not least. . . can learn a utility function that can play as well as any human. On the other hand, may fail completely. . . hypothesis space must contain a function close enough to actual utility function Depends on

• type of function used for hypothesis eg. linear, nonlinear (neural net), etc • chosen features

Trade off: ! larger the hypothesis space x ⇒ better likelihood it includes suitable function, but ⇒ more examples needed ⇒ slower convergence

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Reinforcement Learning Slide 185 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Reinforcement Learning Slide 186

1. Search vs. Planning

Consider the task get milk, bananas, and a cordless drill CITS 7210 Algorithms for Artificial Intelligence Standard search algorithms seem to fail miserably:

Topic 9 Talk to Parrot

Go To Pet Store Buy a Dog Planning Go To School Go To Class

Go To Supermarket Start Buy Tuna Fish

3 Search vs. planning Go To Sleep Buy Arugula 3 Planning Languages and STRIPS Read A Book Buy Milk ... Finish 3 State Space vs. Plan Space Sit in Chair Sit Some More 3 Partial-order Planning

Etc. Etc...... Read A Book

Reading: Russell & Norvig, Chapter 11 After-the-fact heuristic/goal test inadequate

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning Slide 187 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning Slide 188 1. Search vs. Planning 2. Planning Languages and STRIPS

Planning systems do the following: Require declarative language — declarations or statements about world. 1. open up action and goal representation to allow selection 2. divide-and-conquer by subgoaling Range of have been proposed — best descriptive languages we have, but can be difficult to use in practice. 3. relax requirement for sequential construction of solutions more descriptive power → more difficult to compute (reason) Search Planning automatically States internal state of Java objects descriptive (logical) sentences Actions encoded in Java methods preconditions/outcomes STRIPS (STanford Research Institute Problem Solver) first to Goal encoded in Java methods descriptive sentence suggest suitable compromise Plan sequence from s0 constraints on actions • restricted form of logic ⇒ implicit ⇒ explicit ⇒ hard to decompose ⇒ easier to decompose • restricted language ⇒ efficient algorithm Basis of many subsequent languages and planners.

States At(Home), ¬ Have(Milk), ¬ Have(Bananas), ¬ Have(Drill) (conjunctions of function-free ground literals)

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning Slide 189 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning Slide 190

2. Planning Languages and STRIPS 3. State Space vs. Plan Space

Goals Standard search: node = concrete world state Planning search: node = partial plan At(Home), Have(Milk), Have(Bananas), Have(Drill) Definition: open condition is a precondition of a step not yet Can have variables fulfilled At(x), Sells(x,Milk) Operators on partial plans, eg: (conjunctions of function-free literals) • add a step to fulfill an open condition Actions • order one step wrt another • instantiate an unbound variable Action (Name): Buy(x) Precondition: At(p), Sells(p, x) Gradually move from incomplete/vague plans to complete, cor- Effect: Have(x) rect plans (Precondition: conjunction of positive literals Effect: conjunction of literals)

At(p) Sells(p,x)

Buy(x)

Have(x)

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning Slide 191 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning Slide 192 4. Partial-order planning 4. Partial-order planning

Example Start Goal: RightShoeOn, LeftShoeOn Operators: Start Left Right Op(Action: RightShoe, Precond: RightSockOn, Effect: RightShoeOn) Sock Sock Op(Action: RightSock, Effect: RightSockOn) Op(Action: LeftShoe, Precond: LeftSockOn, Effect: LeftShoeOn) Op(Action: LeftSock, Effect: LeftSockOn) LeftShoeOn, RightShoeOn LeftSockOn RightSockOn Left Right Consider partial plans: Finish Shoe Shoe

1. LeftShoe, RightShoe — ordering unimportant

2. RightSock, RightShoe — ordering important LeftShoeOn, RightShoeOn

3. RightSock, LeftShoe, RightShoe — ordering between some Finish actions important partial order planner ⇒ planner that can represent steps in least commitment planner — partial order planner that delays which some are ordered (in sequence) and others not (in “paral- commitment to order between steps for as long as possible lel”) ⇒ less backtracking

A plan is complete iff every precondition is achieved A precondition is achieved iff it is the effect of an earlier step and no possibly intervening step undoes it

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning Slide 193 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning Slide 194

4. Partial-order planning 4. Partial-order planning linearisation — obtaining a totally ordered plan from a partially In addition to orderings we must record ordered plan by imposing ordering constraints • variable bindings: eg. x = LocalStore c • causal links: Si −→ Sj (Si achieves precondition c for Sj) Partial Order Plan: Total Order Plans: Thus our initial plan might be: Start Start Start Start Start Start Start

Plan(Steps:{ S1: Op(Action: Start), Right Right Left Left Right Left Sock Sock Sock Sock Sock Sock S2: Op(Action: Finish, Left Right Sock Sock Precond: RightShoeOn, LeftShoeOn)}, Left Left Right Right Right Left Orderings: { S ≺ S }, Sock Sock Sock Sock Shoe Shoe 1 2 Bindings: {}, LeftSockOn RightSockOn Right Left Right Left Left Right Left Right Links: {}) Shoe Shoe Shoe Sock Sock Shoe Shoe Shoe

Left Right Left Right Left Right Algorithm · · · ; Shoe Shoe Shoe Shoe Shoe Shoe LeftShoeOn, RightShoeOn

Finish Finish Finish Finish Finish Finish Finish

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning Slide 195 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning Slide 196 4.1 POP algorithm sketch 4.1 POP algorithm sketch

function POP(initial, goal, operators) returns plan procedure Choose-Operator(plan, operators, Sneed, c)

plan ← Make-Minimal-Plan(initial, goal) choose a step Sadd from operators or Steps(plan) that has c as an loop do effect if Solution?(plan) then return plan if there is no such step then fail c Sneed, c ← Select-Subgoal(plan) add the causal link Sadd −→ Sneed to Links(plan) Choose-Operator(plan, operators, Sneed, c) add the ordering constraint Sadd ≺ Sneed to Orderings(plan) Resolve-Threats(plan) if Sadd is a newly added step from operators then end add Sadd to Steps(plan) add Start ≺ Sadd ≺ F inish to Orderings(plan) function Select-Subgoal(plan) returns Sneed, c procedure Resolve-Threats(plan) pick a plan step Sneed from Steps(plan) c with a precondition c that has not been achieved for each Sthreat that threatens a link Si −→ Sj in Links(plan) do return Sneed, c choose either Demotion: Add Sthreat ≺ Si to Orderings(plan) Promotion: Add Sj ≺ Sthreat to Orderings(plan) continued. . . if not Consistent(plan) then fail end

POP is sound, complete, and systematic (no repetition) Extensions for more expressive languages (eg disjunction, etc)

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning Slide 197 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning Slide 198

4.2 Clobbering and promotion/demotion 4.3 Example: Blocks world

A clobberer is a potentially intervening step that destroys the condition achieved by a causal link. E.g., Go(Home) clobbers C START At(HWS): B A On(C,A) On(A,Table) Cl(B) On(B,Table) Cl(C)

PutOn(A,B) clobbers Cl(B) DEMOTION => order after Demotion: put before Go(HW S) PutOn(B,C) Go(HWS) On(C,z) Cl(C) PutOn(B,C) PutOnTable(C) clobbers Cl(C) => order after Go(Home) PutOnTable(C) Cl(B) On(B,z) Cl(C) At(Home) Cl(A) On(A,z) Cl(B) PutOn(B,C) At(HWS) PutOn(A,B) Promotion: put after Buy(Drill) Buy(Drill)

A PROMOTION On(A,B) On(B,C) At(Home) B Finish FINISH C

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning Slide 199 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning Slide 200 1. The real world

CITS 7210 Algorithms for Artificial Intelligence

Topic 10

On(x) ~Flat(x) Planning and Acting START FINISH ~Flat(Spare) Intact(Spare) Off(Spare) On(Tire1) Flat(Tire1)

3 The real world On(x) Off(x) ClearHub Intact(x) Flat(x) Remove(x) Puton(x) Inflate(x) 3 Conditional planning Off(x) ClearHub On(x) ~ClearHub ~Flat(x) 3 Monitoring and replanning 3 Integrated planning and acting

Reading: Russell & Norvig, Chapter 13

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning and Acting Slide 201 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning and Acting Slide 202

1.1 Incomplete/incorrect information 1.2 Solutions

Incomplete information Conditional planning • Unknown preconditions, e.g., Intact(Spare)? • Plan to obtain information ⇒ observation actions • Disjunctive effects, e.g., Inflate(x) causes • Subplan for each contingency, e.g., Inflated(x) or SlowHiss(x) or Burst(x) or BrokenPump or . . . [Check(Tire1), If (Intact(Tire1),[Inflate(Tire1)],[CallRAC])] Incorrect information ⇒ Expensive because it plans for many unlikely cases • Current state incorrect, e.g., spare NOT intact Monitoring/Replanning • Missing/incorrect postconditions in operators ⇒ Unanticipated outcomes may lead to failure (e.g., no RAC “Qualification problem” card) ⇒ can never finish listing all the required preconditions • Assume normal states, outcomes and possible conditional outcomes of actions • Check progress during execution, replan if necessary

In general, some monitoring is unavoidable

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning and Acting Slide 203 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning and Acting Slide 204 2. Conditional planning 2.1 Conditional planning example

[. . . , If(p, [then plan], [else plan]), . . .] On(Tire1) On( x ) Start Flat(Tire1) Finish Inflated( x ) Execution: Inflated(Spare) (True) check p against current KB, execute “then” or “else”

Conditional planning: just like POP except • if an open condition can be established by observation action – add the action to the plan – complete plan for each possible observation outcome – insert conditional step with these subplans

CheckTire(x)

KnowsIf(Intact(x))

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning and Acting Slide 205 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning and Acting Slide 206

2.1 Conditional planning example 2.1 Conditional planning example

On(Tire1) Tire1 On(Tire1) Tire1 On( x ) On( x ) Start Flat(Tire1) Tire1 Finish Start Flat(Tire1) Tire1 Finish Inflated( x ) Inflated( x ) Inflated(Spare) (True) Inflated(Spare) (True) Flat(Tire1) Flat(Tire1) Inflate(Tire1) (Intact(Tire1)) Inflate(Tire1) (Intact(Tire1)) Intact(Tire1) Intact(Tire1) Intact(Tire1) (Intact(Tire1)) Intact(Tire1) (Intact(Tire1)) Check(Tire1) Check(Tire1) L Intact(Tire1) Spare On( x ) On( x ) Finish Spare Finish Inflated( x ) Remove(Tire1) Puton(Spare) Inflated( x ) L L

( Intact(Tire1)) L L ( Intact(Tire1)) ( Intact(Tire1)) ( Intact(Tire1))

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning and Acting Slide 207 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning and Acting Slide 208 3. Monitoring 3.1 Preconditions for remaining plan

Execution monitoring Start • “failure” = preconditions of remaining plan not met

• preconditions = causal links at current time At(Home) Action monitoring Go(HWS)

• “failure” = preconditions of next action not met At(HWS) Sells(HWS,Drill) (or action itself fails, e.g., robot bump sensor) Buy(Drill) In both cases, need to replan At(HWS) At(HWS) Have(Drill) Go(SM) Sells(SM,Ban.) Sells(SM,Milk)

At(SM) Sells(SM,Milk) At(SM) Sells(SM,Ban.)

Buy(Milk) Buy(Ban.)

At(SM)

Go(Home)

Have(Milk) At(Home) Have(Ban.) Have(Drill)

Finish

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning and Acting Slide 209 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning and Acting Slide 210

3.2 Replanning 4. Fully Integrated Planning and Acting

Simplest: on failure, replan from scratch Instead of planning and execution monitoring as separate pro- Better (but harder): plan to get back on track by reconnecting cesses. . . to best continuation planning execution monitoring ⇒ “loop until done” behavior (with no explicit loop) +

situated planning Failure situated planning agent

PRECONDITIONS FAILURE RESPONSE • always “part of the way” through a plan

START • activities include Color(Chair,Blue) ~Have(Red) – execute a plan step none N/A – monitor the world Get(Red) – fix deficiencies in plan (open conditions, clobbering, etc) Have(Red) Fetch more red Have(Red) – refine plan in light of new information (execution errors,

Paint(Red) actions by other agents, etc)

Color(Chair,Red) Repaint Color(Chair,Red) FINISH

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning and Acting Slide 211 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning and Acting Slide 212 4.1 Situated Planning Example 4.1 Situated Planning Example

Agent wishes to achieve goal state On(C,D), On(D,B)

C

C D D D D D D B C D B C C B B B C D B C C B B A E F G A E F G A E F G A E F G

A E F G A E F G A E F G A E F G (a) (b) (c) (d)

(a) (b) (c) (d)

(a) start state Initial plan (a). (b) another agent has put D on B On(C,F) Clear(C) Move(C,D) Ontable(A) Clear(D) (c) our agent has executed Move(C,D) but failed, On(B,E) On(C,F) On(C,D) Start On(D,G) On(D,B) Finish dropping C on A Clear(A) Clear(C) Clear(D) On(D,G) Clear(B) Clear(D) Move(D,B) (d) goal state achieved Clear(B)

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning and Acting Slide 213 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning and Acting Slide 214

4.1 Situated Planning Example 4.1 Situated Planning Example

C C D D D D D D B C D B C C B B B C D B C C B B A E F G A E F G A E F G A E F G A E F G A E F G A E F G A E F G

(a) (b) (c) (d) (a) (b) (c) (d)

External agent changes environment (b). Redundant action removed.

On(C,F) On(C,F) Clear(C) Move(C,D) Clear(C) Move(C,D) Clear(D) Clear(D) Ontable(A) Ontable(A) On(B,E) On(B,E) On(C,F) On(C,F) On(C,D) On(C,D) Start On(D,B) On(D,B) Finish Start On(D,B) On(D,B) Finish Clear(A) Clear(A) Clear(C) Clear(C) On(D,y) Clear(D) Clear(D) Clear(G) Clear(G) Clear(D) Move(D,B) Clear(B)

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning and Acting Slide 215 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning and Acting Slide 216 4.1 Situated Planning Example 4.1 Situated Planning Example

C C D D D D D D B C D B C C B B B C D B C C B B A E F G A E F G A E F G A E F G A E F G A E F G A E F G A E F G

(a) (b) (c) (d) (a) (b) (c) (d)

Move executed...... but failed (c). Replan.

Ontable(A) On(C,A) On(B,E) Clear(C) Move(C,D) Clear(D) On(C,A) On(C,D) Ontable(A) Start On(D,B) On(D,B) Finish On(B,E) Clear(F) On(C,A) Clear(C) On(C,D) Start On(D,B) On(D,B) Finish Clear(D) Clear(F) Clear(G) Clear(C) Clear(D) Clear(G)

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning and Acting Slide 217 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning and Acting Slide 218

4.1 Situated Planning Example

C D D D B C D B C C B B A E F G A E F G A E F G A E F G

(a) (b) (c) (d) The End

Move executed and succeeded (d).

Ontable(A) On(B,E) On(C,D) On(C,D) Start On(D,B) On(D,B) Finish Clear(F) Clear(C) Clear(A) Clear(G)

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning and Acting Slide 219 !c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning and Acting Slide 220 Monday Week 11 Seminar Room 2.28 3pm — 4pm Seminars 4pm — 5pm CheX World Cup

Wednesday Week 11 Seminar Room 1.24 3pm — 4pm Seminars 4pm — 5pm CheX World Cup

Monday Week 12 Seminar Room 2.28 3pm — 4.30pm Seminars

Wednesday Week 12 Seminar Room 1.24 3pm — 4.30pm Seminars

Wednesday Week 13 Seminar Room 1.24 3pm — 3.30pm Seminars 3.30pm — 5pm CheX World Cup Finals

!c Cara MacNish. Includes material !c S. Russell & P. Norvig 1995,1998 with permission. AAI 7210 Planning and Acting Slide 221