<<

AgentKnowledgeRepresentation

AgentKnowledgeRepresentation

What is knowledge?

● "the fact or condition of knowing something with familiarity gained through experience or association" (Merriam-Webster 1988) ● simply: association of statements about the world

Why represent knowledge?

● for reasoning ● for learning ● for sharing ● for ease of transfer to computers ● for ease of explanation to humans

What to represent?

● the environment and the domain ● the task ● the agent himself ● the user ● the other agents

Viewpoints:

● structural ● functional ● behavioral ● causal ● ... formalisms, models and notations:

● FormalLogic, expert rules

● SemanticNets, frames, OO, SharedOntology and the SemanticWeb

● FiniteStateMachines

http://eureka.sce.carleton.ca:8080/cgi-bin/agentcourse.cgi?AgentKnowledgeRepresentation (1 of 2)9/23/2005 3:09:19 PM AgentKnowledgeRepresentation

● PetriNets

● BayesianNetworks

● decision trees ● task decomposition (TCA) ● neural nets exercise:

Use the different models to represent very simply the different types of knowledge of a soccer player: his behavior on the field, his knowledge of the soccer rules, his perception of the field, his tactics, his model of the other players... Discuss the choice of the models. reference:

Knowledge Representation, Logical, Philosophical, and Computational Foundations, John F. Sowa, Brooks/Cole

(last edited September 16, 2004) Find Page by browsing or searching

http://eureka.sce.carleton.ca:8080/cgi-bin/agentcourse.cgi?AgentKnowledgeRepresentation (2 of 2)9/23/2005 3:09:19 PM FormalLogic

FormalLogic

Informal definition:

Logics allow to assert statements about the world and prove properties thereof, given:

● a set of primitive notations and rules to write sentences by combining them ● an ontology to define the terms ● a syntax that provides the tools to establish the of the inference of a statement: ❍ axioms ❍ rules of inference ● semantics that provide the tools for the interpretation of the sentences

Propositional logic:

● boolean propositions, and, or, negation ● semantics: truth tables (true, false) ● syntax: a set of standard axioms + ● good for theorem proving but poor expressiveness example:

heads => me tails => ~you ~heads => tails you => ~me

First order logic:

● predicates that can contain constants, variables and functions, existential and universal quantifiers ● good expressiveness (enables to express generality for example) but theorem proving is difficult (ie VERY long) and sometimes impossible ● Prolog uses resolution and unification mechanisms on a subset of first order logic example:

http://eureka.sce.carleton.ca:8080/cgi-bin/agentcourse.cgi?FormalLogic (1 of 2)9/23/2005 3:11:02 PM FormalLogic

Different ways to represent the sentence: Every trailer truck has 18 wheels:

Ax, trailerTruck(x) => eighteenWheeler(x)

Ax, trailerTruck(x) => numberOfWheels(x,18)

(Ax)((truck(x) and (Ey)(trailer(y) and part(x,y))) => (Es)(set(s) and count(s,18) and (Aw)(member(w,s) and part(x,w))) ))

more:

● conceptual graphs ● ModalLogic

● higher order logics ref:

● Genesereth's excellent course on logics [1]

● Sowa's book on knowledge representation

(last edited January 14, 2002) Find Page by browsing or searching

http://eureka.sce.carleton.ca:8080/cgi-bin/agentcourse.cgi?FormalLogic (2 of 2)9/23/2005 3:11:02 PM ModalLogic

ModalLogic motivation: classic logic limitations: (example is from Wooldridge)

● Bel(Janine, Father (Zeus, Cronos)) ● (Zeus = Jupiter) ● Bel(Janine, Father (Jupiter, Cronos)) ● Janine believing p is not dependent on the truth of p.

Normal Modal Logic for Possible Worlds:

● Classical propositional logic extended by 2 operators. " " (necessarily) and " " (possibly). ● The semantics of modal connectives define what worlds are considered accessible from other worlds. ● The formula f is then TRUE if f is TRUE for all worlds accessible from the current world. ● The formula f is TRUE if f is TRUE for in at least one world accessible from the current world. ● The two modal operators are duals of each other:

f = ~ ~f

The modal operators can be used in various contexts and take specific meanings. ex: temporal logic, deontic logic, epistemic logic.

For our purpose, we will use modal operators in the context of epistemic logic (logic of knowledge). From the point of view of an agent A (we will not represent A to keep notations simple), f will be read as: "A believes f". example: An agent is playing poker. Complete knowledge of the opponents hand is impossible to determine. The ability to play is determined partially by the agent's belief in the opponents hand. Suppose agent has Ace of Spades. First compute all the possible ways the cards could be dealt to the opponent. These are possible worlds. Then eliminate those worlds that are not possible given what the agent knows. What is left over are the epistemic alternatives (worlds possible given ones beliefs). Something TRUE in all worlds is said to be believed by the agent (It is TRUE that the agent has the Ace of Spades). more:

http://eureka.sce.carleton.ca:8080/cgi-bin/agentcourse.cgi?ModalLogic (1 of 2)9/23/2005 3:12:22 PM SemanticNets

SemanticNets definitions

A semantic net is a graph of concepts connected with associative links that describe the relationships between them.

Frames refine semantic nets by allowing the description of each concept with a set of slot/filler pairs, and by restricting the types of links. A filler can either be a value or a procedure.

Mammal subclass: Animal warm_blooded: yes

http://eureka.sce.carleton.ca:8080/cgi-bin/agentcourse.cgi?SemanticNets (1 of 2)9/23/2005 3:16:42 PM SemanticNets

Elephant subclass: Mammal * colour: grey * size: large

Clyde instance: Elephant colour: pink owner: Fred

Nellie: instance: Elephant size: small

ref:

[3] more:

SemanticWeb and ResourceDescriptionFramework

(last edited January 15, 2004) Find Page by browsing or searching

http://eureka.sce.carleton.ca:8080/cgi-bin/agentcourse.cgi?SemanticNets (2 of 2)9/23/2005 3:16:42 PM SharedOntology

SharedOntology short definition of an ontology: a "specification of a conceptualization" (Gruber). in plain English: Ontology is "the study of existence, of all the kinds of entities -abstract and concrete- that make up the world." (Sowa)

Ontologies describe concepts and the relationships between them. Think of an ontology as an encyclopedia or a glossary of terms.

Ontologies constitute the backbone of knowledge representation, because they provide:

● the predicates in logics, ● the classes of an object-oriented system, ● the domains of a database,

http://eureka.sce.carleton.ca:8080/cgi-bin/agentcourse.cgi?SharedOntology (1 of 2)9/23/2005 3:19:17 PM SharedOntology

● the concepts of a semantic network.

For agents to be able to communicate they need to share at least the ontology of the domain they are involved in.

A few known efforts in sharing ontologies:

● Knowledge Interchange Format (KIF) [1], Ontolingua.

● XML, OML [2][4]

● SemanticWeb, DAML, OIL

● CYC [3]

(last edited October 20, 2002) Find Page by browsing or searching

http://eureka.sce.carleton.ca:8080/cgi-bin/agentcourse.cgi?SharedOntology (2 of 2)9/23/2005 3:19:17 PM SemanticWeb

SemanticWeb

● XML ● XML Schema ● ResourceDescriptionFramework

● Dublin Core ● RDFS ● OIL, DAML, OWL ● DAML-S, OWL-S

Neal Arthorne's presentation [1] puts these concepts nicely together. references

● official site: http://www.w3.org/2001/sw/

● a good overview by Tim Berners-Lee: http://www.w3.org/DesignIssues/Semantic.html

● http://semanticweb.org

● Protege is a semantic web editing tool: http://protege.semanticweb.org/

(last edited September 21, 2004) Find Page by browsing or searching

http://eureka.sce.carleton.ca:8080/cgi-bin/agentcourse.cgi?SemanticWeb9/23/2005 3:20:34 PM ResourceDescriptionFramework

ResourceDescriptionFramework

John Smith [email protected] Home, Inc.

ref:

● example taken from the following introduction to RDF [4]

● Pierre-Antoine Champmin's RDF tutorial [2]

http://eureka.sce.carleton.ca:8080/cgi-bin/agentcourse.cgi?ResourceDescriptionFramework (1 of 2)9/23/2005 3:21:12 PM FiniteStateMachines

FiniteStateMachines

Finite State Machines, or Finite Automata, are a simple type of state transition diagrams. Here is a formal definition: definition (adapted from Sipser)

A finite automaton is a 5-tuple (Q, S, d, q, F), where:

1. Q is a finite set called the states, 2. S is a finite set called the alphabet, 3. d: Q x S -> Q is the transition function, 4. q, an element of Q, is the start state, 5. F, a subset of Q, is the set of accept states.

The state machine above accepts the string abbaaa, but does not accept aba. example

http://eureka.sce.carleton.ca:8080/cgi-bin/agentcourse.cgi?FiniteStateMachines (1 of 3)9/23/2005 3:23:14 PM FiniteStateMachines

basic call in telephony originating IN call

http://eureka.sce.carleton.ca:8080/cgi-bin/agentcourse.cgi?FiniteStateMachines (2 of 3)9/23/2005 3:23:14 PM FiniteStateMachines

use

● to represent change ● pattern recognition implementation

● a table (door controller example) ● a state variable + a "switch... case" statement ● the State Pattern (see: [6] or [7])

more

● statecharts [8]

● Introduction to the Theory of Computation, Michael Sipser, International Thomson Publishing.

(last edited September 21, 2004) Find Page by browsing or searching

http://eureka.sce.carleton.ca:8080/cgi-bin/agentcourse.cgi?FiniteStateMachines (3 of 3)9/23/2005 3:23:14 PM Programming Patterns Overview: State

State

Allow an object to alter its behavior when its internal state changes. The object will appear to change its class.

Applicability

● Need to change an object's behavior when it changes its state ● Operations have large, multipart conditional statements that depend on the object's state

Consequences

● Localizes state-specific behavior and partitions behavior for different states ● Makes state transitions explicit ● State objects can be shared if they have no intrinsic state (see Flyweight).

http://pages.cpsc.ucalgary.ca/~kremer/patterns/state.html9/23/2005 3:38:10 PM Programming Patterns Overview: Flyweight

Flyweight

Use sharing to support large numbers of fine-grained objects efficiently.

Applicability

● An application uses a large number of objects ● Storage costs are high because of the sheer quantity of objects ● Most object state can be made extrinsic ● Many groups of objects may be replaced by relatively few shared objects once extrinsic state is removed ● The application doesn't depend on object identity

Consequences

● Reduces the number of instances ● Most of the state must be extrinsic

http://pages.cpsc.ucalgary.ca/~kremer/patterns/flyweight.html9/23/2005 3:38:36 PM PetriNets

PetriNets definition: places, transitions, tokens... see:

● http://www.cis.um.edu.mt/~jskl/petri.html use:

● to model concurrent processes ● to simulate and verify distributed systems and protocols example:

● a bus stop

● the dining philosophers

http://eureka.sce.carleton.ca:8080/cgi-bin/agentcourse.cgi?PetriNets (1 of 2)9/23/2005 3:41:42 PM PetriNets

The location of the tokens reflects the state of the system. tools:

[3] refs:

[4] more:

● Use case maps [5] [6]

● UML activity diagrams [7]

(last edited September 21, 2004) Find Page by browsing or searching

http://eureka.sce.carleton.ca:8080/cgi-bin/agentcourse.cgi?PetriNets (2 of 2)9/23/2005 3:41:42 PM Petri Networks (jskl)

Petri Networks

Definition and basic ideas Example Reference

Definition and basic ideas

Definition

The MARKED PETRI NET (P-net) is a 5-tuple (P,T,I,O,M) where P and T are non empty finite sets of PLACES and TRANSITIONS respectively. (The sets P and T have no common elements.) I is the so called Input function: I: PxT-> N, where N is the set of non negative integer numbers. The value I(p,t) is the number of (directed) ARCs from the place p to the transition t. O is the so called Output function: O: TxP-> N, where the value O(t,p) is the number of arcs from the transition t to the place p. So the 4-tuple (P,T,I,O) is a bipartite (directed) multigraph whose arcs connect nodes from two distinct sets (P and T). M is the Initial marking of places: M: P -> N, where the value M(p) is the number of the so called TOKENS that are located in the place p.

Interpretation

A transition is ENABLED if all places contain at least so many tokens as is the number of arcs from the place to the transition. A transition which is not enabled is DISABLED. An enabled transition may be FIRED. During firing every arc whose endpoint is the transition REMOVES one token from its starting (input) place and every arc starting at the transition ADDS one token to its ending (output) place. From modeling point of view places represent certain conditions that must be true for some activities (transition firings) to start. Firing changes the markings of both input and output places. This may enable or disable other transitions, etc. All enabled transitions that are not mutually exclusive may be fired in parallel. There are many extensions to the basic P-net theory improving modeling power of network models. Petri nets have both analytical and descriptive (modeling) capabilities. For more details and for classification of various types of Petri and related networks, visit the World of Petri Nets - see Reference.

Drawing

http://www.cis.um.edu.mt/~jskl/petri.html (1 of 5)9/23/2005 3:42:32 PM Petri Networks (jskl)

Places are drawn as circles that are either empty or that contain some number of tokens. Because of practical reasons PetriSim does not draw tokens as small filled circles but every place contains an integer (non-negative) number of tokens. Name of the place is displayed at the bottom right side of the place. Transitions are drawn as short thick lines. PetriSim also displays the name at the bottom right side of the transition. PetriSim allows either horizontal (default) or vertical direction of a transition. Enabled transitions are displayed with a mark. Arcs are lines finished by an arrow. Points where the arcs start/end are defined by the user. PetriSim allows any length and shape of arcs.

Go back to Heading Go back to Home page

Example

The following picture (created by PetriSim) is an example that models a cooperation between two processes called Producer and Consumer:

http://www.cis.um.edu.mt/~jskl/petri.html (2 of 5)9/23/2005 3:42:32 PM Petri Networks (jskl) The Producer prepares data and writes them to buffers. If there is no empty buffer, the Producer must wait. The Consumer reads the data supplied by the Producer. The initial marking of the place Empty_buffers is the total number of buffers available (initially all the buffers are empty). The place Semaphore ensures that only one process can work with the data at a time. After reading the data the Consumer returns the empty buffer. The above picture shows the P-net in the initial state, when the Producer is ready to start writing data, all buffers are empty, the Consumer is ready to accept data. Note that the transition Producer_works is enabled because of the token in the place Producer_starts.

Next picture shows the P-net after firing the transition Producer_works. This picture has been taken in PetriSim Simulation mode.

Note how firing moved the token from the place Producer_starts to the place Data_ready. This has enabled the transition Writing_data. Its firing in the next step moves the token back to place Producer_starts, so Producer will be again in its initial state. Firing of Writing_data moves also one token from Empty_buffers to Data_in_buffer. A token from the place Semaphore is taken only for this firing (see next picture).

The following picture shows a refinement of the above model. Here both writing and reading data may take some time, which is represented by another two places. There are now two transitions per operation representing start and finish of the operation:

http://www.cis.um.edu.mt/~jskl/petri.html (3 of 5)9/23/2005 3:42:32 PM Petri Networks (jskl)

In this model the use of Semaphore is obvious. Either reading or writing takes the token from Semaphore, the other operation must wait until the successful one is completed. Note that there are two enabled transition. The one to be fired in the next step is either selected by user or PetriSim selects it randomly. This is a way how P-nets cope with parallelism. In the real situation both operations (preparing data by Producer and reading data by Consumer) are independent and may be carried on in parallel.

PetriSim manual contains some more Examples.

Reference

This page contains an extract from the PetriSim manual. For more information visit the World of Petri Nets page that contains many useful links and references.

http://www.cis.um.edu.mt/~jskl/petri.html (4 of 5)9/23/2005 3:42:32 PM BayesianNetworks

BayesianNetworks

(adapted from Ramiro Liscano's lecture and also Russell & Norvig) context: need for an agent to make decisions based on uncertainty. One way to do that is to calculate all the probable outcomes of decisions and to decide on the decision that yields the most desired outcome with most probability. That's the basis of decision theory. introduction: first, some ProbabilityBasics definition:

A Bayesian network consists of the following

● A set of variables and a set of directed edges between variables. ● Each variable has a finite set of mutually exclusive states. ● The variables together with the directed edges form a directed acyclic graph (DAG). ● To each variable A with parents B1,..., Bn there is attached a conditional probability table P(A| B1,...,Bn).

http://eureka.sce.carleton.ca:8080/cgi-bin/agentcourse.cgi?BayesianNetworks (1 of 2)9/23/2005 3:46:03 PM BayesianNetworks

properties:

● calculation of the joint probability distribution: using the product rule and proprty of conditional independence, we can calculate any joint probability distribution rather easily. ● d-separation ● inference (see [3]) modelling tips: see BayesNetModeling more

● course notes [3] [8]

● learning bayesian networks [5]

● Partially Observable Markov Decision Processes (POMDP)... for planning. [4]

(last edited September 23, 2004) Find Page by browsing or searching

http://eureka.sce.carleton.ca:8080/cgi-bin/agentcourse.cgi?BayesianNetworks (2 of 2)9/23/2005 3:46:03 PM BayesNetModeling

BayesNetModeling

Building Models

The building of Bayesian network models involves 2 parts

● Building the structure of the network ● Computing the conditional probabilities

Structure of the Network

Here are some tips:

● Identify the hypothesis events. The purpose of the model is to give estimates for those variables that are not observable (i.e. hypothesis events). ● Identify the number of variables that are needed to encapsulate the hypothesis events. ● Identify the information channels. The achievable information which reveal something about the state of the hypothesis. ● Identify the causal structure of the network. What variables have an impact on other events.

Determining Conditional Probabilities

The two common approaches are based on well-founded theory over frequencies or subjective estimates. One has to understand the domain well and this takes some time and effort. So where do these numbers come from?

Here are several approaches:

● Chain graphs (Lauritzen 1996) ● Noisy or (Pearl 1986) ● Causal Independance (Heckerman 1993) ● Divorcing (MUNIN Adreassen et al. 1989) ● Learning Algortihms (Batch Learning)

References

Lauritzen, S. L. 1996. Graphical Models. Oxford: Oxford University Press.

http://eureka.sce.carleton.ca:8080/cgi-bin/agentcourse.cgi?BayesNetModeling (1 of 2)9/23/2005 4:02:29 PM BayesNetModeling Pearl, J. 1986. Fusion, propagation, and structuring inbelief networks. Artificial Intelligence 29(3), 241-88.

Heckerman, D. 1993. Causal Independance, knowledge acquisition, and inference. In Proc. of the ninth conference on uncertainty in artificial intelligence, D. Heckerman & A. Mamdani (eds), 122-7. San Mateo, CA: MOrgan Kaufmann.

Adreassen, S., F. V. Jensen, S. K. Andersen, B. Falck, U. Kjaerulff, M. Woldbye, A. Sorensen. A. Rosenfalck, F. Jensen 1989. MUNIN - an expert EMG assistant. In Computer-aided alectromyography and expert systems, J. E. Desmedt (ed.), 255-77. Amsterdam: Elsevier Science.

[BayesNetComputationExample | BayesianNetworks | UseOfBayesianNetworks]

(last edited September 15, 1999) Find Page by browsing or searching

http://eureka.sce.carleton.ca:8080/cgi-bin/agentcourse.cgi?BayesNetModeling (2 of 2)9/23/2005 4:02:29 PM ProbabilityBasics

ProbabilityBasics

Before you proceed, make sure you understand the following points (for example see Russel & Norvig for the answers): random variables

● P(Cavity) = 0.1 means that the proposition "Cavity" has a 10% chance of being true, assuming no other element about the proposition is known. It is therefore called a prior (or unconditional) probability. ● the above can be extended to variables that have a bigger domain than just true or false: they are generally called random variables: X, with P(X=1)= 0.1, P(X=2)= 0.2... ● the probability distribution of X is then the complete assignment of the prior probabilities of each particular value of the variable. ● one can assign probabilities to logical sentences involving one or many variables: P(A and B), P (not(A)),... joint probability distribution

● the axioms of probability can then be used to evaluate the probabilities of such logical sentences: ❍ P(true) = 1, P(false) = 0, ❍ P(A or B) = P(A) + P(B) - P(A and B) ● when we are in presence of many random variables, we are often interested in the probability of joint events, i.e. the probability of a given assignment to specific values of the variables: P(A=a1, B=b1) ● the joint probability distribution is the complete assignment of prior probabilities to all joint events. Using the joint and the axioms of probability, one can calculate all needed probabilities: P (A), P(A=a1 or B=b2)... ● but even in the simple case of boolean variables, when in presence of n variables the joint would require 2^n assignments, which is not practical. conditional probability

● cause vs evidence P(Cavity|Toothache) (Cavity given toothache) vs P(Toothache|Cavity) (Toothache given Cavity): we usually have one and we need to obtain the other ● the following definition: P(A|B) = P(A and B)/P(B) leads to the product rule: P(A and B) = P(A| B)xP(B)

http://eureka.sce.carleton.ca:8080/cgi-bin/agentcourse.cgi?ProbabilityBasics (1 of 2)9/23/2005 4:03:12 PM ProbabilityBasics

● by extending the product rule to more than two variables we obtain the chain rule: P(A,B,C) = P (A| B,C) P(B,C) = P(A|B,C) P(B|C) P(C) ● Bayes rule: P(B|A) = P(A|B)xP(B)/P(A) ● conditional independence: P(X|Y,Z) = P(X|Z) when X and Y are independent given Z: P(Catch| Cavity and Toothache) = P(Catch|Cavity) ref: the Russel & Norvig book, Prof. Oommen's intro to probability theory. see also: [1]

(last edited September 20, 2005) Find Page by browsing or searching

http://eureka.sce.carleton.ca:8080/cgi-bin/agentcourse.cgi?ProbabilityBasics (2 of 2)9/23/2005 4:03:12 PM Graphical Models

A Brief Introduction to Graphical Models and Bayesian Networks

By Kevin Murphy, 1998.

"Graphical models are a marriage between probability theory and graph theory. They provide a natural tool for dealing with two problems that occur throughout applied mathematics and engineering -- uncertainty and complexity -- and in particular they are playing an increasingly important role in the design and analysis of machine learning algorithms. Fundamental to the idea of a graphical model is the notion of modularity -- a complex system is built by combining simpler parts. Probability theory provides the glue whereby the parts are combined, ensuring that the system as a whole is consistent, and providing ways to interface models to data. The graph theoretic side of graphical models provides both an intuitively appealing interface by which humans can model highly- interacting sets of variables as well as a data structure that lends itself naturally to the design of efficient general-purpose algorithms.

Many of the classical multivariate probabalistic systems studied in fields such as statistics, systems engineering, information theory, pattern recognition and statistical mechanics are special cases of the general graphical model formalism -- examples include mixture models, factor analysis, hidden Markov models, Kalman filters and Ising models. The graphical model framework provides a way to view all of these systems as instances of a common underlying formalism. This view has many advantages -- in particular, specialized techniques that have been developed in one field can be transferred between research communities and exploited more widely. Moreover, the graphical model formalism provides a natural framework for the design of new systems." --- Michael Jordan, 1998.

This tutorial

We will briefly discuss the following topics.

● Representation, or, what exactly is a graphical model?

● Inference, or, how can we use these models to efficiently answer probabilistic queries?

● Learning, or, what do we do if we don't know what the model is?

● Decision theory, or, what happens when it is time to convert beliefs into actions?

● Applications, or, what's this all good for, anyway?

Note: Dan Hammerstrom has made a pdf version of this web page. I also have a closely related tutorial in postscript or pdf format.

Articles in the popular press

The following articles provide less technical introductions.

● LA times article (10/28/96) about Bayes nets.

● Economist article (3/22/01) about Microsoft's application of BNs.

Other sources of technical information

● My tutorial on Bayes rule

● AUAI homepage (Association for Uncertainty in Artificial Intelligence)

● The UAI mailing list

● UAI proceedings.

● My list of recommended reading .

● Bayes Net software packages

● My Bayes Net Toolbox for Matlab

● Tutorial slides on graphical models and BNT, presented to the Mathworks, May 2003

● List of other Bayes net tutorials

http://www.cs.ubc.ca/~murphyk/Bayes/bayes.html (1 of 21)9/23/2005 4:06:27 PM Graphical Models

Representation

Probabilistic graphical models are graphs in which nodes represent random variables, and the (lack of) arcs represent conditional independence assumptions. Hence they provide a compact representation of joint probability distributions. Undirected graphical models, also called Markov Random Fields (MRFs) or Markov networks, have a simple definition of independence: two (sets of) nodes A and B are conditionally independent given a third set, C, if all paths between the nodes in A and B are separated by a node in C. By contrast, directed graphical models also called Bayesian Networks or Belief Networks (BNs), have a more complicated notion of independence, which takes into account the directionality of the arcs, as we explain below.

Undirected graphical models are more popular with the physics and vision communities, and directed models are more popular with the AI and statistics communities. (It is possible to have a model with both directed and undirected arcs, which is called a chain graph.) For a careful study of the relationship between directed and undirected graphical models, see the books by Pearl88, Whittaker90, and Lauritzen96.

Although directed models have a more complicated notion of independence than undirected models, they do have several advantages. The most important is that one can regard an arc from A to B as indicating that A ``causes'' B. (See the discussion on causality.) This can be used as a guide to construct the graph structure. In addition, directed models can encode deterministic relationships, and are easier to learn (fit to data). In the rest of this tutorial, we will only discuss directed graphical models, i.e., Bayesian networks.

In addition to the graph structure, it is necessary to specify the parameters of the model. For a directed model, we must specify the Conditional Probability Distribution (CPD) at each node. If the variables are discrete, this can be represented as a table (CPT), which lists the probability that the child node takes on each of its different values for each combination of values of its parents. Consider the following example, in which all nodes are binary, i.e., have two possible values, which we will denote by T (true) and F (false).

We see that the event "grass is wet" (W=true) has two possible causes: either the water sprinker is on (S=true) or it is raining (R=true). The strength of this relationship is shown in the table. For example, we see that Pr(W=true | S=true, R=false) = 0.9 (second row), and hence, Pr(W=false | S=true, R=false) = 1 - 0.9 = 0.1, since each row must sum to one. Since the C node has no parents, its CPT specifies the prior probability that it is cloudy (in this case, 0.5).

http://www.cs.ubc.ca/~murphyk/Bayes/bayes.html (2 of 21)9/23/2005 4:06:27 PM Graphical Models

The simplest conditional independence relationship encoded in a Bayesian network can be stated as follows: a node is independent of its ancestors given its parents, where the ancestor/parent relationship is with respect to some fixed topological ordering of the nodes.

By the chain rule of probability, the joint probability of all the nodes in the graph above is

P(C, S, R, W) = P(C) * P(S|C) * P(R|C,S) * P(W|C,S,R)

By using conditional independence relationships, we can rewrite this as

P(C, S, R, W) = P(C) * P(S|C) * P(R|C) * P(W|S,R) where we were allowed to simplify the third term because R is independent of S given its parent C, and the last term because W is independent of C given its parents S and R.

We can see that the conditional independence relationships allow us to represent the joint more compactly. Here the savings are minimal, but in general, if we had n binary nodes, the full joint would require O(2^n) space to represent, but the factored form would require O(n 2^k) space to represent, where k is the maximum fan-in of a node. And fewer parameters makes learning easier.

Are "Bayesian networks" Bayesian?

Despite the name, Bayesian networks do not necessarily imply a commitment to Bayesian statistics. Indeed, it is common to use frequentists methods to estimate the parameters of the CPDs. Rather, they are so called because they use Bayes' rule for probabilistic inference, as we explain below. (The term "directed graphical model" is perhaps more appropriate.) Nevetherless, Bayes nets are a useful representation for hierarchical Bayesian models, which form the foundation of applied Bayesian statistics (see e.g., the BUGS project). In such a model, the parameters are treated like any other random variable, and becomes nodes in the graph.

Inference

The most common task we wish to solve using Bayesian networks is probabilistic inference. For example, consider the water sprinkler network, and suppose we observe the fact that the grass is wet. There are two possible causes for this: either it is raining, or the sprinkler is on. Which is more likely? We can use Bayes' rule to compute the posterior probability of each explanation (where 0==false and 1==true).

where

is a normalizing constant, equal to the probability (likelihood) of the data. So we see that it is more likely that the grass is wet because it is raining: the likelihood ratio is 0.7079/0.4298 = 1.647.

Explaining away

In the above example, notice that the two causes "compete" to "explain" the observed data. Hence S and R become conditionally dependent given that their common child, W, is observed, even though they are marginally independent. For example, suppose the grass is wet, but that we also know that it is raining. Then the posterior probability that the sprinkler is on goes down:

http://www.cs.ubc.ca/~murphyk/Bayes/bayes.html (3 of 21)9/23/2005 4:06:27 PM Graphical Models

Pr(S=1|W=1,R=1) = 0.1945

This is called "explaining away". In statistics, this is known as Berkson's paradox, or "selection bias". For a dramatic example of this effect, consider a college which admits students who are either brainy or sporty (or both!). Let C denote the event that someone is admitted to college, which is made true if they are either brainy (B) or sporty (S). Suppose in the general population, B and S are independent. We can model our conditional independence assumptions using a graph which is a V structure, with arrows pointing down:

B S \ / v C

Now look at a population of college students (those for which C is observed to be true). It will be found that being brainy makes you less likely to be sporty and vice versa, because either property alone is sufficient to explain the evidence on C (i.e., P(S=1 | C=1, B=1) <= P(S=1 | C=1)). (If you don't believe me, try this little BNT demo!)

Top-down and bottom-up reasoning

In the water sprinkler example, we had evidence of an effect (wet grass), and inferred the most likely cause. This is called diagnostic, or "bottom up", reasoning, since it goes from effects to causes; it is a common task in expert systems. Bayes nets can also be used for causal, or "top down", reasoning. For example, we can compute the probability that the grass will be wet given that it is cloudy. Hence Bayes nets are often called "generative" models, because they specify how causes generate effects.

Causality

One of the most exciting things about Bayes nets is that they can be used to put discussions about causality on a solid mathematical basis. One very interesting question is: can we distinguish causation from mere correlation? The answer is "sometimes", but you need to measure the relationships between at least three variables; the intution is that one of the variables acts as a "virtual control" for the relationship between the other two, so we don't always need to do experiments to infer causality. See the following books for details.

● "Causality: Models, Reasoning and Inference", Judea Pearl, 2000, Cambridge University Press.

● "Causation, Prediction and Search", Spirtes, Glymour and Scheines, 2001 (2nd edition), MIT Press.

● "Cause and Correlation in Biology", Bill Shipley, 2000, Cambridge University Press.

● "Computation, Causation and Discovery", Glymour and Cooper (eds), 1999, MIT Press.

Conditional independence in Bayes Nets

In general, the conditional independence relationships encoded by a Bayes Net are best be explained by means of the "Bayes Ball" algorithm (due to Ross Shachter), which is as follows: Two (sets of) nodes A and B are conditionally independent (d-separated) given a set C if and only if there is no way for a ball to get from A to B in the graph, where the allowable movements of the ball are shown below. Hidden nodes are nodes whose values are not known, and are depicted as unshaded; observed nodes (the ones we condition on) are shaded. The dotted arcs indicate direction of flow of the ball.

http://www.cs.ubc.ca/~murphyk/Bayes/bayes.html (4 of 21)9/23/2005 4:06:27 PM Graphical Models

The most interesting case is the first column, when we have two arrows converging on a node X (so X is a "leaf" with two parents). If X is hidden, its parents are marginally independent, and hence the ball does not pass through (the ball being "turned around" is indicated by the curved arrows); but if X is observed, the parents become dependent, and the ball does pass through, because of the explaining away phenomenon. Notice that, if this graph was undirected, the child would always separate the parents; hence when converting a directed graph to an undirected graph, we must add links between "unmarried" parents who share a common child (i.e., "moralize" the graph) to prevent us reading off incorrect independence statements.

Now consider the second column in which we have two diverging arrows from X (so X is a "root"). If X is hidden, the children are dependent, because they have a hidden common cause, so the ball passes through. If X is observed, its children are rendered conditionally independent, so the ball does not pass through. Finally, consider the case in which we have one incoming and outgoing arrow to X. It is intuitive that the nodes upstream and downstream of X are dependent iff X is hidden, because conditioning on a node breaks the graph at that point.

Bayes nets with discrete and continuous nodes

The introductory example used nodes with categorical values and multinomial distributions. It is also possible to create Bayesian networks with continuous valued nodes. The most common distribution for such variables is the Gaussian. For discrete nodes with continuous parents, we can use the logistic/softmax distribution. Using multinomials, conditional Gaussians, and the softmax distribution, we can have a rich toolbox for making complex models. Some examples are shown below. For details, click here. (Circles denote continuous-valued random variables, squares denote discrete rv's, clear means hidden, and shaded means observed.)

http://www.cs.ubc.ca/~murphyk/Bayes/bayes.html (5 of 21)9/23/2005 4:06:27 PM Graphical Models

For more details, see this excellent paper.

● A Unifying Review of Linear Gaussian Models, Sam Roweis & Zoubin Ghahramani. Neural Computation 11(2) (1999) pp.305- 345

Temporal models

Dynamic Bayesian Networks (DBNs) are directed graphical models of stochastic processes. They generalise hidden Markov models (HMMs) and linear dynamical systems (LDSs) by representing the hidden (and observed) state in terms of state variables, which can have complex interdependencies. The graphical structure provides an easy way to specify these conditional independencies, and hence to provide a compact parameterization of the model.

Note that "temporal Bayesian network" would be a better name than "dynamic Bayesian network", since it is assumed that the model structure does not change, but the term DBN has become entrenched. We also normally assume that the parameters do not change, i.e., the model is time-invariant. However, we can always add extra hidden nodes to represent the current "regime", thereby creating mixtures of models to capture periodic non-stationarities. There are some cases where the size of the state space can change over time, e. g., tracking a variable, but unknown, number of objects. In this case, we need to change the model structure over time.

Hidden Markov Models (HMMs)

The simplest kind of DBN is a Hidden Markov Model (HMM), which has one discrete hidden node and one discrete or continuous observed node per slice. We illustrate this below. As before, circles denote continuous nodes, squares denote discrete nodes, clear means hidden, shaded means observed.

We have "unrolled" the model for 4 "time slices" -- the structure and parameters are assumed to repeat as the model is unrolled further. Hence to specify a DBN, we need to define the intra-slice topology (within a slice), the inter-slice topology (between two slices), as well as the parameters for the first two slices. (Such a two-slice temporal Bayes net is often called a 2TBN.)

Some common variants on HMMs are shown below.

http://www.cs.ubc.ca/~murphyk/Bayes/bayes.html (6 of 21)9/23/2005 4:06:27 PM Graphical Models

Linear Dynamical Systems (LDSs) and Kalman filters

A Linear Dynamical System (LDS) has the same topology as an HMM, but all the nodes are assumed to have linear-Gaussian distributions, i.e.,

x(t+1) = A*x(t) + w(t), w ~ N(0, Q), x(0) ~ N(init_x, init_V) y(t) = C*x(t) + v(t), v ~ N(0, R)

The Kalman filter is a way of doing online filtering in this model. Some simple variants of LDSs are shown below.

The Kalman filter has been proposed as a model for how the brain integrates visual cues over time to infer the state of the world, although the reality is obviously much more complicated. The main point is not that the Kalman filter is the right model, but that the brain is combining bottom up and top down cues. The figure below is from a paper called "A Kalman Filter Model of the Visual Cortex", by P. Rao, Neural Computation 9(4):721--763, 1997.

http://www.cs.ubc.ca/~murphyk/Bayes/bayes.html (7 of 21)9/23/2005 4:06:27 PM Graphical Models

More complex DBNs

It is also possible to create temporal models with much more complicated topologies, such as the Bayesian Automated Taxi (BAT) network shown below. (For simplicity, we only show the observed leaves for slice 2. Thanks to Daphne Koller for providing this figure.)

http://www.cs.ubc.ca/~murphyk/Bayes/bayes.html (8 of 21)9/23/2005 4:06:27 PM Graphical Models

When some of the observed nodes are thought of as inputs (actions), and some as outputs (percepts), the DBN becomes a POMDP. See also the section on decision theory below.

A generative model for generative models

The figure below, produced by Zoubin Ghahramani and Sam Roweis, is a good summary of the relationships between some popular graphical models.

http://www.cs.ubc.ca/~murphyk/Bayes/bayes.html (9 of 21)9/23/2005 4:06:27 PM Graphical Models

INFERENCE

A graphical model specifies a complete joint probability distribution (JPD) over all the variables. Given the JPD, we can answer all possible inference queries by marginalization (summing out over irrelevant variables), as illustrated in the introduction. However, the JPD has size O(2^n), where n is the number of nodes, and we have assumed each node can have 2 states. Hence summing over the JPD takes exponential time. We now discuss more efficient methods.

http://www.cs.ubc.ca/~murphyk/Bayes/bayes.html (10 of 21)9/23/2005 4:06:27 PM Graphical Models Variable elimination

For a directed graphical model (Bayes net), we can sometimes use the factored representation of the JPD to do marginalisation efficiently. The key idea is to "push sums in" as far as possible when summing (marginalizing) out irrelevant terms, e.g., for the water sprinkler network

Notice that, as we perform the innermost sums, we create new terms, which need to be summed over in turn e.g.,

where

Continuing in this way,

where

This algorithm is called Variable Elimination. The principle of distributing sums over products can be generalized greatly to apply to any commutative semiring. This forms the basis of many common algorithms, such as Viterbi decoding and the Fast Fourier Transform. For details, see

● R. McEliece and S. M. Aji, 2000. The Generalized Distributive Law, IEEE Trans. Inform. Theory, vol. 46, no. 2 (March 2000), pp. 325--343. ● F. R. Kschischang, B. J. Frey and H.-A. Loeliger, 2001. Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory, February, 2001.

The amount of work we perform when computing a marginal is bounded by the size of the largest term that we encounter. Choosing a summation (elimination) ordering to minimize this is NP-hard, although greedy algorithms work well in practice.

Dynamic programming

If we wish to compute several marginals at the same time, we can use Dynamic Programming (DP) to avoid the redundant computation that would be involved if we used variable elimination repeatedly. If the underlying undirected graph of the BN is acyclic (i.e., a tree), we can use a local message passing algorithm due to Pearl. This is a generalization of the well-known forwards-backwards algorithm for HMMs (chains). For details, see

http://www.cs.ubc.ca/~murphyk/Bayes/bayes.html (11 of 21)9/23/2005 4:06:27 PM Graphical Models

● "Probabilistic Reasoning in Intelligent Systems", Judea Pearl, 1988, 2nd ed. ● "Fusion and propogation with multiple observations in belief networks", Peot and Shachter, AI 48 (1991) p. 299-318.

If the BN has undirected cycles (as in the water sprinkler example), local message passing algorithms run the risk of double counting. e. g., the information from S and R flowing into W is not independent, because it came from a common cause, C. The most common approach is therefore to convert the BN into a tree, by clustering nodes together, to form what is called a junction tree, and then running a local message passing algorithm on this tree. The message passing scheme could be Pearl's algorithm, but it is more common to use a variant designed for undirected models. For more details, click here

The running time of the DP algorithms is exponential in the size of the largest cluster (these clusters correspond to the intermediate terms created by variable elimination). This size is called the induced width of the graph. Minimizing this is NP-hard.

Approximation algorithms

Many models of interest, such as those with repetitive structure, as in multivariate time-series or image analysis, have large induced width, which makes exact inference very slow. We must therefore resort to approximation techniques. Unfortunately, approximate inference is #P-hard, but we can nonetheless come up with approximations which often work well in practice. Below is a list of the major techniques.

● Variational methods. The simplest example is the mean-field approximation, which exploits the law of large numbers to approximate large sums of random variables by their means. In particular, we essentially decouple all the nodes, and introduce a new parameter, called a variational parameter, for each node, and iteratively update these parameters so as to minimize the cross- entropy (KL distance) between the approximate and true probability distributions. Updating the variational parameters becomes a proxy for inference. The mean-field approximation produces a lower bound on the likelihood. More sophisticated methods are possible, which give tighter lower (and upper) bounds.

● Sampling (Monte Carlo) methods. The simplest kind is importance sampling, where we draw random samples x from P(X), the (unconditional) distribution on the hidden variables, and then weight the samples by their likelihood, P(y|x), where y is the evidence. A more efficient approach in high dimensions is called Monte Carlo Markov Chain (MCMC), and includes as special cases Gibbs sampling and the Metropolis-Hasting algorithm.

● "Loopy belief propogation". This entails applying Pearl's algorithm to the original graph, even if it has loops (undirected cycles). In theory, this runs the risk of double counting, but Yair Weiss and others have proved that in certain cases (e.g., a single loop), events are double counted "equally", and hence "cancel" to give the right answer. Belief propagation is equivalent to exact inference on a modified graph, called the universal cover or unwrapped/ computation tree, which has the same local topology as the original graph. This is the same as the Bethe and cavity/TAP approaches in statistical physics. Hence there is a deep connection between belief propagation and variational methods that people are currently investigating.

● Bounded cutset conditioning. By instantiating subsets of the variables, we can break loops in the graph. Unfortunately, when the cutset is large, this is very slow. By instantiating only a subset of values of the cutset, we can compute lower bounds on the probabilities of interest. Alternatively, we can sample the cutsets jointly, a technique known as block Gibbs sampling.

● Parametric approximation methods. These express the intermediate summands in a simpler form, e.g., by approximating them as a product of smaller factors. "Minibuckets" and the Boyen-Koller algorithm fall into this category.

Approximate inference is a huge topic: see the references for more details.

Inference in DBNs

The general inference problem for DBNs is to compute P(X(i,t0) | y(:, t1:t2)), where X(i,t) represents the i'th hidden variable at time and t Y(:,t1:t2) represents all the evidence between times t1 and t2. (In fact, we often also want to compute joint distributions of variables over one or more time slcies.) There are several special cases of interest, illustrated below. The arrow indicates t0: it is X(t0) that we are trying to estimate. The shaded region denotes t1:t2, the available data.

http://www.cs.ubc.ca/~murphyk/Bayes/bayes.html (12 of 21)9/23/2005 4:06:27 PM Graphical Models

Here is a simple example of inference in an LDS. Consider a particle moving in the plane at constant velocity subject to random perturbations in its trajectory. The new position (x1, x2) is the old position plus the velocity (dx1, dx2) plus noise w.

[ x1(t) ] = [1 0 1 0] [ x1(t-1) ] + [ wx1 ] [ x2(t) ] [0 1 0 1] [ x2(t-1) ] [ wx2 ] [ dx1(t) ] [0 0 1 0] [ dx1(t-1) ] [ wdx1 ] [ dx2(t) ] [0 0 0 1] [ dx2(t-1) ] [ wdx2 ]

We assume we only observe the position of the particle.

[ y1(t) ] = [1 0 0 0] [ x1(t) ] + [ vx1 ] [ y2(t) ] [0 1 0 0] [ x2(t) ] [ vx2 ] [ dx1(t) ] [ dx2(t) ]

Suppose we start out at position (10,10) moving to the right with velocity (1,0). We sampled a random trajectory of length 15. Below we show the filtered and smoothed trajectories.

http://www.cs.ubc.ca/~murphyk/Bayes/bayes.html (13 of 21)9/23/2005 4:06:27 PM Graphical Models

The mean squared error of the filtered estimate is 4.9; for the smoothed estimate it is 3.2. Not only is the smoothed estimate better, but we know that it is better, as illustrated by the smaller uncertainty ellipses; this can help in e.g., data association problems. Note how the smoothed ellipses are larger at the ends, because these points have seen less data. Also, note how rapidly the filtered ellipses reach their steady-state (Ricatti) values. (See my Kalman filter toolbox for more details.)

LEARNING

One needs to specify two things to describe a BN: the graph topology (structure) and the parameters of each CPD. It is possible to learn both of these from data. However, learning structure is much harder than learning parameters. Also, learning when some of the nodes are hidden, or we have missing data, is much harder than when everything is observed. This gives rise to 4 cases:

Structure Observability Method ------Known Full Maximum Likelihood Estimation Known Partial EM (or gradient ascent) Unknown Full Search through model space Unknown Partial EM + search through model space

Known structure, full observability

We assume that the goal of learning in this case is to find the values of the parameters of each CPD which maximizes the likelihood of the training data, which contains N cases (assumed to be independent). The normalized log-likelihood of the training set D is a sum of terms, one for each node:

We see that the log-likelihood scoring function decomposes according to the structure of the graph, and hence we can maximize the contribution to the log-likelihood of each node independently (assuming the parameters in each node are independent of the other nodes).

Consider estimating the Conditional Probability Table for the W node. If we have a set of training data, we can just count the number of times the grass is wet when it is raining and the sprinler is on, N(W=1,S=1,R=1), the number of times the grass is wet when it is raining and the sprinkler is off, N(W=1,S=0,R=1), etc. Given these counts (which are the sufficient statistics), we can find the Maximum Likelihood Estimate of the CPT as follows:

http://www.cs.ubc.ca/~murphyk/Bayes/bayes.html (14 of 21)9/23/2005 4:06:27 PM Graphical Models

where the denominator is N(S=s,R=r) = N(W=0,S=s,R=r) + N(W=1,S=s,R=r). Thus "learning" just amounts to counting (in the case of multinomial distributions). For Gaussian nodes, we can compute the sample mean and variance, and use linear regression to estimate the weight matrix. For other kinds of distributions, more complex procedures are necessary.

As is well known from the HMM literature, ML estimates of CPTs are prone to sparse data problems, which can be solved by using (mixtures of) Dirichlet priors (pseudo counts). This results in a Maximum A Posteriori (MAP) estimate. For Gaussians, we can use a Wishart prior, etc.

Known structure, partial observability

When some of the nodes are hidden, we can use the EM (Expectation Maximization) algorithm to find a (locally) optimal Maximum Likelihood Estimate of the parameters. The basic idea behind EM is that, if we knew the values of all the nodes, learning (the M step) would be easy, as we saw above. So in the E step, we compute the expected values of all the nodes using an inference algorithm, and then treat these expected values as though they were observed (distributions). For example, in the case of the W node, we replace the observed counts of the events with the number of times we expect to see each event:

P(W=w|S=s,R=r) = E N(W=w,S=s,R=r) / E N(S=s,R=r) where E N(x) is the expected number of times event x occurs in the whole training set, given the current guess of the parameters. These expected counts can be computed as follows

E N(.) = E sum_k I(. | D(k)) = sum_k P(. | D(k)) where I(x | D(k)) is an indicator function which is 1 if event x occurs in training case k, and 0 otherwise.

Given the expected counts, we maximize the parameters, and then recompute the expected counts, etc. This iterative procedure is guaranteed to converge to a local maximum of the likelihood surface. It is also possible to do gradient ascent on the likelihood surface (the gradient expression also involves the expected counts), but EM is usually faster (since it uses the natural gradient) and simpler (since it has no step size parameter and takes care of parameter constraints (e.g., the "rows" of the CPT having to sum to one) automatically). In any case, we see than when nodes are hidden, inference becomes a subroutine which is called by the learning procedure; hence fast inference algorithms are crucial.

Unknown structure, full observability

We start by discussing the scoring function which we use to select models; we then discuss algorithms which attempt to optimize this function over the space of models, and finally examine their computational and sample complexity.

The objective function used for model selection

The maximum likelihood model will be a complete graph, since this has the largest number of parameters, and hence can fit the data the best. A well-principled way to avoid this kind of over-fitting is to put a prior on models, specifying that we prefer sparse models. Then, by Bayes' rule, the MAP model is the one that maximizes

Taking logs, we find where c = - \log \Pr(D) is a constant independent of G.

The effect of the structure prior P(G) is equivalent to penalizing overly complex models. However, this is not strictly necessary, since the marginal likelihood term

http://www.cs.ubc.ca/~murphyk/Bayes/bayes.html (15 of 21)9/23/2005 4:06:27 PM Graphical Models

P(D|G) = \int_{\theta} P(D|G, \theta) has a similar effect of penalizing models with too many parameters (this is known as Occam's razor).

Search algorithms for finding the best model

The goal of structure learning is to learn a dag (directed acyclic graph) that best explains the data. This is an NP-hard problem, since the number of dag's on N variables is super-exponential in N. (There is no closed form formula for this, but to give you an idea, there are 543 dags on 4 nodes, and O(10^18) dags on 10 nodes.)

If we know the ordering of the nodes, life becomes much simpler, since we can learn the parent set for each node independently (since the score is decomposable), and we don't need to worry about acyclicity constraints. For each node, there at most

\sum_{k=0}^n \choice{n}{k} = 2^n sets of possible parents for each node, which can be arranged in a lattice as shown below for n=4. The problem is to find the highest scoring point in this lattice.

There are three obvious ways to search this graph: bottom up, top down, or middle out. In the bottom up approach, we start at the bottom of the lattice, and evaluate the score at all points in each successive level. We must decide whether the gains in score produced by a larger parent set is ``worth it''. The standard approach in the reconstructibility analysis (RA) community uses the fact that \chi^2(X, Y) \approx I(X,Y) N \ln(4), where N is the number of samples and I(X,Y) is the mutual information (MI) between X and Y. Hence we can use a \chi^2 test to decide whether an increase in the MI score is statistically significant. (This also gives us some kind of confidence measure on the connections that we learn.) Alternatively, we can use a BIC score.

Of course, if we do not know if we have achieved the maximum possible score, we do not know when to stop searching, and hence we must evaluate all points in the lattice (although we can obviously use branch-and-bound). For large n, this is computationally infeasible, so a common approach is to only search up until level K (i.e., assume a bound on the maximum number of parents of each node), which takes O(n ^ K) time.

The obvious way to avoid the exponential cost (and the need for a bound, K) is to use heuristics to avoid examining all possible subsets. (In fact, we must use heuristics of some kind, since the problem of learning optimal structure is NP-hard \cite{Chickering95}.) One approach in the RA framework, called Extended Dependency Analysis (EDA) \cite{Conant88}, is as follows. Start by evaluating all subsets of size up to two, keep all the ones with significant (in the \chi^2 sense) MI with the target node, and take the union of the resulting set as the set of parents.

The disadvantage of this greedy technique is that it will fail to find a set of parents unless some subset of size two has significant MI with the target variable. However, a Monte Carlo simulation in \cite{Conant88} shows that most random relations have this property. In addition, highly interdependent sets of parents (which might fail the pairwise MI test) violate the causal independence assumption,

http://www.cs.ubc.ca/~murphyk/Bayes/bayes.html (16 of 21)9/23/2005 4:06:27 PM Graphical Models which is necessary to justify the use of noisy-OR and similar CPDs.

An alternative technique, popular in the UAI community, is to start with an initial guess of the model structure (i.e., at a specific point in the lattice), and then perform local search, i.e., evaluate the score of neighboring points in the lattice, and move to the best such point, until we reach a local optimum. We can use multiple restarts to try to find the global optimum, and to learn an ensemble of models. Note that, in the partially observable case, we need to have an initial guess of the model structure in order to estimate the values of the hidden nodes, and hence the (expected) score of each model; starting with the fully disconnected model (i.e., at the bottom of the lattice) would be a bad idea, since it would lead to a poor estimate.

Unknown structure, partial observability

Finally, we come to the hardest case of all, where the structure is unknown and there are hidden variables and/or missing data. In this case, to compute the Bayesian score, we must marginalize out the hidden nodes as well as the parameters. Since this is usually intractable, it is common to usean asymptotic approximation to the posterior called BIC (Bayesian Information Criterion), which is defined as follows:

\log \Pr(D|G) \approx \log \Pr(D|G, \hat{\Theta}_G) - \frac{\log N}{2} \#G where N is the number of samples, \hat{\Theta}_G is the ML estimate of the parameters, and #G is the dimension of the model. (In the fully observable case, the dimension of a model is the number of free parameters. In a model with hidden variables, it might be less than this.) The first term is just the likelihood and the second term is a penalty for model complexity. (The BIC score is identical to the Minimum Description Length (MDL) score.)

Although the BIC score decomposes into a sum of local terms, one per node, local search is still expensive, because we need to run EM at each step to compute \hat{\Theta}. An alternative approach is to do the local search steps inside of the M step of EM - this is called Structureal EM, and provably converges to a local maximum of the BIC score (Friedman, 1997).

Inventing new hidden nodes

So far, structure learning has meant finding the right connectivity between pre-existing nodes. A more interesting problem is inventing hidden nodes on demand. Hidden nodes can make a model much more compact, as we see below.

(a) A BN with a hidden variable H. (b) The simplest network that can capture the same distribution without using a hidden variable (created using arc reversal and node elimination). If H is binary and the other nodes are trinary, and we assume full CPTs, the first network has 45 independent parameters, and the second has 708.

The standard approach is to keep adding hidden nodes one at a time, to some part of the network (see below), performing structure learning at each step, until the score drops. One problem is choosing the cardinality (number of possible values) for the hidden node, and its type of CPD. Another problem is choosing where to add the new hidden node. There is no point making it a child, since hidden children can always be marginalized away, so we need to find an existing node which needs a new parent, when the current set of possible parents is not adequate.

http://www.cs.ubc.ca/~murphyk/Bayes/bayes.html (17 of 21)9/23/2005 4:06:27 PM Graphical Models

\cite{Ramachandran98} use the following heuristic for finding nodes which need new parents: they consider a noisy-OR node which is nearly always on, even if its non-leak parents are off, as an indicator that there is a missing parent. Generalizing this technique beyond noisy-ORs is an interesting open problem. One approach might be to examine H(X|Pa(X)): if this is very high, it means the current set of parents are inadequate to ``explain'' the residual entropy; if Pa(X) is the best (in the BIC or \chi^2 sense) set of parents we have been able to find in the current model, it suggests we need to create a new node and add it to Pa(X).

A simple heuristic for inventing hidden nodes in the case of DBNs is to check if the Markov property is being violated for any particular node. If so, it suggests that we need connections to slices further back in time. Equivalently, we can add new lag variables and connect to them.

Of course, interpreting the ``meaning'' of hidden nodes is always tricky, especially since they are often unidentifiable, e.g., we can often switch the interpretation of the true and false states (assuming for simplicity that the hidden node is binary) provided we also permute the parameters appropriately. (Symmetries such as this are one cause of the multiple maxima in the likelihood surface.)

Further reading on learning

The following are good tutorial articles.

● W. L. Buntine, 1994. "Operations for Learning with Graphical Models", J. AI Research, 159--225.

● D. Heckerman, 1996. "A tutorial on learning with Bayesian networks", Microsoft Research tech. report, MSR-TR-95-06.

Decision Theory

It is often said that "Decision Theory = Probability Theory + Utility Theory". We have outlined above how we can model joint probability distributions in a compact way by using sparse graphs to reflect conditional independence relationships. It is also possible to decompose multi-attribute utility functions in a similar way: we create a node for each term in the sum, which has as parents all the attributes (random variables) on which it depends; typically, the utility node(s) will have action node(s) as parents, since the utility depends both on the state of the world and the action we perform. The resulting graph is called an influence diagram. In principle, we can then use the influence diagram to compute the optimal (sequence of) action(s) to perform so as to maximimize expected utility, although this is computationally intractible for all but the smallest problems.

Classical control theory is mostly concerned with the special case where the graphical model is a Linear Dynamical System and the utility function is negative quadratic loss, e.g., consider a missile tracking an airplane: its goal is to minimize the squared distance between itself and the target. When the utility function and/or the system model becomes more complicated, traditional methods break down, and one has to use reinforcement learning to find the optimal policy (mapping from states to actions).

Applications

The most widely used Bayes Nets are undoubtedly the ones embedded in Microsoft's products, including the Answer Wizard of Office 95, the Office Assistant (the bouncy paperclip guy) of Office 97, and over 30 Technical Support Troubleshooters.

BNs originally arose out of an attempt to add probabilities to expert systems, and this is still the most common use for BNs. A famous example is QMR-DT, a decision-theoretic reformulation of the Quick Medical Reference (QMR) model.

http://www.cs.ubc.ca/~murphyk/Bayes/bayes.html (18 of 21)9/23/2005 4:06:27 PM Graphical Models

Here, the top layer represents hidden disease nodes, and the bottom layer represents observed symptom nodes. The goal is to infer the posterior probability of each disease given all the symptoms (which can be present, absent or unknown). QMR-DT is so densely connected that exact inference is impossible. Various approximation methods have been used, including sampling, variational and loopy belief propagation.

Another interesting fielded application is the Vista system, developed by Eric Horvitz. The Vista system is a decision-theoretic system that has been used at NASA Mission Control Center in Houston for several years. The system uses Bayesian networks to interpret live telemetry and provides advice on the likelihood of alternative failures of the space shuttle's propulsion systems. It also considers time criticality and recommends actions of the highest expected utility. The Vista system also employs decision-theoretic methods for controlling the display of information to dynamically identify the most important information to highlight. Horvitz has gone on to attempt to apply similar technology to Microsoft products, e.g., the Lumiere project.

Special cases of BNs were independently invented by many different communities, for use in e.g., genetics (linkage analysis), speech recognition (HMMs), tracking (Kalman fitering), data compression (density estimation) and coding (turbocodes), etc.

For examples of other applications, see the special issue of Proc. ACM 38(3), 1995, and the Microsoft Decision Theory Group page.

Applications to biology

This is one of the hottest areas. For a review, see

● Inferring cellular networks using probabilistic graphical models Science, Nir Friedman, v303 p799, 6 Feb 2004.

Recommended introductory reading

Books

In reverse chronological order (bold means particularly recommended)

● F. V. Jensen. "Bayesian Networks and Decision Graphs". Springer. 2001. Probably the best introductory book available.

● D. Edwards. "Introduction to Graphical Modelling", 2nd ed. Springer-Verlag. 2000. Good treatment of undirected graphical models from a statistical perspective.

● J. Pearl. "Causality". Cambridge. 2000.

http://www.cs.ubc.ca/~murphyk/Bayes/bayes.html (19 of 21)9/23/2005 4:06:27 PM Graphical Models

The definitive book on using causal DAG modeling.

● R. G. Cowell, A. P. Dawid, S. L. Lauritzen and D. J. Spiegelhalter. "Probabilistic Networks and Expert Systems". Springer- Verlag. 1999. Probably the best book available, although the treatment is restricted to exact inference.

● M. I. Jordan (ed). "Learning in Graphical Models". MIT Press. 1998. Loose collection of papers on machine learning, many related to graphical models. One of the few books to discuss approximate inference.

● B. Frey. "Graphical models for machine learning and digital communication", MIT Press. 1998. Discusses pattern recognition and turbocodes using (directed) graphical models.

● E. Castillo and J. M. Gutierrez and A. S. Hadi. "Expert systems and probabilistic network models". Springer-Verlag, 1997. A Spanish version is available online for free.

● F. Jensen. "An introduction to Bayesian Networks". UCL Press. 1996. Out of print. Superceded by his 2001 book.

● S. Lauritzen. "Graphical Models", Oxford. 1996. The definitive mathematical exposition of the theory of graphical models.

● S. Russell and P. Norvig. "Artificial Intelligence: A Modern Approach". Prentice Hall. 1995. Popular undergraduate textbook that includes a readable chapter on directed graphical models.

● J. Whittaker. "Graphical Models in Applied Multivariate Statistics", Wiley. 1990. This is the first book published on graphical modelling from a statistics perspective.

● R. Neapoliton. "Probabilistic Reasoning in Expert Systems". John Wiley & Sons. 1990.

● J. Pearl. "Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference." Morgan Kaufmann. 1988. The book that got it all started! A very insightful book, still relevant today.

Review articles

● P. Smyth, 1998. "Belief networks, hidden Markov models, and Markov random fields: a unifying view", Pattern Recognition Letters. ● E. Charniak, 1991. "Bayesian Networks without Tears", AI magazine.

● Sam Roweis & Zoubin Ghahramani, 1999. A Unifying Review of Linear Gaussian Models, Neural Computation 11(2) (1999) pp.305-345

Exact Inference

● C. Huang and A. Darwiche, 1996. "Inference in Belief Networks: A procedural guide", Intl. J. Approximate Reasoning, 15 (3):225-263. ● R. McEliece and S. M. Aji, 2000. The Generalized Distributive Law, IEEE Trans. Inform. Theory, vol. 46, no. 2 (March 2000), pp. 325--343. ● F. Kschischang, B. Frey and H. Loeliger, 2001. Factor graphs and the sum product algorithm, IEEE Transactions on Information Theory, February, 2001. ● M. Peot and R. Shachter, 1991. "Fusion and propogation with multiple observations in belief networks", Artificial Intelligence, 48:299-318.

http://www.cs.ubc.ca/~murphyk/Bayes/bayes.html (20 of 21)9/23/2005 4:06:27 PM Graphical Models Approximate Inference

● M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, 1997. "An introduction to variational methods for graphical models."

● D. MacKay, 1998. "An introduction to Monte Carlo methods".

● T. Jaakkola and M. Jordan, 1998. "Variational probabilistic inference and the QMR-DT database"

Learning

● W. L. Buntine, 1994. "Operations for Learning with Graphical Models", J. AI Research, 159--225.

● D. Heckerman, 1996. "A tutorial on learning with Bayesian networks", Microsoft Research tech. report, MSR-TR-95-06.

DBNs

● L. R. Rabiner, 1989. "A Tutorial in Hidden Markov Models and Selected Applications in Speech Recognition", Proc. of the IEEE, 77(2):257--286. ● Z. Ghahramani, 1998. Learning Dynamic Bayesian Networks In C.L. Giles and M. Gori (eds.), Adaptive Processing of Sequences and Data Structures . Lecture Notes in Artificial Intelligence, 168-197. Berlin: Springer-Verlag.

http://www.cs.ubc.ca/~murphyk/Bayes/bayes.html (21 of 21)9/23/2005 4:06:27 PM Modal Logic

version history Stanford Encyclopedia of Philosophy last substantive content change HOW A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T TO | U | V | W | X | Y | Z OCT CITE 7 THIS This document uses XHTML/Unicode to format the display. If you 2003 ENTRY think special symbols are not displaying correctly, see our guide Displaying Special Characters.

The Encyclopedia gratefully acknowledges a partial contribution to our fundraising efforts from the University of Ottawa Libraries.

Please encourage librarians at your institution to make the recommended contribution and thereby ensure continued open access.

Modal Logic

A modal is an expression (like ‘necessarily’ or ‘possibly’) that is used to qualify the truth of a judgement. Modal logic is, strictly speaking, the study of the deductive behavior of the expressions ‘it is necessary that’ and ‘it is possible that’. However, the term ‘modal logic’ may be used more broadly for a family of related systems. These include logics for belief, for tense and other temporal expressions, for the deontic (moral) expressions such as ‘it is obligatory that’ and ‘it is permitted that’, and many others. An understanding of modal logic is particularly valuable in the formal analysis of philosophical argument, where expressions from the modal family are both common and confusing. Modal logic also has important applications in computer science.

● 1. What is Modal Logic?

● 2. Modal Logics

● 3. Deontic Logics

● 4. Temporal Logics

● 5. Conditional Logics

● 6. Possible Worlds Semantics

● 7. Modal Axioms and Conditions on Frames

● 8. Map of the Relationships between Modal Logics

● 9. The General Axiom

● 10. Provability Logics

● 11. Quantifiers in Modal Logic

● Bibliography

● Other Internet Resources

http://plato.stanford.edu/entries/logic-modal/ (1 of 20)9/24/2005 12:09:17 PM Modal Logic

● Related Entries

1. What is Modal Logic?

Narrowly construed, modal logic studies reasoning that involves the use of the expressions ‘necessarily’ and ‘possibly’. However, the term ‘modal logic’ is used more broadly to cover a family of logics with similar rules and a variety of different symbols.

A list describing the best known of these logics follows.

Logic Symbols Expressions Symbolized Modal Logic □ It is necessary that .. ◊ It is possible that ..

Deontic Logic O It is obligatory that .. P It is permitted that .. F It is forbidden that ..

Temporal Logic G It will always be the case that .. F It will be the case that .. H It has always been the case that .. P It was the case that..

Doxastic Logic Bx x believes that ..

2. Modal Logics

The most familiar logics in the modal family are constructed from a weak logic called K (after Saul Kripke). Under the narrow reading, modal logic concerns necessity and possibility. A variety of different systems may be developed for such logics using K as a foundation. The symbols of K include ‘~’ for ‘not’, ‘→’ for ‘if...then’, and ‘□’ for the modal operator ‘it is necessary that’. (The connectives ‘&’, ‘ ’, and ‘↔’ may be defined from ‘~’ and ‘→’ as is done in propositional logic.) K results from adding the following to the principles of propositional logic.

Necessitation Rule: If A is a theorem of K, then so is □A.

Distribution Axiom: □(A→B) → (□A→□B).

http://plato.stanford.edu/entries/logic-modal/ (2 of 20)9/24/2005 12:09:17 PM Modal Logic

(In these principles we use ‘A’ and ‘B’ as metavariables ranging over formulas of the language.) According to the Necessitation Rule, any theorem of logic is necessary. The Distribution Axiom says that if it is necessary that if A then B, then if necessarily A then necessarily B.

The operator ◊ (for ‘possibly’) can be defined from □ by letting ◊A = ~□~A. In K, the operators □ and ◊ behave very much like the quantifiers ∀ (all) and ∃ (some). For example, the definition of ◊ from □ mirrors the equivalence of ∀xA with ~∃x~A in predicate logic. Furthermore, □(A&B) entails □A&□B and vice versa; while □A □B entails □(A B), but not vice versa. This reflects the patterns exhibited by the universal : ∀x(A&B) entails ∀xA&∀xB and vice versa, while ∀xA ∀xB entails ∀x(A B) but not vice versa. Similar parallels between ◊ and ∃ can be drawn. The basis for this correspondence between the modal operators and the quantifiers will emerge more clearly in the section on Possible Worlds Semantics.

The system K is too weak to provide an adequate account of necessity. The following axiom is not provable in K, but it is clearly desirable.

(M) □A→A

(M) claims that whatever is necessary is the case. Notice that (M) would be incorrect were □ to be read ‘it ought to be that’, or ‘it was the case that’. So the presence of axiom (M) distinguishes modal from other logics in the modal family. A basic modal logic M results from adding (M) to K. (Some authors call this system T.)

Many logicians believe that M is still too weak to correctly formalize the logic of necessity and possibility. They recommend further axioms to govern the iteration, or repetition of modal operators. Here are two of the most famous iteration axioms:

(4) □A→□□A

(5) ◊A→□◊A

S4 is the system that results from adding (4) to M. Similarly S5 is M plus (5). In S4, the sentence □□A is equivalent to □A. As a result, any string of boxes may be replaced by a single box, and the same goes for strings of diamonds. This amounts to the idea that iteration of the modal operators is superfluous. Saying that A is necessarily necessary is considered a uselessly long-winded way of saying that A is necessary. The system S5 has even stronger principles for simplifying strings of modal operators. In S4, a string of operators of the same kind can be replaced for that operator; in S5, strings containing both boxes and diamonds are equivalent to the last operator in the string. So, for example, saying that it is possible that A is necessary is the same as saying that A is necessary. A summary of these features of S4 and S5 follows.

http://plato.stanford.edu/entries/logic-modal/ (3 of 20)9/24/2005 12:09:17 PM Modal Logic

S4: □□...□ = □ and ◊◊...◊ = ◊

S5: 00...□ = □ and 00...◊ = ◊, where each 0 is either □ or ◊

One could engage in endless argument over the correctness or incorrectness of these and other iteration principles for □ and ◊. The controversy can be partly resolved by recognizing that the words ‘necessarily’ and ‘possibly’, have many different uses. So the acceptability of axioms for modal logic depends on which of these uses we have in mind. For this reason, there is no one modal logic, but rather a whole family of systems built around M. The relationship between these systems is diagrammed in Section 8, and their application to different uses of ‘necessarily’ and ‘possibly’ can be more deeply understood by studying their possible world semantics in Section 6.

The system B (for the logician Brouwer) is formed by adding axiom (B) to M.

(B) A→□◊A

It is interesting to note that S5 can be formulated equivalently by adding (B) to S4. The axiom (B) raises an important point about the interpretation of modal formulas. (B) says that if A is the case, then A is necessarily possible. One might argue that (B) should always be adopted in any modal logic, for surely if A is the case, then it is necessary that A is possible. However, there is a problem with this claim that can be exposed by noting that ◊□A→A is provable from (B). So ◊□A→A should be acceptable if (B) is. However, ◊□A→A says that if A is possibly necessary, then A is the case, and this is far from obvious. Why does (B) seem obvious, while one of the things it entails seems not obvious at all? The answer is that there is a dangerous ambiguity in the English interpretation of A→□◊A. We often use the expression ‘If A then necessarily B’ to express that the conditional ‘if A then B’ is necessary. This interpretation corresponds to □(A→B). On other occasions, we mean that if A, then B is necessary: A→□B. In English, ‘necessarily’ is an adverb, and since adverbs are usually placed near verbs, we have no natural way to indicate whether the modal operator applies to the whole conditional, or to its consequent. For these reasons, there is a tendency to confuse (B): A→□◊A with □(A→◊A). But □ (A→◊A) is not the same as (B), for □(A→◊A) is already a theorem of M, and (B) is not. One must take special care that our positive reaction to □(A→◊A) does not infect our evaluation of (B). One simple way to protect ourselves is to formulate B in an equivalent way using the axiom: ◊□A→A, where these ambiguities of scope do not arise.

3. Deontic Logics

Deontic logics introduce the primitive symbol O for ‘it is obligatory that’, from which symbols P for ‘it is permitted that’ and F for ‘it is forbidden that’ are defined: PA = ~O~A and FA = O~A. The deontic analog of the modal axiom (M): OA→A is clearly not appropriate for deontic logic. (Unfortunately, what ought to be is not always the case.) However, a basic system D of deontic logic can be constructed by

http://plato.stanford.edu/entries/logic-modal/ (4 of 20)9/24/2005 12:09:17 PM Modal Logic adding the weaker axiom (D) to K.

(D) OA→PA

Axiom (D) guarantees the consistency of the system of obligations by insisting that when A is obligatory, A is permissible. A system which obligates us to bring about A, but doesn't permit us to do so, puts us in an inescapable bind. Although some will argue that such conflicts of obligation are at least possible, most deontic logicians accept (D).

O(OA→A) is another deontic axiom that seems desirable. Although it is wrong to say that if A is obligatory then A is the case (OA→A), still, this conditional ought to be the case. So some deontic logicians believe that D needs to be supplemented with O(OA→A) as well.

Controversy about iteration (repetition) of operators arises again in deontic logic. In some conceptions of obligation, OOA just amounts to OA. ‘It ought to be that it ought to be’ is treated as a sort of stuttering; the extra ‘ought’s do not add anything new. So axioms are added to guarantee the equivalence of OOA and OA. The more general iteration policy embodied in S5 may also be adopted. However, there are conceptions of obligation where distinction between OA and OOA is preserved. The idea is that there are genuine differences between the obligations we actually have and the obligations we should adopt. So, for example, ‘it ought to be that it ought to be that A’ commands adoption of some obligation which may not actually be in place, with the result that OOA can be true even when OA is false.

4. Temporal Logics

In temporal logic (also known as tense logic), there are two basic operators, G for the future, and H for the past. G is read ‘it always will be that’ and the defined operator F (read ‘it will be the case that’), can be introduced by FA = ~G~A. Similarly H is read: ‘it always was that’ and P (for ‘it was the case that’) is defined by PA=~H~A. A basic system of temporal logic called Kt results from adopting the principles of K for both G and H, along with two axioms to govern the interaction between the past and future operators:

"Necessitation" Rules: If A is a theorem then so are GA and HA.

Distribution Axioms: G(A→B) → (GA→GB) and H(A→B) → (HA→HB)

Interaction Axioms: A→GPA and A→HFA

The interaction axioms raise questions concerning asymmetries between the past and the future. A standard intuition is that the past is fixed, while the future is still open. The first interaction axiom (A→GPA) conforms to this intuition in reporting that what is the case (A), will at all future times, be in the past (GPA). However A→HFA may appear to have unacceptably deterministic overtones, for it

http://plato.stanford.edu/entries/logic-modal/ (5 of 20)9/24/2005 12:09:17 PM Modal Logic claims, apparently, that what is true now (A) has always been such that it will occur in the future (HFA). However, possible world semantics for temporal logic reveals that this worry results from a simple confusion, and that the two interaction axioms are equally acceptable.

Note that the characteristic axiom of modal logic, (M): □A→A, is not acceptable for either H or G, since A does not follow from ‘it always was the case that A’, nor from ‘it always will be the case that A’. However, it is acceptable in a closely related temporal logic where G is read ‘it is and always will be’, and H is read ‘it is and always was’.

Depending on which assumptions one makes about the structure of time, further axioms must be added to temporal logics. A list of axioms commonly adopted in temporal logics follows. An account of how they depend on the structure of time will be found in the section Possible Worlds Semantics.

GA→GGA and HA→HHA

GGA→GA and HHA→HA

GA→FA and HA→PA

It is interesting to note that certain combinations of past tense and futuere tense operators may be used to express complex tenses in English. For example, FPA, corresponds to sentence A in the future perfect tense, (as in ‘20 seconds from now the light will have changed’). Similarly, PPA expresses the past perfect tense.

For a more detailed discussion of temporal logic, see the entry on temporal logic.

5. Conditional Logics

The founder of modal logic, C. I. Lewis, defined a series of modal logics which did not have □ as a primitive symbol. Lewis was concerned to develop of logic of conditionals that was free of the so called Paradoxes of Material Implication, namely the classical theorems A→(~A→B) and B→(A→B). He introduced the symbol for "strict implication" and developed logics where neither A (~A B) nor B (A B) is provable. The modern practice has been to define A B by □(A→B), and use modal logics governing □ to obtain similar results. However, the provability of such formulas as (A&~A) B in such logics seems at odds with concern for the paradoxes. Anderson and Belnap (1975) have developed systems R (for Relevance Logic) and E (for Entailment) which are designed to overcome such difficulties. These systems require revision of the standard systems of propositional logic. (For a more detailed discussion of relevance logic, see the entry on relevance logic.)

David Lewis (1973) has developed special conditional logics to handle counterfactual expressions, that

http://plato.stanford.edu/entries/logic-modal/ (6 of 20)9/24/2005 12:09:17 PM Modal Logic is, expressions of the form ‘if A were to happen then B would happen’. (Kvart (1980) is another good source on the topic.) Counterfactual logics differ from those based on strict implication because the former reject while the latter accept contraposition.

6. Possible Worlds Semantics

The purpose of logic is to characterize the difference between valid and invalid arguments. A logical system for a language is a set of axioms and rules designed to prove exactly the valid arguments statable in the language. Creating such a logic may be a difficult task. The logican must make sure that the system is sound, i.e. that every argument proven using the rules and axioms is in fact valid. Furthermore, the system should be complete, meaning that every valid argument has a proof in the system. Demonstrating soundness and completeness of formal systems is a logician's central concern.

Such a demonstration cannot get underway until the concept of validity is defined rigorously. Formal semantics for a logic provides a definition of validity by characterizing the truth behavior of the sentences of the system. In propositional logic, validity can be defined using truth tables. A valid argument is simply one where every row that makes its premises true also makes its conclusion true. However truth tables cannot be used to provide an account of validity in modal logics because there are no truth tables for expressions such as ‘it is necessary that’, ‘it is obligatory that’, and the like. (The problem is that the truth value of A does not determine the truth value for □A. For example, when A is ‘Dogs are dogs’, □A is true, but when A is ‘Dogs are pets’, □A is false. Nevertheless, semantics for modal logics can be defined by introducing possible worlds. We will illustrate possible worlds semantics for a logic of necessity containing the symbols ~, →, and □. Then we will explain how the same strategy may be adapted to other logics in the modal family.

In propositional logic, a valuation of the atomic sentences (or row of a truth table) assigns a truth value (T or F) to each propositional variable p. Then the truth values of the complex sentences is calculated with truth tables. In modal semantics, a set W of possible worlds is introduced. A valuation then gives a truth value to each propositional variable for each of the possible worlds in W. This means that value assigned to p for world w may differ from the value assigned to p for another world w′.

The truth value of the atomic sentence p at world w given by the valuation v may be written v(p, w). Given this notation, the truth values (T for true, F for false) of complex sentences of modal logic for a given valuation v (and member w of the set of worlds W) may be defined by the following truth clauses. (‘iff’ abbreviates ‘if and only if’.)

(~) v(~A, w)=T iff v(A, w)=F.

(→) v(A→B, w)=T iff v(A, w)=F or v(B, w)=T.

(5) v(□A, w)=T iff for every world w′ in W, v(A, w′)=T.

http://plato.stanford.edu/entries/logic-modal/ (7 of 20)9/24/2005 12:09:17 PM Modal Logic

Clauses (~) and (→) simply describe the standard truth table behavior for negation and material implication respectively. According to (5), □A is true (at a world w) exactly when A is true in all possible worlds. Given the definition of ◊, (namely, ◊A = ~□~A) the truth condition (5) insures that ◊A is true just in case A is true in some possible world. Since the truth clauses for □ and ◊ involve the quantifiers ‘all’ and ‘some’ (respectively), the parallels in logical behavior between □ and ∀x, and between ◊ and ∃x noted in section 2 will be expected.

Clauses (~), (→), and (5) allow us to calculate the truth value of any sentence at any world on a given valuation. A definition of validity is now just around the corner. An argument is 5-valid for a given set W (of possible worlds) if and only if every valuation of the atomic sentences that assigns the premises T at a world in W also assigns the conclusion T at the same world. An argument is said to be 5-valid iff it is valid for every non empty set of W of possible worlds.

It has been shown that S5 is sound and complete for 5-validity (hence our use of the symbol ‘5’). The 5- valid arguments are exactly the arguments provable in S5. This result suggests that S5 is the correct way to formulate a logic of necessity.

However, S5 is not a reasonable logic for all members of the modal family. In deontic logic, temporal logic, and others, the analog of the truth condition (5) is clearly not appropriate; furthermore there are even conceptions of necessity where (5) should be rejected as well. The point is easiest to see in the case of temporal logic. Here, the members of W are moments of time, or worlds "frozen", as it were, at an instant. For simplicity let us consider a future temporal logic, a logic where □A reads: ‘it will always be the case that’. (We formulate the system using □ rather than the traditional G so that the connections with other modal logics will be easier to appreciate.) The correct clause for □ should say that □A is true at time w iff A is true at all times in the future of w. To restrict attention to the future, the relation R (for ‘eaRlier than’) needs to be introduced. Then the correct clause can be formulated as follows.

(K) v(□A, w)=T iff for every w′, if wRw′, then v(A, w′)=T.

This says that □A is true at w just in case A is true at all times after w.

Validity for this brand of temporal logic can now be defined. A frame is a pair consisting of a non-empty set W (of worlds) and a binary relation R on W. A model consists of a frame F, and a valuation v that assigns truth values to each atomic sentence at each world in W. Given a model, the values of all complex sentences can be determined using (~), (→), and (K). An argument is K-valid just in case any model whose valuation assigns the premises T at a world also assigns the conclusion T at the same world. As the reader may have guessed from our use of ‘K’, it has been shown that the simplest modal logic K is both sound and complete for K-validity.

7. Modal Axioms and Conditions on Frames

http://plato.stanford.edu/entries/logic-modal/ (8 of 20)9/24/2005 12:09:17 PM Modal Logic

One might assume from this discussion that K is the correct logic when □ is read ‘it will always be the case that’. However, there are reasons for thinking that K is too weak. One obvious logical feature of the relation R (earlier than) is transitivity. If wRv (w is earlier than v) and vRu (v is earlier than u), then it follows that wRu (w is earlier than u). So let us define a new kind of validity that corresponds to this condition on R. Let a 4-model be any model whose frame is such that R is a transitive relation on W. Then an argument is 4-valid iff any 4-model whose valuation assigns T to the premises at a world also assigns T to the conclusion at the same world. We use ‘4’ to describe such a transitive model because the logic which is adequate (both sound and complete) for 4-validity is K4, the logic which results from adding the axiom (4): □A→□□A to K.

Transitivity is not the only property which we might want to require of the frame if R is to be read ‘earlier than’ and W is a set of moments. One condition (which is only mildly controversial) is that there is no last moment of time, i.e. that for every world w there is some world v such that wRv. This condition on frames is called seriality. Seriality corresponds to the axiom (D): □A→◊A, in the same way that transitivity corresponds to (4). A D-model is a K-model with a serial frame. From the concept of a D-model the corresponding notion of D-validity can be defined just as we did in the case of 4- validity. As you probably guessed, the system that is adequate with respect to D-validity is KD, or K plus (D). Not only that, but the system KD4 (that is K plus (4) and (D)) is adequate with respect to D4- validity, where a D4-model is one where is both serial and transitive.

Another property which we might want for the relation ‘earlier than’ is density, the condition which says that between any two times we can always find another. Density would be false if time were atomic, i.e. if there were intervals of time which could not be broken down into any smaller parts. Density corresponds to the axiom (C4): □□A→□A, the converse of (4), so for example, the system KC4, which is K plus (C4) is adequate with respect to models where the frame is dense, and KDC4, adequate with respect to models whose frames are serial and dense, and so on.

Each of the modal logic axioms we have discussed corresponds to a condition on frames in the same way. The relationship between conditions on frames and corresponding axioms is one of the central topics in the study of modal logics. Once an interpretation of the intensional operator □ has been decided on, the appropriate conditions on R can be determined to fix the corresponding notion of validity. This, in turn, allows us to select the right set of axioms for that logic.

For example, consider a deontic logic, where □ is read ‘it is obligatory that’. Here the truth of □A does not demand the truth of A in every possible world, but only in a subset of those worlds where people do what they ought. So we will want to introduce a relation R for for this kind of logic as well, and use the truth clause (K) to evaluate □A at a world. However, in this case, R is not earlier than. Instead wRw′ holds just in case world w′ is a morally acceptable variant of w, i.e. a world that our actions can bring about which satisfies what is morally correct, or right, or just. Under such a reading, it should be clear that the relevant frames should obey seriality, the condition that requires that each possible world have a morally acceptable variant. The analysis of the properties desired for R makes it clear that a basic deontic logic can be formulated by adding the axiom (D) and to K.

http://plato.stanford.edu/entries/logic-modal/ (9 of 20)9/24/2005 12:09:17 PM Modal Logic

Even in modal logic, one may wish to restrict the range of possible worlds which are relevant in determining whether □A is true at a given world. For example, I might say that it is necessary for me to pay my bills, even though I know full well that there is a possible world where I fail to pay them. In ordinary speech, the claim that A is necessary does not require the truth of A in all possible worlds, but rather only in a certain class of worlds which I have in mind (for example, worlds where I avoid penalties for failure to pay). In order to provide a generic treatment of necessity, we must say that □A is true in w iff A is true in all worlds that are related to w in the right way. So for an operator □ interpreted as necessity, we introduce a corresponding relation R on the set of possible worlds W, traditionally called the accessibility relation. The accessibility relation R holds between worlds w and w′ iff w′ is possible given the facts of w. Under this reading for R, it should be clear that frames for modal logic should be reflexive. It follows that modal logics should be founded on M, the system that results from adding (M) to K. Depending on exactly how the accessibility relation is understood, symmetry and transitivity may also be desired.

A list of some of the more commonly discussed conditions on frames and their corresponding axioms along with a map showing the relationship between the various modal logics can be found in the next section.

8. Map of the Relationships Between Modal Logics

The following diagram shows the relationships between the best known modal logics, namely logics that can be formed by adding a selection of the axioms (D), (M), (4), (B) and (5) to K. A list of these (and other) axioms along with their corresponding frame conditions can be found below the diagram.

In this chart, systems are given by the list of their axioms. So, for example M4B is the result of adding

http://plato.stanford.edu/entries/logic-modal/ (10 of 20)9/24/2005 12:09:17 PM Modal Logic

(M) (4) and (B) to K. In boldface, we have indicated traditional names of some systems. When system S appears below and/or to the left of S′ connected by a line, then S′ is an extension of S. This means that every argument provable in S is provable in S′, but S is weaker than S′, i.e. not all arguments provable in S′ are provable in S.

The following list indicates axioms, their names, and the corresponding conditions on the accessibility relation R, for axioms so far discussed in this encyclopedia entry.

Axiom Axiom Condition on Frames R is... Name (D) □A→◊A ∃u wRu Serial (M) □A→A wRw Reflexive (4) □A→□□A (wRv&vRu) ⇒ wRu Transitive (B) A→□◊A wRv ⇒ vRw Symmetric (5) ◊A→□◊A (wRv&wRu) ⇒ vRu Euclidean (CD) ◊A→□A (wRv&wRu) ⇒ v=u Unique (□M) □(□A→A) wRv ⇒ vRv Shift Reflexive (C4) □□A→□A wRv ⇒ ∃u(wRu&uRv) Dense (C) ◊□A → □◊A wRv&wRx ⇒ ∃u(vRu&xRu) Convergent

In the list of conditions on frames, the variables ‘w’, ‘v’, ‘u’, ‘x’ and the quantifier ‘∃u’ are understood to range over W. ‘&’ abbreviates ‘and’ and ‘⇒’ abbreviates ‘if...then’.

9. The General Axiom

The correspondence between axioms and conditions on frames may seem something of a mystery. A beautiful result of Lemmon and Scott (1977) goes a long way towards explaining those relationships. Their theorem concerned axioms which have the following form:

(G) ◊h□ iA → □ j◊kA

We use the notation ‘◊n’ to represent n diamonds in a row, so, for example, ‘◊3’ abbreviates a string of three diamonds: ‘◊◊◊’. Similarly ‘□ n’ represents a string of n boxes. When the values of h, i, j, and k are all 1, we have axiom (C):

http://plato.stanford.edu/entries/logic-modal/ (11 of 20)9/24/2005 12:09:17 PM Modal Logic

(C) ◊□A → □◊A = ◊ 1□ 1A → □ 1◊1A

The axiom (B) results from setting h and k to 0, and letting j and k be 1:

(B) A → □◊A = ◊ 0□ 0A → □ 1◊1A

To obtain (4), we may set h and k to 0, set i to 1 and j to 2:

(4) □A →□□A = ◊ 0□ 1A → □ 2◊0A

Many (but not all) axioms of modal logic can be obtained by setting the right values for the parameters in (G)

Our next task will be to give the condition on frames which corresponds to (G) for a given selection of values for h, i, j, and k. In order to do so, we will need a definition. The composition of two relations R and R′ is a new relation R R′ which is defined as follows:

wR R′v iff for some u, wRu and uR′v.

For example, if R is the relation of being a brother, and R′ is the relation of being a parent then R R′ is the relation of being an uncle, (because w is the uncle of v iff for some person u, both w is the brother of u and u is the parent of v). A relation may be composed with itself. For example, when R is the relation of being a parent, then R R is the relation of being a grandparent, and R R R is the relation of being a great-grandparent. It will be useful to write ‘Rn’, for the result of composing R with itself n times. So R2 is R R, and R4 is R R R R. We will let R1 be R, and R0 will be the identity relation, i.e. wR0v iff w=v.

We may now state the Scott-Lemmon result. It is that the condition on frames which corresponds exactly to any axiom of the shape (G) is the following.

(hijk-Convergence) wRhv & wRju ⇒ ∃x (vRix & uRkx)

It is interesting to see how the familiar conditions on R result from setting the values for h, i, j, and k according to the values in the corresponding axiom. For example, consider (5). In this case i=0, and h=j=k=1. So the corresponding condition is

wRv & wRu ⇒ ∃x (vR0x & uRx).

We have explained that R0 is the identity relation. So if vR0x then v=x. But ∃x (v=x & uRx), is equivalent to uRv, and so the Euclidean condition is obtained:

http://plato.stanford.edu/entries/logic-modal/ (12 of 20)9/24/2005 12:09:17 PM Modal Logic

(wRv & wRu) ⇒ uRv.

In the case of axiom (4), h=0, i=1, j=2 and k=0. So the corresponding condition on frames is

(w=v & wR2u) ⇒ ∃x (vRx & u=x).

Resolving the identities this amounts to:

vR2u ⇒ vRu.

By the definition of R2, vR2u iff ∃x(vRx & xRu), so this comes to:

∃x(vRx & xRu) ⇒ vRu, which by predicate logic, is equivalent to transitivity.

vRx & xRu ⇒ vRu.

The reader may find it a pleasant exercise to see how the corresponding conditions fall out of hijk- Convergence when the values of the parameters h, i, j, and k are set by other axioms.

The Scott-Lemmon results provides a quick method for establishing results about the relationship between axioms and their corresponding frame conditions. Since they showed the adequacy of any logic that extends K with a selection of axioms of the form (G) with respect to models that satisfy the corresponding set of frame conditions, they provided "wholesale" adequacy proofs for the majority of systems in the modal family. Sahlqvist (1975) has discovered important generalizations of the Scott- Lemmon result covering a much wider range of axiom types.

10. Provability Logics

Modal logic has been useful in clarifying our understanding of central results concerning provability in the foundations of mathematics (Boolos, 1993). Provability logics are systems where the propositional variables p, q, r, etc. range over formulas of some mathematical system, for example Peano's system PA for arithmetic. (The system chosen for mathematics might vary, but assume it is PA for this discussion.) Gödel showed that arithmetic has strong expressive powers. Using code numbers for arithmetic sentences, he was able to demonstrate a correspondence between sentences of mathematics and facts about which sentences are and are not provable in PA. For example, he showed there there is a sentence C that is true just in case no contradiction is provable in PA and there is a sentence G (the famous Gödel sentence) that is true just in case it is not provable in PA.

http://plato.stanford.edu/entries/logic-modal/ (13 of 20)9/24/2005 12:09:17 PM Modal Logic

In provability logics, □p is interpreted as a formula (of arithmetic) that expresses that what p denotes is provable in PA. Using this notation, sentences of provability logic express facts about provability. Suppose that is a constant of provability logic denoting a contradiction. Then ~□ says that PA is consistent and □A→A says that PA is sound in the sense that when it proves A, A is indeed true. Furthermore, the box may be iterated. So, for example, □~□ makes the dubious claim that PA is able to prove its own consistency, and ~□ → ~□~□ asserts (correctly as Gödel proved) that if PA is consistent then PA is unable to prove its own consistency.

Although provability logics form a family of related systems, the system GL is by far the best known. It results from adding the following axiom to K:

(GL) □(□A→A)→□A

The axiom (4): □A→□□A is provable in GL, so GL is actually a strengthening of K4. However, axioms such as (M): □A→A, and even the weaker (D): □A→◊A are not available (nor desirable) in GL. In provability logic, provability is not to be treated as a brand of necessity. The reason is that when p is provable in an arbitrary system S for mathematics, it does not follow that p is true, since S may be unsound. Furthermore, if p is provable in S (□p) it need not even follow that ~p lacks a proof (~□~p = ◊p). S might be inconsistent and so prove both p and ~p.

Axiom (GL) captures the content of Loeb's Theorem, an important result in the foundations of arithmetic. □A→A says that PA is sound for A, i.e. that if A were proven, A would be true. (Such a claim might not be secure for an arbitrarily selected sytem S, since A might be provable in S and false.) (GL) claims that if PA manages to prove the sentence that claims soundness for a given sentence A, then A is already provable in PA. Loeb's Theorem reports a kind of modesty on PA's part (Boolos, 1993, p. 55). PA never insists (proves) that a proof of A entails A's truth, unless it already has a proof of A to back up that claim.

It has been shown that GL is adequate for provability in the following sense. Let a sentence of GL be always provable exactly when the sentence of arithmetic it denotes is provable no matter how its variables are assigned values to sentences of PA. Then the provable sentences of GL are exactly the sentences that are always provable. This adequacy result has been extremely useful, since general questions concerning provability in PA can be transformed into easier questions about what can be demonstrated in GL.

GL can also be outfitted with a possible world semantics for which it is sound and complete. A corresponding condition on frames for GL-validity is that the frame be transitive, finite and irreflexive.

11. Quantifiers in Modal Logic

It would seem to be a simple matter to outfit a modal logic with the quantifiers ∀ (all) and ∃ (some). One

http://plato.stanford.edu/entries/logic-modal/ (14 of 20)9/24/2005 12:09:17 PM Modal Logic would simply add the standard (or classical) rules for quantifiers to the principles of whichever propositional modal logic one chooses. However, systems of this kind create problems which have motivated some logicians to abandon classical quantifier rules in favor of the weaker rules of free logic (Garson, 1984). The controversy over whether classical principles should be adopted continues today.

The main points of disagreement can be traced back to decisions about how to handle the domain of quantification. The simplest alternative, the fixed-domain (sometimes called the possibilist) approach, assumes a single domain of quantification that contains all the possible objects. On the other hand, the world-relative (or actualist) interpretation, assumes that the domain of quantification changes from world to world, and contains only the objects that actually exist in a given world.

The fixed-domain approach requires no major adjustments to the classical machinery for the quantifiers. Modal logics that are adequate for fixed domain semantics can usually be axiomatized by adding principles of a propositional modal logic to classical quantifier rules together with the Barcan Formula (BF) (Barcan 1946). (For an account of some interesting exceptions see Cresswell (1995)).

(BF) ∀x□A→□∀xA.

The fixed-domain interpretation has advantages of simplicity and familiarity, but it does not provide a direct account of the semantics of certain quantifier expressions of natural language. We do not think that ‘Some man exists who signed the Declaration of Independence’ is true, at least not if we read ‘exists’ in the present tense. Nevertheless, this sentence was true in 1777, which shows that the domain for the natural language expression ‘some man exists who’ changes to reflect which men exist at different times. A related problem is that on the fixed-domain interpretation, the sentence ∀y□∃x(x=y) is valid. However, assuming that ∃x(x=y) is read: y exists, ∀y□∃x(x=y) says that everything exists necessarily. However, it seems a fundamental feature of common ideas about modality that the existence of many things is contingent, and that different objects exist in different possible worlds.

The defender of the fixed-domain interpretation may respond to these objections by insisting that on his (her) reading of the quantifiers, the domain of quantification contains all possible objects, not just the objects that happen to exist at a given world. So the theorem ∀y□∃x(x=y) makes the innocuous claim that every possible object is necessarily found in the domain of all possible objects. Furthermore, those quantifier expressions of natural language whose domain is world (or time) dependent can be expressed using the fixed-domain quantifier ∃x and a predicate letter E with the reading ‘actually exists’. For example, instead of translating ‘Some Man exists who Signed the Declaration of Independence’ by

∃x(Mx&Sx), the defender of fixed domains may write:

∃x(Ex&Mx&Sx),

http://plato.stanford.edu/entries/logic-modal/ (15 of 20)9/24/2005 12:09:17 PM Modal Logic thus ensuring the translation is counted false at the present time. Cresswell (1991) makes the interesting observation that world-relative quantification has limited expressive power relative to fixed-domain quantification. World-relative quantification can be defined with fixed domain quantifiers and E, but there is no way to fully express fixed-domain quantifiers with world-relative ones. Although this argues in favor of the classical approach to quantified modal logic, the translation tactic also amounts to something of a concession in favor of free logic, for the world-relative quantifiers so defined obey exactly the free logic rules.

A problem with the translation strategy used by defenders of fixed domain quantification is that rendering the English into logic is less direct, since E must be added to all translations of all sentences whose quantifier expressions have domains that are context dependent. A more serious objection to fixed-domain quantification is that it strips the quantifier of a role which Quine recommended for it, namely to record robust ontological commitment. On this view, the domain of ∃x must contain only entities that are ontologically respectable, and possible objects are too abstract to qualify. Actualists of this stripe will want to develop the logic of a quantifier ∃x which reflects commitment to what is actual in a given world rather than to what is merely possible.

However, recent work on actualism tends to undermine this objection. For example, Linksy and Zalta (1994) argue that the fixed-domain quantifier can be given an interpretation that is perfectly acceptable to actualists. Actualists who employ possible worlds semantics routinely quantify over possible worlds in their semantical theory of language. So it would seem that possible worlds are actual by these actualist's lights. By cleverly outfitting the domain with abstract entities no more objectionable than the ones actualists accept, Linsky and Zalta show that the Barcan Formula and classical principles can be vindicated. Note however, that actualists may respond that they need not be commited to the actuality of possible worlds so long as it is understood that quantifiers used in their theory of language lack strong ontological import. In any case, it is open to actualists (and non actualists as well) to investigate the logic of quantifiers with more robust domains, for example domains excluding possible worlds and other such abstract entities, and containing only the spatio-temporal particulars found in a given world. For quantifiers of this kind, a world-relative domains are appropriate.

Such considerations motivate interest in systems that acknowledge the context dependence of quantification by introducing world-relative domains. Here each possible world has its own domain of quantification (the set of objects that actually exist in that world), and the domains vary from one world to the next. When this decision is made, a difficulty arises for classical quantification theory. Notice that the sentence ∃x(x=t) is a theorem of , and so □∃x(x=t) is a theorem of K by the Necessitation Rule. Let the term t stand for Saul Kripke. Then this theorem says that it is necessary that Saul Kripke exists, so that he is in the domain of every possible world. The whole motivation for the world-relative approach was to reflect the idea that objects in one world may fail to exist in another. If standard quantifier rulers are used, however, every term t must refer to something that exists in all the possible worlds. This seems incompatible with our ordinary practice of using terms to refer to things that only exist contingently.

http://plato.stanford.edu/entries/logic-modal/ (16 of 20)9/24/2005 12:09:17 PM Modal Logic One response to this difficulty is simply to eliminate terms. Kripke (1963) gives an example of a system that uses the world-relative interpretation and preserves the classical rules. However, the costs are severe. First, his language is artificially impoverished, and second, the rules for the propositional modal logic must be weakened.

Presuming that we would like a language that includes terms, and that classical rules are to be added to standard systems of propositional modal logic, a new problem arises. In such a system, it is possible to prove (CBF), the converse of the Barcan Formula.

(CBF) □∀xA→∀x□A.

This fact has serious consequences for the system's semantics. It is not difficult to show that every world- relative model of (CBF) must meet condition (ND) (for ‘nested domains’).

(ND) If wRv then the domain of w is a subset of the domain of v.

However (ND) conflicts with the point of introducing world-relative domains. The whole idea was that existence of objects is contingent so that there are accessible possible worlds where one of the things in our world fails to exist.

A straightforward solution to these problems is to abandon classical rules for the quantifiers and to adopt rules for free logic (FL) instead. The rules of FL are the same as the classical rules, except that inferences from ∀xRx (everything is real) to Rp (Pegasus is real) are blocked. This is done by introducing a predicate ‘E’ (for ‘actually exists’) and modifying the rule of universal instantiation. From ∀xRx one is allowed to obtain Rp only if one also has obtained Ep. Assuming that the universal quantifier ∀x is primitive, and the existential quantifier ∃x is defined by ∃xA =df ~∀x~A, then FL may be constructed by adding the following two principles to the rules of propositional logic

Universal Generalization. If B→A(y) is a theorem, so is B→∀xA(x).

Universal Instantiation. (∀xA(x) & En)→A(n)

(Here it is assumed that A(x) is any well-formed formula of predicate logic, and that A(y) and A(n) result from replacing y and n properly for each occurrence of x in A(x).) Note that the principle of is standard, but that the instantiation axiom is restricted by mention of En in the antecedent. In FL, proofs of formulas like ∃x□(x=t), ∀y□ ∃x(x=y), (CBF), and (BF) which seem incompatible with the world-relative interpretation, are blocked.

One philosophical objection to FL is that E appears to be an existence predicate, and many would argue that existence is not a legitimate property like being green or weighing more than four pounds. So philosophers who reject the idea that existence is a predicate may object to FL. However in most (but

http://plato.stanford.edu/entries/logic-modal/ (17 of 20)9/24/2005 12:09:17 PM Modal Logic not all) quantified modal logics that include identity (=) these worries may be skirted by defining E as follows.

∃ Et =df x(x=t).

The most general way to formulate quantified modal logic is to create FS by adding the rules of FL to a given propositional modal logic S. In situations where classical quantification is desired, one may simply add Et as an axiom to FS, so that the classical principles become derivable rules. Adequacy results for such systems can be obtained for most choices of the modal logic S, but there are exceptions.

A final complication in the semantics for quantified modal logic is worth mentioning. It arises when non- rigid expressions such as ‘the inventor of bifocals’, are introduced to the language. A term is non-rigid when it picks out different objects in different possible worlds. The semantical value of such a term can be given by what Carnap (1947) called an individual concept, a function that picks out the denotation of the term for each possible world. One approach to dealing with non-rigid terms is to employ Russell's theory of descriptions. However, in a language that treats non rigid expressions as genuine terms, it turns out that neither the classical nor the free logic rules for the quantifiers are acceptable. (The problem can not be resolved by weakening the rule of for identity.) A solution to this problem is to employ a more general treatment of the quantifiers, where the domain of quantification contains individual concepts rather than objects. This more general interpretation provides a better match between the treatment of terms and the treatment of quantifiers and results in systems that are adequate for classical or free logic rules (depending on whether the fixed domains or world-relative domains are chosen).

Bibliography

An excellent bibliography of historical sources can be found in Hughes and Cresswell (1968).

● Anderson, A. and Belnap, N., 1975, 1992, Entailment: The Logic of Relevance and Necessity, vol. 1 (1975), vol. 2 (1992), Princeton: Princeton University Press. ● Barcan, R., 1946, "A Functional Calculus of First Order Based on Strict Implication," Journal of Symbolic Logic, 11: 1-16. ● Bencivenga, E., 1986, "Free Logics," in Gabbay, D., and Guenthner, F. (eds.) Handbook of Philosophical Logic, 3.6 Dordrecht: D. Reidel. ● Bonevac, D., 1987, Deduction, Part II, Palo Alto, California: Mayfield Publishing Company. ● Boolos, G., 1993, The Logic of Provability, Cambridge: Cambridge University Press. ● Bull, R. and Segerberg, Krister, 1984, "Basic Modal Logic," in Gabbay, D., and Guenthner, F. (eds.) Handbook of Philosophical Logic, 2.1, Dordrecht: D. Reidel. ● Carnap, R., 1947, Meaning and Necessity, Chicago: U. Chicago Press. ● Chellas, Brian, 1980, Modal Logic: An Introduction, Cambridge: Cambridge University Press. ● Cresswell, M. J., 1995, "Incompleteness and the Barcan formula", Journal of Philosophical

http://plato.stanford.edu/entries/logic-modal/ (18 of 20)9/24/2005 12:09:17 PM Modal Logic Logic, 24: 379-403. ● Cresswell, M. J., 1991, "In Defence of the Barcan Formula," Logique et Analyse, 135-136: 271- 282. ● Fitting, M. and Mendelsohn, R., 1998, First Order Modal Logic, Dordrecht: Kluwer. ● Gabbay, D., 1976, Investigations in Modal and Tense Logics, Dordrecht: D. Reidel. ● Gabbay, D., 1994, Temporal Logic: Mathematical Foundations and Computational Aspects, New York: Oxford University Press. ● Garson, James, 1984, "Quantification in Modal Logic," in Gabbay, D., and Guenthner, F. (eds.) Handbook of Philosophical Logic, 2.5, Dordrecht: D. Reidel. ● Hintikka, J., 1962, Knowledge and Belief: An Introduction to the Logic of the Two Notions, Ithaca, N. Y.: Cornell University Press. ● Hilpinen, R., 1971, Deontic Logic: Introductory and Systematic Readings, Dordrecht: D. Reidel. ● Hughes, G. and Cresswell, M., 1968, An Introduction to Modal Logic, London: Methuen. ● Hughes, G. and Cresswell, M., 1984, A Companion to Modal Logic, London: Methuen. ● Hughes, G. and Cresswell, M., 1996, A New Introduction to Modal Logic, London: Routledge. ● Kripke, Saul, 1963, "Semantical Considerations on Modal Logic," Acta Philosophica Fennica, 16: 83-94 ● Konyndik, K., 1986, Introductory Modal Logic, Notre Dame: University of Notre Dame Press. ● Kvart, I., 1986, A Theory of Counterfactuals, Indianapolis: Hackett Publishing Company. ● Lemmon, E. and Scott, D., 1977, An Introduction to Modal Logic, Oxford: Blackwell. ● Lewis, C.I. and Langford, C. H., 1959 (1932 0, Symbolic Logic, New York: Dover Publications. ● Lewis, D., 1973, Counterfactuals, Cambridge, Massachusetts: Harvard University Press. ● Linsky, B. and Zalta, E., 1994, "In Defense of the Simplest Quantified Modal Logic," Philosophical Perspectives, (Logic and Language), 8: 431-458 ● Prior, A. N., 1957, Time and Modality, Oxford: Clarendon Press. ● Prior, A. N., 1967, Past, Present and Future, Oxford: Clarendon Press. ● Quine, W. V. O., 1953, "Reference and Modality", in From a Logical Point of View, Cambridge, MA: Harvard University Press. 139-159 ● Rescher, N, and Urquhart, A., 1971, Temporal Logic, New York: Springer Verlag. ● Sahlqvist, H., 1975, "Completeness and Correspondence in First and Second Order Semantics for Modal Logic," in Kanger, S. (ed.) Proceedings of the Third Scandanavian Logic Symposium, Amsterdam: North Holland. 110-143 ● Van Benthem, J. F., 1982, The Logic of Time, Dordrecht: D. Reidel. ● Zeman, J., 1973, Modal Logic, The Lewis-Modal Systems, Oxford: Oxford University Press.

Other Internet Resources

● John Halleck's Logic System Interrelationships Home Page

● John McCarthy's Modal Logic Page

Related Entries

http://plato.stanford.edu/entries/logic-modal/ (19 of 20)9/24/2005 12:09:17 PM