review articles

DOI:10.1145/2699411 those arguments: so one can write rules Open-universe models show merit about On(p, c, x, y, t) (piece p of color c is on square x, y at move t) without filling in unifying efforts. in each specific value for c, p, x, y, and t. Modern AI research has addressed BY STUART RUSSELL another important property of the real world—pervasive about both its state and its dynamics—using probability . A key step was Pearl’s devel­­opment of Bayesian networks, Unifying which provided the beginnings of a formal language for probability mod- els and enabled rapid progress in rea- soning, learning, vision, and language and understanding. The expressive power of Bayes nets is, however, limited. They assume a fixed of variables, each taking a value from a fixed range; Probability thus, they are a propositional formal- ism, like Boolean circuits. The rules of chess and of many other domains are beyond them. What happened next, of course, is that classical AI researchers noticed the perva- sive uncertainty, while modern AI research- ers noticed, or remembered, that the world has things in it. Both traditions arrived at the same place: the world is uncertain and it has things in it. To deal with this, we have to unify logic and probability. But how? Even the meaning of such a goal is unclear. Early attempts by Leibniz, Bernoulli, De Morgan, Boole, Peirce, Keynes, and Carnap (surveyed PERHAPS THE MOST enduring idea from the early by Hailperin12 and Howson14) involved days of AI is that of a declarative system reasoning attaching to logical sen- over explicitly represented knowledge with a general tences. This line of work influenced AI engine. Such systems require a formal key insights language to describe the real world; and the real world ˽˽ First-order logic and probability has things in it. For this rea­son, classical AI adopted theory have addressed complementary aspects of knowledge representation first-order logic—the mathe­matics of objects and and reasoning: the ability to describe complex domains concisely in terms relations—as its foundation. of objects and relations and the ability The key benefit of first-order logic is its expressive to handle uncertain information. Their unification holds enormous power, which leads to concise—and hence promise for AI.

learnable—models. For example, the rules of chess ˽˽ New languages for defining open-universe 0 5 probability models appear to provide occupy 10 pages in first-order logic, 10 pages in the desired unification in a natural way. 38 As a bonus, they support probabilistic propositional logic, and 10 pages in the language of reasoning about the existence and identity finite automata. The power comes from separating of objects, which is important for any system trying to understand the world predicates from their arguments and quantifying over through perceptual or textual inputs.

88 COMMUNICATIONS OF THE ACM | JULY 2015 | VOL. 58 | NO. 7 research but has serious shortcomings Despite their successes, these appro­ The difference between knowing all as a vehicle for representing knowledge. aches miss an important consequence the objects in advance and inferring An alternative approach, arising from of uncertainty in a world of things: uncer- their existence from observation cor- both branches of AI and from , tainty about what things are in the world. responds to the distinction between com­bines the syntactic and semantic Real objects seldom wear unique identi- closed-universe languages such as SQL devices of logic (composable fiers or preannounce their existence like and logic programs and open-universe symbols, logical variables, quantifiers) the cast of a play. In areas such as vision, languages such as full first-order logic. with the compositional of language understanding, Web mining, This article focuses in particular on Bayes nets. The resulting languages and , the existence of open-universe­ probability models. enable the construction of very large prob- objects must be inferred from raw data The section on Open-Universe Models ability models and have changed the way (pixels, strings, and so on) that contain describes a formal language, Bayesian 21 IMAGE BY ALMAGAMI BY IMAGE in which real-world data are analyzed. no explicit object . logic or BLOG, for writing such models.

JULY 2015 | VOL. 58 | NO. 7 | COMMUNICATIONS OF THE ACM 89 review articles

It gives several examples, including (in terms. Thus, Parent(Bill, William) These two assumptions force every world simplified form) a global seismic moni- and Father(William) = Bill are atomic to contain the same objects, which are toring system for the Comprehensive sentences. The quantifiers ∀ and ∃ in one-to-one correspondence with the Nuclear Test-Ban Treaty. make assertions across all objects, for ground terms of the language (see example, Figure 1b).b Obviously, the set of worlds Logic and Probability under open-universe semantics is larger This section explains the core concepts ∀ p, c (Parent( p, c) ∧ Male (p)) ⇔ and more heterogeneous, which makes of logic and probability, beginning with Father (c) = p. the task of defining open-universe prob- possible worlds.a A is a ability models more challenging. formal object (think “data structure”) For first-order logic, a possible world The formal semantics of a logical with respect to which the of any specifies (1) a set of domain elements (or language define the of a sen-

assertion can be evaluated. objects) o1, o2, . . . and (2) mappings from tence in a possible world. For example, For the language of propositional the constant symbols to the domain the first-order sentence A = B is true in logic, in which sentences are composed elements and from the function and world ω iff A and B refer to the same

from proposition symbols X1, . . . , Xn predicate symbols to functions and rela- object in ω; thus, it is true in the first joined by logical connectives (∧, ∨, ¬, tions on the domain elements. Figure 1a three worlds of Figure 1a and false in the ⇒, ⇔), the possible worlds ω ∈ Ω are all shows a simple example with two con- fourth. (It is always false under closed- possible assignments of true and false stants and one binary predicate. Notice universe semantics.) Let T(a) be the set to the symbols. First-order logic adds that first-order logic is an open-universe of worlds where the sentence a is true; the notion of terms, that is, expres- language: even though there are two con- then one sentence a entails another sions referring to objects; a term is a stant symbols, the possible worlds allow sentence b, written a  b, if T(a) ⊆ T(b). constant symbol, a logical , or a for 1, 2, or indeed arbitrarily many objects. Logical inference algorithms generally k-ary function applied to k terms as A closed-­universe language enforces determine whether a query sentence is arguments. Proposition symbols are additional assumptions: entailed by the known sentences. replaced by atomic sentences, con- In , a probability sisting of either predicate symbols – The unique assumption model P for a countable space Ω of pos- applied to terms or equality between requires that distinct terms must sible worlds assigns a probability P(ω) to refer to distinct objects. each world, such that 0 ≤ P(ω) ≤ 1 and

– The domain closure assump- Σω ∈ Ω P(ω) = 1. Given a probability model, a In logic, a possible world may be called a model tion requires that there are no the probability of a logical sentence α or structure; in probability theory, a sample point. To avoid confusion, this paper uses objects other than those named is the total probability of all worlds in “model” to ­refer only to probability models. by terms. which a is true:

Figure 1. (a) Some of the infinitely many possible worlds for a first-order, open-universe P(α) = Σ P(ω). (1) language with two constant symbols, A and B, and one binary predicate R(x, y). Gray arrows ω∈T(a) indicate the interpretations of A and B and black arrows connect pairs of objects satisfying R. (b) The analogous figure under closed-universe semantics; here, there are exactly four The of one sen- possible x, y-pairs and hence 24 = 16 worlds. tence given another is P(a|b ) = P(a ∧ b )/ P(b ), provided P(b ) > 0. A random vari- AB AB AB AB AB AB able is a function from possible worlds to a fixed range of values; for example, (a) ...... one might define the Boolean random

variable VA = B to have the value true in the first three worlds of Figure 1a and AB AB AB AB AB false in the fourth. The distribution of a is the set of prob- (b) A A A A . . . A B B B B B abilities associated with each of its possible values. For example, suppose the variable Coin has values 0 (heads) and 1 (tails). Then the assertion Coin Figure 2. Left: a Bayes net with three Boolean variables, showing the conditional probability ∼ Bernoulli(0.6) says that Coin has a of true for each variable given its parents. Right: the joint distribution defined by Equation (2). with parameter 0.6, that is, a probability 0.6 for the BAE P(B, E, A) Burglary P(B) Earthquake P(E) value 1 and 0.4 for the value 0. 0.003 0.002 t t t 0.0000054 t t f 0.0000006 t f t 0.0023952 b The open/closed distinction can also be illus- t f f B E P(A | B,E) 0.0005988 f t t 0.0007976 trated with a common-sense example. Suppose t t 0.9 f t f Alarm t f 0.0011964 a system knows that Father(William) = Bill and 0.8 f f t f t 0.00995006 Father(Junior) = Bill. How many children does Bill 0.4 f f f 0.98505594 f f 0.01 have? Under closed-universe semantics—for ex- ample, in a —he has exactly two; under open-universe semantics, between 1 and ∞.

90 COMMUNICATIONS OF THE ACM | JULY 2015 | VOL. 58 | NO. 7 review articles

Unlike logic, probability theory lacks defined in Equation (2). This generative performed by solving linear programs, broad agreement on syntactic forms for view will be helpful in extending Bayes as suggested by Hailperin. Within the expressing nontrivial assertions. For nets to the first-order case. subfield of probabilistic one communication among statisticians, also finds logical sentences labeled with a combination of English and LATEX Adding Probabilities to Logic probabilities6—but in this case probabil- usually suffices; but precise language Early attempts to unify logic and prob- ities are attached directly to the tuples of are needed as the “input for- ability attached probabilities directly the database. (In AI and statistics, prob- mat” for general probabilistic - to logical sentences. The first rigor- ability is attached to general relation- ing systems and as the “output format” ous treatment, Gaifman’s propositional ships whereas observations are viewed for general learning systems. As noted, proba­bility logic,9 was augmented with as incontrovertible evidence.) Although Bayesian networks27 provided a (partial) algorithmic analysis by Hailperin12 and probabilistic databases can model com- and semantics for the propo- Nilsson.23 In such a logic, one can assert, plex dependencies, in practice one often sitional case. The syntax of a Bayes for example, finds such systems using global indepen- net for random variables X1, . . . , Xn dence assumptions across tuples. consists of a directed, acyclic graph P(Burglary ⇒ Earthquake) = 0.997006, (3) Halpern13 and Bacchus3 adopted and whose nodes correspond to the ran- extended Gaifman’s technical approach, dom variables, together with associated a claim that is implicit in the model adding probability expressions to the local conditional distributions.c The of Figure 2. The sentence Burglary ⇒ logic. Thus, one can write semantics specifies a joint distribution Earthquake is true in six of the eight pos- over the variables as follows: sible worlds; so, by Equation (1), asser- ∀ h1, h2 Burglary (h1) ∧ ¬ Burglary (h2)

tion (3) is equivalent to Þ P(Alarm (h1)) > P(Alarm (h2)),

P(ω) = P(x1, . . . , xn) = ∏i P(xiparents ( Xi )) . (2) P(t t t) + P(t t f ) + P( f t t) + P( f t f ) where now Burglary and Alarm are predi- These definitions have the desirable + P( f f t) + P( f f f ) = 0.997006 . cates applying to individual houses. The property that every well-formed Bayes net new language is more expressive but does corresponds to a proper probability model Because any particular probability model not resolve the difficulty that Gaifman on the associated Cartesian product space. m assigns a probability to every possible faced—how to define complete and con- Moreover, a sparse graph—reflecting a world, such a constraint will be either true sistent probability models. Each inequal- sparse causal structure in the underlying or false in m; thus, m, a distribution over ity constrains the underlying probability domain—leads to a representation that ­possible propositional worlds, acts as model to lie in a half-space in the high- is exponentially smaller than the corre- a single possible world, with respect to dimensional space of probability mod- sponding complete enumeration. which the truth of any probability asser­­tion els. Conjoining assertions corresponds The Bayes net example in Figure 2 can be evaluated. Entailment between to intersecting the constraints. Ensuring (due to Pearl) shows two independent probability assertions is then defined in that the intersection yields a single point causes, Earthquake and Burglary, that exactly the same way as in ordinary logic; is not easy. In fact, Gaifman’s princi- influence whether or not an Alarm sounds thus, assertion (3) entails the assertion pal result8 is a single probability model in Professor Pearl’s house. According requiring (1) a probability for every pos- to Equation (2), the joint probability P(Burglary ∧ Earthquake) £ 0.997006, sible ground sentence and (2) probability P(Burglary, Earthquake, Alarm) is given by constraints for infinitely many existen- because the latter is true in every prob- tially quantified sentences. P(Burglary) P(Earthquake) ability model in which assertion (3) holds. Researchers have explored two solu- P(Alarm  Burglary, Earthquake) . Satisfiability of sets of such assertions can tions to this problem. The first involves be determined by linear programming.12 writing a partial theory and then “com- The results of this calculation are shown Hence, we have a “probability logic” in the pleting” it by picking out one canonical in the figure. Notice the eight possible same sense as “temporal logic”—that is, a model in the allowed set. Nilsson23 pro- worlds are the same ones that would deductive logical system specialized for posed choosing the maximum exist for a propositional logic theory reasoning with probabilistic assertions. model consistent with the specified with the same symbols. To apply probability logic to tasks con­straints. Paskin24 developed a A Bayes net is more than just a speci- such as proving the theorems of probabil- “maximum-entropy probabilistic fication for a distribution over worlds; it ity theory, a more expressive language logic” with constraints expressed is also a stochastic “machine” for gener- was needed. Gaifman8 proposed a first- as weights (relative probabilities) ating worlds. By sampling the variables order probability logic, with possible attached to first-order clauses. Such in topological order (that is, parents worlds being first-order model structures models are often called Markov logic before children), one generates a world and with probabilities attached to sen- networks or MLNs30 and have become exactly according to the distribution tences of (function-free) first-order logic. a popular technique for applications Within AI, the most direct descendant involving relational data. There is, of these ideas appears in Lukasiewicz’s however, a semantic difficulty with c Texts on Bayes nets typically do not define a syntax for local conditional distributions oth- programs, in which such models: weights trained in one sce- er than tables, although Bayes net software a probability range is attached to each nario do not generalize to scenarios packages do. first-order Horn clause and inference is with different of objects.

JULY 2015 | VOL. 58 | NO. 7 | COMMUNICATIONS OF THE ACM 91 review articles

Moreover, by adding irrelevant objects Breese4 proposed a more declarative to a scenario, one can change the mod- approach reminiscent of Horn clauses. el’s for a given query to an Other declarative languages included arbitrary extent.16,20 Poole’s Independent Choice Logic, Sato’s A second approach, which hap- PRISM, Koller and Pfeffer’s probabilis- pens to avoid the problems just noted,d An open-universe tic relational models, and de Raedt’s builds on the fact that every well-formed probability model Bayesian Logic Programs. In all these Bayes net necessarily defines a unique cases, the head of each clause or depen- —a complete (OUPM) defines dency corresponds to a para­ theory in the terminology of probability a probability meterized set of child random variables, ­—over the variables it contains. The with the parent variables being the cor- next section describes how this property distribution over responding ground instances of the can be combined with the expressive literals in the body of the clause. For exam- power of first-order logical notation. possible worlds ple, Equation (4) shows the dependency that vary in the statements equivalent to the code frag- Bayes Nets with Quantifiers ment in Figure 3: Soon after their introduction, research- objects they contain ers developing Bayes nets for applica- and in the mapping Burglary(h) ∼ Bernoulli(0.003) tions came up against the limitations Earthquake(r) ∼ Bernoulli(0.002) of a propositional language. For exam- from symbols Alarm(h) ∼ (4) ple, suppose that in Figure 2 there are to objects. Thus, CPT [. . .] (Earthquake many houses in the same general area (Faultregion(h)), Burglary(h)) as Professor Pearl’s: each one needs an OUPMs can handle Alarm variable and a Burglary variable data from sources where CPT denotes a suitable conditional with the same CPTs and connected to probability table indexed by the corre- the Earthquake variable in the same that violate the sponding arguments. Here, h and r are way. In a propositional language, this logical variables ranging over houses and repeated structure has to be built manu- closed-universe regions; they are implicitly universally ally, one variable at a time. The same assumption. quantified.FaultRegion is a function sym- problem arises with models for sequen- bol connecting a house to its geological tial data such as text and time series, region. Together with a relational skeleton which contain sequences of identical that enumerates the objects of each type submodels, and in models for Baye­­ and specifies the values of each function sian parameter learning, where every and relation, a set of dependency state- instance variable is influenced by the ments such as Equation (4) corresponds parameter variables in the same way. to an ordinary, albeit potentially very At first, researchers simply wrote large, Bayes net. For example, if there are programs to build the networks, using two houses in region A and three in region ordinary loops to handle repeated struc- B, the corresponding Bayes net is the one ture. Figure 3 shows pseudocode to build in Figure 4. an alarm network for R geological fault regions, each with H(r) houses. Open-Universe Models The pictorial notation of plates was As noted earlier, closed-universe lan­ ­developed to denote repeated structure guages disallow uncertainty about what and software tools such as BUGS10 and things are in the world; the existence Microsoft’s infer.net have facilitated and identity of all objects must be known a rapid expansion in applications of in advance. probabilistic methods. In all these tools, In contrast, an open-universe prob- the model structure is built by a fixed ability model (OUPM) defines a proba- program, so every possible world has bility distribution over possible worlds the same random variables connected that vary in the objects they contain in the same way. Moreover, the code for and in the mapping from symbols to constructing the models is not viewed objects. Thus, OUPMs can handle data as something that could be the output from sources (text, video, radar, intelli- of a learning algorithm. gence reports, among others) that vio- late the closed-universe assumption. Given evidence, OUPMs learn about the d Briefly, the problems are avoided by separating the generative models for object existence and objects the world contains. object properties and relations and by allowing Looking at Figure 1a, the first problem for unobserved objects. is how to ensure that the model specifies a

92 COMMUNICATIONS OF THE ACM | JULY 2015 | VOL. 58 | NO. 7 review articles proper distribution over a heterogeneous, To make such a claim precise, one Equation (6)—each object records unbounded set of possible worlds. The must define exactly what these worlds its origin; for example, 〈House, 〈Fault key is to extend the generative view of are and how the model assigns a prob- Region, 〈Region, , 2〉〉, 3〉 is the third Bayes nets from the propositional to the ability to each. The definitions (given house in the second region. first-order, open-universe case: in full in Brian Milch’s PhD thesis20) The number variables of a BLOG begin with the objects each world model specify how many objects there – Bayes nets generate propositional contains. In the standard semantics are of each type with each possible ori-

worlds one event at a time; each of typed first-order logic, objects are gin; thus #House〈FaultRegion,〈Region, ,2〉〉 (ω) = 4 event fixes the value of a variable. just numbered tokens with types. In that in world ω there are 4 houses – First-order, closed-universe mod- BLOG, each object also has an origin, in region 2. The basic variables deter- els such as Equation (4) define indicating how it was generated. (The mine the values of predicates and func- generation steps for entire classes reason for this slightly baroque con- tions for all tuples of objects; thus,

of events. struction will become clear shortly.) Earthquake〈Region, ,2〉 (ω) = true means that – First-order, open-universe models For number statements with no ori- in world ω there is an earthquake in include generative steps that add gin functions—for example, Equation region 2. A possible world is defined by objects to the world rather than just (5)—the objects have an empty origin; the values of all the number variables fixing their properties and relations. for example, 〈Region, , 2〉 refers to the and basic variables. A world may be second region generated from that generated from the model by sampling Consider, for example, an open-universe statement. For number statements in topological order, for example, see version of the alarm model in Equation with origin functions—for example, Figure 5. (4). If one suspects the existence of up to three geological fault regions, with Figure 3. Illustrative pseudocode for building a version of the Bayes net in Figure 2 that handles R different regions with H(r) houses in each. equal probability, this can be expressed as a number statement: loop for r from 1 to R do add node Earthquaker with no parents, prior 0.002 # Region ∼ UniformInt (1, 3) . (5) loop for h from 1 to H(r) do add node Burglaryr,h with no parents, prior 0.003 For the sake of illustration, let us assume add node Alarmr,h with parents Earthquaker, Burglaryr,h that the number of houses in a region r is drawn uniformly between 0 and 4: Figure 4. The Bayes net corresponding to Equation (4), given two houses in region A and three in B. # House(FaultRegion = r) ∼ UniformInt(0, 4) . (6)

Burglary Burglary Earthquake Here, FaultRegion is called an origin A,1 A,2 A function, since it connects a house to the region from which it originates. Together, the dependency state- AlarmA,1 AlarmA,2 ments (4) and the two number state- ments (5 and 6), along with the necessary type signatures, specify a BurglaryB,1 BurglaryB,2 BurglaryB,3 EarthquakeB complete distribution over all the possible worlds definable with this vocabulary. There are infinitely many AlarmB,1 AlarmB,2 AlarmB,3 such worlds, but, because the number statements are bounded, only finitely many—317,680,374 to be precise— have non-zero probability. Figure 5 Figure 5. Construction of one possible world by sampling from the burglary/earthquake model. Each row shows the variable being sampled, the value it receives, and the shows an example of a particular world probability of that value conditioned on the preceding assignments. constructed from this model. The BLOG language21 provides a pre- cise syntax, semantics, and inference Variable Value Prob. capability for open-universe probabil- #Region 2 0.3333 ity models composed of dependency Earthquake〈Region, ,1〉 false 0.998 Earthquake〈Region, ,2〉 false 0.998 and number statements. BLOG mod- #House 1 0.2 els can be arbitrarily complex, but they 〈FaultRegion,〈Region, ,1〉〉 #House FaultRegion, Region, ,2 1 0.2 inherit the key declarative property 〈 〈 〉〉 Burglary〈House,〈FaultRegion,〈Region, ,1〉〉,1〉 false 0.997 of Bayes nets: every well-formed BLOG Burglary〈House,〈FaultRegion,〈Region, ,2〉〉,1〉 true 0.003 model specifies a well-defined probabil- ity distribution over possible worlds.

JULY 2015 | VOL. 58 | NO. 7 | COMMUNICATIONS OF THE ACM 93 review articles

The probability of a world so con- Examples of another, depending on the value of a structed is the product of the prob- The standard “use case” for BLOG has third variable. abilities for all the sampled values; in three elements: the model, the evi- Citation matching. Systems such as this case, 0.00003972063952. Now it dence (the known facts in a given sce- CiteSeer and Google Scholar extract a becomes clear why each object contains nario), and the query, which may be database-like representation, relating its origin: this property ensures that any expression, possibly with free logi- papers and researchers by authorship every world can be constructed by exactly cal variables. The answer is a posterior and citation links, from raw ASCII citation one sampling sequence. If this were not joint probability for each possible set strings. These strings contain no object the case, the probability of a world would of substitutions for the free variables, identifiers and include errors of syntax, be the sum over all possible sampling given the evidence, according to the spelling, punctuation, and content, which sequences that create it. model.e Every model includes type lead in turn to errors in the extracted Open-universe models may have declarations, type signatures for the databases. For example, in 2002, Cite­ infinitely many random variables, predicates and functions, one or more Seer reported over 120 distinct books so the full theory involves nontrivial number statements for each type, and written by Russell and Norvig. -­theoretic considerations. one dependency statement for each A generative model for this domain For example, the number statement predicate and function. (In the exam- (Figure 6) connects an underlying, #Region ∼ Poisson( m) assigns prob- ples here, declarations and signatures unobserved world to the observed ability e−m m k/k! to each nonnegative are omitted where the meaning is strings: there are researchers, who integer k. Moreover, the language clear.) Dependency statements use an have names; researchers write papers, allows recursion and infinite types if-then-else syntax to handle so-called which have titles; people cite the (integers, strings, and so on). Finally, context-specific dependencies, whereby papers, combining the authors’ names well-formedness disallows cyclic one variable may or may not be a parent and the paper’s title (with errors) into dependencies and infinitely receding the text of the citation according to ancestor chains; these conditions are some grammar. Given citation strings e As with Prolog, there may be infinitely many undecidable in general, but certain sets of substitutions of unbounded size; de- as evidence, a multi-author version syntactic sufficient conditions can be signing exploratory interfaces for such answers of this model, trained in an unsuper- checked easily. is an interesting HCI challenge. vised fashion, had an error rate two to three times lower than CiteSeer’s Figure 6. BLOG model for citation information extraction. For simplicity the model assumes on four standard test sets.25 The infer- one author per paper and omits details of the grammar and error models. OM(a, b) is a ­discrete log-normal, base 10, that is, the order of magnitude is 10a±b. ence process in such a vertically inte- grated model also exhibits a form of collective, knowledge-driven disam- type Researcher, Paper, Citation; biguation: the more citations there random String (Researcher); random String Title(Paper); are for a given paper, the more accu- random Paper PubCited(Citation); rately each of them is parsed, because random String Text(Citation); the parses have to agree on the facts random Boolean Prof(Researcher); origin Researcher Author(Paper); about the paper. #Researcher ∼ OM(3,1); Multitarget tracking. Given a set of Name(r) ∼ CensusDB_NamePrior(); unlabeled data points generated by Prof(r) ∼ Boolean(0.2); some unknown, time-varying set of #Paper(Author=r) if Prof(r) then ∼ OM(1.5,0.5) else ∼ OM(1,0.5); objects, the goal is to detect and track Title(p) ∼ CSPaperDB_TitlePrior(); the underlying objects. In radar sys- CitedPaper(c) ∼ UniformChoice(Paper p); tems, for example, each rotation of the Text(c) ∼ HMMGrammar(Name(Author(CitedPaper(c))), Title(CitedPaper(c))); radar dish produces a set of blips. New objects may appear, existing objects may disappear, and false alarms and detection failures are possible. The Figure 7. BLOG model for radar tracking of multiple targets. X(a, t) is the state of aircraft a standard model (Figure 7) assumes at time t, while Z(b) is the observed position of blip b. independent, linear–Gaussian dynamics and measurements. Exact #Aircraft(EntryTime = t) ∼ Poisson(λa); inference is provably intractable, but Exits(a,t) if InFlight(a,t) then ∼ Boolean(ae); InFlight(a,t) = MCMC typically works well in practice. (t == EntryTime(a)) | (InFlight(a,t-1) & !Exits(a,t-1)); Perhaps more importantly, elabora- X(a,t) if t = EntryTime(a) then ∼ InitState() tions of the scenario (formation fly- else if InFlight(a,t) then ∼ Normal( F*X(a,t-1),Σx); #Blip(Source=a, Time=t) ing, objects heading for unknown if InFlight(a,t) then ∼ Bernoulli(DetectionProb(X(a,t))); destina­tions, objects taking off or #Blip(Time=t) ∼ Poisson(λf); landing) can be handled by small Z(b) if Source(b)=null then ∼ UniformInRegion(R); changes to the model without resort- else ∼ Normal(H*X(Source(b),Time(b)),Σ ); b ing to new mathematical derivations and complex programming.

94 COMMUNICATIONS OF THE ACM | JULY 2015 | VOL. 58 | NO. 7 review articles

Nuclear treaty monitoring. Verifying has announced its intention to deploy example, when observations of Uranus the Comprehensive Nuclear-Test-Ban NET-VISA as soon as possible. led to the discovery of Neptune. Allowing Treaty requires finding all seismic events Remarks on the examples. Despite object trajectories to be nonindepen- on Earth above a minimum magnitude. superficial differences, the three exam- dent in Figure 7 enables such The UN CTBTO maintains a network of ples are structurally similar: there are to take place. sensors, the International Monitoring unknown objects (papers, aircraft, System (IMS); its automated processing earthquakes) that generate percepts Inference software, based on 100 years of seismol- according to some physical process (cita­­­ By Equation (1), the probability­ P(a|e) ogy research, has a detection tion, radar detection, seismic propaga- for a closed query sentence a given of about 30%. The NET-VISA system,2 tion). The same structure and reasoning evidence e is proportional to the sum based on an OUPM, significantly reduces patterns hold for areas such as database of probabilities for all worlds in which detection failures. deduplication and natural language a and e are satisfied, with the probability The NET-VISA model (Figure 8) understanding. In some cases, inferring for each world being a product of model exp­resses the relevant geophysics an object’s existence involves grouping parameters as explained earlier. directly. It describes distributions ­percepts together—a process that resem- Many algorithms exist to calculate over the number of events in a given bles the clustering task in machine or approximate this sum of products time interval (most of which are natu- learning. In other cases, an object may for Bayes nets. Hence, it is natural to rally occurring) as well as over their generate no percepts at all and still have consider grounding a first-order prob- time, magnitude, depth, and location. its existence inferred—as happened, for ability model by instantiating the The locations of natural events are dis- tributed according to a spatial prior Figure 8. The NET-VISA model. trained (like other parts of the model) from historical data; man-made #SeismicEvents ∼ Poisson(T∗λe); events are, by the treaty rules, assumed Natural(e) ∼ Boolean(0.999); Time(e) ∼ UniformReal(0,T); to occur uniformly. At every station s, Magnitude(e) ∼ Exponential(log(10)); each phase (seismic wave type) p from Depth(e) if Natural(e) then ∼ UniformReal(0,700) else = 0; an event e produces either 0 or 1 detec- Location(e) if Natural(e) then ∼ SpatialPrior() else ∼ UniformEarthDistribution(); tions (above-threshold signals); the #Detections(event=e, phase=p, station=s) detection probability depends on the if IsDetected(e,p,s) then = 1 else = 0; event magnitude and depth and its dis- IsDetected(e,p,s) ∼ tance from the station. (“False alarm” Logistic(weights(s,p), Magnitude(e), Depth(e), Distance(e,s)); #Detections(site = s) ∼ Poisson(T∗λf(s)); detections also occur according to a OnsetTime(d,s) station-specific rate parameter.) The if (event(d) = null) then ∼ UniformReal(0,T) measured arrival time, amplitude, and else = Time(event(d)) + Laplace(mt(s),st(s)) + GeoTT(Distance(event(d),s),Depth(event(d)),phase(d)); other properties of a detection d depend Amplitude(d,s) on the properties of the originating if (event(d) = null) then ∼ NoiseAmplitudeDistribution(s) event and its distance from the station. else ∼ AmplitudeModel(Magnitude(event(d)), Once trained, the model runs con- Distance(event(d),s),Depth(event(d)),phase(d)); ... and similar clauses for azimuth, slowness, and phase type. tinuously. The evidence consists of detections (90% of which are false alarms) extracted from raw IMS wave- form data and the query typically asks Figure 9. Location estimates for the DPRK nuclear test of February 12, 2013: UN CTBTO Late Event Bulletin (green triangle); NET-VISA (blue square). The tunnel entrance (black for the most likely event history, or cross) is 0.75 km from NET-VISA’s estimate. Contours show NET-VISA’s posterior location bulletin, given the data. For distribution. previously explained, NET-VISA uses a special-purpose inference algo- rithm. Results so far are encouraging: for example, in 2009 the UN’s SEL3 automated bulletin missed 27.4% of the 27,294 events in the magnitude range 3–4 while NET-VISA missed 11.1%. Moreover, comparisons with dense regional networks show that NET-VISA finds up to 50% more real events than the final bulletins pro- duced by the UN’s expert seismic ana- lysts. NET-VISA also tends to associate more detections with a given event, leading to more accurate location estimates (see Figure 9). The CTBTO

JULY 2015 | VOL. 58 | NO. 7 | COMMUNICATIONS OF THE ACM 95 review articles

Figure 10. A Church program that expresses the burglary/earthquake model in Equations (4)−(6). size around the variables whose values are changed. Moreover, a logical query (define num-regions (mem (lambda () (uniform 1 3)))) can be evaluated incrementally in each (define-record-type region (fields index)) world visited, usually in constant time (define regions (map (lambda (i) (make-region i)) per world, rather than recomputing it (iota(num-regions))) 19, 31 (define num-houses (mem (lambda (r)(uniform 0 4)))) from scratch. (define-record-type house (fields fault-region index)) Despite these optimizations, generic (define houses (map (lambda (r) inference for BLOG and other first-order (map (lambda (i) (make-house r i)) languages remains too slow. Most real-­ (iota (num-houses r)))) regions))) (define earthquake (mem (lambda (r) (flip 0.002)))) world applications require a special- (define burglary (mem (lambda (h) (flip 0.003)))) purpose proposal distribution to reduce (define alarm (mem (lambda (h) the mixing time. A number of avenues (if (burglary h) (if (earthquake (house-fault-region h)) are being pursued to resolve this issue: (flip 0.9) (flip 0.8)) (if (earthquake (house-fault-region h)) – Compiler techniques can generate (flip 0.4) (flip 0.01)))) inference code that is specific to the model, query, and/or evidence. logical variables with “all possible” MCMC can move among the open-uni- Microsoft’s infer.net uses such ground terms in order to generate a verse worlds shown in Figure 1a. methods to handle millions of Bayes net, as illustrated in Figure 4; As noted, a typical BLOG model has variables. with BLOG then, existing inference algorithms infinitely many possible worlds, each show speedups of over 100×.18 can be applied. of potentially infinite size. As an exam- – Special-purpose probabilistic hard- Because of the large size of these ple, consider the multitarget tracking ware—such as Analog Devices’ ground networks, exact inference is usu- model in Figure 7: the function X(a, t), GP5 chip—offers further constant-­ ally infeasible. The most general method denoting the state of aircraft a at time factor speedups. for approximate inference is Markov t, corresponds to an infinite sequence – Special-purpose samplers jointly chain Monte Carlo. MCMC methods of variables for an unbounded number sample groups of variables that execute a among possible of aircraft at each step. Nonetheless, are tightly constrained. A library worlds, guided by the relative probabili- BLOG’s MCMC inference algorithm of such samplers may render most ties of the worlds visited and aggregating converges to the correct answer, under user programs efficiently solvable. query results from each world. MCMC the usual ergodicity conditions, for any – Static analysis can transform pro- algorithms vary in the choice of neigh- well-formed BLOG model.20 grams for efficiency5, 15 and iden- borhood structure for the random walk The algorithm achieves this by sam- tify exactly solvable submodels.7 and in the proposal distribution from pling not completely specified possible which the next state is sampled; sub- worlds but partial worlds, each corre- Finally, the rapidly advancing tech- ject to an ergodicity condition on these sponding to a disjoint set of complete niques of lifted inference29 aim to unify choices, samples will converge to the worlds. A partial world is a minimal self- probability and logic at the inferen- true posterior in the limit. MCMC scales supporting instantiationf of a subset of tial level, borrowing and generalizing well with network size, but its mixing time the relevant variables, that is, ancestors ideas from logical theorem proving. (the time needed before samples reflect of the evidence and query variables. Such methods, surveyed by Van den the true posterior) is sensitive to the For example, variables X(a, t) for values Broeck,32 may avoid grounding and quantitative structure of the conditional of t greater than the last observation take advantage of symmetries by distributions; indeed, hard constraints time (or the query time, whichever is manipulating symbolic distributions may prevent convergence. greater) are irrelevant so the algorithm over large sets of objects. BUGS10 and MLNs30 apply MCMC can consider just a finite prefix of the to a preconstructed ground network, infinite sequence. Moreover, the algo- Learning which requires a bound on the size of rithm eliminates isomorphisms under Generative languages such as BLOG possible worlds and enforces a propo- object renumberings by and BUGS naturally support Bayesian sitional “bit-vector” representation the required combinatorial ratios for parameter learning with no modi- for them. An alternative approach is MCMC transitions between partial fication: parameters are defined as to apply MCMC directly on the space worlds of different sizes.22 random variables with priors and ordi- of first-order possible worlds26, 31; this MCMC algorithms for first-order nary inference yields posterior param- provides much more freedom to use, languages also benefit from local- eter distributions given the evidence. for example, a sparse graph or even a ity of computation27: the probability For example, in Figure 8 we could 19 relational database to represent each ratio between neighboring worlds add a prior λe ∼ Gamma(3, 0.1) for

world. Moreover, in this view it is easy to depends on a subgraph of constant the seismic rate parameter λe instead see that MCMC moves can not only alter of fixing its value in advance; learn- relations and functions but also add or f An instantiation of a set of variables is self-­ ing proceeds as data arrive, even in the subtract objects and change the inter- supporting if the parents of every variable in unsupervised case where no ground- pretations of constant symbols; thus, the set are also in the set. truth events are supplied. A “trained

96 COMMUNICATIONS OF THE ACM | JULY 2015 | VOL. 58 | NO. 7 review articles model” with fixed parameter values can Inference in Church uses MCMC, Network processing vertically integrated seismic analysis. Bull. Seism. Soc. Am. be obtained by, for example, choosing where each move resamples one of the 103 (2013). maximum a posteriori values. In this stochastic primitives involved in pro- 3. Bacchus, F. Representing and Reasoning with Probabilistic Knowledge. MIT Press, 1990. way many standard ducing the current trace. 4. Breese, J.S. Construction of and decision methods can be implemented using just Execution traces of a randomized networks. Comput. Intell. 8 (1992) 624–647. 5. Claret, G., Rajamani, S.K., Nori, A.V., Gordon, A.D., a few lines of modeling code. Maximum- program may vary in the new objects Borgström, J. using data flow likelihood parameter estimation could they generate; thus, PPLs have an open- analysis. In FSE-13 (2013). 6. Dalvi, N.N., Ré, C., Suciu, D. Probabilistic databases. be added by stating that certain param- universe flavor. One can view BLOG as CACM 52, 7 (2009), 86–94. 7. Fischer, B., Schumann, J. AutoBayes: A system for eters are learnable, omit­ting their priors, a declarative, relational PPL, but there generating data analysis programs from statistical and interleaving maximization steps is a significant semantic difference: in models. J. Funct. Program 13 (2003). 8. Gaifman, H. Concerning measures in first order calculi. with MCMC inference steps to obtain a BLOG, in any given possible world, every Israel J. Math. 2 (1964), 1–18. stochastic online EM algorithm. ground term has a single value; thus, 9. Gaifman, H. Concerning measures on Boolean algebras. Pacific J. Math. 14 (1964), Structure learning—generating new expressions such as f (1) = f (1) are true by 61–73. dependencies and new predicates and . In a PPL, on the other hand, 10. Gilks, W.R., Thomas, A., Spiegelhalter, D.J. A language and program for complex Bayesian modelling. The functions—is more difficult. The stan- f (1) = f (1) may be false if f is a stochastic Statistician 43 (1994), 169–178. dard idea of trading off degree of fit function, because each instance of f (1) 11. Goodman, N.D., Mansinghka, V.K., Roy, D., Bonawitz, K., Tenenbaum, J.B. Church: A language for generative and model can be applied, corresponds to a distinct piece of the models. In UAI-08 (2008). and some model search methods execution trace. Memoizing every sto- 12. Hailperin, T. Probability logic. Notre Dame J. Formal Logic 25, 3 (1984), 198–212. from the field of inductive logic pro- chastic function (via mem in Figure 10) 13. Halpern, J.Y. An analysis of first-order logics of gramming can be generalized to the restores the standard semantics. probability. AIJ 46, 3 (1990), 311–350. 14. Howson, C. Probability and logic. J. Appl. Logic 1, 3–4 probabilistic case, but as yet little is (2003), 151–165. known about how to make such methods Prospects 15. Hur, C.-K., Nori, A.V., Rajamani, S.K., Samuel, S. Slicing probabilistic programs. In PLDI-14 (2014). computationally feasible. These are early days in the process of uni- 16. Jain, D., Kirchlechner, B., Beetz, M. Extending Markov logic to model probability distributions in relational fying logic and probability. Experience domains. In KI-07 (2007). Probability and Programs in developing models for a wide range 17. Koller, D., McAllester, D.A., Pfeffer, A. Effective Bayesian inference for stochastic programs. In AAAI-97 (1997). Probabilistic programming languages of applications will uncover new model- 18. Li, L., Wu, Y., Russell, S. SWIFT: Compiled inference for or PPLs17 represent a distinct but ing idioms and lead to new kinds of pro- probabilistic programs. Tech. Report EECS-2015-12, UC Berkeley, 2015. closely related approach for defin- gramming constructs. And of course, 19. McCallum, A., Schultz, K., Singh, S. FACTORIE: ing expressive probability models. inference and learning remain the major Probabilistic programming via imperatively defined factor graphs. In NIPS 22 (2010). The basic idea is that a randomized bottlenecks. 20. Milch, B. Probabilistic models with unknown objects. algorithm, written in an ordinary pro- Historically, AI has suffered from PhD thesis, UC Berkeley, 2006. 21. Milch, B., Marthi, B., Sontag, D., Russell, S.J., Ong, D., gramming language, can be viewed insularity and fragmentation. Until the Kolobov, A. BLOG: Probabilistic models with unknown not as a program to be executed, but 1990s, it remained isolated from fields objects. In IJCAI-05 (2005). 22. Milch, B., Russell, S.J. General-purpose MCMC inference as a probability model: a distribution such as statistics and operations over relational structures. In UAI-06 (2006). over the possible execution traces of research, while its subfields—espe- 23. Nilsson, N.J. Probabilistic logic. AIJ 28 (1986), 71–87. 24. Paskin, M. Maximum entropy probabilistic logic. Tech. the program, given inputs. Evidence cially vision and robotics—went their Report UCB/CSD-01-1161, UC Berkeley, 2002. is asserted by fixing any aspect of the 25. Pasula, H., Marthi, B., Milch, B., Russell, S.J., Shpitser, I. own separate ways. The primary cause Identity uncertainty and citation matching. In NIPS trace and a query is any predicate on was mathematical incompatibility: what 15 (2003). 26. Pasula, H., Russell, S.J. Approximate inference for traces. For example, one can write a could a statistician of the 1960s, well- first-order probabilistic languages. InIJCAI-01 (2001). simple Java program that rolls three versed in linear regression and mixtures 27. Pearl, J. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988. six-sided ; fix the sum to be 13; and of Gaussians, offer an AI researcher 28. Pfeffer, A. IBAL: A probabilistic rational programming ask for the probability that the second building a robot to do the grocery shop- language. In IJCAI-01 (2001). 29. Poole, D. First-order probabilistic inference. In die is even. The answer is a sum over ping? Bayes nets have begun to recon- IJCAI-03 (2003). probabilities of complete traces; the nect AI to statistics, vision, and language 30. Richardson, M., Domingos, P. Markov logic networks. Machine Learning 62, 1–2 (2006), 107–136. probability of a trace is the product research; first-order probabilistic lan- 31. Russell, S.J. Expressive probability models in . of the probabilities of the random guages, which have both Bayes nets and In Discovery Science (Tokyo, 1999). 32. Van den Broeck, G. Lifted Inference and Learning in choices made in that trace. first-order logic as special cases, will Statistical Relational Models. PhD thesis, Katholieke The first significant PPL was Pfeffer’s extend and broaden this process. Universiteit Leuven, 2013. IBAL,28 a functional language equip­ Stuart Russell ([email protected]), is a professor ped with an effective inference engine. Acknowledgments of and Smith-Zadeh Professor of Church,11 a PPL built on Scheme, gen- My former students Brian Milch, Hanna at the University of California, Berkeley. erated interest in the Pasula, Nimar Arora, Erik Sudderth, Copyright held by author. community as a way to model complex Bhaskara Marthi, David Sontag, Daniel Publication rights licensed to ACM. $15.00. forms of learning; it also led to inter- Ong, and Andrey Kolobov contributed esting connections to computability to this research, as did NSF, DARPA, the theory1 and programming language Chaire , and ANR. research. Figure 10 shows the bur- Watch the authors discuss their work in this exclusive glary/earthquake example in Church; References Communications video. 1. Ackerman, N., Freer, C., Roy, D. On the computability of http://cacm.acm.org/ notice that the PPL code builds a pos- conditional probability. arXiv 1005.3014, 2013. videos/unifying-logic- sible-world data structure explicitly. 2. Arora, N.S., Russell, S., Sudderth, E. NET-VISA: and-probability

JULY 2015 | VOL. 58 | NO. 7 | COMMUNICATIONS OF THE ACM 97