review articles

DOI:10.1145/3241978 Markov logic can be used as a general framework for joining logical and statistical AI.

BY PEDRO DOMINGOS AND DANIEL LOWD Unifying Logical and Statistical AI with Markov Logic

To handle the complexity and uncer- FOR MANY YEARS, the two dominant paradigms in tainty present in most real-world prob- artificial intelligence (AI) have been logical AI and lems, we need AI that is both logical and statistical AI. Logical AI uses first-order logic and statistical, integrating first-order logic related representa­tions to capture complex and graphical models. One or the other relationships and knowledge about the world. key insights

However, logic-based approaches are often too brittle ˽˽ Intelligent systems must be able to handle the complexity and uncertainty of the real to handle the uncertainty and noise pres­ent in many world. Markov logic enables this by unifying first-order logic and probabilistic graphical applications. Statistical AI uses probabilistic models into a single representation. Many representations such as probabilistic graphical deep architectures are instances of Markov logic.

models to capture uncertainty. However, graphical ˽˽ A extensive suite of learning and models only represent distributions over inference algorithms for Markov logic has been developed, along with open source propositional universes and must be customized to implementations like Alchemy. ˽˽ Markov logic has been applied to natural handle relational domains. As a result, expressing language understanding, information complex concepts and relationships in graphical models extraction and integration, robotics, social network analysis, computational is often difficult and labor-intensive. biology, and many other areas.

74 COMMUNICATIONS OF THE ACM | JULY 2019 | VOL. 62 | NO. 7 by itself cannot provide the minimum Markov logic7 is a simple yet powerful Markov logic combine ideas from prob- functionality needed to support the full generalization of first-order logic and abilistic and logical inference, such as range of AI applications. Further, the probabilistic graphical models, which Markov chain Monte Carlo, belief propa- two need to be fully integrated, and are allows it to build on and integrate the gation, satisfiability, and resolution. not simply provided alongside each oth- best approaches from both logical and Markov logic has already been used to er. Most applications require simulta- statistical AI. A Markov logic network efficiently develop state-of-the-art mod- neously the expressiveness of first-order (MLN) is a set of weighted first-order for- els for many AI problems, such as collec- logic and the robustness of probability, mulas, viewed as templates for con- tive classification, , not just one or the other. Unfortunately, structing Markov networks. This yields a ontology mapping, knowledge base the split between logical and statistical well-defined probability distribution in refinement,­ and semantic parsing in AI runs very deep. It dates to the earliest which worlds are more likely when they application areas such as the Web, days of the field, and continues to be satisfy a higher-weight set of ground for- social networks, molecular biology, highly visible today. It takes a different mulas. Intuitively, the magnitude of the information extraction, and others. form in each subfield of AI, but it is om- weight corresponds to the relative Markov logic makes solving new prob- nipresent. Table 1 shows examples of strength of its formula; in the infinite- lems easier by offering a simple frame- this. In each case, both the logical and weight limit, Markov logic reduces to work for representing well-defined the statistical approaches contribute first-order logic. Weights can be set by probability distributions over uncer- something important. This justifies the hand or learned automatically from tain, relational data. Many existing abundant research on each of them, but data. Algorithms for learning or revising approaches can be described by a few also implies that ultimately a combina- formulas from data have also been weighted formulas, and multiple

IMAGE BY GUZEL KHUZHINA BY IMAGE tion of the two is required. developed. Inference algorithms for approaches can be combined by

JULY 2019 | VOL. 62 | NO. 7 | COMMUNICATIONS OF THE ACM 75 review articles

9 Table 1. Examples of logical and statistical AI. Deep learning methods have led to competitive or dominant approaches in a growing number of problems. Deep Field Logical approach Statistical approach learning fits complex, nonlinear func- Knowledge representation First-order logic Graphical models tions directly from data. For domains Automated reasoning Satisfiability testing Markov chain Monte where we know something about the Carlo problem structure, MLNs and other Machine learning Inductive logic programming Neural networks graphical models make it easier to cap- Planning Classical planning Markov decision ture background knowledge about the processes domain, sometimes specifying competi- Natural language processing Definite clause grammars Probabilistic context-free tive models without having any training grammars data at all. Due to their complementary strengths, combining deep learning with graphical models is an ongoing Table 2. Example of a first-order knowledge base and MLN. area of research.

First-Order Logic English First-order logic Weight A first-order knowledge base(KB) is a set “Friends of friends are friends.” ∀x∀y∀z Fr(x, y) ∧ Fr(y, z) ⇒ Fr(x, z) 0.7 of sentences or formulas in first-order “Friendless people smoke.” ∀x (¬(∃y Fr(x, y) ) ⇒ Sm(x) ) 2.3 logic. Formulas are constructed using “Smoking causes cancer.” ∀x Sm(x) ⇒ Ca(x) 1.5 “If two people are friends, then ∀x∀y Fr(x, y) ⇒ (Sm(x) ⇔ Sm(y) ) 1.1 four types of symbols: constants, vari- either both smoke or neither does.” ables, functions, and predicates. Fr() is short for Friends(), Sm() for Smokes(), and Ca() for Cancer(). Constant symbols represent objects in the domain of interest (for example, people: Anna, Bob, and Chris). Variable including all of the relevant formulas. uses weighted formulas like Markov symbols range over the objects in the Many algorithms, as well as sample logic, but with a continuous relaxation domain. Function symbols (MotherOf) datasets and applications, are available of the variables in order to reason effi- represent mappings from tuples of in the open source Alchemy system17 ciently. In some cases, PSL can be objects to objects. Predicate symbols (alchemy.cs.washington.edu). viewed as Markov logic with a particular represent relations among objects in In this article, we describe Markov choice of approximate inference algo- the domain (Friends) or attributes of logic and its algorithms, and show rithm. One limitation of PSL’s degree- objects (Smokes). An interpretation speci- how they can be used as a general of-satisfaction semantics is that more fies which objects, functions, and rela- framework for combining logical and evidence does not always make an event tions in the domain are represented by statistical AI. Before presenting back- more likely; many weak sources of evi- which symbols. ground and details on Markov logic, dence do not combine to produce A term is any expression representing we first discuss how it relates to other strong evidence, even when the sources an object in the domain. It can be a con- methods in AI. are independent. stant, a variable, or a function applied to Markov logic is the most widely used Probabilistic programming28 is a tuple of terms. For example, Anna, x, approach to unifying logical and statis- another­ paradigm for defining and rea- and GreatestCommonDivisor(x, y) are tical AI, but this is an active research soning with rich, probabilistic models. terms. An atomic formula or atom is a area, and there are many others (see Probabilistic programming is a good fit predicate symbol applied to a tuple of Kimmig et al.14 for a recent survey with for problems where the data is gener- terms (for example, Friends(x, many examples). Most approaches can ated by a random process, and the pro- MotherOf(Anna) ) ). Formulas are recur- be roughly categorized as either extend- cess can be described as the execution sively constructed from atomic formu- ing logic programming languages (for of a procedural program with random las using logical connectives and

example, Prolog) to handle uncer- choices. Not every domain is well-suited quantifiers. If F1 and F2 are formulas,

tainty, or extending probabilistic to this approach; for example, we may the following are also formulas: ¬F1

graphical models to handle relational wish to describe or predict the behavior (negation), which is true if F1 is false; F1 ∧

structure. Many of these model classes of people in a social network without F2 (conjunction), which is true if both F1

can also be represented efficiently as modeling the complete evolution of that and F2 are true; F1 ∨ F2 (disjunction), 32 MLNs (see Richardson and Domingos network. Inference methods for proba- which is true if F1 or F2 is true; F1 ⇒ F2

for a discussion of early approaches to bilistic programming languages work (implication), which is true if F1 is false or

statistical relational AI and how they backwards from the data to reason about F2 is true; F1 ⇔ F2 (equivalence), which is

relate to Markov logic). In recent years, the processes that generate the data true if F1 and F2 have the same truth value; x most work on statistical relational AI and compute conditional probabilities ∀ F1 (universal quantification), which is x has assumed a parametric factor (par- of other events. Probabilistic graphical true if F1 is true for every object in the 29 x factor) representation which is simi- models perform similar reasoning, but domain; and ∃ F1 (existential quantifi-

lar to the weighted formulas in an typically have more structure to exploit cation), which is true if F1 is true for at MLN. Probabilistic soft logic (PSL)1 for reasoning at scale. least one object x in the domain.

76 COMMUNICATIONS OF THE ACM | JULY 2019 | VOL. 62 | NO. 7 review articles

Parentheses may be used to enforce pre- For convenience, we also define Φ(x) as idea in Markov logic is to soften these cedence. A positive literal is an atomic for- the unnormalized probability distribu- constraints: when a world violates one mula; a negative literal is a negated atomic tion, the product of all potential func- formula in the KB, it is less probable, formula. The formulas in a KB are implic- tions. A Markov network can be but not impossible. The fewer formulas itly conjoined, and thus a KB can be represented as a graph with one node a world violates, the more probable it is. viewed as a single large formula. A ground per variable and an undirected edge Each formula has an associated weight term is a term containing no variables. A between any two variables that appear (for example, see Table 2) that reflects ground atom is an atomic formula all of together in the same factor. how strong a constraint it is: the higher whose arguments are ground terms. A Markov networks are often conve- the weight, the greater the difference in grounding of a predicate or formula is a niently represented as log-linear models, log probability between a world that sat- replacement of all of its arguments by con- with each potential function replaced isfies the formula and one that does not, stants (or functions all of whose argu- by an exponentiated weighted sum of other things being equal. ments are constants or other functions, features of the state, leading to recursively, but we consider only the 1   Definition 1.32 A Markov logic network P(X = x)= exp w j f j (x) (2) case of constants in this article). ∑  (MLN) L is a set of pairs (Fi, wi), where Fi is a Z  j  A possible world (along with an inter- formula in first-order logic and iw is a real pretation) assigns a truth value to each A feature may be any real-valued func- number. Together with a finite set of con- possible ground atom. tion of the state. We will focus on stants C = {c1, c2, …, c|C|}, it defines a

A formula is satisfiable if and only binary features, fj (x) ∈ {0, 1}, typi- Markov network ML, C (Equations 1 and 2) if there is at least one world in which it cally indicating if the variables are in as follows: is true. some particular state or satisfy some ML, C contains one random variable Determining if a formula is satisfi- logical expression. for each possible grounding of each atom able is only semidecidable. Because of Markov networks have been suc- appearing in L. The value of the variable this, knowledge bases are often con- cessfully applied to many problems in is true if the ground atom is true and structed using a restricted subset of AI, such as stereo vision, natural lan- false otherwise. first-order logic with more desirable guage translation, information extrac- ML, C contains one feature for each pos- properties. tion, machine reading, social network sible grounding of each formula Fi in L. Table 2 shows a simple KB. Notice analysis, and more. However, Markov The value of this feature is 1 if the ground that although these formulas may be networks only represent probabil- formula is true and 0 otherwise. The typically true in the real world, they ity distributions over propositional weight of the feature is the wi associated are not always true. In most domains, it domains with a fixed set of variables. with Fi in L. is very difficult to come up with non- There is no standard language for trivial formulas that are always true, extending Markov networks to vari- In the graph corresponding to ML, C, and such formulas capture only a frac- able-sized domains, such as social net- there is a node for each grounding of tion of the relevant knowledge. Thus, works over different numbers of people each atom, and an edge appears between despite its expressiveness, pure first- or documents with different numbers two nodes if the corresponding ground order logic has limited applicability to of words. As a result, applying Markov atoms appear together in at least one practical AI problems. networks to these problems is a labor- grounding of one formula in L. Many ad hoc extensions to address intensive process requiring custom For example, an MLN containing the this have been proposed. In the more implementations. formulas ∀x Smokes(x) ⇒ Cancer(x) limited case of propositional logic, the (smoking causes cancer) and ∀x∀y problem is well solved by probabilistic Markov Logic Friends(x, y) ⇒ (Smokes(x) ⇔ Smokes(y) ) graphical models such as Markov net- A first-order KB can be seen as a set of (friends have similar smoking habits) works, as we describe next. We will later hard constraints on the set of possible applied to the constants Anna and Bob (or show how to generalize these models to worlds: if a world violates even one for- A and B for short) yields the ground the first-order case. mula, it has zero probability. The basic Markov network in Figure 1. Its features

Markov Networks Figure 1. Ground Markov network obtained by applying an MLN containing the formulas ∀x Smokes(x) ⇒ Cancer(x) and ∀x∀y Friends(x, y) ⇒ (Smokes(x) ⇔ Smokes(y) ) to the A Markov network (also known as constants Anna(A) and Bob(B). Markov random field) represents a joint probability distribution over variables Friends(A,B) X = {X1, X2, …, Xn} as a product of factors (also known as potential functions):

(1) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) where each φ is a nonnegative, real- C Cancer(A) Cancer(B) valued function defined over variables Friends(B,A) XC ⊂ X and Z is a normalization con- stant known as the partition function.

JULY 2019 | VOL. 62 | NO. 7 | COMMUNICATIONS OF THE ACM 77 review articles

include Smokes(Anna) ⇒ Cancer(Anna). P(S(A) |R(A) ) → 1, recovering the logi- Notice that, although the two formulas cal entailment. are false as universally quantified logical Bayesian networks, Markov networks, statements, as weighted features of an and many other propositional models MLN they capture valid statistical regu- frequently used in machine learning larities, and in fact represent a standard A first-order and data mining can be stated quite social network model. Notice also that knowledge base concisely as MLNs, and combined and nodes and links in the social networks extended simply by adding the corre- are both represented as nodes in the can be seen as sponding formulas.32 Most significantly, Markov network; arcs in the Markov net- a set of hard Markov logic facilitates the modeling of work represent probabilistic dependen- multi-relational data, where objects are cies between nodes and links in constraints on not independent but are related in the social network (for example, Anna’s diverse and complex ways. Such repre- smoking habits depend on her friends’ the set of possible sentations are important for social net- smoking habits). worlds. The basic works, biological networks, natural An MLN can be viewed as a template language understanding, and more. for constructing Markov networks. idea in Markov logic Boltzmann machines and the deep From Definition 1 and Equations 1 and is to soften these ­models based on them are special 2, the probability distribution over pos- cases of Markov networks. sible worlds x specified by the ground constraints. MLN weights can be normalized glob-

Markov network ML, C is given by ally, as in undirected graphical models,

F or locally, as in directed ones. The latter 1   option enables MLNs to handle vari- P(X = x)= exp∑wini (x) (3) Z  i=1  able numbers of objects and irrelevant objects similar to directed first-order where F is the number of formulas in formalisms (counter to Russell’s34 claim; 6 the MLN and ni (x) is the number of see also Domingos ). In practice, even

true groundings of Fi in x. As formula globally normalized MLNs are quite weights increase, an MLN increas- robust to these variations, largely ingly resembles a purely logical KB, because the number of relations per becoming equivalent to one in the object usually varies much less than the limit of all infinite weights. When the number of objects, and because factors weights are positive and finite, and all from irrelevant objects cancel out in formulas are simultaneously satisfi- conditional probabilities. able, the satisfying solutions are the When working with Markov logic, modes of the distribution represented we typically make three assumptions by the ground Markov network. Most about the logical representation: dif- importantly, Markov logic allows ferent constants refer to different contradictions between formulas, objects (unique names), the only which it resolves simply by weighing objects in the domain are those repre- the evidence on both sides. sentable using the constant and func- It is interesting to see a simple tion symbols (domain closure), and example of how Markov logic general- the value of each function for each izes first-order logic. Consider an tuple of arguments is always a known MLN containing the single formula ∀x constant (known functions). These R(x) ⇒ S(x) with weight w, and C = {A}. assumptions ensure the number of pos- This leads to four possible worlds: sible worlds is finite and that the {¬R(A), ¬S(A)}, {¬R(A), S(A)}, {R(A), Markov logic network will give a well- ¬S(A)}, and {R(A), S(A)}. From defined probability distribution. These Equation 3 we obtain that P({R(A), assumptions are quite reasonable in ¬S(A)}) = 1/(3ew + 1) and the probabil- most practical applications, and ity of each of the other three worlds is greatly simplify the use of MLNs. We ew/(3ew + 1). (The denominator is the will make these assumptions in most partition function Z; see Markov of the remainder of this article, but Networks.) Thus, if w > 0, the effect of Markov logic can be generalized to the MLN is to make the world that is domains where they do not hold, such inconsistent with ∀x R(x) ⇒ S(x) less as those with infinite objects or con- likely than the other three. From the tinuous random variables.36,37 Open- probabilities here we obtain that world reasoning methods discussed P(S(A) |R(A) ) = 1/(1 + e−w). When w → ∞, by Russell34 can also be applied to

78 COMMUNICATIONS OF THE ACM | JULY 2019 | VOL. 62 | NO. 7 review articles

Markov logic, as Bayesian networks can Monte Carlo (MCMC) and importance compiled, the arithmetic circuit can be be translated into MLNs. sampling, and examples of the latter reused for multiple queries. include loopy belief propagation and Probabilistic theorem proving. WMC Inference variational methods. is a natural approach to inference in Given an MLN model, the questions of Any of these methods could be used MLNs, as MLNs already use a logical interest are answered by performing to perform inference in an MLN after representation for their features. How­ inference on it. (For example, “What are instantiating the ground network, and ever, MLNs have additional structure to the topics of these Web pages, given the many of them have. However, inference exploit: each formula is instantiated words on them and the links between remains challenging in Markov net- many times with different combinations of them?”) Because an MLN acts as a tem- works and even more challenging in constants. For example, suppose we are plate for a Markov network, we can MLNs, which are often very large and modeling a social network in which each always apply standard Markov network have many loops. Next we will discuss pair of people is either friends or not. inference methods on the instantiated on one of the most promising infer- Before introducing any information network. However, methods that also ence methods to date, which can take about the individuals, the probability exploit the logical structure in an MLN advantage of logical structure, perform that any two people are friends must be can yield tremendous savings in mem- exact inference when tractable, and be the same as any other pair. Lifted infer- ory and time. We first provide an over- relaxed to perform approximate infer- ence exploits these symmetries to reason view of inference in Markov networks, ence when necessary. efficiently, even on very large domains.29 and then describe how these methods Weighted model counting. Probabilistic theorem proving (PTP)8 can be adapted to take advantage of the Comput­ing the partition function in a applies this idea to perform lifted MLN structure. Markov network can be reduced to a weighted model counting, so that many Markov network inference. The weighted­ model counting (WMC) equivalent groundings of the same for- main inference problem in Markov problem. Weighted­ model counting mula can be counted at once without networks is computing the probability finds the total weight of all satisfying instantiating them. As in the proposi- of a set of query variables Q given some assignments to a logical formula F. tional case, lifted WMC can also be per- evidence E: Following Chavira and Darwiche,2 we formed by compiling the first-order P(q,e) focus on literal-weighted model count- knowledge base to a (lifted) arithmetic P(Q = q|E = e)= ing, in which each literal is assigned a circuit for repeated querying.5 P(e) (4) real-valued weight and the weight of In some cases, lifted inference lets Φ(q,e,h′) Z an assignment is the product of the us reason efficiently independent of = ∑ h′ = q,e Φ q′,e,h′ Z weights of its literals. domain size, so that inferring proba- ∑ q′, h′ ( ) e To represent Z as a WMC problem, bilities over millions of constants and where H = X − Q − E denotes the remain- we need each assignment x to receive trillions of ground formulas takes no ing nonquery, nonevidence variables, Φ weight Φ(x). Suppose each potential φi longer than reasoning over hundreds. is the unnormalized product of poten- (x) evaluates to a constant Θi when a More often, evidence about individual tials from Equation 1, and Zq, e and Ze are logical expression Fi is satisfied and 1 constants breaks symmetries, so that the partition functions of reduced otherwise. (If the Markov network is not different groundings are no longer Markov networks, where the query and already in this form, we can convert it exchangeable. The efficiency gains evidence variables have been fixed to efficiently.) To define the WMC prob- from lifting depend on both the struc- constants. Thus, if we can compute par- lem, for each potential φi, we introduce ture of the knowledge base and the tition functions, then we can answer a literal Ai with weight Θi. We also intro- structure of the evidence. arbitrary probabilistic queries. duce a logical formula, Ai ⇔ Fi, so that Ai When there is not enough structure

In general, computing Z or answer- is only true when Fi is satisfied. Thus, and symmetry to perform inference ing other queries in a Markov network the product of the weights of the Ai liter- exactly, we can replace some of the recur- is #P-complete. When the Markov net- als is exactly the product of the original sive conditioning steps in PTP with work has certain structural properties, potential functions. sampling.8 This leads to an approximate such as a tree or tree-like structure, WMC algorithms can then be used to lifted inference algorithm, where sam- inference can be done in polynomial solve the problem and compute Z. One pling is used to estimate the weighted time. For network structures with many approach is recursive decomposition, in count of some of the subformulas. variables and loops, exact inference is which we break the WMC problem into usually intractable and approximate two subproblems, one where some vari- Learning inference algorithms are used instead. able xi is fixed to true and one where xi is Here, we discuss methods for automat- The two most common approaches to fixed to false. This requires exponential ically learning weights, refining formu- approximation are to generate random time in the worst case, but WMC algo- las, and constructing new formulas samples through some random pro- rithms can often exploit problem struc- from data. cess that converges to the true distribu- ture to solve it much faster in practice. Weight learning. In generative tion, or to solve a relaxed problem that Another approach is to compile the learning, the goal is to learn a joint captures as many constraints from the model into a logical form where WMC is pro­bability distribution over all original as possible. Examples of the tractable, such as d-DNNF, and build an atoms. A standard approach is to former approach include Markov chain arithmetic circuit based on it.2 Once maximize the likelihood of the data

JULY 2019 | VOL. 62 | NO. 7 | COMMUNICATIONS OF THE ACM 79 review articles

through ­gradient-based methods. where the sum is over all possible data- Either way, we have found it useful to

Note that we can learn to generalize bases y′, and Pw (y′|x) is P (y′|x) com- start by adding all atomic formulas from even a single example because puted using the current weight vector (single atoms) to the MLN. The weights

the formula weights are shared across w = (…, wi, …). In other words, the ith of these capture (roughly speaking) their many respective groundings. component of the gradient is simply the marginal distributions of the This is essential when the training the difference between the number of atoms, allowing the longer formulas to data is a single network, such as the true groundings of the ith formula in focus on modeling atom dependen- Web. Given mild assumptions about the data and its expectation according cies. To extend this initial model, we the relational dependencies, maxi- to the current model. either repeatedly find the best formula mizing the likelihood (or pseudo-­ When computing the expected using beam search and add it to the

likelihood) of a sufficiently large counts Ew[ni (x, y′)] is intractable, we MLN, or add all “good” formulas of example will recover the parameters can approximate them using either length l before trying formulas of that generated the data.38 the MAP state (that is, the most prob- length l + 1. Candidate formulas are For MLNs, the gradient of the log- able state of y given x) or by averaging formed by adding each predicate likelihood is the difference between over several samples from MCMC. We (negated or otherwise) to each current the true formula counts in the data obtain the best results by applying a formula, with all possible combina- and the expected counts according to version of the scaled conjugate gradi- tions of variables, subject to the con- the model. When learning a generative ent algorithm. We use a small number straint that at least one variable in the probability distribution over all atoms, of samples from MCMC to approxi- new predicate must appear in the cur- even approximating these expecta- mate the gradient and Hessian rent formula. Hand-coded formulas are tions tends to be prohibitively expen- matrix, and use the inverse diagonal also modified by removing predicates. sive or inaccurate due to the large state Hessian as a preconditioner (see A wide variety of other methods for space. Instead, we can maximize Lowd and Domingos18 for more MLN structure learning have been pseudo-likelihood, which is the condi- details and results). developed, such as generative learning tional probability of each atom in the MLN weights can also be learned with lifted inference,10 discriminative database conditioned on all other with a max-margin approach, similar to structure learning,11 gradient boost- atoms. Computing the pseudo-likeli- structural support vector machines.11 ing,13 and generating formulas using a hood and its gradient does not require Structure learning. The structure of a Markov network21 or random walks.16 inference, and is therefore much MLN is the set of formulas to which we For the special case where MLN for- faster. However, the pseudo-likelihood attach weights. Although these formulas mulas define a relational Bayesian net- parameters may lead to poor results are often specified by one or more work, consistent Bayesian network when long chains of inference are experts, such knowledge is not always structure learning methods can be required. In order to combat overfit- accurate or complete. In addition to extended to consistent structure learn- ting, we penalize each weight with a learning weights for the provided for- ing in the relational setting.35 Gaussian prior, but for simplicity, we mulas, we can revise or extend the MLN ignore that in what follows. structure with new formulas learned Applications In many applications, we know a pri- from data. We can also learn the entire MLNs have been used in a wide variety ori which atoms will be evidence and structure from scratch. The inductive of applications, often achieving state- which ones will be queried. For these logic programming (ILP) community of-the-art performance. Their greatest cases, discriminative learning optimizes has developed many methods for this strength is their flexibility for defining our ability to predict the query atoms Y purpose.4 ILP algorithms typically rich models in varied domains. given the evidence X. A common search for rules that have high accu- Collective classification. One of the approach is to maximize the conditional racy, or high coverage, among others. most common uses of MLNs is for pre- likelihood of Y given X, However, because an MLN represents dicting the labels of interrelated enti- a probability distribution, much better ties, as in the friends and smoking 1   P(y|x) = exp wini (x, y) (5) results are obtained by using an evalu- example. Applications include labeling Z  ∑  x  i∈FY  ation function based on pseudo-likeli- Web pages and predicting protein 15 3 where FY is the set of all MLN formulas hood. Log-likelihood or conditional function. MLNs can also model collec- with at least one grounding involving a log-likelihood are potentially better tive classification tasks on sequential

query atom, and ni (x, y) is the number evaluation functions, but are much data, such as segmenting text for infor- of true groundings of the ith formula more expensive to compute. mation extraction.30 involving query atoms. The gradient of Most structure learning algorithms Link prediction. A second common the conditional log-likelihood is focus on clausal knowledge bases, in task is to predict unknown or future which each formula is a disjunction of relationships based on known relation- ∂ literals (negated or nonnegated ships and attributes. Examples include log Pw (y|x)= ni (x, y) 3 ∂wi atoms). The classic approach is to predicting protein interaction, pre- − P y′|x n x, y′ (6) begin with either an empty network or dicting advising relationships in a com- ∑ w ( ) i ( ) 32 y′ an existing KB and perform a combi- puter science department, and natorial search for formulas that predicting work relationships among = n x, y − E n x, y  i ( ) w  i ( ) improve the pseudo-likelihood.15 directors and actors.20

80 COMMUNICATIONS OF THE ACM | JULY 2019 | VOL. 62 | NO. 7 review articles

Knowledge base mapping, integration, arguments into “argument forms” of and refinement. Reasoning about the some type and number. A clustering is world requires combining diverse sources more probable if expressions in the of uncertain information, such as noisy same cluster tend to have the same KBs. MLNs can easily represent KBs number of subexpressions as each and soft constraints on the knowledge The goal of other and those subexpressions are in they represent: facts and rules in the semantic parsing the same clusters. As with SNE, USP knowledge base can be represented uses a clustering algorithm to learn directly as atoms and formulas in the is to map sentences and reason more efficiently. MLN. Ontology alignment can then be to logical forms formulated as a link prediction prob- Extensions lem, predicting which concepts in one representing the Beyond the capabilities described ontology map to which concepts in the here, Markov logic has been extended other. MLN formulas enforce structural same information. in a variety of ways to satisfy addi- similarity, so that related concepts in tional properties or accommodate one ontology map to similarly related different domains. concepts in the other.23 Similar rules For decision theoretic problems, we can be used for knowledge base refine- can extend MLNs to Markov logic deci- ment, automatically detecting and cor- sion networks (MLDNs) by attaching a recting errors in uncertain knowledge utility to each formula as well as a by enforcing consistency among classes weight.22 The utility of a world is the and relations.12 sum of the utilities of its satisfied for- Semantic network extraction (SNE). mulas. The optimal decision is the set- A semantic network is a type of KB con- ting of the action predicates that jointly sisting of a collection of concepts and maximizes expected utility. relationships among them. The SNE39 Domains with continuous as well as system uses Markov logic to define a discrete variables can be handled by probability distribution over semantic hybrid Markov logic networks (HMLNs).37 networks. The MLN entities are the HMLNs allow numeric properties of relation and object symbols from objects as nodes, in addition to Boolean extracted tuples and the cluster ones, and numeric terms as features, in assignments that group them into addition to logical formulas. For exam- concepts and relationships. The MLN ple, to reason about distances, we can rules state that the truth of an introduce the numeric property extracted relationship depends on the Distance(x, y). To state that a car should clusters of the objects and relation be centered in a lane, we can add terms involved. SNE uses a specialized bot- such as: tom-up clustering algorithm to find the semantic clusters for objects and Car(c)∧LeftLine(l) ∧RightLine(r) relations. This lets SNE scale to dis- ⇒ −(Dist(c, l)−Dist(c, r) )2 cover thousands of clusters over mil- lions of tuples in just a few hours. When c is a car, l is the left lane bound- Semantic parsing. The goal of ary, and r is the right lane boundary, this semantic parsing is to map sentences term penalizes differences between the to logical forms representing the same distance to the left and right boundar- information. The resulting informa- ies. Inference algorithms for HMLNs tion can be used to build a medical KB combine ideas from satisfiability testing, from PubMed abstracts, infer the slice-sampling MCMC, and numerical meaning of a news article, or answer optimization. Weight learning algo- questions from an encyclopedia entry. rithms are straightforward extensions Unsupervised semantic parsing31 of ones for MLNs. learns to map dependency trees from Markov logic can be extended to infi- sentences to their logical forms with- nite domains using Gibbs measures, out any explicitly annotated data. The the infinite-dimensional extension of USP system does this by recursively Markov networks.36 An MLN in an infi- clustering expressions (lambda nite domain is locally finite if each forms) with similar subexpressions. ground atom appears in a finite num- The MLN for this includes four rules: ber of ground formulas. Local finite- one to cluster expressions into “core ness guarantees the existence of a forms,” and three to cluster their probability measure; when the

JULY 2019 | VOL. 62 | NO. 7 | COMMUNICATIONS OF THE ACM 81 review articles

interactions are not too strong, the The oldest MLN toolkit is measure is unique as well. Nonunique Alchemy,17 currently in version 2. MLNs may still be useful for model- Compared to other toolkits, Alchemy ing large systems with strong interac- offers the widest variety of algorithms, tions, such as social networks with such as multiple methods for genera- strong word-of-mouth effects. In such When choosing tive and discriminative weight learn- cases, we can analyze the different the MLN structure, ing, structure learning, and marginal “phases” of a nonunique MLN and and MAP inference. Tuffy25 offers a define a satisfying measure to reason domain knowledge subset of Alchemy’s features but about entailment. about the relevant obtains greater scalability by using a Recursive Markov logic networks or database to keep track of groundings. recursive random fields (RRFs)19 extend relationships is Tuffy is the basis of DeepDive,26 a sys- MLNs to multiple layers by replacing tem for information extraction, inte- the logical formulas with MLNs, which a good place gration, and prediction built on can themselves have nested MLNs as to start. Markov logic. Other implementations features, for as many levels or layers as of Markov logic include Markov the- necessary. RRFs can compactly repre- beast33 and RockIt.27 sent distributions such as noisy DNF, When applying Markov logic to new rules with exceptions, and m-of-all problems, it is usually best to start with quantifiers. RRFs also allow more flexi- a simple model on a small amount of bility in revising or learning first-order data. High-arity predicates and formu- representations through weight learn- las may have a large number of ground- ing. An RRF can be seen as a type of ings, resulting in large models, high deep neural network, in which the node memory use, and slow inference. It is activation function is exponential and important to determine which model- the network is trained to maximize the ing choices are most important for joint likelihood of its input. In other making accurate predictions and how ways, an RRF resembles a deep expensive they are. Lifted inference Boltzmann machine, but with no hid- techniques or customized grounding den variables to sum out. or inference methods can help good Tractable Markov logic (TML)24 is a models scale to larger data. probabilistic description logic where When choosing the MLN structure, inference time is linear for all mar- domain knowledge about the relevant ginal, conditional, and MAP queries. relationships is a good place to start. TML defines objects in terms of class When such knowledge is available, it is and part hierarchies, and allows usually better to use it than to learn the objects to have probabilistic attri- structure from scratch. As with other butes, probabilistic relations between knowledge engineering problems, their subparts, and probabilistic exis- there are often several ways to repre- tence. Tractability is ensured by hav- sent the same knowledge, and some ing a direct mapping between the representations may work better than structure of the KB and the computa- others. For example, a relationship tion of its partition function: each split between smoking and cancer could be of a class into subclasses corresponds represented as equivalence (Smokes(A) to a sum, and each split of a part into ⇔ Cancer(A) ), implication (Smokes(A) ⇒ subparts corresponds to a product. Cancer(A) and Cancer(A) ⇒ Smokes(A) ), or conjunction (Smokes(A) ∧ Cancer(A) Getting Started with Markov Logic and ¬Smokes(A) ∧ ¬Cancer(A) ). If you would like to try out Markov logic for yourself, there are several open Conclusion and Directions source software packages for learning or for Future Research reasoning with MLNs. In some cases, Markov logic offers a simple yet powerful software for learning and reasoning representation for AI problems in many with Markov networks or conditional domains. As it generalizes first-order random fields can also be used; how- logic, Markov logic can easily model the ever, the task of translating from an full relational structure present in many MLN to a ground Markov network is left problems, such as multiple relations and to you, and standard algorithms do not attributes of different types and arities, exploit the structure and symmetries relational concepts such as transitivity, present in MLNs. and background knowledge in first-order

82 COMMUNICATIONS OF THE ACM | JULY 2019 | VOL. 62 | NO. 7 review articles logic. And as it generalizes probabilistic N00014-08-1-0670, N00014-12-1-0312 19. Lowd, D., Domingos, P. Recursive random fields. In Proceedings of the 20th International Joint Conference graphical models, Markov logic can effi- and N00014-16-1-2697, an NSF on Artificial Intelligence (2007). AAAI Press, ciently represent uncertainty in the con- CAREER Award (first author), a Sloan Hyderabad, India, 950–955. 20. Mihalkova, L., Huynh, T., Mooney, R.J. Mapping and cepts, attributes, relationships, among Research Fellowship (first author), revising Markov logic networks for transfer learning. others required by most AI applications. an NSF Graduate Fellowship (second In Proceedings of the 22nd AAAI Conference on Artificial Intelligence (2007). AAAI Press, Vancouver, For future research, one of the most author) and a Microsoft Research Canada, 608–614. important directions is improving the Graduate Fellowship (second author). 21. Mihalkova, L., Mooney, R. Bottom-up learning of Markov logic network structure. In Proceedings efficiency of inference. Lifted inference The views and conclusions contained of the 24th International Conference on Machine Learning (2007). ACM Press, Corvallis, OR, 625–632. algorithms obtain exponential speed- in this document are those of the 22. Nath, A., Domingos, P. A language for relational ups by exploiting relational symme- authors and should not be interpreted decision theory. In Proceedings of the International Workshop on Statistical Relational Learning (2009). tries, but can fail when these as necessarily representing the official Leuven, Belgium. symmetries are broken by evidence or policies, either expressed or implied, 23. Niepert, M., Meilicke, C., Stuckenschmidt, H. A probabilistic-logical framework for ontology more complex structures. Tractable of ARO, DARPA, AFRL, NSF, ONR, or matching. In Proceedings of the 24th AAAI Conference Markov logic guarantees efficient infer- the United States Government. on Artificial Intelligence (2010). AAAI Press. 24. Niepert, M., Domingos, P. Learning and inference in ence but constrains model structure. tractable probabilistic knowledge bases. In Proceedings More research is needed to make rea- References of the 31st Conference on Uncertainty in Artificial Intelligence (2015). AUAI Press, Brussels, Belgium. soning work well in a wider range of 1. Bach, S., Broecheler, M., Huang, B., Getoor, 25. Niu, F., Ré, C., Doan, A., Shavlik, J. Tuffy: scaling up models. Because most learning meth- L. Hinge-loss Markov random fields and probabilistic statistical inference in Markov logic networks using an soft logic. J. Mach. Learn. Res. 18, 109 (2017), RDBMS. PVLDB 4 (2011), 373–384. ods rely on inference, this will lead to 1–67. 26. Niu, F., Zhang, C., Ré, C., Shavlik, J. DeepDive: web- 2. Chavira, M., Darwiche, A. On probabilistic inference more reliable learning methods as well. scale knowledge-base construction using statistical by weighted model counting. Artif. Intell. 6–7, 172 learning and inference. In VLDS, 2012. A second key direction is enriching (2008), 772–799. 27. Noessner, J., Niepert, M., Stuckenschmidt, H. RockIt: 3. Davis, J., Domingos, P. Deep transfer via second-order exploiting parallelism and symmetry for MAP inference the representation itself. Markov th Markov logic. In Proceedings of the 26 International in statistical relational models. In Proceedings of the logic is built on first-order logic, Conference on Machine Learning. ACM Press, 27th AAAI Conference on Artificial Intelligence (2013). Montréal, Canada. AAAI Press, Bellevue, WA. which is not always the best way to 4. De Raedt, L. Logical and Relational Learning. Springer, 28. Pfeffer, A. Practical Probabilistic Programming. compactly encode knowledge, even Berlin, Germany, 2008. Manning Publications, 2016 5. Van den Broeck, G., Taghipour, N., Meert, W., Davis, J., 29. Poole, D. First-order probabilistic inference. In in logical domains. For example, con- De Raedt, L. Lifted probabilistic inference by first- Proceedings of the 18th International Joint Conference order knowledge compilation. In Proceedings of on Artificial Intelligence (2003). Morgan Kaufmann, cepts such as “every person has at nd the 22 International Joint Conference on Artificial Acapulco, Mexico, 985–991. least five friends” are difficult to Intelligence (IJCAI) (2011). Barcelona, Spain. 30. Poon, H., Domingos, P. Joint inference in information express with standard first-order con- 6. Domingos, P., Kersting, K., Mooney, R., Shavlik, J. What extraction. In Proceedings of the 22nd AAAI about statistical relational learning? Commun. ACM Conference on Artificial Intelligence (2007). nectives and quantifiers. Markov logic 58, 12 (2015), 8. AAAI Press, Vancouver, Canada, 913–918. has been extended to handle decision 7. Domingos, P., Lowd, D. Markov Logic: An Interface Layer 31. Poon, H., Domingos, P. Unsupervised semantic for AI. Morgan & Claypool, San Rafael, CA, 2009. parsing. In Proceedings of the 2009 Conference on theory, continuous variables, and 8. Gogate, V., Domingos, P. Probabilistic theorem Empirical Methods in Natural Language Processing th more. Some new applications may proving. In Proceedings of the 27 Conference (2009). ACL, Singapore. on Uncertainty in Artificial Intelligence (UAI-11). 32. Richardson, M., Domingos, P. Markov logic networks. require new extensions. AUAI Press, Barcelona, Spain, 2011. Mach. Learn. 62 (2006), 107–136. 9. Goodfellow, I., Bengio, Y., Courville, A. Deep Learning. We hope that Markov logic will be 33. Riedel, S. Improving the accuracy and efficiency of MIT Press, 2016. http://www.deeplearningbook.org. MAP inference for Markov logic. In Proceedings of the of use to AI researchers and practitio- 10. Van Haaren, J., Van den Broeck, G., Meert, W., Davis, J. 24th Conference on Uncertainty in Artificial Intelligence Lifted generative learning of Markov logic networks. ners who wish to have the full spec- (2008). AUAI Press, Helsinki, Finland, 468–475. Mach. Learn, 103 (2015), 27–55. 34. Russell, S. Unifying logic and probability. Commun. ACM trum of logical and statistical 11. Huynh, T., Mooney, R. Discriminative structure and 58, 7 (2015), 88–97. https://doi.org/10.1145/2699411 parameter learning for Markov logic networks. In 35. Schulte, O., Gholami, S. Locally consistent Bayesian th inference and learning techniques at Proceedings of the 25 International Conference on network scores for multi-relational data. In Proceedings their disposal, without having to Machine Learning (2008). ACM Press, Helsinki, Finland, of the 26th International Joint Conference on Artificial 416–423. Intelligence (IJCAI-17) (2017). Melbourne, Australia, develop every piece themselves. We 12. Jiang, S., Lowd, D., Dou, D. Ontology matching 2693–2700. also hope that Markov logic will with knowledge rules. In Proceedings of the 26th 36. Singla, P., Domingos, P. Markov logic in infinite domains. International Conference on Database and Expert In Proceedings of the 23rd Conference on Uncertainty in inspire development of even richer Systems Applications (DEXA 2015) (2015). Springer, Artificial Intelligence (2007). AUAI Press, Vancouver, representations and more powerful Valencia, Spain. Canada, 368–375. 13. Khot, T., Natarajan, S., Kersting, K., Shavlik, J. Gradient- 37. Wang, J., Domingos, P. Hybrid Markov logic networks. algorithms to further integrate and based boosting for statistical relational learning: the In Proceedings of the 23rd AAAI Conference on Artificial unify diverse AI approaches and Markov logic network and missing data cases. Mach. Intelligence (2008). AAAI Press, Chicago, IL, 1106–1111. Learn, 100 (2015), 75–100. 38. Xiang, R., Neville, J. Relational learning with one applications. More details on Markov 14. Kimmig, A., Mihalkova, L., Getoor, L. Lifted graphical network: An asymptotic analysis. In Proceedings of the logic and its applications can be models: a survey. Mach. Learn. (1), 99 (2015), 1–45. 14th International Conference on Artificial Intelligence 15. Kok, S., Domingos, P. Learning the structure and Statistics (2011). 779–788. 7 found in Domingos and Lowd. of Markov logic networks. In Proceedings 39. Kok, S. and Domingos, P. Extracting semantic networks nd of the 22 International Conference on Machine from text via relational clustering. In Proceedings Learning (2005). ACM Press, Bonn, Germany, 441–448. of ECML/PKDD-08 (Antwerp, Belgium, Sept. 2008). Acknowledgments 16. Kok, S., Domingos, P. Learning Markov logic networks Springer, 624–639. using structural motifs. In Proceedings of the 27th This research was partly supported by International Conference on Machine Learning (2010). ACM Press, Haifa, Israel. ARO grants W911NF-08-1-0242 and Pedro Domingos ([email protected]) is 17. Kok, S., Sumner, M., Richardson, M., Singla, P., Poon, H., a professor in the Allen School of Computer Science W911NF-15-1-0265, DARPA contracts Lowd, D., Domingos, P. The Alchemy system for and Engineering at the University of Washington, statistical relational AI. Technical Report. Department FA8750-05-2-0283, FA8750-07-D-0185, Seattle, WA, USA. of Computer Science and Engineering, University FA8750-16-C-0158, HR0011-06-C-0025, of Washington, Seattle, WA, 2000. http://alchemy. Daniel Lowd ([email protected]) is an associate HR0011-07-C-0060, NBCH-D030010 and cs.washington.edu. professor in the Department of Computer and Information 18. Lowd, D., Domingos, P. Efficient weight learning for Science at the University of Oregon, Eugene, OR, USA. AFRL contract FA8750-13-2-0019, NSF Markov logic networks. In Proceedings of the 11th European Conference on Principles and Practice of grants IIS-0534881 and IIS-0803481, Knowledge Discovery in Databases (2007). Springer, Copyright held by authors/owners. ONR grants N-00014-05-1-0313, Warsaw, Poland, 200–211. Publication rights licenced to ACM.

JULY 2019 | VOL. 62 | NO. 7 | COMMUNICATIONS OF THE ACM 83