<<

JOURNAL OF COMPUTATIONAL BIOLOGY Volume 19, Number 3, 2012 Review Article # Mary Ann Liebert, Inc. Pp. 316–336 DOI: 10.1089/cmb.2011.0234

Probabilistic Methods and Some Applications to Biology and Medicine

NIKITA A. SAKHANENKO1 and DAVID J. GALAS1,2

ABSTRACT

For the computational analysis of biological problems—analyzing data, inferring networks and complex models, and estimating model parameters—it is common to use a range of methods based on probabilistic logic constructions, sometimes collectively called machine learning methods. Probabilistic modeling methods such as Bayesian Networks (BN) fall into this class, as do Hierarchical Bayesian Networks (HBN), Probabilistic Boolean Networks (PBN), Hidden Markov Models (HMM), and Markov Logic Networks (MLN). In this re- view, we describe the most general of these (MLN), and show how the above-mentioned methods are related to MLN and one another by the imposition of constraints and re-

strictions. This approach allows us to illustrate a broad landscape of constructions and methods, and describe some of the attendant strengths, weaknesses, and constraints of many of these methods. We then provide some examples of their applications to problems in biology and medicine, with an emphasis on genetics. The key concepts needed to picture this landscape of methods are the ideas of probabilistic graphical models, the structures of the graphs, and the scope of the logical language repertoire used (from First-Order Logic [FOL] to Boolean logic.) These concepts are interlinked and together define the nature of each of the probabilistic logic methods. Finally, we discuss the initial applications of MLN to ge- netics, show the relationship to less general methods like BN, and then mention several examples where such methods could be effective in new applications to specific biological and medical problems.

Key words: Bayesian networks, first-order logic, hierarchical Bayesian networks, machine learning, Markov logic networks, probabilistic Boolean networks, probabilistic graphical models, propositional logic.

1. INTRODUCTION

ogic and probabilistic graphical models are natural tools of computing and have been studied Land used in computer science for many years. Some of these methods have emerged in recent years as promising approaches to a range of problems in artificial intelligence. In biology and medicine, where many of the computational problems involve data analysis and the inference of underlying complex models, simple forms of these methods, such as Hidden Markov Models (HMM), and simple forms of Bayesian Networks

1The Institute for Systems Biology, Seattle, Washington. 2Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg.

316 PROBABILISTIC LOGIC METHODS IN BIOLOGY 317

(BN), have become more commonly used in recent years (Baldi and Brunak, 2001). There is now a wide range of these methods that have been usefully applied to biological and medical problems, but it is often difficult to see in the reports of successful applications in the literature how they are related to one another. In this review, we attempt to address this difficulty and elucidate some of the key relationships, not specifically for the experts in the field, but more for the computational biologists that are grappling with the realities and complexities of current problems in systems biology, genetics and their applica- tions to modern medicine. This review is not a comprehensive survey of methods, nor of the specific techniques and implementations in the literature, but rather is focused on providing a map of some of the key concepts that we hope will allow computational biologists to see the relationships, contrasts, and overlaps among widely used methods. Even more importantly, in our view, we hope to point out where new applications of probabilistic logic methods may provide powerful new approaches to the sometimes dauntingly complex problems of understanding biological systems. Our plan for this review is to begin with the most general combination of logic and , and illustrate how simplifying restrictions and constraints lead to more familiar methods, in order to elucidate the key relationships, and to indicate where some power is lost and where efficiency is gained. There are several relevant logical systems and several different forms, and representations, of probability distributions. These are briefly summarized below. First-Order Logic (FOL) is a very general formal logic, and represents a more powerful tool for ex- pression of logical relationship than propositional logic (Ershov and Palyutin, 1986). Using propositional logic, we are not able to express assertions about classes of objects or events. FOL is considerably richer than propositional logic and allows just these expressions. FOL allows the propositional symbols to have arguments that range over elements of sets, which allows us to make assertions about the sets or classes. FOL can also deal with recursive statements, while propositional logic cannot. Probability distributions can involve a wide range of simple and complex dependencies and are often represented in graphical form in , where the independencies of the variables represented as nodes are encoded in the edges. In this review, we illustrate the key relationships by beginning with the most general and proceeding to more restrictive, and constrained representations. While there are some sig- nificant differences in the graphical representations used in various methods the logical components of the probabilistic logic are where most of the restrictions occur. Markov Logic Networks (MLN) represent a new and general approach to modeling based on full FOL and Markov Random Fields (MRF) (Richardson and Domingos, 2006). In addition to high representational power of FOL, MRFs provide a very compact way of representing probability distributions and are very useful for modeling and reasoning in noisy, uncertain environments. For physicists, MRFs are probably most familiar as the mathematical representation of Ising models, which are simplified, statistical models of the interaction of arrays of magnetic spins (Kindermann and Snell, 1980). MLNs thus combine (and generalize) FOL and MRFs to gain most of the advantages of both the logic and probabilistic modeling worlds. Among the advantages of MLN are the ability to handle arbitrary classes of variables, recursive statements, and the ability to construct multiple templates for MRFs. Using MLN can allow us to perform sophisticated probabilistic modeling while directly incorporating biological knowledge, particularly including partial knowledge, into the models, which is one of our major motivations for developing this general approach. We think that this capability is fundamentally important for the development of system level analysis and modeling of biological systems, including metabolic, regulatory, and genetic aspects of these systems. It is useful here to presage, or summarize, the conclusions of the later sections here by giving a compact overview of the key relationships that put the MLN and other methods into some context. We use the examples elaborated later in this article to illustrate this overview. Since the most widely used probabilistic graphical models in computational biology are various versions of BN, it is specifically useful to note the key differences between these and MLN. The major differences are these three:

1. The probability distributions that can be represented in MLN are those of MRF, which are more general than those of BN. 2. Relationships and can be assigned to classes of elements and/or events in MLN, whereas in BN probabilities are assigned to individual elements or events. 3. The representation of the logical relationships in MLN is much more compact than in BN (for those that can be expressed in BN), particularly for more complex models. 318 SAKHANENKO AND GALAS

These differences will be illustrated and discussed in later sections. A fourth difference is that MLNs are much more computationally intensive to implement, which can be an important practical limitation. Now we turn to a specific example of model selection in genetic analysis. We have applied MLN by using a search method that allows us to solve various model selection problems with different biological knowledge constraints naturally embedded in them. In a previous article, we used MLN to select models for the influence of the genotype (specific sets of markers, indicating gene variants) on the phenotype (e.g., properties of the organism as expressed in measurements: size, color, shape, gene expression levels). We first used a single-marker model iteratively that focused on relationships between genetic markers and a given phenotype (Sakhanenko and Galas, 2010). Every marker associated with a gene selected in the model had a specific influence on the phenotype, even when considered alone. The model is asked to predict the phenotype given the genotype of a single organism, and the markers were iteratively selected to improve the prediction. Later, in this review, we discuss a natural extension of this application that uses a pair-wise gene model that explicitly represents relationships between a phenotype and pair-wise interacting genes. The models discussed here to illustrate the approach, while complex in implementation, are still rather simple genetic models, and the full power of MLNs will come with the natural extension to much more complex models. This is one of the major points we would like to emphasize—that much more powerful applications are to come, but probabilistic logic represents a ‘‘language’’ for expression and calculation with such complex models. To illustrate the method in a simple form, a single-marker model encodes an influence of a set of genetic markers on a phenotype as an aggregation of influences of individual markers. When conditioned on the alleles of the markers (predicting the phenotype values given the genotype data), modeling using a single- marker model can be seen (as we specifically show in a later section) as performing a logistic regression of the marker alleles on the phenotype values. Note however that the single-marker model does not make the assumption that the data is identically, independently distributed (iid) as opposed to the case of a true logistic regression. At the first iteration of the search method (see Algorithm 1 of Section 4), when each model is based on only one marker, the corresponding logistic regression has two predictors (assuming a genetic marker can have two possible alleles). For the following iterations, as we add more markers to the predictive markers set, the corresponding logistic regression is expanded to include more predictors (four, six, and so on). It is important to note however that the MLN description of the models at different levels of iteration does not change; what changes is a set of possible values of the MLN variables that correspond to the predictors of the logistic regression. The method is then a systematic application of a template specified by MLN to different subsets of the data. This template is essentially a model. We can use this template then for adding biological domain knowledge into the probabilistic search—information about how the various parts of a biological system interact or influence one another. For example, by using pair-wise models, we expand the template to specifically allow for pair-wise gene interactions. A pair-wise model encodes an aggregation of joint influences of pairs of markers together with their interactions on a phenotype. This model is an MLN template that explicitly takes into account possible gene interactions. As with the single-marker model, when predicting phenotype values from genotype values, the pair-wise model can also be seen as a logistic regression. The difference between the single- marker and the pair-wise models becomes apparent when we look closely at their corresponding forms of regression. A logistic regression derived from the pair-wise model contains all the terms of the regression of the single-marker model given the same set of markers, plus the terms that correspond to possible interactions in each pair of the markers. If the markers have no gene interactions, then a corresponding pair- wise model reduces to a single-marker model. On the other hand, if the markers do interact, then the pair- wise model can encompass that effect when predicting the phenotype, even if the markers individually have little influence on the phenotype and cannot therefore be detected as contributing in the single gene model. This is why the pair-wise models have been successful at capturing synthetic gene interactions that confer a phenotype only when both specific alleles of a pair of genes are present. However, if two genes have an effect on phenotype individually, but also have a significant interaction, the interaction may be detected by the single gene template as a non-additive effect of the two genes on prediction accuracy (because of the induced correlation or dependence). The pair-wise model, on the other hand can detect significant inter- action effects even if the single gene effects are in the noise and not detectable. The pair-wise model, while more complex than the single gene model, is clearly just the tip of the iceberg of the kinds of complex models that likely underlie real complex genetic traits. PROBABILISTIC LOGIC METHODS IN BIOLOGY 319

MLNs are powerful tools for systems genetics since they allow us to incorporate explicit biological knowledge into genome-wide studies. For example, when using the pair-wise model, we can control the interaction terms of the corresponding logistic regression through specific constraints in the logical rela- tionships based on our biological knowledge. We could also define specific constraints concerning the gene interactions in the model. Moreover, the logic component of MLNs makes it possible to represent large probabilistic models (a logistic regression with many interaction terms, for example) in a highly compact way. Using MLNs as model templates in genetic studies is a huge step beyond Genome-Wide Association Studies (GWAS) and moves us substantially toward true systems genetics. It is important to compare MLNs to other established and related modeling methods in computational biology. Since MLNs are based on MRFs and FOL, it is natural that MLNs represent a generalization of both of these (Richardson and Domingos, 2006). In the most extreme case, we can reduce MLNs simply to FOL by setting all the weights in an MLN to an arbitrarily large number. Conversely, we can also reduce MLNs to MRFs by assigning an MLN formula for each clique of an MRF and then using the cliques potential as the weight of that formula. As with MRFs, we can also express any probability distribution represented by a BN using MLNs. Consequently, other kinds of BNs, such as Hierarchical BNs (HBN) and Dynamical BNs (DBN), can also be represented in the MLN framework. We note, for example, that an HMM is actually a particularly simple form of a DBN. To briefly summarize the relationships between these methods, we present a simplified table of prop- erties here (Table 1). These elements will be described and discussed in the body of the review, and though some of the entries in this table may be less than clear at this point, the table represents a rough map of the methods review and of our intent. We revisit the summary of relationships more explicitly in a later section (Fig. 3). To further explore and analyze the relationship between MLNs and other methods, we will now consider in more detail the probabilistic, logic, and structural components of each method and indicate where the methods differ in how constrained their components are. The MLNs have the least constrained logic component, and so we will use them as the general, master method. Next, in Section 2, we set out the notation carefully, and we briefly introduce FOL, MRFs, and MLNs with a little more rigor. In Section 3, we show explicitly the relationship between MLNs and BNs, and then discuss how MLNs fit into a broader picture of probabilistic logic methods. To make it explicitly clear, we illustrate a specific MLN repre- sentation of a BN in an example in Section 3. In Section 4, we illustrate the use of MLN-based methods applied thus far to genetics. These examples show two types of models, single-marker and pair-wise models, which are analyzed in Sections 5 and 6, respectively. Finally, we conclude by describing some future, possible applications of MLNs to other biological problems in Section 7.

Table 1. Brief Summary of Some of the Key Elements of Probabilistic Logic Methods

Name Probabilistic component Logic component Graphical representation

Markov Logic Markov Random Fields First-order logic General undirected graphs Networks Markov Random Conditional independence Propositional logic General undirected graphs Fields via graph separation Bayesian Networks Conditional probability Propositional logic with Directed acyclic graphs chain rule mutually exclusive constraints (DAGs) and acyclicity requirements Hierarchical Bayesian Conditional probability Propositional logic with Directed trees with links Networks chain rule mutually exclusive constraints inside each layer and structural requirements Dynamic Bayesian Conditional probability Propositional logic with Sequence of snapshots Networks chain rule, Markov mutually exclusive constraints (DAGs) connected in chain and structural sequential order requirements Probabilistic Boolean Boolean logic Sequence of snapshots Networks connected in order 320 SAKHANENKO AND GALAS

2. PRELIMINARIES

We describe here all the necessary preliminary information, definitions, and notation for logic, MRFs, and networks.

2.1. Logic Among various systems of logic, the most general one we will use here is FOL. On the other hand, propositional logic has the simplest semantics, but many concepts of propositional logic generalize to FOL (Ershov and Palyutin, 1986). In propositional logic, there are atomic (simple) assertions, consisting of propositional letters, and compound assertions, composed from the atomic assertions and the logical connectives, and (^), or (n), not (:), implication (0), and equivalence (5). An interpretation in propositional logic is a mapping that assigns a (True or False) to every propositional letter. Once the atomic assertions in a pro- position have received an interpretation, we can compute the truth value of the proposition. We can do this since all propositional formulas are inductively constructed from the atomic assertions, and the logical connectives are interpreted with the given in Table 2. In propositional logic, propositions that are equivalent irrespective of their interpretations form an equivalence class. A structure based on these equivalence classes is called . Using propositional logic, we are not able to express assertions about elements of structures. FOL is considerably richer than propositional logic and allows these expressions. FOL allows the propositional symbols to have arguments that range over elements of different structures, which allows us to make assertions about the sets of elements of structures. In FOL, there are atomic formulas, consisting of predicates applied to logical terms. A term is an entity inductively constructed from variables and functions and has many levels of complexity. Note that the simplest term is a value of a variable, called a constant, that can be seen as a function that takes no arguments, or has the same value irrespective of its arguments. Note also that a simplest atomic formula is a

predicate that takes no arguments, which is similar to a propositional letter in propositional logic. In FOL, there are formulas, composed inductively from the atomic formulas, the logical connectives, ^, n, :, 0, 5 (as in propositional logic), and the quantifiers, universal (c, the usual mathematical symbol for ‘‘each and every’’) and existential (d, the usual mathematical symbol for ‘‘there exists’’). In order to assign meaning to the symbols in FOL, we first must define a domain (a universe), which is the domain of the logical variables. An instantiation of a first-order formula is, then, a replacement of variables of the formula with logical terms. Note that the most common instantiation replaces the variables with the values (con- stants) from their domain. An interpretation in FOL is a mapping that assigns a truth value to every atomic formula for every instantiation of variables. An interpretation is then recursively defined on complex formulas. Note that, given an instantiation, formulas composed from atomic formulas and logical con- nectives, are interpreted just as in propositional logic. The interpretation of quantified formulas is as follows:

(8xA(x)) is true iff A(d) is true for any d 2 Domain (9xA(x)) is true iff A(d) is true for some d 2 Domain A possible world (a Herbrand interpretation) in FOL is an assignment of truth-values to all possible predicates whose variables are replaced with every possible combination of values. For example, one

Table 2. Truth Table Defining the Interpretation of Logical Connectives of Propositional Logic

PQ:PP^ QPn QP0 QP5 Q

True True False True True True True True False False False True False False False True True False True True False False False True False False True True PROBABILISTIC LOGIC METHODS IN BIOLOGY 321

Table 3. Possible World (in FOL) for the Program (13)

Phenotype(s1, t1) True Phenotype(s1, t2) False

Phenotype(sN4 ‚ tN3 ) True Allele(s1, m1, v1) False

Allele(sN4 ‚ mN1 ‚ vN2 ) False

Here predicates Phenotype(si, tj) and Allele(si, mk, vl) encode two facts (two data points) that tj is a phenotype value of strain si and that vl is an allele of marker mk for strain si. possible world for the program (13) can be given in Table 3. Note that various model restrictions can be applied by reducing the set of all possible predicates and a set of possible values for the variables.

2.2. Probabilistic graphical models A BN is a probabilistic graphical model that represents random variables and their probabilistic de- pendencies with a directed, acyclic graph (Pearl, 1988; Koller and Friedman, 2009). A BN is represented by a directed, acyclic graph where each vertex represents a random variable and each edge represents a conditional dependence between two vertices. Since the graph is directed, for every edge we distinguish a parent, a vertex from which the edge originates, and a child, a vertex to which the edge goes. Consequently, every vertex is assigned a conditional probability, in a table defining the probability of a variable given every possible set of values of the parents of the vertex (corresponding to all edges going in the vertex). We could say ‘‘conditioned on’’ those values. One powerful feature of BNs is their ability to graphically encode conditional dependence between random variables: a variable is conditionally independent from any of non-descendent variables, given the values of its parents. This feature allows BNs to specify probability distributions in a compact and efficient way, but only those distributions that fit these constraints. Consider a set of random variables V ¼fV1‚ ...‚ VN g and consider a directed, acyclic graph G = (V, E) based on the set V (thus, every Vi will be referred to as either a variable or a vertex). Given a parameter set Y ¼fY1‚ ...‚ YN g, where each Yi is a conditional probability distribution of Vi given its parents, Yi = Pr(Vijparents(Vi)), then CV, G, YD is a BN if a joint probability distribution on V can be factorized as

YN YN Pr (V) ¼ Yi ¼ Pr (Vijparents(Vi)): (1) i ¼ 1 i ¼ 1 Note that in many applications of BNs researchers give causal meaning to the edges of a network. In general, however, the directionality of edges does not imply causality. Note the significance of the fac- torization is related to the chain rule for conditional probabilities. The interpretation of this rule is the source of these incorrect causality arguments. Consider here an example from a textbook (Luger, 2008). Suppose there is one event that can cause orange barrels to appear on the road, B = true: a road construction, C = true. There is also one event that can cause flashing lights to appear on the road, L = true: an accident, A = true. Suppose then that there are two events that can cause bad traffic, T = true: either an accident or a road construction. This example can be modeled by a BN shown in Figure 1, whose parameters are given in the tables below. Note that all the variables are binary in this case. The probabilistic independencies encoded by the BN in Figure 1 allows us to express a joint probability distribution in a compact way with a small number of parameters. Here, the rule is the probabilities of nodes not directly connected by edges are independent and can be multiplied: Pr (C‚ A‚ B‚ T‚ L) ¼ Pr (C) · Pr (A) · Pr (BjC) · Pr (TjC‚ A) · Pr (LjA): MRFs are another kind of probabilistic graphical model, which is similar to BNs in how dependencies are represented (Kindermann and Snell, 1980; Pearl, 1988; Koller and Friedman, 2009). As opposed to BNs though, MRFs are defined on undirected graphs. Removing edge directionality eliminates the asymmetry between a parent vertex and a child vertex, which allows MRFs to represent cyclic dependencies that 322 SAKHANENKO AND GALAS

FIG. 1. Structure of a road traffic . Here B stands for orange barrels on the street, C— for a road construction, T—for traffic, A—for an accident, and L— for flashing lights.

cannot be represented by BNs. Note however that there are dependencies, such as induced dependencies, that can be represented by BNs, but not by MRFs. Thus, MRFs offer an alternative graphical semantics for probability distributions where graph separation of two vertices by a separating set implies conditional independence of the two vertices given the separating set, and we are able to make different conditional independence statements in MRFs than in BNs. More formally, given three disjoint sets of vertices, A, B, C, in an undirected graph, the set B separates A from C in the graph if every path from A to C contains at least one vertex from B. We now use the definition of graph separation to define an MRF. Consider a set of random variables V ¼fV1‚ ...‚ VN g and consider an undirected graph G = (V, E) based on the set V. An MRF defined on V is a probability distribution such that there exists a G with a condition that, for disjoint sets of vertices A, B, C,ifA and C are separated by B, then vertices from A are conditionally independent from vertices from C given B. The conditional inde- pendence property of MRFs implies that if we have two vertices Vi and Vj that are not directly connected,

then they are conditionally independent given all other vertices, i.e.,

Pr (Vi ¼ vi‚ Vj ¼ vjjV nfVi‚ Vjg¼v nfvi‚ vjg) ¼

¼ Pr (Vi ¼ vijV nfVi‚ Vjg¼v nfvi‚ vjg) Pr (Vj ¼ vjjV nfVi‚ Vjg¼v nfvi‚ vjg):

Here, Vi = vi stands for an event that a variable Vi takes on a value vi, V = v is shorthand for V1 ¼ v1‚ ...‚ VN ¼ vN , and V y {Vi, Vj} is a set V without vertices Vi and Vj. Note also that two directly connected vertices are not conditionally independent given all other vertices. This can be used to factorize an MRF. A clique c of a graph G is a subgraph such that all vertices of c are fully connected. A Gibbs distribution defined on G is 1 Y Pr (V ¼ v) ¼ exp ( / (v )): (2) Z c c c2Cl P Q PThe so-called partition function Z ¼ v c2Cl exp ( /c(vc)) normalizes the probability to ensure that v Pr (V ¼ v) ¼ 1. Here Cl is the set of all cliques of G, vc is a restriction of a configuration of v to a clique c, and /c is a real-valued potential function assigned to a clique c. Given a Gibbs distribution on G, Pr (V ¼ v) Pr (V1 ¼ v1jV nfV1g¼v nfv1g) ¼ Pr (V nfV g¼v nfv g) Q Q1 1 exp / v ‚ v ‚ ...‚ v exp ( /c(v1‚ v2‚ ...‚ vN )) ( c( 1 2 N )) fc2CljV 2cg ¼ Pc2ClQ ¼ P Q1 : exp ( /c(x‚ v2‚ ...‚ vN )) exp ( /c(x‚ v2‚ ...‚ vN )) x c2Cl x fc2CljV12cg

Note that although for convenience we write /c(V1‚ ...‚ VN ), this potential function depends only on the vertices from the clique c, hence the right side of the derivation above depends only on the variables from the cliques containing V1. Thus, the variable V1 is conditionally dependent on only the variables form the same cliques it belongs to, which means that any Gibbs distribution is an MRF. On the other hand, the PROBABILISTIC LOGIC METHODS IN BIOLOGY 323

Hammersley-Clifford theorem proves the converse, that every positive MRF corresponds to some Gibbs distribution. Without loss of generality, we can represent an MRF conveniently as a log-linear model: 1 X Pr (V ¼ v) ¼ exp wi fi(v) ‚ (3) Z i where the fi are real-valued functions defining features of the MRF and wi are the real-valued weights of the MRF, that are the parameters of the model (Pietra et al., 1997). Features, that can be as simple as indicator functions representing the presence of some attributes, and can overlap in arbitrary ways providing rep- resentational flexibility.

2.3. Markov logic networks Because of their flexibility and the potential for representing complex relationships we have proposed using probabilistic logic methods for analyzing genetic data. From a set of related logic-based probabilistic methods, we chose the most general of these, MLNs, and have used the method to identify genetic loci that predict quantitative phenotypes (Sakhanenko and Galas, 2010). MLNs merge MRFs with first-order logic (see Sections 2.1 and 2.2). Any strictly formal completely logic system (not including probabilities) is not suitable for applications where the data contain any uncertainty or noise. This is because a set of first-order formulas specifying a logical model is seen as a set of uncompromising requirements so that the model is either true or false by comparison with the data. In other words, models in FOL can only have probability values 1 or 0, and since noise and uncertainty exist in all real data, no model is satisfied exactly and all must therefore be false. Since the data we wish to analyze are actually rich in information while not being exactly satisfied, this is not a useful point of view. MLNs relax this constraint by allowing a model with unsatisfied formulas with a lesser probability than 1. The model with the smallest measure of unsatisfied formulas is the most probable, and will therefore represent the most successful extraction of knowledge of reality from the data.

An MLN is a set of first-order formulas, Fi, with assigned weights wi. An MLN, together with the set of possible values of its variables, is then converted to an MRF as follows. For every predicate of the MLN whose variables are instantiated with every possible combination of values, we create one random variable in the MRF. The value of the random variable is either 1 or 0 corresponding to whether the instantiated predicate is true or false. Furthermore, for every possible instantiation of every formula, Fi, we construct one feature of the log-linear MRF (see Section 2.2) whose value is either 1 or 0, depending on the truth value of the instantiated formula. The weight of the MRF feature is the weight wi assigned to the formula Fi in the MLN. Taking the original definition (3) and replacing all the features corresponding to false formulas with 0, we obtain the probability distribution represented by an instantiated MLN 1 X Pr (c) ¼ exp wini(c) ‚ (4) Z i where ni(c) is a number of times the formula Fi is instantiated to a true proposition in the state c (which corresponds to our data). Note that this probability distribution depends on the set of possible values of MLN variables. Therefore, MLNs can be seen as templates specifying classes of MRFs, similar to FOL specifying propositional formulas (see Section 2.1).

3. MARKOV LOGIC NETWORKS VERSUS BAYESIAN NETWORKS 3.1. Representing Baysian networks with conjunctive normal forms Darwiche (2002) showed that a BN can be encoded in a Conjunctive Normal Form (CNF). A CNF C is a conjunction of logical clauses Di : C ¼ D1 ^ ...^ DN , where a logical clause Di is a disjunction of either propositional variables or their negations (although a variable and its negation cannot be in the same clause): thus, Vi1 _:Vi2 _ ..._ ViK . Following Darwiche’s notation, we first establish the alphabet of propositional logic corresponding to the objects of a BN. We use two types of propositional symbols here, kv and hvju. A propositional symbol kv for 324 SAKHANENKO AND GALAS each value v of a random variable V is interpreted as being true iff V = v. Note that V can have more than two values. A propositional symbol hvju, for each value v of V and for each combination of values u of a set U of parent vertices of V, is interpreted as being true iff there is an entry in a conditional probability table whose value is Pr(V = vjU = u). These elements can be used to define precisely the logical structure inherent in BNs. Consider then a BN CV, G, YD and construct a CNF encoding this BN. First, for each network variable V, whose possible values are v1‚ ...‚ vK , we must include the following clauses (disjunctions): kv1 _ ..._ kvK and :kvi _:kvj ‚ i 6¼ j. These clauses ensure that we use exactly one value for each random variable when evaluating our BN. In addition, for each entry of every conditional probability table of the BN, we must 5 include in the CNF encoding the following clauses: kv ^ ku1 ^ ...^ kuM hvju1‚ ...‚ uM . These clauses ensure that, while evaluating the BN, the evidence v‚ u1‚ ...‚ uM is necessary and sufficient for the BN to include a conditional probability table that contains Pr (vju1‚ ...‚ uM). Let us now construct a CNF encoding the road traffic BN from Figure 1. Our CNF is a conjunction of the disjunctions (clauses) shown in Table 4. The CNF models (possible truth assignments to the propositional variables) are in one-to-one correspondence with the instantiations of the network variables. Following Darwiche (2002), we can now apply weighted model counting to the CNF by assigning weights to all the letters and their negations: the weight of kv, :kv, and :hvju is 1, and the weight of hvju is a value Pr(V = vjU = u) from a corresponding CPT of the BN. Consequently, the probability of an event e can be computed by weighted model counting of the CNF in conjunction with e. For example, one of the 32 (2jVj = 25) possible CNF models correspond to the following instantiation of the random variables of the BN:

C)c2‚ A)a1‚ B)b1‚ T)t1‚ L)l2: In this model, the following propositional letters are true:

kc2 ‚ ka1 ‚ kb1 ‚ kt1 ‚ kl2 ‚ hc2 ‚ ha1 ‚ hb1jc2 ‚ ht1jc2‚ a1 ‚ hl2ja1 : Consequently, the weight of this model is a product of weights of these propositional letters,

0.6 · 0.5 · 0.2 · 0.8 · 0.01 & 0.0005. Thus, it is possible to represent and elucidate the logical structure of BNs using the CNF formulation. We can use this in turn to make the connection directly to MLNs.

3.2. Representing Bayesian networks with Markov logic networks The relationship between these two kinds of networks can best be elucidated by explicitly formulating one in terms of the other. Let us then encode a BN using MLNs by a construction similar to the CNF

Table 4. Set of Propositional Clauses Representing Each Element of Every Conditional Probability Table (CPT) of the BN Given in Figure 1

BN component CNF clauses

Variable C kc1 _ kc2 :kc1 _:kc2

Variable A ka1 _ ka2 :ka1 _:ka2

Variable B kb1 _ kb2 :kb1 _:kb2

Variable T kt1 _ kt2 :kt1 _:kt2

Variable L kl1 _ kl2 :kl1 _:kl2 5 5 CPT for C kc1 hc1 kc2 hc2 5 5 CPT for A ka1 ha1 ka2 ha2 5 5 CPT for B kb1 ^ kc1 hb1jc1 kb1 ^ kc2 hb1jc2 5 5 kb2 ^ kc1 hb2jc1 kb2 ^ kc2 hb2jc2 5 5 CPT for T kt1 ^ kc1 ^ ka1 ht1jc1‚ a1 kt1 ^ kc1 ^ ka2 ht1jc1‚ a2 5 5 kt1 ^ kc2 ^ ka1 ht1jc2‚ a1 kt1 ^ kc2 ^ ka2 ht1jc2‚ a2 5 5 kt2 ^ kc1 ^ ka1 ht2jc1‚ a1 kt2 ^ kc1 ^ ka2 ht2jc1‚ a2 5 5 kt2 ^ kc2 ^ ka1 ht2jc2‚ a1 kt2 ^ kc2 ^ ka2 ht2jc2‚ a2 5 5 CPT for L kl1 ^ ka1 hl1ja1 kl1 ^ ka2 hl1ja2 5 5 kl2 ^ ka1 hl2ja1 kl2 ^ ka2 hl2ja2 PROBABILISTIC LOGIC METHODS IN BIOLOGY 325 encoding in the previous section. Consider a set of random variables V ¼fV1‚ ...‚ VN g of a BN, where 1 K j Vi 2fvi ‚ ...‚ vi g. For each value vi of every variable Vi we define a predicate IVi(x) such that j 5 j IVi(vi) ¼ True Vi ¼ vi. In BNs, each random variable can take on one and only one value, therefore we have to add the following formulas to the corresponding encoding MLN to specify this restriction:

k k 9vi IVi(vi ): (5) k l k 0 l (vi 6¼ vi) ^ IVi(vi ) :IV(vi): (6) Note that formulas (5,6) are ‘‘pure’’ FOL formulas (they are either true or false, meaning that their probabilistic weight is large without bound.) Consider a set of parameters Y ¼fY1‚ ...‚ YN g of the BN we are trying to encode, where each Yi is a conditional probability table assigned to a variable Vi. Assuming variable Vi has K parents in the BN,

Vi1 ‚ ...‚ ViK , then

Yi fPr (Vi ¼ vijVi1 ¼ vi1 ‚ ...‚ ViK ¼ viK )‚ 8vi‚ vi1 ‚ ...‚ viK g: (7)

For every BN parameter Yi, the corresponding encoding MLN contains the following set of formulas ...

wi IVi(vi) ^ IVi1 (vi1 ) ^ ...^ IViK (viK )(8) ...

where the weight wi ¼ ln ( Pr (Vi ¼ vijVi1 ¼ vi1 ‚ ...‚ ViK ¼ viK )). Note that the set (8) contains

Ni · Ni1 · ... · NiK formulas, one formula per each element of the conditional probability table Yi. Here Ni and Nij are the number of values variables Vi and Vij can take. If Pr (Vi ¼ vijVi1 ¼ vi1 ‚ ...‚ ViK ¼ viK ) ¼ 0, then we use a ‘‘hard’’ encoding formula (whose weight is large without bound):

:IVi(vi) ^:IVi1 (vi1 ) ^ ...^:IViK (viK ): (9)

As indicated in equation (4), an MLN represents the following distribution: ! 1 X Pr (V ¼ v ‚ ...‚ V ¼ v ) ¼ exp w f (v ‚ ...‚ v ) ‚ (10) 1 1 N N Z i i 1 N fFig where the sum is taken over all formulas of the MLN. For each formula Fi, wi is its weight and fi is its characteristic function: ( 1‚ Fi(V1)v1‚ ...‚ VN )vN ) ¼ True‚ fi(v1‚ ...‚ vN ) ¼ 0‚ otherwise: Because of the way we constructed this MLN to encode a BN, equation (10) is a product of elements of the BN conditional probability tables correspondingP P to the BN instantiation v1‚ ...‚ vN . Note also that the partition function Z defined as Z ¼ exp w f v ‚ ...‚ v is 1 in this case, since values of v1‚ ...‚ vN ( fFig i i( 1 N )) random variables are mutually exclusive and all the formulas exhaust all the possibilities, thus we are essentially adding up the joint probabilities for all possible instantiations of the set of random variables. Let us now illustrate with an example how a BN can be encoded by an MLN. Consider a toy BN in Figure 2.

FIG. 2. Structure of a toy Bayesian network. Here A and B stand for two events. 326 SAKHANENKO AND GALAS

Let us define predicates IA(A) and IB(B):

IA(ai) ¼ True5A ¼ ai (11)

IB(bi) ¼ True5B ¼ bi (12) The MLN encoding the toy BN in figure 2 consist of the following formulas

9xIA(x): (x 6¼ y) ^ IA(x)0:IA(y): 9xIB(x): (x 6¼ y) ^ IB(x)0:IB(y):

ln (0:6) IA(a1)

ln (0:4) IA(a2)

ln (0:9) IA(a1) ^ IB(b1)

ln (0:1) IA(a1) ^ IB(b2)

ln (0:2) IA(a2) ^ IB(b1)

ln (0:8) IA(a2) ^ IB(b2) The top four first-order formulas of this MLN ensure that the arguments of IA and IB are mutually exclusive and exhaustive, whereas the remaining formulas represent every element of the corresponding probability tables of the BN. Note that the representational complexity of this MLN and the original BN is the same, since we are essentially mapping every element of the BN to a formula in MLN. Note that the description of the MLN is a bit lengthy, since we want to express exactly the same probability distribution represented by the BN. As a result, most of the formulas of this MLN are propositional (since they were manually instantiated) with explicitly specified weights. The representational power of MLN becomes clearer in this example if we decide to learn the weights from data, in which case the propositional part of the MLN can be replaced with a more concise IA( þ x) IA( þ x) ^ IB( þ y) Moreover, this structure does not change, if we change the domains of the logical variables x and y.This demonstrates both that BNs can be represented as MLNs and that part of this representation consists of restrictions not necessary if we do not impose the BN constraints.

3.3. MLN in the landscape of probabilistic logic methods Markov logic networks are based on a combination of MRFs with FOL. An MLN can be seen as a template (Richardson and Domingos, 2006) for constructing MRFs according to specific logical patterns. MLNs are more general than MRFs, since we can represent any MRF by an MLN (in the worst case we can simply list all the cliques of the MRF using statements in propositional logic). On the other hand, an MLN can be seen as a relaxation of otherwise strict logical rules specified in FOL, allowing us to define the likelihood of logical models (Richardson and Domingos, 2006). MLNs can thus be seen also as a generalization of FOL: when the weights are equal and infinitely large, MLNs become FOL. Richardson and Domingos showed that, when all weights are equal and tend to infinity, an MLN represents a uniform distribution over all possible worlds satisfying the set of first-order formulas (Richardson and Domingos, 2006). Moreover, using the MLN with infinitely large weights, we can check whether or not a formula can be logically inferred from the set of MLN formulas by computing the probability that the formula is true and checking whether it is true or not. Figure 3 schematically depicts the relationship between MLNs and other well-known probabilistic methods. Each method is classified here according to three major components: a probabilistic component, a logic component, and a graphical representation. Note that these components are not completely separate from each other, for example the graphical representation is intrinsically connected with the probabilistic dependency specified in the probabilistic component. On the other hand, using these three components allows us to illustrate the overall relationships between MLNs and other probabilistic methods. In par- ticular, we can see how methods derive from others by imposing more constraints. PROBABILISTIC LOGIC METHODS IN BIOLOGY 327

FIG. 3. The relationships bet- ween MLN and different probabi- listic representations.

In Figure 3, we position MRFs and BNs on the same level, right under MLNs, since both are the most expressive methods among those based on propositional logic. MLNs generalize MRFs and BNs by using FOL. Note that BNs typically imply some (temporal) ordering: a true value of a variable may ‘‘cause’’ another variable to be true. Although MRFs imply no such ordering, MLNs use FOL to make such an implication (see previous section). Two well-known methods, HBNs and DBNs, are the specialized ver- sions of BNs; therefore, we place them right under BNs in Figure 3. HBNs and DBNs are essentially BNs with additional structural constraints: an HBN is a directed hierarchical tree where some ‘‘sibling’’ vertices may be linked, and a DBN is a sequence of BNs, representing a systems snapshot, such as a systems state at a moment of time, interlinked in one direction, representing a series of events, such as a progression of time. MLNs, which can express BNs, are also able to represent HBNs and DBNs, whose structural constraints can be easily expressed in FOL. Another well-known class of probabilistic models, probabilistic boolean networks (PBN), deals with systems dynamics, similarly to DBNs. Moreover, a very close rela- tionship between PBNs and DBNs was shown in Lahdesmaki et al. (2006). Thus, we place PBNs on the same level with DBNs and directly under MLNs, since Boolean logic, which PBNs are based on, is directly generalized by FOL.

4. THE APPLICATION OF MLNS TO GENETICS

We have used MLNs to study the interactions of genetic loci in predicting phenotypes (Sakhanenko and Galas, 2010) and to capture the influence of these interconnected loci on phenotypes. We use a regression- type MLN: given N predictor variables Xi (that could represent alleles of genetic markers), predict a value of an outcome variable Y (that could represent a phenotype, for example). Algorithm 1 summarizes the method that handles a model selection problem of finding a model that captures best the probability distribution over the training data:

Algorithm 1: MLN-based predictor selection foreach predictor variable Xi do repeat shuffle data; train and test MLN(Xi, Known, Y); obtain a cross-validation score e(Xi); until the average score e(Xi) does not change; ^ if e(Xi) is a max outlierS then Known ¼ Known fX^ig; go to line 1;

The inner repeat-until loop trains and evaluates the same model on a reordered, or shuffled, data set: since MLN modeling is path-dependent, this loop reduces the effect of path-dependency on a prediction 328 SAKHANENKO AND GALAS score. The outer foreach loop traverses the set of all predictor variables (genetic markers) and computes how well each marker, together with a small set of known markers, predicts an outcome variable (a phenotype). Once all the markers in the given set are traversed and assigned a score, the most significantly predictive marker is selected and added to the set of known markers, prompting another iteration of the search procedure. The search stops when there are no markers left with significant predictive power. This heuristic search keeps the structure of MLN and the outcome variable Y the same, but considers different subsets of predictor variables Xi in order to find a model that best ‘‘explains’’ the variation of Y (Sakha- nenko and Galas, 2010). In the next sections, we present and compare two types of MLNs using Algorithm 1, single-marker models and pair-wise models. We also explain what it means for two markers to predict a phenotype together when using each of these models.

5. SINGLE-MARKER MODELS

We have used MLNs as an underlying model representation in the method that detects genetic loci determining quantitative phenotypes (Sakhanenko and Galas, 2010). This method is an iterative procedure that scans the set of all possible genetic markers during every iteration and selects a marker that, in combination with other known predictors, has a significant predictive power of the phenotype. The selected marker is then added to the set of known predictors and the method continues to the next iteration. The first iteration of the method is similar to GWAS in the following way. The method scans all the markers and finds those markers whose individual predictive power is high. However, our method is different from the traditional statistical tools of GWAS: during the next iterations, our method searches for subsets of markers whose compound effect on the phenotype is high. Consequently, our method is a model selection procedure, where the relationships between markers (gene interactions) and phenotypes are hypothesized and encoded by a model, and the method then evaluates all possible such models and identifies the most probable one from the available data. Furthermore, the models are represented using MLNs allowing us to bring the power of both logic and probabilistic reasoning into genetic analyses, which will allow direct extension to arbitrarily complex models.

5.1. The connection with logistic regression Following the MLN syntax used in Alchemy (Kok et al., 2007), a single-marker model is expressed by the following program:

Phenotype(Strain‚ þ T)(13) Allele(Strain‚ þ Marker‚ þ V)0Phenotype(Strain‚ þ T): Here, Phenotype and Allele are logical predicates, and Strain, Marker, V, and T are variables. Assuming Marker = m1, V = v1, T = t1, and Strain = s1, predicates Phenotype(s1, t1) and Allele(s1, m1, v1) encode two facts (two data points) that t1 is a phenotype value of strain s1 and that v1 is an allele of marker m1 for strain s1. Furthermore, a logical formula, Allele(Strain, m1, v1) 0 Phenotype(Strain, t1), encodes the following statement:

for all strains‚ an allele value v1 of a marker m1 (14)

implies a phenotype value t1:

If Marker 2fm1‚ ...‚ mN1 g, V 2fv1‚ ...‚ vN2 g, T 2ft1‚ ...‚ tN3 g, and Strain 2fs1‚ ...‚ sN4 g, then program (13) is actually a set of N3 instances of a first-order formula (15) and N1 · N2 · N3 instances of a first-order formula (16) with separate weights wi and wjkl correspondingly: ...

wi : Phenotype(Strain‚ ti)(15) ...

wjkl : Allele(Strain‚ mj‚ vk)0Phenotype(Strain‚ tl)(16) ... PROBABILISTIC LOGIC METHODS IN BIOLOGY 329

Note that all variables of (15,16) are replaced with various combinations of values, except for the variable Strain, which represents the specific, representative organism. Consequently, each formula rep- resents a truth statement for any value of the variable Strain. A weight assigned to a formula indicates the probability of the truth of the encoded statement: the higher the weight, the greater the difference in log probability between the world that satisfies the statement and the one that does not. Note that while a weighted formula (16) models a probability distribution over logic statements (14), a weighted formula (15) models a binomial probability distribution of the phenotype values across all possible strains. The model represented by the formulas (15) and (16) is equivalent to ...

wi : Phenotype(Strain‚ ti)(17) ...

wjkl : Allele(Strain‚ mj‚ vk) ^ Phenotype(Strain‚ tl)(18) ... if conditioned on Allele(Strain, mi, vj) for all i and j. The equivalency between models (15, 16) and (17, 18) will be discussed later in this section. Let us introduce characteristic functions aij and pk: 1‚ allele of marker m is v ‚ 1‚ phenotype is t : a ¼ i j p ¼ k ij 0‚ otherwise k 0‚ otherwise

Note that aij = 1 and pk = 1 for a strain s iff the corresponding predicates Allele(s, mi, vj) and Phenotype (s, tk) are true. The MLN represented by the formulas (17, 18) encodes the joint probability distribution (see equation (4)): ! XN1 XN2

1 Pr (p ‚ a ‚ ...‚ a ) ¼ exp w p þ w a p : (19) k 11 N1N2 Z k k ijk ij k i ¼ 1 j ¼ 1

This equation yields a logistic regression formula where every characteristic function aij of each allele is a binary predictor variable of a phenotype function pk (an outcome variable of the logistic regression): ! XN1 XN2 Pr (pk ¼ 1ja11‚ ...‚ aN N ) log 1 2 ¼ w þ w a : (20) p a ... a k ijk ij Pr ( k ¼ 0j 11‚ ‚ N1N2 ) i ¼ 1 j ¼ 1

Let us now come back to the equivalency of models (15, 16) and (17, 18). For simplicity we will compare the following two models:

w1 : B(x)

w2 : A(x) ^ B(x)(21)

0 w1 : B(x) 0 0 w2 : A(x) B(x)(22)

As in our earlier discussion, we introduce two characteristic functions 1‚ A is true‚ 1‚ B is true‚ a ¼ b ¼ 0‚ otherwise‚ 0‚ otherwise:

We also introduce two binary feature functions representing a truth-value of a compound formula 1‚ A ^ B is true‚ 1‚ A 0 B is true‚ f ¼ f ¼ 1 0‚ otherwise‚ 2 0‚ otherwise: 330 SAKHANENKO AND GALAS

Table 5. Truth Table for and Implication

ABA^ BA0 B

True True True True True False False False False True False True False False False True

Table 5 provides the truth table for logical concatenation and implication. This table can be rewritten in terms of the characteristic functions (Table 6) revealing the dependency of f1 and f2 on a and b, which can be written as:

f1 ¼ ab and f2 ¼ 1 a(1 b):

The expression of f2 in terms of a and b is also evident if we recall an equivalent representation of the logical implication through the concatenation and negation: (A 0 B) h :(A ^:B), sometimes termed a ‘‘contrapositive.’’ In terms similar to (19), we can now express the joint probability distribution encoded by model (21) as 1 Pr (b‚ a) ¼ exp(w b þ w f ): (23) Z 1 2 1 Rewriting this as a logistic regression model, we get: Pr (b ¼ 1ja) exp(w þ w f j ) ¼ 1 2 1 b ¼ 1 ¼ Pr (b ¼ 0ja) exp(w2f1jb ¼ 0) exp(w1 þ w2a) ¼ ¼ exp(w1 þ w2a): (24) exp(0)

Model (22) encodes the joint probability distribution expressed as: 1 Pr (b‚ a) ¼ exp(w0 b þ w0 f ): (25) Z 1 2 2 and can therefore be represented as a logistic regression formula:

0 0 Pr (b ¼ 1ja) exp(w1 þ w2f2jb ¼ 1) ¼ 0 ¼ (26) Pr (b ¼ 0ja) exp(w2f2jb ¼ 0) 0 0 exp(w1 þ w2) 0 0 ¼ 0 ¼ exp(w1 þ w2a): exp(w2(1 a)) We can see then that, conditioned on a, the logistic regression models (24) and (25) are equivalent.

5.2. Single-markers models in Algorithm 1

Working with haploid yeast genetics, every marker can have only 2 allele values, A and B,soN2 = 2. When applying a single-marker model (13) to one marker only (such as when we perform the first iteration of the marker search in Algorithm 1), N1 = 1 and the simplified model (20) becomes

Table 6. Relationship Between Characteristic Functions a, b, Representing Two Propositions, and f1, f2, Representing a Conjunction and Implication Based on These Propositions

abf1 f2 111 1 100 0 010 1 000 1 PROBABILISTIC LOGIC METHODS IN BIOLOGY 331

FIG. 4. Structure of a logical model underlying a single-marker MLN based on one marker. A single-marker MLN, defined by (13) and applied to a single marker m1, encodes a MRF. For the illus- tration we assumed there are sN strains, and the phenotype can have two values. Al() and Ph() stand for Allele() and Phenotype().

! Pr (pk ¼ 1ja11‚ a12) log ¼ wk þ w11ka11 þ w12ka12‚ (27) Pr (pk ¼ 0ja11‚ a12) where 1‚ allele of m is A‚ 1‚ allele of m is B: a ¼ 1 a ¼ 1 11 0‚ otherwise 12 0‚ otherwise

Figure 4 shows a structure of a MRF imposed by the logical formulas of the single-marker MLN (13) applied to one marker. Each node of the structure graph corresponds to a predicate (either Allele or Phenotype) whose variables are substituted with every possible combination of values. There is an edge between two nodes of the graph if the corresponding predicates appear in the same formula. In our example, edges show all possible dependencies of Phenotype on Allele: probabilistic dependency of p1 and p2 on a11 and a12, and thus every edge is assigned a probabilistic weight. Note that the graphical structure is disjoint, since there are no edges between nodes corresponding to different strains. Note that edges, such as Allele(s1, m1, A) - Phenotype(s1, 0), Allele(s2, m1, A) - Phenotype(s2, 0), etc, are instances of the same statistical template, Allele($, m1, A) - Phenotype($, 0), and thus are assigned the same weight, learned from the entire network. At the second iteration of Algorithm 1 using a single-marker model, each model is applied to two markers, one of which is a known marker m~1 selected after the first iteration. Therefore, N1 = 2 and the model (20) becomes ! Pr (pk ¼ 1ja11‚ a12‚ a21‚ a22) log ¼ wk þ w11ka11 þ w12ka12 þ w21ka21 þ w22ka22‚ (28) Pr (pk ¼ 0ja11‚ a12‚ a21‚ a22)

FIG. 5. Structure of a logical model underlying a single-marker MLN based on two markers. A single-marker MLN, defined by (13) and applied to a pair of markers m1 and m2, encodes a MRF. We assume there is a data set with sN strains, and for each strain the phenotype can have one of two possible values. In brown we show the subnetwork, which is the entire network in figure 4, illustrating the increase of complexity of the model when switching to the sec- ond iteration of Algorithm 1. 332 SAKHANENKO AND GALAS where 1‚ allele of m~1 is A‚ 1‚ allele of m~1 is B‚ a11 ¼ a12 ¼ 0‚ otherwise 0‚ otherwise 1‚ allele of m2 is A‚ 1‚ allele of m2 is B: a21 ¼ a22 ¼ 0‚ otherwise 0‚ otherwise

Figure 5 shows a graphical representation of a model generated for two markers at the second iteration. Notice that this model is similar to the model generated at the first iteration (Fig. 4). The only difference is in the number of predictor variables.

6. PAIR-WISE MODEL

Like a single-marker model (13), a pair-wise model is expressed by the following program including more than one marker and variant:

Phenotype(Strain‚ þ T)

Allele(Strain‚ þ M1‚ þ V1) ^ Allele(Strain‚ þ M2‚ þ V2)0 (29) 0Phenotype(Strain‚ þ T):

Assuming M1 = m1, V1 = v1, M2 = m2, V2 = v2, and T = t1, this program encodes a statement: for all strains, the allele values v1 and v2 of markers m1 and m2 together (as a pair) imply a phenotype value t1. As in the case of a single-marker model, the pair-wise MLN (29) encodes a joint probability distribution:

! 1 XN1 XN2 XN1 XN2 Pr (p ‚ a ‚ ...‚ a ) ¼ exp w p þ wkl a a p : (30) m 11 N1N2 Z m m ijm ik jl m i ¼ 1 k ¼ 1 j ¼ 1 l ¼ 1 Since a genetic marker cannot have two different allele values at the same time, which means that aimain = 0ifm s n, expression (30) can be rewritten as:

11 N1N1 Pr (pm‚ a11‚ ...‚ aN N ‚ b ‚ ...‚ b ) ¼ 1 2 11 N2N2 1 XN1 XN2 ¼ exp w p þ w a p þ (31) Z m m ijm ij m i ¼ 1 j ¼ 1 ! X X kl kl þ wijmbij pm ‚ i‚ j N N ( )2 1 · 1 (k‚ l)2N2 · N2 i6¼j where aij are the same characteristic functions as in the single-marker model case, and 1‚ allele of marker m is v and allele of marker m is v : bkl ¼ i k j l ij 0‚ otherwise

kl Note that bij ¼ aikajl. The pair-wise MLN can be seen as a logistic regression model where all characteristic functions aij and kl bij are the predictor variables of an outcome variable pm: ! Pr p 1 a ‚ ...‚ a ‚ b11‚ ...‚ bN1N1 XN1 XN2 X X ( m ¼ j 11 N1N2 11 N2N2 ) kl kl log ¼ wm þ wijmaij þ w b : (32) 11 N1N1 ijm ij Pr (p ¼ 0ja ‚ ...‚ a ‚ b ‚ ...‚ b ) i‚ j N N m 11 N1N2 11 N2N2 i ¼ 1 j ¼ 1 ( )2 1 · 1 (k‚ l)2N2 · N2 i6¼j Note that this can also be seen as logistic regression with interaction terms of every pair of predictor variables aij. PROBABILISTIC LOGIC METHODS IN BIOLOGY 333

FIG. 6. Structure of a logical model underlying a pair-wise MLN based on two markers. A pair-wise MLN, defined by (29) and applied to a pair of markers m1 and m2, encodes a MRF. As before, we as- sume a data set with sN strains and two phenotypes. In brown we show the sub-network, which is the entire network in figure 5, illustrating the increase of complexity when switching from a single-marker model to a pair-wise model. All the new edges, not present in figure 5, correspond to the pair-wise inter- actions and are colored in blue.

When applying a pair-wise model (29) to two markers, N1 = 2 and the simplified model (32) becomes ! Pr (p ¼ 1ja ‚ a ‚ a ‚ a ) log m 11 12 21 22 Pr (pm ¼ 0ja11‚ a12‚ a21‚ a22)

¼ wm þ w11ma11 þ w12ma12 þ w21ma21 þ w22ma22 þ 11 11 12 12 21 21 22 22 þ w12mb12 þ w12mb12 þ w12mb12 þ w12mb12 þ 11 11 12 12 21 21 22 22 þ w21mb21 þ w21mb21 þ w21mb21 þ w21mb21: (33)

Recall that pm is a characteristic function equal to 1 when the phenotype has a value tm. Note that this long expression of logistic regression with many predictor variables is compactly represented by a pair- wise model (29) emphasizing the representational power of MLN. The second line of equation (33) is similar to equation (28) of the single-marker model. That is why all the informative markers identified by algorithm 1 using a single-marker model can be also identified using a pair-wise model. The third and fourth lines of equation (33) are the interaction terms between alleles of different markers. Consider two markers that have a specific combination of alleles that are predictive of a phenotype, but that have no effect of the phenotype on their own (for example if the markers have a synthetic interaction). Equation (28) would not work on these markers, since each weight representing a predictive power of an individual marker will be zero. Similarly, the second line of equation (33) would disappear as well (the corresponding weights would also be zero). However, some of the weights of the interaction terms of equation (33) would not be zero allowing us to detect the synergy between two markers by using the pair-wise model. Figure 6 illustrates a graphical representation of a pair-wise model generated for two markers. Notice that this model is somewhat similar to the single-marker model generated at the second iteration (Fig. 5). However, the major difference is that in a pair-wise model predictor variables are interconnected, forming cliques with phenotype variables (see blue edges in Fig. 6). Moreover, the probabilistic de- pendency is modeled between two predictor variables (alleles) and an outcome variable (a phenotype value); thus, the weights are assigned to every triangle (clique) connecting two marker alleles and a phenotype value. Table 7 summarizes the three models described above: a single-marker MLN based on one and two markers and a pair-wise MLN based on two markers, and their corresponding representation as logistic regressions. The complexity of the regression formula for the pair-wise case points to the significant advantage of the first order logic expression, even when as here they can be effectively expressed as logistic regressions. The compactness of the expression of the model is striking. Many more complex models cannot, of course, be expressed at all as logistic regressions, even extremely large ones. The power of the compactness of expression of models in MLNs is illustrated by the rapid expansion of the equivalent logistic regressions shown above, relative to the modest expansion of the model complexity. Since we are 334 SAKHANENKO AND GALAS

Table 7. Summary of Single-Marker MLNs Based on One and Two Markers as Well as a Pair-Wise MLN Based on Two Markers

First-order logic Logistic regression Pr (pk ¼ 1ja11‚ a12) 1 Allele(S, + M, + V)0 log ¼ wk + w ka + w ka Pr (pk ¼ 0ja11‚ a12) 11 11 12 12 Phenotype(S, + T), M = m1 2 Allele(S, + M, + V)0 log Pr (pk ¼ 1ja11‚ a12‚ a21‚ a22) ¼ Pr (pk ¼ 0ja11‚ a12‚ a21‚ a22) Phenotype(S, + T), M 2fm1‚ m2g wk + w11ka11 + w12ka12 +w21ka21 + w22ka22 3 Allele(S, + M , + V )^ log Pr (pm ¼ 1ja11‚ a12‚ a21‚ a22) ¼ 1 1 Pr (pm ¼ 0ja11‚ a12‚ a21‚ a22) Allele(S, + M2, + V2)0 wm + w11ma11 + w12ma12 + w21ma21 + w22ma22 + 11 11 12 12 21 21 22 22 Phenotype(S, + T), M 2fm1‚ m2g w12mb12 þ w12mb12 þ w12mb12 þ w12mb12 þ 11 11 12 12 21 21 22 22 w21mb21 þ w21mb21 þ w21mb21 þ w21mb21 The left column shows the formulas expressed in FOL defining the structure of the models, and the right column shows their corresponding representation as logistic regressions.

only scratching the surface of the underlying complexity of models that will be useful in future, the lesson here is evident.

7. OTHER APPLICATIONS IN BIOLOGY AND MEDICINE

In its most abstract form, genetic analysis is directed to the detection of causative patterns in heritable genomes that predict well the phenotypes of interest. In a similar vein, the detection of patterns in data that predict experimental outcomes is at the heart of almost any modern biological or medical problem. Framed this way we can quickly conclude that MLNs and other probabilistic logic methods are well suited to the solution of these kinds of problems. Problems in this class include predicting sub-networks of the func- tioning complex networks that are central to cellular function (we do not presume to infer the entire networks of any system quite yet). This includes inferring networks from mRNA expression data like array data and RNA Seq data. An important characteristic of biological problems of this kind is that we sometimes know something about the system, or have specific hypotheses about the system, that need to be incorporated into the models during inference. MLNs are perfectly well suited to this kind of problem. Classes of data like mRNA expression data—which include microRNA data, alternative splicing data (exon usage and alternative UTR use), protein expression level data, metabolite level data, transcription factor binding site occupation levels, histone modification and methylation patterns, and a variety of other kinds of molecular and non-molecular data—present the same kinds of problems. The challenge is to use these different data sets together to extract more information about the reality of the complex models that predict function than can be inferred from the individual sets by themselves. Since we also know something about how one kind of data is related to the others (e.g., proteins are made from mRNA at a rate determined by their sequences, translation factors and the levels of modulators like miRNAs) we need to represent this knowledge as constraints on the models when using the data together. This concept is a central aspect of true data integration and, as the knowledge of biology and medicine grows, is a major potential application for the methods we describe here.

8. CONCLUSION

Probabilistic logic methods, particularly those that use graphical model representations as a central part of the structure, are powerful tools in the analysis of data. They are particularly potent in dealing with biological data, as the field is currently in a state where the detailed data generation volume is enormous and the knowledge of the underlying complex systems is substantial, but very partial and often uncertain, with non-negligible quantitative error levels. Taking all of this into account—huge diverse data sets, partial knowledge of various types—is all but impossible without the systematic structures provided by methods and models that combine the rigors of representing and tracking multiple logical relationships with the PROBABILISTIC LOGIC METHODS IN BIOLOGY 335 scoring of likelihoods reflecting reality in a balanced and integrated fashion. Neither strictly logical nor completely statistical and probabilistic methods can properly manage the complexity that is presented by the current state and dynamics of modern biological and medical research. Together they hold a great deal of promise. We have summarized a wide range of both simple and complex mathematical and computational advances by focusing on a very general method, the MLN method, and attempting to put it into the context of some methods more familiar to most experimental and computational biologists. In the computational analysis of biological problems, analyzing data, inferring networks and complex models, and estimating model parameters, it has been common to use a range of methods based on various probabilistic logic models, sometimes collectively called ‘‘machine learning’’ methods. Inference methods based on the widely used BNs fall into this class, as do those based on HBNs, PBNs, and MLNs. We have tried to illustrate this landscape of methods, particularly contrasting MLNs with BNs, and describe some of the strengths and limitations. The pivotal concepts for the sketching of this landscape and comparisons of methods has been probabilistic graphical models, the structures of the graphs, and the scope of the logical language repertoire (from FOL to Boolean logic). While these methods are powerful, the computational intensity is substantial for the most general, and therefore most potent of them. This certainly includes the application of MLN, for which the pair-wise genetic model has proven to be extremely computationally intensive. This limitation is, of course, temporary for several reasons, but two of them are evident. First, the available computing power to be brought to these problems is increasing rapidly and the costs are dropping. In addition, the engineering of software, and specialized hardware, for efficiency and speed in executing these algorithms, particularly using parallelization methods and hardware embodiment of algorithms, is substantial and growing. It is clear that the complexity of our understanding of biological problems, and the application of probabilistic logic methods are both at the very beginnings of their growth curves, and the future is very rich with possibilities.

ACKNOWLEDGMENTS

We are grateful to a number of colleagues for comments and suggestions on the manuscript: Aimee Dudley, Alex Skupin, and Hong Li. This work has been supported by the National Science Foundation FIBR Program and the Luxembourg-ISB Program.

DISCLOSURE STATEMENT

No competing financial interests exist.

REFERENCES

Baldi, P., and Brunak, S. 2001. Bioinformatics: The Machine Learning Approach. MIT Press, Cambridge, MA. Darwiche, A. 2002. A logical approach to factoring belief networks. Proc. KR 409–420. Ershov, B., Yu. L., and Palyutin, E.A. 1986. Mathematical Logic. Imported Publications, New York. Kindermann, R., and Snell, J.L. 1980. Markov Random Fields and Their Applications. American Mathematical Society, New York. Kok, S., Sumner, M., Richardson, M., et al. 2007. The alchemy system for statistical relational AI [Technical report]. University of Washington, Seattle. Koller, D., and Friedman, N. 2009. Probabilistic Graphical Models: Principles and Techniques. MIT Press, Cam- bridge, MA. Lahdesmaki, H., Hautaniemi, S., Shmulevich, I., et al. 2006. Relationships between probabilistic Boolean networks and dynamic Bayesian networks as models of gene regulatory networks. J. Signal Process. 86, 814–834. Luger, G.F. 2008. Artificial Intelligence: Structures and Strategies for Complex Problem Solving. Addison-Wesley, New York. 336 SAKHANENKO AND GALAS

Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers, New York. Pietra, S.D., Pietra, V.D., and Lafferty, J. 1997. Inducing features of random fields. IEEE Trans. Pattern Anal. Mach. Intell. 19, 380–393. Richardson, M., and Domingos, P. 2006. Markov logic networks. Mach. Learn. 62, 107–136. Sakhanenko, N.A., and Galas, D.J. 2010. Markov logic networks in the analysis of genetic data. J. Comput. Biol. 17, 1491–1508.

Address correspondence to: Dr. David J. Galas The Institute for Systems Biology 401 Terry Avenue North Seattle, WA 98109

E-mail: [email protected]