<<

4th International Symposium on Imprecise Probabilities and Their Applications, Pittsburgh, Pennsylvania, 2005

Application of a hill-climbing to exact and approximate inference in credal networks

Andr´es Cano, Manuel G´omez, Seraf´ın Moral Department of Computer Science and Artificial Intelligence E.T.S Ingenier´ıa Inform´atica, University of Granada [email protected], [email protected], [email protected]

Abstract function of the number of parents [20]. This hypothe- sis will be accepted in this paper. Some authors have This paper proposes two new for inference considered the propagation of probabilistic intervals in credal networks. These algorithms enable proba- directly in graphical structures [1, 12, 23, 24]. In the bility intervals to be obtained for the states of a given proposed procedures, however, there is no guarantee query variable. The first algorithm is approximate that the calculated intervals are always the same as and uses the hill-climbing technique in the Shenoy- those obtained when a global computation using the Shafer architecture to propagate in join trees; the associated convex sets of probabilities is used. In gen- second is exact and is a modification of Rocha and eral, it can be said that calculated bounds are wider Cozman’s branch-and-bound algorithm, but applied than exact ones. The problem is that exact bounds to general directed acyclic graphs. need a computation with the associated convex sets Keywords. Credal network, probability inter- of probabilities. Following this approach, different vals, Bayesian networks, strong independence, hill- approximate and exact methods have been proposed climbing, branch-and-bound algorithms. [5, 3, 2, 8, 20, 19]. These assume that there is a convex set of conditional probabilities for each configuration of the parent variables in the dependence graph, and 1 Introduction provide a model for obtaining exact bounds through the use of local computations. Working with con- A credal network is a graphical structure (a di- vex sets, however, may be extremely inefficient: if we rected acyclic graph (DAG) [9]) which is similar to a have n variables and each variable, Xi, has a convex Bayesian network [18], where imprecise probabilities set with li extreme points as the conditional infor- are used to represent quantitative knowledge. Each mation, the propagation algorithm has a complexity node in the DAG represents a random event, and n order of O(K. i=1 li), where K is the complexity of is associated with a set of convex sets of conditional carrying out aQsimple probabilistic propagation. This probability distributions. is so since the propagation of convex sets is equiva- There are several mathematical models for imprecise lent to the propagation of all the global probabilities probabilities [25]. We consider convex sets of proba- that can be obtained by choosing an exact conditional bilities to be the most suitable for calculating and rep- probability in each convex set. resenting imprecise probabilities as they are powerful In this paper, we shall propose a new approximate enough to represent the results of basic operations algorithm for inference in credal networks. This is within the model without having to make approxima- a hill-climbing procedure which quickly obtains good tions which cause loss of information (as in interval inner approximations to the correct intervals. The probabilities). algorithm is based on the Shenoy-Shafer architecture In this paper, we shall consider the problem of com- [22] for propagation in join trees. Rocha, Campos and puting upper and lower probabilities for each state of Cozman [13] present another hill-climbing search, in- a specific query variable, given a set of observations. spired by the Lukatskii-Shapot algorithm, for obtain- A very common hypothesis is that of separately speci- ing accurate inner approximations. We shall also pro- fied credal sets [17, 20] (there is no restriction between pose a branch-and-bound procedure (also used in [19]) the set of probabilities associated to a variable given for inference in credal networks, where the initial so- two different configurations of its parents). Under this lution is computed with our hill-climbing algorithm, hypothesis, the size of the conditional convex sets is a reducing its running time. This approach is similar to the one in [13]. Campos and Cozman [11] also give a multilinear programming method which provides a very fast branch-and-bound procedure. The rest of the paper is organized as follows: Sec- tion 2 describes the basics of probability propagation in Bayesian networks and the Shenoy-Shafer architec- ture; Section 3 presents basic notions about credal sets and credal networks; Section 4 introduces proba- Figure 1: A Bayesian network and its join bility trees and briefly describes two algorithms based on variable elimination and probability trees for in- j Each observation, Xi = xi , is represented by means ference in credal networks; Section 5 details the pro- of a valuation which is a Dirac function defined on posed hill-climbing algorithm for inference in credal j j ΩXi as δXi (xi; xi ) = 1 if xi = xi, xi ∈ ΩXi , and networks; Section 6 explains how to apply the pro- j j δ (x ; x ) = 0 if x 6= x . posed algorithm in Rocha and Cozman’s branch-and- Xi i i i i bound algorithm; Section 7 shows the experimental The aim of probability propagation algorithms is work; and finally Section 8 presents the conclusions. to calculate the a posteriori probability function

p(xq|xE), (for every xq ∈ ΩXq ) by making local com- 2 Probability propagation in putations, where Xq is a given query variable. This distribution verifies: Bayesian networks

Let X = {X1, . . . , Xn} be a set of variables. Let us ↓Xq assume that each variable Xi takes values on a finite  j  p(xq|xE) ∝ p(xi|yj ) δXi (xi; xi ) (2) set ΩXi (the frame of Xi). We shall use xi to de- Y Y Xi j  xi ∈xE  note one of the values of Xi, xi ∈ ΩXi . If I is a set of indices, we shall write XI for the set {Xi|i ∈ I}. Let N = {1, . . . , n} be the set of all the indices. The In fact, the previous formula is the expression for p(xq, xE). p(xq|xE ) can be obtained from p(xq, xE) Cartesian product i∈I ΩXi will be denoted by ΩXI . Q by normalization. The elements of ΩXI are called configurations of XI and will be written with x or xI . In Shenoy and A known propagation algorithm can be constructed Shafer’s [22] terminology, a mapping from a set ΩXI by transforming the DAG into a join tree. For exam- R+ into 0 will be called a valuation h for XI . Shenoy ple, the righthand tree in Figure 1 shows a possible and Shafer define two operations on valuations: com- join tree for the DAG on the left. There are sev- bination h1 ⊗ h2 (multiplication) and marginalization eral schemes [16, 22, 21, 14] for propagating on join ↓J h (adding the removed variables). trees. We shall follow Shafer and Shenoy’s scheme

A Bayesian network is a directed acyclic graph (see [22]. Every node in the join tree has a valuation ΨCi the left side of Figure 1), where each node represents attached, initially set to the identity mapping. There a random event Xi, and the topology of the graph are two messages (valuations) MCi→Cj , MCj →Ci be- shows the independence relations between variables tween every two adjacent nodes Ci and Cj . MCi→Cj according to the d-separation criterion [18]. Each is the message that Ci sends to Cj , and MCj →Ci is node Xi also has a conditional probability distribu- the message that Cj sends to Ci. Every conditional tion pi(Xi|Y) for that variable given its parents Y. distribution pi and every observation δi will be as- Following Shafer and Shenoy’s terminology, these con- signed to one node Ci which contains all its variables. ditional distributions can be considered as valuations. Each node obtains a (possibly empty) set of condi- A Bayesian network determines a unique joint proba- tional distributions and Dirac functions, which must bility distribution: then be combined into a single valuation. The propagation algorithm is performed by crossing p(X = x) = p (x |y ) ∀x ∈ Ω (1) Y i i j X the join tree from leaves to root and then from root i∈N to leaves, updating messages as follows:

↓Ci∩Cj An observation is the knowledge of the exact value MCi→Cj = (ΨCi ⊗ ( MCk→Ci )) j O Xi = xi of a variable. The set of observed variables is Ck ∈Ady(Ci,Cj ) denoted by XE; the configuration for these variables (3) is denoted by xE and it is called the evidence set. E where Ady(Ci, Cj ) is the set of adjacent nodes to Ci will be the set of indices of the observed variables. with the exception of Cj . Once the propagation has been performed, the a pos- The propagation of credal sets is completely analo- teriori probability distribution for Xq, p(Xq|xE ), can gous to the propagation of probabilities; the proce- be calculated by looking for a node Ci containing Xq dures are the same. Here, we shall only describe the and normalizing the following expression: main differences and further details can be found in [5]. Valuations are convex sets of possible probabili- p(X , x ) = (Ψ ⊗ ( M ))↓Xq (4) q E Ci O Cj →Ci ties, with a finite number of vertices. A conditional Cj ∈Ady(Ci) valuation is a convex set of conditional probability where Ady(Ci) is the set of adjacent nodes to Ci. The distributions. An observation of a value for a variable normalization factor is the probability of the given will be represented in the same way as in the prob- evidence that can be calculated from the valuation abilistic case. The combination of two convex sets calculated in Expression 4 using: of mappings is the convex hull of the set obtained by combining a mapping of the first convex set with a p(x ) = p(xi , x ) (5) E X q E mapping of the second convex set (repeating the prob- i xq abilistic combination for all pairs of vertices of the two convex sets). The marginalization of a convex set is In fact, we can compute the probability of the evi- defined by marginalizing each mapping of the convex dence, p(x ), by choosing any node C , combining all E i set. With these operations, we can carry out the same the incoming messages with the valuation Ψ , and Ci propagation algorithms as in the probabilistic case. summing out all the variables in Ci. The result of the propagation for a variable, Xq, will

3 Inference in credal networks be a convex set of mappings from ΩXq in [0, 1]. The result of the propagation is a convex set on Rn which shall be called R . For the sake of simplicity, let us A credal set for a variable Xi is a convex, closed set q 1 2 of probability distributions and shall be denoted by assume that this variable has two values: xq , xq. The HXi . We assume that every credal set has a finite points of this convex set, Rq, are obtained in the fol- number of extreme points. The extreme points of a lowing way: if p is a global probability distribution, convex set are also called vertices. A credal set can be formed by selecting a fixed probability for each con- identified by enumerating its vertices. A conditional vex set, then we shall obtain a point (p1, p2) ∈ Rq i x credal set about Xi given a set of variables Y will associated to this probability, where pi = p(xq, E), Xi|Y with x being the given evidence. If we normalize be a closed, convex set H of mappings p : Xi × E i each point of Rq by computing pi(xq|xE ) = pi/ pj , Y −→ [0, 1], verifying x ∈Ω p(xi, yj) = 1, ∀yj ∈ j P i Xi then we obtain the convex set of a posteriori Pcondi- ΩY. Once again, we suppose a finite set of extreme i Xi|Y tional probability distributions for each xq ∈ ΩXq . points, Ext(H ) = {p1, . . . , pl}. A credal network is a directed acyclic graph similar to a Bayesian network. Each node is also associated with 4 Probability trees a variable, but every variable is now associated with a credal set HXi|Y, where Y are the parent variables of Probability trees [4] have been used as a flexible Xi in the graph. In this paper, we suppose that a local that allows asymmetrical indepen- X |Y=y credal set H i j is given for each configuration yj dences and exact or approximate representations of of Y. This is described by Rocha and Cozman [20] probability potentials to be represented. as separately specified credal sets. From a separately A probability tree T is a directed labeled tree, where specified credal set, we obtain a conditional one with each internal node represents a variable and each leaf HXi|Y = {p|f (x ) = p(x , y ) ∈ HXi|Y=yj , ∀y ∈ yj i i j j represents a non-negative real number. Each internal Ω }. Y node has one outgoing arc for each state of the vari- The strong extension of a credal network is the largest able associated with that node. The size of a tree T , joint credal set such that every variable is strongly denoted by size(T ), is defined as its number of leaves. independent [9, 7] of its nondescendants nonparents A probability tree T on variables X = {X |i ∈ I} given its parents. The strong extension of a credal I i represents a valuation h : Ω → R+ if for each x ∈ network is the joint credal set that contains every XI 0 I Ω the value h(x ) is the number stored in the leaf possible combination of vertices for all credal sets in XI I node that is reached by starting from the root node the network, such that the vertices are combined with and selecting the child corresponding to coordinate x Expression 1 [9]. In this paper, we shall consider the i for each internal node labeled X . inference in a credal network as the calculus of tight i bounds for the probability values of a query variable A probability tree is usually a more compact repre- Xq given a set of observed variables XE. sentation of a valuation than a table. This is illus- 4.1 Propagating credal sets using probability trees

Different algorithms have been used for propagation in credal networks using probability trees [3, 2]. For each Xi, we originally have a collection of m local credal sets {HXi|Y=y1 , . . . , HXi|Y=ym }, where m is the number of configurations of Y. The problem is now transformed into an equivalent one by using a Figure 2: A valuation h, its representation as a proba- transparent variable Tyj for each yj ∈ ΩY. Tyj will bility tree and its approximation after pruning various have as many cases as the number of vertices in the lo- branches cal credal set HXi|Y=yj . A vertex of the global credal set HXi|Y can be found by fixing all transparent vari-

ables Tyj to one of its values. Let us use T to denote the set of all transparent variables in the credal net- work. trated in Figure 2, which displays a valuation h and its representation using a probability tree. The tree Probability trees enable a conditional credal set H X|Y contains the same information as the table, but us- to be represented efficiently when we start with m lo- ing only five values instead of eight. Furthermore, cal credal sets {H Xi|Y=y1 , . . . , HXi|Y=ym } and with a trees enable even more compact representations to be single data structure (the necessary space for the tree obtained in exchange for loss of accuracy. This is is proportional to the sum of the necessary spaces for achieved by pruning certain leaves and replacing them the m local trees). In Figure 3, we can see one ex- by the average value, as shown in the second tree in ample where a probability tree represents the global Figure 2. information HX|Y associated to the two credal sets HX|Y =y1 and HX|Y =y2 . The figure also shows the We say that part of a probability tree is a terminal tree global conditional credal set H X|Y by means of a ta- if it contains only one node labeled with a variable, ble. In the probability tree in Figure 3, we obtain and all the children are numbers (leaf nodes). the extreme points by fixing Ty1 and Ty2 to one of If T is a probability tree on XI and XJ ⊆ XI , we its values. For example, if the probability tree is re- R(xJ ) 1 2 use T (probability tree restricted to the config- stricted to Ty1 = ty1 and Ty2 = ty2 , we obtain a new uration xJ ) to denote the restriction operation which probability tree that gives us the extreme point r2. consists in returning the part of the tree which is con- sistent with the values of the configuration xJ ∈ ΩXJ . For example, in the left probability tree in Figure 2, 1 1 T R(X2=x2,X3=x3) represents the terminal tree enclosed by the dashed line square. This operation is used to define combination and marginalization operations. It is also used for conditioning. The basic operations (combination, marginalization) over potentials required for probability propagation can be carried out directly on probability trees (fur- ther details can be found in [4]). The combination Figure 3: A probability tree for H X|Y of a probability tree comprising a single node labeled with a real number r, and another probability tree T The simpler approximate algorithm for propagating is obtained by multiplying all the leaf nodes of T by r. credal sets using probability trees is based on vari- The combination of two general probability trees T1 able elimination [3]. The algorithm applies a pruning and T2 is obtained in the following way: for each leaf procedure in order to reduce the size of the proba- node l in T1, if xJ is the configuration for the ances- bility trees that represent initial conditional distribu- tor nodes of l, then l is replaced by the combination tions, and the potentials obtained after combination R(xJ ) of node l and T2 . The sum of two probability or marginalization in the variable elimination algo- trees is defined in the same way as the combination rithm. The pruning is an that selects but by adding (rather than multiplying) real numbers. a terminal tree and replaces it with the average of its Marginalization is equivalent to deleting variables. A leaf nodes. A terminal tree can be selected for prun- variable is deleted from a probability tree and is re- ing if Kullback-Leibler’s cross-entropy [15] between placed by the sum of its children. the potential before pruning and the potential after pruning is below a given threshold σ. The pruning the configuration ts which results in the upper prob- i operation can select any variable in the probability ability for p(Xq = xq|xE ): tree including a transparent variable. The method is applied until there is no terminal tree with a distance i max pts (Xq = xq|xE ) (6) smaller than a given threshold σ. The greater the pa- ts rameter σ, the smaller the probability tree obtained. Further information about the measure used to select The proposed algorithm is a hill-climbing algorithm the terminal tree for pruning can be found in Propo- based on the Shafer-Shenoy propagation algorithm for sition 2 in [3]. Bayesian networks. A run of the algorithm is directed to compute the lower or upper probability for a single When we apply the variable elimination algorithm de- state xi of a given query variable X . The algorithm scribed above to calculate probability intervals for a q q begins with the construction of a join tree from the given variable X , then, in general, the correct inter- q credal network (see Section 2). Now. a double mes- vals enclose the approximate ones (inner approxima- sage system is necessary. For each pair of connected tion). In [3], a method is also proposed for obtaining nodes, Ci and Cj , there are two messages going from outer approximations. The proposal consists in us- 1 2 Ci to Cj , M and M , and two messages ing two new values at each leaf node in probability Ci→Cj Ci→Cj 1 2 R+ going from Cj to Ci, MC →C and MC →C . Mes- trees. If previously we only had one value r ∈ 0 , we j i j i R+ sages M 1 are calculated as usual, according to now add the values rmin, rmax ∈ 0 at each leaf node. Ci→Cj 2 These values inform us about the interval in which the Formula 3. Messages MCi→Cj are also calculated as i true value can oscillate. When a branch of the tree usual but we assume that the observation Xq = xq is has not been approximated, then rmin, rmax and r will added to the evidence set xE . be the same; when it is approximated, however, then r After constructing the join tree, each conditional will be between rmin and rmax. For example, Figure 4 Xi|Y shows the probability tree in Figure 2, supposing that credal set H is assigned to one node as we ex- the only approximation is the one shown in this fig- plained for Bayesian networks in Section 2. Each con- Xi|Y ure. In [3], details can be found about the calculation ditional credal set H is represented by means of a probability tree (see Section 4 and Figure 3). For of the rmin and rmax values for the probability trees resulting from combination and marginalization oper- observations, we follow a different procedure from the ations. Once the variable elimination algorithm has one explained in Section 2. If we have an observation X = xj , then we should combine its valuation δ finished, we obtain a probability tree with the vari- i i Xi with a valuation ΨC containing Xi. We achieve the able of interest Xq and some transparent variables. i same result with the operation of restriction in prob- The leaf nodes contain rmin, rmax values. These val- j ability trees; that is, for each observation Xi = xi , ues enable outer bounds to be obtained for the correct j ones (see [3] for further details). we restrict ΨCi to Xi = xi if node Ci contains Xi. In this way, valuations are significantly reduced in size, making posterior operations (combination and

marginalization) more efficient. The valuation ΨCi is then calculated for every node Ci by combining all the valuations assigned to node Ci and restricting the combination of all its assigned valuations to the evi-

dence set xE . The resulting ΨCi is saved on a copy c valuation ΨC . Figure 4: Probability tree with rmin and rmax values i At a given time, the algorithm has associated a con- figuration t for the transparent variables. The config- 5 Hill-climbing algorithm uration t will be modified in the hill-climbing step in order to optimize the conditional probability pt(Xq = i The objective of the proposed algorithm is to ob- xq|xE). A random initial configuration t0 is selected i tain the upper or lower bound for p(xq|xE) (poste- for the set of transparent variables T. This initial i T rior probability of a case xq of a given query variable configuration of is appended to the evidence set Xq). This can be solved by selecting the configuration xE. Each probability tree ΨCi is then restricted ac- of transparent variables (a configuration ts of trans- cording to this new set of observations. For example, parent variables determines a global probability dis- in Figure 3, if the initial configuration for T contains 1 2 T 1 = t and T 2 = t , then the valuation (probabil- tribution pts ) which results in a minimum value for y y1 y y2 i Xi|Y p(xq|xE ) (and the configuration for the maximum). ity tree) ΨCi containing H will be restricted and Let us suppose that our problem consists in finding the tree in Figure 5 shall be obtained. c ΨCi to the current configuration of transparent vari-

ables t, but removing the observation for Tyj in such a configuration. In this way, the only transparent

variable in ΨCi is Tyj . A new configuration t will be

obtained by selecting in Tyj the state that maximizes X|Y 1 i Figure 5: Restricted tree for H using Ty1 = ty1 p(xq|xE). The process continues by improving the 2 and Ty2 = ty2 next transparent variable in Ci, and so on until mod- ification of any transparent variable does not improve i p(xq|xE). In this case, a new node is visited.

The valuations ΨCi no longer contain transparent variables because all of them are observed and pruned in probability trees. In this join tree, a propagation 6 Branch-and-Bound algorithm from root to leaves and then from leaves to root would Rocha and Cozman [19] have given an exact algo- allow the probability distribution pt0 (Xq|xE ) to be obtained for the vertex of the a posteriori credal set rithm for inference on credal networks with a polytree Xq |xE H corresponding to T = t0, by looking for a structure which is based on branch-and-bound opti- node Ci containing the variable Xq and using Ex- mization search algorithms. The authors distinguish pression 4; however, we follow a different procedure. between outer and inner approximate inference algo- The algorithm makes an initial propagation (using the rithms. The former are produced when the correct double message system) sending messages from leaf interval between lower and upper probabilities is en- nodes towards the root. Messages are calculated as closed in the approximate interval; the latter approx- we explained above. imations are produced when the correct interval en- closes the approximate one. The algorithm presented After the initial propagation, we begin the hill- in previous section produces inner approximations. climbing step by crossing the join tree from root to leaves and viceversa a given number of times, or un- In the Rocha and Cozman algorithm, given a query til the algorithm does not improve the best solution variable Xq and a credal network N , a single run of found so far. Every time we visit a node Ci, we locally the branch-and-bound algorithm computes the lower i maximize p(Xq = x |xE) by choosing a state for each or upper a posteriori probability for a single state q i transparent variable in Ci. Transparent variables are xq of Xq, as in the hill-climbing algorithm explained improved one by one. The process can require the set in Section 5, but now the exact solution is obtained. of transparent variables of Ci to be treated several The algorithm to find the lower or upper probabil- i times. We leave the current node when there is no ity for a single state xq of Xq uses a credal network further transparent variable to improve. N0 as the input, obtained from N by discarding vari- ables that are not used to compute the inference by In the crossing of the join tree, let us suppose that we d-separation [10]. Let us suppose that we are in- are now visiting a node Ci and that t is the current i terested in looking for max p(xq|xE ). The branch- configuration for the transparent variables. Let us and-bound technique requires a procedure to obtain also assume that Tyj is the transparent variable which a bound r (overestimation) for the solution: r must will be improved (one of the transparent variables in- i produce an estimation r(N ) of max p(xq |xE), veri- cluded in node Ci). The hill-climbing algorithm must i fying r(N ) >= max p(x |xE ). Rocha and Cozman locally select the best value for the variable T . In q yj use Tessem’s A/R algorithm [23], an algorithm that order to do so, we need to calculate the probability of produces outer bounds rather quickly. The A/R al- x i x the evidence p( E ) and the probability p(xq, E) for gorithm focuses on polytrees. The algorithm 1 shows x i x each value of Tyj . With p( E) and p(xq, E ), we can the main steps of the branch-and-bound procedure. i compute p(xq|xE ), the value we want to maximize. The algorithm can be explained as a search in a The vector of values p(xE) is obtained as follows: in the first message system, we discard the previous ob- tree where the root node contains N0. The root node is divided into several simpler credal networks servation of variable Tyj in node Ci, and we apply {N01, . . . , N0k}. Each of these networks is obtained Expression 4, using Tyj instead of Xq, to calculate i i by taking one of the transparent variables Tyj in N0 p(Tyj = ty , xE) for each ty ∈ ΩTy . The vector of j j j and producing as many networks as the number of values p(xi , x ) is calculated in a similar way with q E states in T , by fixing T to the different states the second message system. yj yj (that is, we are selecting one credal set in N0, and we In the previous calculus, we can discard the previous produce as many networks as vertices in that credal observation for Tyj using the following procedure: a set). Resulting networks are inserted as children of new ΨCi is calculated restricting the saved valuation the root node in the search tree. The decomposition Input : A credal network N0 to prune the probability trees resulting from a i Output: The value of p = max p(xq|xE ) combination or marginalization operation. This Initialize pˆ with −∞ enables the probability trees to be maintained in if the credal net N0 contains a single vertex then a reasonable size, making propagation possible i i Update pˆ with p(xq|xE ) if p(xq|xE ) > pˆ when the problem is hard. end Another advantage of this algorithm is that it else Using the k possibilities of one of the credal sets can be used in general Bayesian networks and in N0, obtain a list of k credal networks not only in polytrees. This is a generic outer {N01, . . . , N0k} from N0 . foreach N0h do if r(N0h) > pˆ then Call recursively depth-first 7 Experiments branch-and-bound over N0h end We have analyzed several kind of networks to bet- end ter check the performance of the two proposed algo- end rithms. The basic structure of the networks is the one Take the last pˆ as p used in [19] (see Figure 1), but three different variants Algorithm 1: Branch-and-bound are considered, varying the number of states for the variables and the number of vertices (extreme points) for each conditional credal set H Xi|Y=yj . The fol- procedure is applied recursively. At each step, a trans- lowing combinations are considered: two states and parent variable is expanded. In this way, a leaf node two vertices (BN2s2ext); two states and three ver- contains a Bayesian network, obtained by a particu- tices (BN2s3ext); and three states for all variables lar selection of states in all the transparent variables except variable C (two states), and two vertices (BN- (that is, a selection of vertices in all credal sets in the Mixed2ext). We obtain 30 credal networks for each of credal network). Rocha and Cozman apply the vari- the three variants where the numerical values for the able elimination algorithm to perform inference in the extreme points are randomly generated. This random Bayesian network defined by the leaf. generation is made in the following way. We take a We introduce the following modifications in our ver- Bayesian network and transform it into a credal net- sion of the branch-and-bound algorithm: work by generating a given number l of random ver- tices for each conditional distribution p(Xi|Y = y). The generation of a random vertex from p(X |Y = y) • We use the hill-climbing algorithm in Section 5 i is made by producing a random uniform number to initialize pˆ. This is a better solution because it r ∈ [−1.0, 1.0] for each x ∈ Ω , and adding r to is a fast algorithm and initialization with a good i Xi p(x |Y = y). If the new p(x |Y = y) < 0.0, then underestimation of max p(xi |x ) can save a lot of i i q E we change its sign. Finally, the values p(X |Y = y) search work, and therefore a lot of computation i are normalized to obtain the new vertex. The exper- time in the branch-and-bound algorithm. iments use E as the query variable with no evidence • We also use the variable elimination algorithm (XE = ∅). The potential number of vertices of the for inference in leaf nodes (Bayesian networks) strong extension for the three kind of networks is 211, but now we use the probability tree version [3] 311 and 218 (BN2s2ext, BN2s3ext, and BNMixed2ext, where we always make exact operations. The use respectively). of probability trees in this step does not greatly We have made several tests to check critical issues: affect computing time, but it is used instead of the general method because in the next point we also use probability trees. • For the branch-and-bound algorithm: maximum size of the probability trees required during the • The algorithm for overestimating the solution in search, the size of the search tree, and computa- inner nodes of the search tree is also based on tion time. The maximum size of the probability the variable elimination algorithm with probabil- trees is the size of the biggest tree used in the ity trees. However, we now use the version that computations. These parameters are measured makes use of the rmin and rmax values explained running the algorithm with several values for the in Section 4. In the algorithm, the probability threshold σ (used when pruning the probability trees corresponding with the initial conditional trees). We have proved twelve values for σ, from information are pruned using the method de- 0.0 to 0.004, with a step of 0.0005, and from 0.004 scribed in Section 4. Such a method is also used to 0.01 with a step of 0.002. It should be noted that this algorithm always obtains exact compu- smaller the probability trees obtained. The savings tations for all the values of σ. A small value for when pruning is used should also be observed, even σ requires a low number of visited nodes in the with only a very small threshold σ ( σ = 0.0005). search space, but it needs large probability trees. A large σ, however, requires a high number of visited nodes but smaller probability trees. • For the hill-climbing algorithm: computation i time and root mean square error for p(xq|xE).

The two algorithms are implemented in Java language (j2sdk 1.5) within the Elvira system [6]. The exper- iments have been run on a Pentium IV 3400 MHz computer, with 2 GBytes of Ram memory, and the Linux Fedora Core 3 operating system. In the exper- iments, the hill-climbing algorithm makes only three crosses in the join tree (the initial propagation from leaves towards the root, and then a crossing from the root towards the leaves and viceversa). Figure 6: Prob. tree maximum size - threshold σ

7.1 Hill-climbing algorithm performance Figure 7 shows the number of visited nodes in the search tree of the branch-and-bound algorithm with The performance when only the algo- respect to the threshold σ. In this case, the greater rithm from Section 5 is applied is shown in Table σ is, the greater the search required, the stored valu- 1. This table shows the root mean square error of ations are increasingly approximate, and the solution i i max p(xq |xE) and min p(xq|xE) for all the states of to the problem is reached after a deeper search. In the query variable. In this case, the average run- any case, the size of the search tree by the branch- ning time for the three kinds of networks is 0.014, and-bound is a small fraction of the potential number 0.017 and 0.16 seconds (BN2s2ext, BN2s3ext, and of vertices of the strong extension. BNMixed2ext, respectively). It can be observed that the hill-climbing algorithm produces a very good ap- proximation in a short time.

BN2s2ext BN2s3ext BNMixed2ext 1 x min p(xq| E) 0.0089 0.014 0.024 1 x max p(xq| E) 0.0005 0.021 0.0043 2 x min p(xq| E) 0.0005 0.021 0.0094 2 x max p(xq| E) 0.0089 0.014 0.01 3 x min p(xq| E) - - 0.0034 3 x max p(xq| E) - - 0.00013 i Table 1: Root mean square error for min p(xq|xE) and i max p(xq |xE) using the hill-climbing algorithm

7.2 Branch-and-bound performance Figure 7: Nodes visited during the search Using the branch-and-bound algorithm with the hill- climbing algorithm at the initialization step, we ob- Figure 8 shows the computation time with respect tain the value of the different parameters in Figures to the threshold σ. Time is quickly decreased until 6, 7, and 8. The diagrams show the average values σ reaches 0.0005, and then it is slowly increased. For resulting from the 30 runs for each threshold σ and σ = 0 (no pruning), we have the smallest search space, i for each kind of network. A logarithmic scale (base but the algorithm for overestimating p(xq|xE ) (vari- 10) is used as the differences in the parameters for able elimination with rmin and rmax values) uses large the different kinds of networks are too large. Figure probability trees, requiring a lot of time. When σ is 6 shows the relation between the maximum size of small (close to 0.0005), the search space is increased i the probability trees with respect to the threshold σ because the algorithm for overestimating p(xq|xE) used for pruning. It is clear that the greater σ is, the produces worse bounds than before, but the running time is compensated with a decrease in the time em- variable given a set of observed variables XE. The ployed by that algorithm (the probability trees are first is an approximate algorithm that provides inner now smaller). When σ is large, the search space con- bounds close to the correct ones in a very short time tinues increasing but the reduction in the size of the (see Table 1). The second is based on Rocha and Coz- probability trees no longer compensates for that in- man’s branch-and-bound algorithm. This makes use crease in the search space. The best performance is of the proposed hill-climbing algorithm to initialize obtained when σ = 0.0005. the starting point pˆ. This is a good decision as Table 2 demonstrates. The branch-and-bound algorithm uses the variable elimination algorithm with the rmin and rmax values to overestimate the intervals in the inner nodes of the search. In this way, it can be applied to any credal network structure (not only to polytrees). The variable elimination algorithm is controlled by a parameter σ. When the variable elimination requires more memory than that available in our computer, we apply large values of σ in order to reduce the size of the probability trees which enable propagation, al- though there will be an increase in the search space. In this way, the best performance of the branch-and- bound algorithm is reached with a trade-off between the parameter σ and the number of visited nodes. In the future, we would like to prove the two algo- Figure 8: Computation times rithms in credal networks with a more complex struc- ture. We would also like to prove alternative strate- In order to gain further insight into the performance gies in the hill-climbing algorithm (for example, in- of the hill-climbing algorithm in combination with creasing the number of iterations in the join tree). the branch-and-bound algorithm, we have repeated Another point is to study the effect of the order in the previous experiments with the branch-and-bound which the transparent variables are expanded in the algorithm. Instead of using the hill-climbing algo- search tree on the running time and number of visited rithm to initialize pˆ, we now use +∞ when cal- nodes. i culating min p(xq|xE ), and −∞ when calculating i max p(xq |xE). With this change, the performance of Acknowledgments the branch-and-bound search is not so good. Table 2 includes the average percentages of increases for the This work has been supported by Direcci´on General maximum size of potentials, number of visited nodes, de Investigaci´on e Innovaci´on de la Junta de Comu- and computation times. nidades de Castilla-La Mancha (Spain) under project PBC-02-002. BN2s2ext BN2s3ext BNMixed2ext max. pot. size 0.3399% 0.0074% 0.0051 % visited nodes 55.88% 99.44% 5.61 % References comput. time 46.84% 68.35% 16.55 % [1] S. Amarger, D. Dubois, and H. Prade. Constraint Table 2: Average increase for each parameter without propagation with imprecise conditional probabil- using hill-climbing in the initialization of pˆ ities. In B. D’Ambrosio, Ph. Smets, and P.P. Bonissone, editors, Proceedings of the 7th Con- While the maximum size of potentials is not very dif- ference on Uncertainty in Artificial Intelligence, ferent, the computation time is greatly increased; a pages 26–34. Morgan & Kaufmann, 1991. worse initial bound requires a deeper search in order to reach the final solution. This also explains the in- [2] A. Cano and S. Moral. Computing probability in- crease in the number of visited nodes. tervals with and probability trees. Journal of Applied Non-Classical Logics, 8 Conclusions 12(2):151–171, 2002. [3] A. Cano and S. Moral. Using probabilities trees In this paper, we have presented two new algorithms to compute marginals with imprecise probabili- for inference in credal networks. They are used to ties. International Journal of Approximate Rea- obtain probability intervals for the states of a query soning, 29:1–46, 2002. [4] A. Cano, S. Moral, and A. Salmer´on. Penniless Theoretical Computer Science. Springer-Verlag, propagation in join trees. International Journal 2003. of Intelligent Systems, 15(11):1027–1059, 2000. [15] S. Kullback and R. A. Leibler. On information [5] J. E. Cano, S. Moral, and J. F. Verdegay-L´opez. and sufficiency. Annals of Mathematical Statis- Propagation of convex sets of probabilities in di- tics, 22:76–86, 1951. rected acyclic networks. In B. Bouchon-Meunier et al., editors, Uncertainty in Intelligent Systems, [16] S. L. Lauritzen and D. J. Spiegelhalter. Lo- pages 15–26. Elsevier, 1993. cal computation with probabilities on graphical structures and their application to expert sys- [6] Elvira Consortium. Elvira: An environment for tems. Journal of the Royal Statistical Society, creating and using probabilistic graphical mod- Ser. B, 50:157–224, 1988. els. In J.A. G´amez and A. Salmer´on, editors, Proceedings of the First European Workshop on [17] S. Moral and A. Cano. Strong conditional inde- Probabilistic Graphical Models, pages 222–230, pendence for credal sets. Annals of Mathematics 2002. and Artificial Intelligence, 35:295–321, 2002.

[7] I. Couso, S. Moral, and P. Walley. Examples of [18] J. Pearl. Probabilistic Reasoning with Intelligent independence for imprecise probabilities. In Pro- Systems. Morgan & Kaufman, San Mateo, 1988. ceedings of the First International Symposium on [19] J. C. F. Rocha and F. G. Cozman. Inference Imprecise Probabilities and Their Applications in credal networks with branch-and-bound algo- (ISIPTA’99), 1999. rithms. In Proceedings of the Third International Symposium on Imprecise Probabilities and Their [8] F. G. Cozman. Robustness analysis of Bayesian Applications (ISIPTA ’03), pages 482–495, 2003. networks with local convex sets of distributions. In Proceedings of the 13th Conference on Uncer- [20] J.C. Ferreira Da Rocha and F.G. Cozman. Infer- tainty in Artificial Intelligence. Morgan & Kauf- ence with separately specified sets of probabilities mann, San Mateo, 1997. in credal networks. In A. Darwiche and N. Fried- man, editors, Proceedings of the 18th Conference [9] F. G. Cozman. Credal networks. Artificial Intel- on Uncertainty in Artificial Intelligence. Morgan ligence, 120:199–233, 2000. Kaufmann, 2002. [10] F.G. Cozman. Irrelevance and independence rela- [21] P. P Shenoy. Binary join trees. In Proceedings of tions in quasi-bayesian networks. In Proceedings the Twelfth Annual Conference on Uncertainty in of the Fourteenth Annual Conference on Uncer- Artificial Intelligence (UAI–96), pages 492–499, tainty in Artificial Intelligence (UAI–98), pages Portland, Oregon, 1996. 89–96, San Francisco, CA, 1998. Morgan Kauf- mann Publishers. [22] P. P. Shenoy and G. Shafer. Axioms for probabil- ity and belief-function propagation. In Shachter [11] C. P. de Campos and F. G. Cozman. Inference et al., editors, Uncertainty in Artificial Intelli- in credal networks using multilinear program- gence, 4, pages 169–198. North-Holland, 1990. ming. In Proceedings of the Second Starting AI Researcher Symposium, pages 50–61, 2004. [23] B. Tessen. Interval probability propagation. In- ternational Journal of Approximate Reasoning, [12] K. W. Fertig and J. S. Breese. Interval influence 7:95–120, 1992. diagrams. In M. Henrion, R. D. Shacter, L. N. Kanal, and J. F. Lemmer, editors, Uncertainty in [24] H. Th¨one, U. Gun¨ tzer, and W. Kießling. Towards Artificial Intelligence, 5, pages 149–161. North- precision of probabilistic bounds propagation. In Holland, Amsterdam, 1990. Proceedings of the 8th Conference on Uncertainty in Artificial Intelligence, pages 315–322, 1992. [13] C. P. de Campos J. C. F. da Rocha, F. G. Coz- man. Inference in polytrees with sets of proba- [25] P. Walley. Statistical Reasoning with Imprecise bilities. In Christopher Meek and Uffe Kjærulff, Probabilities. Chapman and Hall, London, 1991. editors, Proceedings of the 19th Conference on Uncertainty in Artificial Intelligence, pages 217– 224. Morgan Kaufmann, 2003.

[14] J. Kohlas. Information Algebras: Generic Struc- tures for Inference. Discrete Mathematics and