<<

From: FLAIRS-00 Proceedings. Copyright © 2000, AAAI (www.aaai.org). All rights reserved.

Latin Hypercube in Bayesian Networks

Jian Cheng & Marek J. Druzdzel Decision Systems Laboratory School of Information Sciences and Intelligent Systems Program University of Pittsburgh Pittsburgh, PA 15260 ~j cheng,marek}@sis, pitt. edu

Abstract of stochastic sampling algorithms is less dependent on the topology of the network and is linear in the num- We propose a scheme for producing Latin hypercube samples that can enhance any of the existing sampling ber of samples. Computation can be interrupted at any algorithms in Bayesian networks. Wetest this scheme time, yielding an anytime property of the algorithms, in combinationwith the likelihood weightingalgorithm important in time-critical applications. and showthat it can lead to a significant improvement In this paper, we investigate the advantages of in the convergence rate. While performance of sam- applying Latin hypercube sampling to existing sam- piing algorithms in general dependson the numerical pling algorithms in Bayesian networks. While Hen- properties of a network, in our Latin hy- rion (1988) suggested that applying Latin hypercube percube sampling performed always better than ran- sampling can lead to modest improvements in per- dom sampling. In some cases we observed as much formance of stochastic sampling algorithms, neither as an order of magnitude improvementin convergence rates. Wediscuss practical issues related to storage re- he nor anybody else has to our knowledge studied quirements of Latin hypercube sample generation and this systematically or proposed a scheme for gener- propose a low-storage, anytime cascaded version of ating Latin hypercube samples in Bayesian networks. Latin hypercube sampling that introduces a minimal We first propose a scheme for producing Latin hy- performanceloss comparedto the original scheme. 1 percube samples. We then test this scheme in the Keywords:Bayesian networks, algorithms, stochastic likelihood weighting algorithm (Fung & Chang 1990; sampling, Latin hypercube sampling Shachter & Peot 1990) and show that it can lead to a significant improvementin the convergence rate over Introduction random sampling. While performance of sampling algo- rithms in general depends on the numerical properties Bayesian networks (Pearl 1988) are increasingly popu- of a network, in our experiments Latin hypercube sam- lar tools for modeling problems involving uncertainty. pling performed always better than random sampling. Practical models based on Bayesian networks often In some cases we observed as much as an order of mag- reach the size of hundreds of variables (e.g., (Pradhan nitude improvement in convergence rates. We discuss et al. 1994; Conati et al. 1997)). Although a number practical issues related to storage requirements of the of ingenious inference algorithms have been developed, Latin hypercube sample generation and propose a low- the problem of exact belief updating in Bayesian net- storage, anytime cascaded version of Latin hypercube works is NP-hard (Cooper 1990). Approximate infer- sampling that introduces a minimal performance loss ence schemes mayoften be the only feasible alternative compared to the original scheme. for very large and complex models. The problem of All random variables used in this paper are multiple- achieving a required precision in approximate inference valued, discrete variables. Capital letters, such as A, is also NP-hard (Dagum & Luby 1993). However, pre- B, or C will denote random variables. Bold capital cision guarantees may not be critical for some types of letters, such as E, will denote sets of variables. Lower problems and can be traded off against the speed of case letters a, b, c will denote particular instantiations computation. In all situations where precision is not of variables A, B, and C respectively. of essence, stochastic simulation methods can offer sev- eral advantages compared to exact algorithms. Execu- tion time required of exact algorithms depends on the Latin hypercube sampling topology of the network and is generally exponential Latin hypercube sampling (McKay, Beckman, in its size and connectivity. In case of stochastic sam- Conover 1979) is inspired by the exper- pling, computation is measured in terms of the number imental design, which tries to eliminate ef- of samples -- in general, the more samples are gener- fect of various experimental factors without increasing ated, the higher precision is obtained. Execution time the number of subjects in the . The purpose 1Copyright2000, AmericanAssociation for Artificial In- of Latin hypercube sampling is to ensure that each value telligence (www.aaai.org).All rights reserved. (or a of values) of a variable is represented in the

UNCERTAINREASONING 287 samples, no matter which value might turn out to be Wewill now illustrate the process of generation of more important. n = 100 Latin hypercube samples in the context of a Consider the problem of generating five samples from forward sampling algorithm on a simple network con- two independent, uniformly distributed variables X and sisting of rn = 3 random variables A, B, and C, pre- Y, such that X, Y E [0, 1]. According to the Latin hy- sented in Figure 3. We first construct the LHS3joo percube sampling method, we first divide the range of both X and Y into five equal intervals, resulting in 5 x 5 -- 25 cells (Figure 1). The requirement of Latin

5 4 4 5 1 3 I Pr(A) 11 I[ Pr(B)II l 2 2 al 0.20 0.4 1 3 0.35 bl a2 b2 0.6 IY/X II 1121314151 a3 0.45 Figure 1: A Latin hypercube sample on the unit square, al a2 I as I with n = 5. Pr(OlA, ) bt b2 bt b2 bt b2 cl 0.01 0.04 0.28 0.06 0.18 0.82 hypercube sampling is that each row and each column C2 0.68 0.93 0.52 0.12 0.50 0.10 of the constructed table contain only one sample. This C3 0.31 0.03 0.20 0.82 0.32 0.08 ensures that even though there are only five samples, each of the five intervals of both X and Y will be sam- Figure 3: A simple with three nodes. pled. For each sample [i,j], the sample values of X, Y are determined by: matrix for this network, a fragment of which is illus- trated in Figure 4. Since the algorithm applied will X = F~t((i- 1 +~x)/n) (1) Y = Fyi((j- l+~y)/n), I Sample II A I B I C I where n is the sample size, ~x and ~y are random num- 1 25 13 74 bers (~x, ~Y E [0, 1]), and Fx and Fy are the cumu- 2 14 91 7 lative probability distributions functions of X and Y respectively. 39 47 32 56 Valid Latin hypercube samples can be generated by ¯ ° ...... starting with a sequence of n numbers 1, 2,..., n and 100 69 4 84 taking various permutations of this sequence. Figure 2 illustrates the samples represented by Figure 1. Each Figure 4: A LHSs,loo matrix for m = 3 variables and sample size n = 100. le I SamP II x IYI be forward sampling, we need to make sure that the 1 2 3 columns of the LHS matrix follow the topological or- 2 1 2 der, i.e., that parents always precede their children. 3 3 1 The process of sample generation proceeds as follows. 4 4 5 We divide each of the variables A, B, and C into 100 5 5 4 equally sized intervals. If LHSA,i < Pr(al) x 100, set the state of variable A in the ith sample to al. If Figure 2: A Latin hypercube matrix (LHS2,5) for m = 2 Pr(at) x 100 LHSA,i < (Pr(al) + Pr(a2)) x 10 variables and sample size n = 5. set the state of variable A to a2, otherwise we set it to aa. If LHSB,i < Pr(bl) x 100, we set the state column of the table corresponds to a variable, each row of variable B to bt, otherwise we set it to bg.. The to a sample. Each column (please note that the columns state of variable C is generated similarly from the con- list the row and column number of the corresponding ditional probability table based on the states of A and sample in Figure 1) is a random permutation of the B. In our example, row 39 of the LHS matrix contains sequence of integers 1, 2, 3, 4, 5. In general, we can ob- {47, 32, 56). According to the above rule, in sample tain Latin hypercube samples of size n for m variables number 39, variable A is instantiated to the state a2 through randomly selecting rn permutations of the se- and variable B is instantiated to the state bl. Vari- quence of integers 1, 2,..., n and assigning them to each able C is instantiated to c2 since Pr(cxIa2,bt) x 100 < of the m columns of the table. Wewill call this table, a LHSc,i < (Pr(clla2,bx) + Pr(c21a2,bl)) x 100. Latin hypercube sampling matrix ( LHS matrix) LHSm,n. note that Equation 1 requires that we add a random

288 FLAIRS-2000 number between 0 and 1 into LHSm,i. Since the sam- we will need at least log2 100,000 ~ 17 bits for each ple size n is usually very large and, consequently 1In of the elements of the LHS matrix. This that is very small, muchsmaller than the required precision, to store the entire LHS matrix we will need at least we can ignore this random number in case of discrete 17 x 500 x 100,000 = 850,000 bits ~ 106 M bytes. variables. The first solution to this problem that we applied in The above Latin hypercube sampling scheme gives us our implementation is to store variable instantiations a way of instantiating variables to their states that is instead of permutations in the LHS matrix. If we gen- applicable to any stochastic sampling algorithm. A gen- erate the permutations in the order of sampling, and eralized procedure based on Latin hypercube sampling, the columns of the LHSmatrix follow the parent order- is shownin Figure 5. ing, we can compute at each step the outcome of the variable in question. As most variables have a relatively small numberof outcomes, a few bits usually suffice. If 1. Order the nodes according to their topolog- the maximumnumber of outcomes of a variable in the ical order, as required by the algorithm. above example is 8, we can use log2 8 = 3 bits per ele- 2. Generate the LHS matrix. ment of the LHS matrix instead of 17. Then the entire 3. For each sample (row of the LHS matrix) matrix will take 3 x 500 × 100,000 = 150 M bits ~ 19 generate corresponding variable states. Mbytes. In case of networks consisting of only binary variables, this number can be reduced even further to 4. Use these samples to calculate the desired approximately 6 M bytes. Another way of looking at . this solution is that it combines steps 2 and 3 in the algorithm of Figure 5 into a single step. Figure 5: A generalized stochastic sampling procedure The second improvement is treatment of the root based on Latin hypercube sampling. nodes in the network. Latin hypercube sampling re- quires that the states of these nodes are sampled ac- The complexity of Latin hypercube sampling is the cording to their . To achieve this sam- same as that of random sampling. A random sample pling, instead of generating permutations for the root in the latter corresponds to generation of one element nodes, we can assign randomly a proportion of elements in their columnsto their states si: Pr(Sl) x n positions of the LHSmatrix, which involves a call to the random number generator. An additional step in generation of for state sl, Pr(s2) x n positions for state s2, etc., as- one element of the LHS matrix is swapping of matrix signing the remainder of the positions to the last state. elements, equal to three assignments. In case of the network in Figure 3, this can save us Using the crossed analysis of (ANOVA) Paa x n + Pb2 x n = 105 calls to the random number decomposition (Efron & Stein 1981), Stein (1987) generator (in this simple example, this amounts to 35% has proven that in the problem of estimating I -- of all calls!). Finally, our third proposed improvementis dividing f[0,1]d f(X)dX, the variance of stochastic simulation the LHS matrix into smaller matrices and instead of based on Latin hypercube sampling is asymptotically working with a large m x n matrix, working with k smaller than that based on simple random sampling. m x n/k LHS matrices. While this solution reduces In finite samples, Owen(1997) shows that the variance the precision of sampling (please note that since we use of Latin hypercube sampling is never much worse than the number of samples to divide the interval [0,1], the that of simple random sampling. Roughly speaking, be- number of samples determines the precision), it may cause Latin hypercube sampling stratifies each dimen- be unimportant when the number of samples is suffi- sion as muchas possible and otherwise picks the rela- ciently large and the probabilities in the model are not tion between different dimensions randomly, it can sig- extreme. We will confirm this observation empirically nificantly decrease the variance coming from individual in the next section. This method, that we call cascaded dimensions. Since the stochastic sampling in Bayesian Latin hypercube sampling has the advantage of being networks is very similar to stochastic integration, Latin anytime, as processing each of the k LHSm,n/k matri- hypercube sampling can be expected to lead to smaller ces increases precision of the algorithm but the result is variance here as well. available as soon as the first matrix has been processed. In the sequel of the paper, we will use the symbolic no- Some improvements to the Latin tation k x LHS to denote a cascaded Latin hypercube hypercube sampling sampling scheme that consists of k steps. The main practical problem related to Latin hyper- cube sampling is that the LHS matrix, which is usu- Experimental results ally very large, has to be stored in memoryfor the du- We performed empirical tests comparing Latin hy- ration of the sampling. When both the network size percube sampling to random sampling in the context and the number of samples are very large, this may of the likelihood sampling algorithm (Fung & Chang become prohibitive. For example, when a network con- 1990; Shachter & Peot 1990). We used two net- sists of 500 nodes that we want to sample 100,000 times, works in our tests. The first network is Coma,a sire-

UNCERTAINREASONING 289 ple multiply-connectednetwork originally proposed by Cooper(1984). O.~s The secondnetwork used in our testsis a subset of the CPCS (Computer-basedPatient Case Study) ~RS model (Pradhan et aL 1994), a large multiply- ---a---20~q~ connectedmulti-layer network consisting of 422 multi- valuednodes and coveringa subsetof the domainof --x-l.I~ internalmedicine. Among the 422 nodes,14 nodes describediseases, 33 nodesdescribe history and risk factors,and the remaining375 nodesdescribe various !0.rn5 findingsrelated to the diseases.The CPCS network is amongthe largestreal networks available to there- searchcommunity at presenttime. Our analysisis I I I I [ I I I [ basedon a subsetof 179 nodesof the CPCS network, l{]0O2000 3000 400) 5(]00 6000 3000 8{X}09000 I0m0 created by Max Henrion and MalcolmPradhan. We N tt’nberof samples usedthis smaller version in orderto be ableto com- putethe exactsolution for the purposeof measuring approximationerror. We usedthe clusteringalgorithm Figure 6: MeanSquared Error as a function of the num- (Lauritzen& Spiegelhalter 1988) to providethe exact ber of samples for the Comanetwork without evidence resultsfor the experiment. for the random sampling (RS) and three variants We focusedon the relationshipbetween the number Latin Hypercube sampling (LHS). of samplesin randomsampling (in all experiments,we run between1,000 and 10,000samples with 1,000 incre- ments)and in Latinhypercube sampling (we triedthe 0.01 straightand cascadedLatin hypercube sampling) on the accuracyof approximationachieved by the simula- ---O--RS 0.ms \ tion.We measuredthe latterin termsof the - ---o--20~LHS Squared-Error(MSE), i.e., square root of the sum of squaredifferences between Pr~(Xij), the computed 0.{306 -~-LI-IS marginalprobability of state j (j -- 1, 2,...,hi) of node i and Pr(Xij),the exact marginal probability, for all non-evidencenodes. In all diagrams,the reportedMSE is averageover 20 trials.More precisely, 11.1112

#SE(t)= : ] 0 I I I I I I I I I ~"~ieN\E ni ieN\Z 5=I Number0f samples where N is the set of all nodes, E is the set of evidence nodes, and ni is the number of outcomes of node i. Figure 6 shows the MSEaa a function of the num- ber of samples for the Comanetwork. It is clear that Figure 7: MeanSquared Error as a function of the num- the Latin hypercube sampling in all three variants out- ber of samples for the CPCSnetwork without evidence performs random sampling. For a given sample size, for the random sampling (RS) and three variants the improvement of MSEcan be as high as 75%. The Latin Hypercube sampling (LHS). improvement looks even more dramatic when we look for the number of samples that are needed to achieve a given precision. Latin hypercube sampling with only Latin hypercube sampling is on the order of 50%. For 2,000 samples can perform better than random sam- a given error level, typically fewer than one quarter of piing with 10,000 samples. In other words, Latin hyper- the samples was used by the Latin hypercube sampling cube sampling can achieve a better precision in one fifth compared to random sampling. Cascaded inference in of the time. For the cascaded versions of the algorithm, the CPCSnetwork requires a larger number of samples as expected, performance deterioration is minimal when to achieve the same precision as pure Latin hypercube the number of samples is large enough. For small sam- sampling. This is due to the fact that the probabilities ples sizes (e.g., for 1,000 samples), a cascaded version in the CPCSnetwork are more extreme. with 20 steps of 50 samples each performed somewhat Figures 8 and 9 show the results for the Coma worse because 50 samples divide the interval [0, 1] too and CPCS networks with evidence. In case of the coarsely to achieve a desired precision. Analogous re- Coma network, the evidence was observation of se- sults for the CPCSnetwork are presented in Figure 7. vere headaches without coma. In case of the CPCS Without evidence, typical improvement in MSEdue to network, we were able to generate evidence randomly

29O FLAIRS-2000 0.006 0=4~[ -"-’~---64K 0.005 -..-o--32K -o-- 20=1.I.~ 0.003 0.004

0.003 0.~02 O.OO2 ~ 0.nnl 0.00!

0 I I I I I I I I I I 0 I I I I I I I I I tOOO20OO 3QO0 4000 .50006000 100O8000 ~O00lOOOO 1~ 2~ 512 IK 2K 4K SK IfSK 32K ~K Numberof samples Blocksize

Figure 8: MeanSquared Error as a function of the num- Figure 10: Mean Squared Error as a function of block ber of samples for the Comanetwork with evidence (se- size and sample size for the CPCSnetwork without ev- vere headaches without coma) for the random sampling idence for the Latin Hypercube sampling (LHS). (RS) and three variants of Latin Hypercube sampling (LHS). cation in the network -- the closer they are to the leaves of the tree, the worse. The reason for this is a mismatch OOl between the sample space (which is in case of likelihood --o-- RS sampling dominated by the prior probability distribu- 0.008 --a--5:J..HS tion) and the distribution. This matches the observations made by Cousins (1993), who 0.006 concluded that performance of different algorithms de- pends strongly on the properties of the network. While r~ 0.1TI4 in the networks that we have tested Latin hypercube sampling was equal to or better than random sampling, sometimes the improvement was not large. We believe 0.002 that in case of large mismatch between the and the posterior distribution, sampling I I I I I I I I I simply performs poorly and its performance cannot be i000 ~00 3000 4000 5EO0~00 "3000 8000 ~000 10000 improved much by the Latin hypercube sampling tech- Numberof sampZ,,s nique. This does not diminish the value of Latin hyper- cube sampling, which can be used in combination with any sampling algorithm, including one that will match Figure 9: MeanSquared Error as a function of the num- the posterior distribution better. ber of samples for the CPCSnetwork with evidence (five Wealso tested the relationship between the block size evidence nodes chosen randomly from among plausible in the cascaded version of the Latin hypercube sam- medical observations) for the random sampling (RS) pling and the MSEfor different sample sizes. Figure 10 and three variants of Latin Hypercube sampling (LHS). shows that the cascaded version very quickly achieves the performance of the original Latin hypercube sam- pling scheme. As soon as the number of samples in from amongthose nodes that described various plausi- individual blocks allows for achieving a desired preci- ble medical findings. These nodes were almost all leaf sion, there is little difference between the performance nodes in the network. We report results for five evi- of the cascaded and the original scheme. In the net- dence nodes. When the number of evidence nodes is works that we tested, block size of 2,000 turned out to muchlarger, the convergence rates deteriorate for all be sufficiently large to achieve the desired performance. versions of the likelihood sampling algorithm. The im- In terms of absolute computation time, we have ob- provement due to Latin hypercube sampling in terms served that generation of one Latin hypercube sample of MSEis not as dramatic as in the test runs without takes about 25% more time than generation of a ran- or with a small number of evidence nodes. dom sample. We believe that this may be related to Wehave observed that in case of large networks, per- large data structures of the Latin hypercube sampling formance of sampling algorithms depends strongly on scheme (the LHS matrix) and resulting decrease in ef- the number of evidence nodes and their topological lo- ficiency of hardware cashing. We would like to point

UNCERTAINREASONING 291 out that despite this difference, Latin hypercube sam- Cousins, S. B.; Chen, W.; and Frisse, M. E. 1993. pling outperforms random sampling in terms of mean A tutorial introduction to stochastic simulation algo- squared error within the same absolute time. rithm for belief networks. In Artificial Intelligence in Medicine. Elsevier Science Publishers B.V. chapter 5, Conclusion 315-340. Computational complexity remains a major problem in Dagum, P., and Luby, M. 1993. Approximating prob- application of probability theory and decision theory in abilistic inference in Bayesian belief networks is NP- knowledge-based systems. It is important to develop hard. Artificial Intelligence 60(1):141-153. schemes that will reduce it -- even though the worst Efron, B., and Stein, C. 1981. The jackknife estimate case will remain NP-hard, many practical cases may of variance. Annals of Statistics 9(3):586-596. become tractable. In this paper, we studied applica- Fung, R., and Chang, K.-C. 1990. Weighting and inte- tion of Latin hypercube sampling to stochastic sam- grating evidence for stochastic simulation in Bayesian pling algorithms in Bayesian networks. We have out- networks. In Henrion, M.; Shachter, R.; Kanal, L.; lined a method for generating Latin hypercube samples and Lemmer,J., eds., Uncertainty in Artificial Intel- and demonstrated empirically that it leads to significant ligence 5. North Holland: Elsevier Science Publishers performance increases in terms of convergence rate. We B.V. 209-219. proposed several ways of reducing storage requirements, the main problem related to practical implementations Henrion, M. 1988. Propagating uncertainty in of Latin hypercube sampling, and proposed a cascaded Bayesian networks by probabilistic logic sampling. In version of sampling that is anytime and that introduces Lemmer,J., and Kanal, L., eds., Uncertainty in Arti- minimal performance losses. Even though we tested it ficial Intelligence 2. Elsevier Science Publishers B.V. only with likelihood sampling, Latin hypercube sam- (North Holland). 149-163. pling is applicable to any sampling algorithm and we Lanritzen, S. L., and Spiegelhalter, D. J. 1988. Lo- believe that it should in each case lead to performance cal computations with probabilities on graphical struc- improvements. The precise advantages will depend on tures and their application to expert systems. Journal the numerical parameters of the networks and the algo- of the Royal Statistical Society, Series B (Methodolog- rithm applied. We expect, however, that Latin hyper- ical) 50(2):157-224. cube sampling should never perform worse than random McKay, M. D.; Beckman, R. J.; and Conover, W. J. sampling. 1979. A comparison of three methods for selecting values of input variables in the analysis of output from Acknowledgments a computer code. Technometrics 21(2):239-245. This research was supported by the National Science Owen, A. B. 1997. Monte Carlo variance of scram- Foundation under Faculty Early Career Development bled equidistribution quadrature. SIAM Journal of (CAREER)Program, grant IRI-9624629, and by the Numerical Analysis 34(5):1884-1910. Air Force Office of Scientific Research, grants F49620- 97-1-0225 and F49620-00-1-0112. Malcolm Pradhan Pearl, J. 1988. Probabilistic Reasoning in Intelligent and MaxHenrion of the Institute for Decision Systems Systems: Networks of Plausible Inference. San Mateo, Research shared with us the CPCSnetwork with a kind CA: Morgan Kanfmann Publishers, Inc. permission from the developers of the Internist system Pradhan, M.; Provan, G.; Middleton, B.; and Hen- at the University of Pittsburgh. All experimental data don, M. 1994. Knowledgeengineering for large belief have been obtained using SMILE, a networks. In Proceedings of the Tenth Annual Confer- engine developed at the Decision Systems Laboratory ence on Uncertainty in Artificial Intelligence (UAI- and available at http://www2. sis.pitt. edu/-genie. 94), 484-490. San Marco, CA: Morgan Kaufmann Publishers, Inc. References Shachter, R. D., and Peot, M. A. 1990. Simulation Conati, C.; Gertner, A. S.; VanLehn, K.; and approaches to general probabilistic inference on belief Druzdzel, M. J. 1997. On-line student modeling for networks. In Henrion, M.; Shachter, R.; Kanal, L.; coached problem solving using Bayesian networks. In and Lemmer,J., eds., Uncertainty in Artificial Intel- Proceedings of the Sixth International Conference on ligence 5. North Holland: Elsevier Science Publishers User Modeling (UM-96), 231-242. Vienna, New York: B.V. 221-231. Springer Verlag. Stein, M. 1987. Large sample properties of simula- Cooper, G. F. 1984. NESTOR: A Computer-based tions using Latin hypercube sampling. Teehnometrics Medical Diagnostic Aid that Integrates Causal and 29(2):143-151. Probabilistic Knowledge. Ph.D. Dissertation, Stanford University, Computer Science Department. Cooper, G. F. 1990. The computational complexity of probabilistic inference using Bayesian belief networks. Artificial Intelligence 42(2-3):393-405.

292 FLAIRS-2000