Algorithms for Exact Structure Discovery in Bayesian Networks

Department of Computer Science Series of Publications A Report A-2012-1 Algorithms for Exact Structure Discovery in Bayesian Networks Pekka Parviainen To be presented, with the permission of the Faculty of Science of the University of Helsinki, for public criticism in Auditorium XII, University Main Building, on 3 February 2012 at twelve o’clock noon. University of Helsinki Finland Supervisors Mikko Koivisto, University of Helsinki, Finland Heikki Mannila, Aalto University, Finland Pre-examiners Tapio Elomaa, Tampere University of Technology, Finland Kevin Murphy, University of British Columbia, Canada Opponent Gregory Sorkin, London School of Economics and Political Science, United Kingdom Custos Esko Ukkonen, University of Helsinki, Finland Contact information Department of Computer Science P.O. Box 68 (Gustaf Hällströmin katu 2b) FI-00014 University of Helsinki Finland Email address: [email protected].fi URL: http://www.cs.Helsinki.fi/ Telephone: +358 9 1911, telefax: +358 9 191 51120 Copyright c 2012 Pekka Parviainen ISSN 1238-8645 ISBN 978-952-10-7573-5 (paperback) ISBN 978-952-10-7574-2 (PDF) Computing Reviews (1998) Classification: G.3, F.2.1, F.2.3, I.2.6 Helsinki 2012 Unigrafia Algorithms for Exact Structure Discovery in Bayesian Networks Pekka Parviainen Department of Computer Science P.O. Box 68, FI-00014 University of Helsinki, Finland [email protected].fi http://www.cs.helsinki.fi/u/pjparvia/ PhD Thesis, Series of Publications A, Report A-2012-1 Helsinki, January 2012, 132 pages ISSN 1238-8645 ISBN 978-952-10-7573-5 (paperback) ISBN 978-952-10-7574-2 (PDF) Abstract Bayesian networks are compact, flexible, and interpretable representations of a joint distribution. When the network structure is unknown but there are observational data at hand, one can try to learn the network structure. This is called structure discovery. This thesis contributes to two areas of structure discovery in Bayesian networks: space–time tradeoffs and learning ancestor relations. The fastest exact algorithms for structure discovery in Bayesian networks are based on dynamic programming and use excessive amounts of space. Motivated by the space usage, several schemes for trading space against time are presented. These schemes are presented in a general setting for a class of computational problems called permutation problems; structure discovery in Bayesian networks is seen as a challenging variant of the permutation problems. The main contribution in the area of the space–time tradeoffs is the partial order approach, in which the standard dynamic programming algorithm is extended to run over partial orders. In particular, a certain family of partial orders called parallel bucket orders is considered. A partial order scheme that provably yields an optimal space–time tradeoff within parallel bucket orders is presented. Also practical issues concerning parallel bucket orders are discussed. Learning ancestor relations, that is, directed paths between nodes, is mo- iii iv tivated by the need for robust summaries of the network structures when there are unobserved nodes at work. Ancestor relations are nonmodular features and hence learning them is more difficult than modular features. A dynamic programming algorithm is presented for computing posterior probabilities of ancestor relations exactly. Empirical tests suggest that ancestor relations can be learned from observational data almost as accurately as arcs even in the presence of unobserved nodes. Computing Reviews (1998) Categories and Subject Descriptors: G.3 [Probability and Statistics] Multivariate Statistics F.2.1 [Analysis of Algorithms and Problem Complexity] Numerical Algorithms and Problems - Computation of transforms F.2.3 [Analysis of Algorithms and Problem Complexity] Tradeoffs between Complexity Measures I.2.6 [Artificial Intelligence] Learning - Knowledge acquisition, Parameter learning General Terms: Algorithms, Experimentation, Theory Additional Key Words and Phrases: Bayesian networks, permutation problems, dynamic programming Acknowledgements First of all, I would like to thank my advisors. I am deeply grateful to Dr. Mikko Koivisto who has guided me through all these years. I thank Prof. Heikki Mannila who in his role as an “elder statesman” provided the research infrastructure and funding and occasional words of wisdom. My pre-examiners Prof. Tapio Elomaa and Prof. Kevin Murphy gave comments that helped me finish the thesis. Other persons who gave feed- back about the thesis were Dr. Michael Gutmann, Antti Hyttinen, Laura Langohr, and Dr. Teemu Roos. I also thank Marina Kurtén who helped improve the language of the thesis. Department of Computer Science has been a nice place to work. I thank current and former colleagues, especially Esther Galbrun, Esa Junttila, Jussi Kollin, Janne Korhonen, Pauli Miettinen, Teppo Niinimäki, Stefan Schönauer, Jarkko Toivonen, and Jaana Wessman, for their contributions to various scientific endeavors and well-being at the work place. Finally, I want to thank my family for their support and encouragement over the years. v vi Contents 1 Introduction 1 2 Preliminaries 7 2.1 Bayesian Networks . 7 2.1.1 Bayesian Network Representation . 7 2.1.2 Computational Problems . 14 2.2 Permutation Problems . 17 2.2.1 The Osd Problem . 21 2.2.2 The Fp Problem . 22 2.2.3 The Ancestor Relation Problem . 23 2.3 Dynamic Programming . 23 2.3.1 The Osd Problem . 25 2.3.2 The Fp Problem . 28 3 Space–Time Tradeoffs 31 3.1 Permutation Problems . 31 3.1.1 Two-Bucket Scheme . 31 3.1.2 Divide and Conquer Scheme . 33 3.2 Structure Discovery in Bayesian Networks . 34 3.2.1 On Handling Parent Sets with Limited Space . 35 3.2.2 The Osd Problem . 36 3.2.3 The Fp Problem . 40 3.3 Time–Space Product . 41 4 The Partial Order Approach 43 4.1 Partial Orders . 44 4.2 Dynamic Programming over Partial Orders . 45 4.3 The Osd Problem . 50 4.4 The Fp Problem . 56 4.4.1 Fast Sparse Zeta Transform . 57 4.4.2 Posterior Probabilities of All Arcs . 65 vii viii Contents 5 Parallel Bucket Orders 69 5.1 Pairwise Scheme . 69 5.2 Parallel Bucket Orders: An Introduction . 72 5.3 On Optimal Time–Space Product . 76 5.4 Parallel Bucket Orders in Practice . 83 5.4.1 Parallelization . 86 5.4.2 Empirical Results . 87 5.4.3 Scalability beyond Partial Orders . 90 6 Learning Ancestor Relations 93 6.1 Computation . 94 6.2 Empirical Results . 99 6.2.1 Challenges of Learning Ancestor Relations . 100 6.2.2 A Simulation Study . 102 6.2.3 Real Life Data . 110 7 Discussion 113 References 119 Appendix A Elementary Mathematical Facts 131 Chapter 1 Introduction Bayesian networks [83] are a widely-used class of probabilistic graphical models. A Bayesian network consists of two components: a directed acyclic graph (DAG) that expresses the conditional independence relations between random variables and conditional probability distributions associated with each variable. Nodes of the DAG correspond to variables and arcs express dependencies between variables. A Bayesian network representation of a joint distribution is advanta- geous because it is compact, flexible, and interpretable. A Bayesian network representation exploits the conditional independencies among variables and provides a compact representation of the joint probability distribution. This is beneficial as in general the number of parameters in a joint probability distribution over a variable set grows exponentially with respect to the number of variables and thus storing the parameter values becomes infeasible even with a moderate number of variables. Since a Bayesian network represents the full joint distribution, it allows one to perform any inference task. Thus, it is a flexible representation. Furthermore, the structure of a Bayesian network, that is, the DAG, can be easily visualized and may uncover some important characteristics of the domain, especially if the arcs are interpreted to be causal, or in the other words, direct cause-effect relations. To illustrate the interpretability of a DAG, consider Figure 1.1 that shows a DAG describing factors affecting the weight of a guinea pig. This DAG has some historical value as it is an adaptation of the struc- tural equation model presented by Sewall Wright in 1921 [110], which, to our best knowledge, was among the first graphical models used to model dependencies between variables. For example, an arc from node “Weight at birth” to node “Weight at 33 days” indicates that the weight at birth directly affects the weight at 33 days. 1 2 1 Introduction Figure 1.1: A directed acyclic graph (DAG) illustrating the interrelations among the factors which determine the weight of guinea pigs at birth and at weaning (33 days) [110]. In recent decades, Bayesian networks have garnered vast interest from different domains and they have a multitude of diverse applications. Early applications were often expert systems for medical diagnosis such as Pathfinder [49]. Bayesian networks are widely used, for example, in biol- ogy, where applications vary from bioinformatical problems like analyzing gene expression data [40, 53] to ecological applications like modeling the population of polar bears [2] and analyzing the effectiveness of different oil combating strategies in the Gulf of Finland [51]. Bayesian networks have been applied also in other fields like business and industry. How can one find a good Bayesian network model for a phenomenon? An immediate attempt would be to let a domain expert construct the model. However, this approach is often infeasible because either the con- struction requires too much time and effort to be practical, or the phenomenon to be modeled is not known well enough to warrant modeling by hand. Fortunately, if an expert cannot construct a model, but one has access to a set of observations from the nodes of interest, one can attempt to learn a Bayesian network from data. From the machine learning point of view, the question is, how can one learn such a model accurately and with feasible computational resources? Learning of a Bayesian network is usually conducted in two phases.

Load more