Learning Polytrees
Total Page:16
File Type:pdf, Size:1020Kb
Learning Polytrees Sanjoy Dasgupta Department of Computer Science University of California, Berkeley Abstract NP-hardness result closes one door but opens many oth- ers. Over the last two decades, provably good approxi- mation algorithms have been developed for a wide host of We consider the task of learning the maximum- NP-hard optimization problems, including for instance the likelihood polytree from data. Our first result is a Euclidean travelling salesman problem. Are there approxi- performance guarantee establishing that the opti- mation algorithms for learning structure? mal branching (or Chow-Liu tree), which can be computed very easily, constitutes a good approx- We take the first step in this direction by considering the imation to the best polytree. We then show that task of learning polytrees from data. A polytree is a di- it is not possible to do very much better, since rected acyclic graph with the property that ignoring the di- the learning problem is NP-hard even to approx- rections on edges yields a graph with no undirected cycles. imately solve within some constant factor. Polytrees have more expressive power than branchings, and have turned out to be a very important class of directed probabilistic nets, largely because they permit fast exact in- 1 Introduction ference (Pearl, 1988). In the setup we will consider, we are given i.i.d. samples from an unknown distribution D over (X1,...,X ) and we wish to find some polytree with n The huge popularity of probabilistic nets has created a n nodes, one per X , which models the data well. There are pressing need for good algorithms that learn these struc- i no assumptions about D. tures from data. Currently the theoretical terrain of this problem has two major landmarks. First, it was shown The question “how many samples are needed so that a good by Edmonds (1967) that there is a very simple greedy al- fit to the empirical distribution reflects a good fit to D?” gorithm for learning the maximum-likelihood branching has been considered elsewhere (Dasgupta, 1997; Hoffgen,¨ (undirected tree, or equivalently, directed tree in which 1993), and so we will not dwell upon sample complexity each node has at most one parent). This algorithm was issues here. further simplified by Chow and Liu (1968), and as a result How do we evaluate the quality of a solution? A natural branchings are often called Chow-Liu trees in the literature. choice, and one which is very convenient for these factored Second, Chickering (1996) showed that learning the struc- distributions, is to rate each candidate solution M by its ture of a general directed probabilistic net is NP-hard, even log-likelihood. If M assigns parents Π to variable X , if each node is constrained to have at most two parents. i i then the maximum-likelihood choice is simply to set the His learning model uses Bayesian scoring, but the proof conditional probabilities P (X = x | Π = π ) to their can be modified to accommodate a maximum-likelihood i i i i empirically observed values. The negative log-likelihood framework. of M is then With these two endposts in mind, it is natural to ask whether there is some subclass of probabilistic nets which n is more general than branchings, and for which a provably −ll(M)= H(Xi|Πi), good structure-learning algorithm can be found. Even if i=1 X this learning problem is NP-hard, it may well be the case that a good approximation algorithm exists – that is, an al- where H(Xi|Πi) is the conditional entropy of Xi given Πi gorithm whose answer is guaranteed to be close to opti- (with respect to the empirical distribution), that is to say the mal, in terms of log-likelihood or some other scoring func- randomness left in Xi once the values of its parents Πi are tion. We emphasize that from a theoretical standpoint an known. A discussion of entropy and related quantities can be found in the book of Cover and Thomas (1991). The exactly those aspects of a problem which make it difficult, core combinatorial problem can now be defined as and thereby provide valuable insight into the hurdles that must be overcome by good algorithms. PT: Select a (possibly empty) set of parents Πi for each node Xi, so that the resulting structure is a polytree and so n 2 An approximation algorithm as to minimize the quantity i=1 H(Xi|Πi), or in deference to sample complexityP considerations, We will show that the optimal branching is not too far be- hind the optimal polytree. This instantly gives us an O(n2) PT(k): just like PT, except that each node is allowed at approximation algorithm for learning polytrees. most k parents (we shall call these k-polytrees). Let us start by reviewing our expression for the cost (nega- It is not at all clear that PT or PT(k) is any easier than learn- tive log-likelihood) of a candidate solution: ing the structure of a general probabilistic net. However, we are able to obtain two results for this problem. First, n the optimal branching (which can be found very quickly) −ll(M)= H(Xi|Πi). (†) constitutes a provably good approximation to the best poly- i=1 tree. Second, it is not possible to do very much better, be- X cause there is some constant c > 1 such that it is NP-hard This simple additive scoring function has a very pleasing to find any 2-polytree whose cost (negative log-likelihood) intuitive explanation. The distribution to be modelled (call is within a multiplicative factor c of optimal. That is to it D) has a certain inherent randomness content H(D). say, if on input I the optimal solution to PT(2) has cost There is no structure whose score is less than this. The OPT (I), then it is NP-hard to find any 2-polytree whose most naive structure, which has n nodes and no edges, will cost is ≤ c · OPT (I). To the best of our knowledge, this is have score i H(Xi), where H(Xi) is the entropy (ran- the first result that treats hardness of approximation in the domness) of the individual variable Xi. There is likewise context of learning structure. no structureP which can possibly do worse than this. We can think of this as a starting point. The goal then is to carefully Although our positive result may seem surprising because add edges to decrease the entropy-cost as much as possible, the class of branchings is less expressive than the class of while maintaining a polytree structure. The following ex- general polytrees, it is rather good news. Branchings have amples illustrate why the optimization problem might not the tremendous advantage of being easy to learn, with low be easy. sample complexity requirements – since each node has at most one parent, only pairwise correlations need to be es- Example 1. Suppose X1 by itself looks like a random coin timated – and this has made them the centerpiece of some flip but its value is determined completely by X2 and X3. very interesting work in pattern recognition. Recently, for That is, H(X1) = 1 but H(X1|X2,X3) = 0. There is then instance, Friedman, Geiger, and Goldszmidt (1997) have a temptation to add edges from X2 and X3 to X1, but it used branchings to construct classifiers, and Meila,˘ Jordan, might ultimately be unwise to do so because of problematic and Morris (1998) have modelled distributions by mixtures cycles that this creates. In particular, a polytree can contain of branchings. at most n − 1 edges, which means that each node must on average have (slightly less than) one parent. There has been a lot of work on reconstructing polytrees given data which actually comes from a polytree distri- Example 2. If X2,X3 and X4 are independent random bution – see, for instance, the papers of Geiger, Paz and coin flips, and X1 = X2⊕X3⊕X4, then adding edges from Pearl (1990) and Acid, de Campos, Gonzalez,´ Molina and X2,X3 and X4 to X1 will reduce the entropy of X1 by one. Perez´ de la Blanca (1991). In our framework we are try- However, any proper subset of these three parents provides ing to approximate an arbitrary distribution using polytrees. no information about X1; for instance, H(X1|X2,X3) = Although there are various local search techniques, such as H(X1) = 1. EM or gradient descent, which can be used for this prob- The cost (†) of a candidate solution can lie anywhere in the lem, no performance guarantees are known for them. Such range [0,O(n)]. The most interesting situations are those guarantees are very helpful, because they provide a scale in which the optimal structure has very low cost (low en- along which different algorithms can be meaningfully com- tropy), since it is in these cases that there is structure to be pared, and because they give the user some indication of discovered. how much confidence he can have in the solution. We will show that the optimal branching has cost (and The reader who seeks a better intuitive understanding of the therefore log-likelihood) which is at most about a constant polytree learning problem might find it useful to study the factor away from that of the optimal polytree. This is a techniques used in the performance guarantee and the hard- very significant performance guarantee, given that this lat- ness result. Such proofs inevitably bring into sharp focus ter cost can be as low as one or as high as n.