
Competitive generative models with structure learning for NLP classification tasks Kristina Toutanova Microsoft Research Redmond, WA [email protected] Abstract els can be very slight (Johnson, 2001) and for small training set sizes generative models can In this paper we show that generative be better because they need fewer training sam- models are competitive with and some- ples to converge to the optimal parameter setting times superior to discriminative models, (Ng and Jordan, 2002). Additionally, many dis- when both kinds of models are allowed to criminative models use a generative model as a learn structures that are optimal for dis- base model and add discriminative features with crimination. In particular, we compare reranking (Collins, 2000; Charniak and Johnson, Bayesian Networks and Conditional log- 2005; Roark et al., 2004), or train discriminatively linear models on two NLP tasks. We ob- a small set of weights for features which are gener- serve that when the structure of the gen- atively estimated probabilities (Raina et al., 2004; erative model encodes very strong inde- Och and Ney, 2002). Therefore it is important to pendence assumptions (a la Naive Bayes), study generative models and to find ways of mak- a discriminative model is superior, but ing them better even when they are used only as when the generative model is allowed to components of discriminative models. weaken these independence assumptions Generative models may often perform poorly via learning a more complex structure, it due to making strong independence assumptions can achieve very similar or better perfor- about the joint distribution of features and classes. mance than a corresponding discrimina- To avoid this problem, generative models for tive model. In addition, as structure learn- NLP tasks have often been manually designed ing for generative models is far more ef- to achieve an appropriate representation of the ficient, they may be preferable for some joint distribution, such as in the parsing models of tasks. (Collins, 1997; Charniak, 2000). This shows that when the generative models have a good model 1 Introduction structure, they can perform quite well. Discriminative models have become the models In this paper, we look differently at compar- of choice for NLP tasks, because of their ability ing generative and discriminative models. We ask to easily incorporate non-independent features and the question: given the same set of input features, to more directly optimize classification accuracy. what is the best a generative model can do if it is State of the art models for many NLP tasks are ei- allowed to learn an optimal structure for the joint ther fully discriminative or trained using discrim- distribution, and what is the best a discriminative inative reranking (Collins, 2000). These include model can do if it is also allowed to learn an op- models for part-of-speech tagging (Toutanova et timal structure. That is, we do not impose any in- al., 2003), semantic-role labeling (Punyakanok et dependence assumptions on the generative or dis- al., 2005; Pradhan et al., 2005b) and Penn Tree- criminative models and let them learn the best rep- bank parsing (Charniak and Johnson, 2005). resentation of the data they can. The superiority of discriminative models has Structure learning is very efficient for genera- been shown on many tasks when the discrimina- tive models in the form of directed graphical mod- tive and generative models use exactly the same els (Bayesian Networks (Pearl, 1988)), since the model structure (Klein and Manning, 2002). How- optimal parameters for such models can be esti- ever, the advantage of the discriminative mod- mated in closed form. We compare Bayesian Net- works with structure learning to their closely re- Y lated discriminative counterpart – conditional log- linear models with structure learning. Our condi- tional log-linear models can also be seen as Con- ditional Random Fields (Lafferty et al., 2001), ex- cept we do not have a structure on the labels, but X1 X2 ...... Xm want to learn a structure on the features. We compare the two kinds of models on two NLP classification tasks – prepositional phrase at- Figure 1: Naive Bayes Bayesian Network tachment and semantic role labelling. Our re- sults show that the generative models are compet- itive with or better than the discriminative mod- dence assumptions: every variable is conditionally els. When a small set of interpolation parame- independent of its non-descendants given its par- ters for the conditional probability tables are fit ents. For example, the structure of the Bayesian discriminatively, the resulting hybrid generative- Network model in Figure 1 encodes the indepen- discriminative models perform better than the gen- dence assumption that the input features are con- erative only models and sometimes better than the ditionally independent given the class label. discriminative models. Let the input be represented as a vector of m In Section 2, we describe in detail the form of nominal features. We define Bayesian Networks the generative and discriminative models we study over the m input variables X1, X2,...,Xm and and our structure search methodology. In Section the class variable Y . In all networks, we add links 3 we present the results of our empirical study. from the class variable Y to all input features. In this way we have generative models which 2 Model Classes and Methodology estimate class-specific distributions over features P (X|Y ) and a prior over labels P (Y ). Figure 1 2.1 Generative Models shows a simple Bayesian Network of this form, which is the well-known Naive Bayes model. In classification tasks, given a training set of in- A specific joint distribution for a given Bayesian stances D = {[xi, yi]}, where xi are the input Network (BN) is given by a set of condi- features for the i-th instance, and yi is its label, tional probability tables (CPTs) which spec- the task is to learn a classifier that predicts the la- ify the distribution over each variable given its bels of new examples. If X is the space of inputs parents P (Z|Pa(Z)). The joint distribution and Y is the space of labels, a classifier is a func- P (Z1, Z2,...,Zm) is given by: tion f : X → Y. A generative model is one that models the joint probability of inputs and labels P (Z1, Z2,...,Zm)= Y P (Zi|Pa(Zi)) PD(x, y) through a distribution Pθ(x, y), depen- i=1...m dent on some parameter vector θ. The classifier based on this generative model chooses the most The parameters of a Bayesian Network model likely label given an input according to the con- given its graph structure are the values of ditionalized estimated joint distribution. The pa- the conditional probabilities P (Zi|Pa(Zi)). If rameters θ of the fitted distribution are usually es- the model is trained through maximizing the timated using the maximum joint likelihood esti- joint likelihood of the data, the optimal pa- mate, possibly with a prior. rameters are the relative frequency estimates: count(Zi=v,Pa(Zi)=~u) Pˆ(Zi = v|Pa(Zi)= ~u)= Here We study generative models represented as count(Pa(Zi)=~u) Bayesian Networks (Pearl, 1988), because their v denotes a value of Zi and ~u denotes a vector of parameters can be estimated extremely fast as the values for the parents of Zi. maximizer of the joint likelihood is the closed Most often smoothing is applied to avoid zero form relative frequency estimate. A Bayesian Net- probability estimates. A simple form of smooth- work is an acyclic directed graph over a set of ing is add-α smoothing which is equivalent to a nodes. For every variable Z, let Pa(Z) denote the Dirichlet prior. For NLP tasks it has been shown set of parents of Z. The structure of the Bayesian that other smoothing methods are far superior to Network encodes the following set of indepen- add-α smoothing – see, for example, Goodman (2001). In particular, it is important to incorpo- conditioning context in every CPT – i.e., each CPT rate lower-order information based on subsets of has as many d parameters as there are back-off lev- the conditioning information. Therefore we as- els. sume a structural form of the conditional proba- We place some restrictions on the Bayesian Net- bility tables which implements a more sophisti- works learned, for closer correspondence with the cated type of smoothing – interpolated Witten-Bell discriminative models and for tractability: Every (Witten and Bell, 1991). This kind of smooth- input variable node has the label node as a parent, ing has also been used in the generative parser of and at most three parents per variable are allowed. (Collins, 1997) and has been shown to have a rel- atively good performance for language modeling 2.1.1 Structure Search Methodology (Goodman, 2001). Our structure search method differs slightly To describe the form of the conditional proba- from previously proposed methods in the literature bility tables, we introduce some notation. Let Z (Heckerman, 1999; Pernkopf and Bilmes, 2005). denote a variable in the BN and Z , Z ,...,Z 1 2 k The search space is defined as follows. We start denote the set of its parents. The probabil- with a Bayesian Network containing only the class ity P (Z = z|Z1 = z1, Z2 = z2,...,Zk = zk) is estimated variable. We denote by CHOSEN the set of vari- using Witten-Bell smoothing as follows: (below ables already in the network and by REMAINING the tuple of values z ,z ,...,z is denoted by 1 2 k the set of unplaced variables. Initially, only the z1k). class variable Y is in CHOSEN and all other vari- ˆ PWB(z|z1k)= λ(z1k) × P (z|z1k) + (1 − λ(z1k)) × PWB(z|z1k−1) ables are in REMAINING.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages9 Page
-
File Size-