Leveraging Verbnet to Build Corpus-Specific Verb Clusters
Total Page:16
File Type:pdf, Size:1020Kb
Leveraging VerbNet to build Corpus-Specific Verb Clusters Daniel W Peterson and Jordan Boyd-Graber and Martha Palmer University of Colorado daniel.w.peterson,jordan.boyd.graber,martha.palmer @colorado.edu { } Daisuke Kawhara Kyoto University, JP [email protected] Abstract which involved dozens of linguists and a decade of work, making careful decisions about the al- In this paper, we aim to close the gap lowable syntactic frames for various verb senses, from extensive, human-built semantic re- informed by text examples. sources and corpus-driven unsupervised models. The particular resource explored VerbNet is useful for semantic role labeling and here is VerbNet, whose organizing princi- related tasks (Giuglea and Moschitti, 2006; Yi, ple is that semantics and syntax are linked. 2007; Yi et al., 2007; Merlo and van der Plas, To capture patterns of usage that can aug- 2009; Kshirsagar et al., 2014), but its widespread ment knowledge resources like VerbNet, use is limited by coverage. Not all verbs have we expand a Dirichlet process mixture a VerbNet class, and some polysemous verbs model to predict a VerbNet class for each have important senses unaccounted for. In addi- sense of each verb, allowing us to incorpo- tion, VerbNet is not easily adaptable to domain- rate annotated VerbNet data to guide the specific corpora, so these omissions may be more clustering process. The resulting clusters prominent outside of the general-purpose corpora align more closely to hand-curated syn- and linguistic intuition used in its construction. tactic/semantic groupings than any previ- Its great strength is also its downfall: adding ous models, and can be adapted to new new verbs, new senses, and new classes requires domains since they require only corpus trained linguists - at least, to preserve the integrity counts. of the resource. According to Levin’s hypothesis, knowing the 1 Introduction set of allowable syntactic patterns for a verb sense In this paper, we aim to close the gap from exten- is sufficient to make meaningful semantic classifi- sive, human-built semantic resources and corpus- cations. Large-scale corpora provide an extremely driven unsupervised models. The work done by comprehensive picture of the possible syntactic re- linguists over years of effort has been validated by alizations for any particular verb. With enough the scientific community, and promises real trac- data in the training set, even infrequent verbs have tion on the fuzzy problem of deriving meaning sufficient data to support learning. Kawahara et from words. However, lack of coverage and adapt- al. (2014) showed that, using a Dirichlet Process ability currently limit the usefulness of this work. Mixture Model (DPMM), a VerbNet-like cluster- The particular resource explored here is Verb- ing of verb senses can be built from counts of syn- Net (Kipper-Schuler, 2005), a semantic resource tactic features. built upon the foundation of verb classes by Levin We develop a model to extend VerbNet, using (1993). Levin’s verb classes are built on the hy- a large corpus with machine-annotated dependen- pothesis that syntax and semantics are fundamen- cies. We build on prior work by adding partial su- tally linked. The semantics of a verb affect the pervision from VerbNet, treating VerbNet classes allowable syntactic constructions involving that as additional latent variables. The resulting clus- verb, creating regularities in language to which ters are more similar to the evaluation set, and speakers are extremely sensitive. It follows that each cluster in the DPMM predicts its VerbNet grouping verbs by allowable syntactic realizations class distribution naturally. Because the technique leads from syntax to meaningful semantic group- is data-driven, it is easily adaptable to domain- ings. This seed grew into VerbNet, a process specific corpora. 102 Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics (*SEM 2016), pages 102–107, Berlin, Germany, August 11-12, 2016. contains, i.e. α G Ck( ), if Ck( ) > 0 P (k α, Ck( )) ∗ ∗ (1) | ∗ ∝ (α, if k = knew, β k where C ( ) is the count of clustered items al- k ∗ ready in cluster k. Each cluster k has an associated multinomial φ w distribution over vocabulary items (e.g. slot:token pairs), φk, which is drawn from G, a Dirichlet dis- N ∞ tribution of the same size as the vocabulary, pa- M rameterized by a constant β. Because the Dirichlet is the multinomial’s conjugate prior, we can actu- Figure 1: The DPMM used in Kawahara et al. ally integrate out φk analytically, given counts of (2014) for clustering verb senses. M is the num- vocabulary items drawn from φk. For a particular ber of verb senses, and N is the sum total of slot vocabulary item w, we compute counts for that verb sense. C (w) + β P (w φ , β) = k , (2) | k C ( ) + V β k ∗ | | 2 Prior Work where Ck(w) is the number of times w has been drawn from φ , C ( ) = C (i), and V is the k k ∗ i k | | Parisien and Stevenson (2011) and Kawahara et size of the vocabulary. P al. (2014) showed distinct ways of applying the When assigning a verb instance to a sense, a Hierarchical Dirichlet Process (Teh et al., 2006) single instance may have multiple syntactic argu- to uncover the latent clusters from cluster exam- ments w. Using Bayes’s law, we update each as- ples. The latter used significantly larger corpora, signment iteratively using Gibbs sampling, using and explicitly separated verb sense induction from equations (1) and (2), according to the syntactic/semantic clustering, which allowed more fine-grained control of each step. P (k α, C ( ), φ , β) | k ∗ k ∝ In Kawahara et al. (2014), two identical P (k α, C ( )) P (w φ , β). (3) | k ∗ | k DPMM’s were used. The first clustered verb w Y instances into senses, and one such model was trained for each verb. These verb-sense clusters β < 1 encourages the clusters to have a sparse are available publicly, and are used unmodified representation in the vocabulary space. α = 1 is a in this paper. The second DPMM clusters verb typical choice, and encourages a small number of senses into VerbNet-like clusters of verbs. The clusters to be used. result is a resource that, like Verbnet, inherently 2.2 Step-wise Verb Cluster Creation captures the inherent polysemy of verbs. We fo- By separating the verb sense induction and the cus our improvements on this second step, and try clustering of verb senses, the features can be opti- to derive verb clusters that more closely align to mized for the distinct tasks. According to (Kawa- VerbNet. hara et al., 2014), the best features for inducing verb classes are joint slot:token pairs. For the verb 2.1 Dirichlet Process Mixture Models clustering task, slot features which ignore the lexi- cal items were the most effective. This aligns with The DPMM used in Kawahara et al. (2014) is Levin’s hypothesis of diathesis alternations - the shown in Figure 1. The clusters are drawn from a syntactic contexts are sufficient for the clustering. Dirichlet Process with hyperparameter α and base In this paper, we re-create the second stage clus- distribution G. The Dirichlet process prior cre- tering with the same features, but add supervision. ates a clustering effect described by the Chinese Supervised Topic Modeling (Mimno and McCal- Restaurant Process. Each cluster is chosen pro- lum, 2008; Ramage et al., 2009) builds on the portionally to the number of elements it already Bayesian framework by adding, for each item, a 103 prediction about a variable of interest, which is ob- Sampling a cluster for a verb sense now depends served at least some of the time. This encourages on the VerbNet class y, the topics to be useful at predicting a supervised signal, as well as coherent as topics. We do not P (k y, α, φk, β, ρk, γ,θv, η) have explicit knowledge of VerbNet class for any | ∝ P (k α, C ( )) of the first-level DPMM’s verb senses, so our su- | k ∗ × pervision is informed only at the level of the verb. P (y ρ , γ, θ , η) (7) | k v × P (w φ , β) . 3 Supervised DPMM | k w Y Adding supervision to the DPMM is fairly straightforward: at each step, we sample both a We then update y based on Equation 6, and then mixture component k and a VerbNet class y. For resample for the next batch. this, we assign each cluster (mixture component) a The supervised process is depicted in Figure 2. unique distribution ρ over VerbNet classes, drawn In brief, we know for each verb an η, a given by from a fixed-size Dirichlet prior with parameter γ. counts from SemLink, which we use as a prior As before, this allows us to estimate the likelihood for θ. We sample, in addition to the cluster label of a VerbNet class y knowing only the counts of k, a VerbNet class y, which depends on θ and ρ, ρ assigned senses, Ck(y), for each y, as where is the distribution over VerbNet classes in cluster k. ρ is drawn from a Dirichlet distribution Ck(y) + γ paramaterized by γ < 1, encouraging each cluster P (y ρk, γ) = , (4) to have a sparse distribution over VerbNet classes. | Ck( ) + S γ ∗ | | Because y depends on both θ and ρ, the clusters where S is the number of classes in the supervi- are encouraged to align with VerbNet classes. | | sion. The likelihood of choosing a class for a partic- α γ ular verb requires us to form an estimate of that verb’s probability of joining a particular VerbNet class.