Unsupervised Grammar Induction with Depth-Bounded PCFG
Total Page:16
File Type:pdf, Size:1020Kb
Unsupervised Grammar Induction with Depth-bounded PCFG Lifeng Jin Finale Doshi-Velez Timothy Miller Department of Linguistics Harvard University Boston Children’s Hospital & The Ohio State University [email protected] Harvard Medical School [email protected] [email protected] William Schuler Lane Schwartz Department of Linguistics Department of Linguistics The Ohio State University University of Illinois at Urbana-Champaign [email protected] [email protected] Abstract et al., 2009) converged to weak modes of a very multimodal distribution of grammars. There has There has been recent interest in applying been recent interest in applying cognitively- or cognitively- or empirically-motivated bounds empirically-motivated bounds on recursion depth to on recursion depth to limit the search space of grammar induction models (Ponvert et al., limit the search space of grammar induction mod- 2011; Noji and Johnson, 2016; Shain et els (Ponvert et al., 2011; Noji and Johnson, 2016; al., 2016). This work extends this depth- Shain et al., 2016). Ponvert et al. (2011) and Shain bounding approach to probabilistic context- et al. (2016) in particular report benefits for depth free grammar induction (DB-PCFG), which bounds on grammar acquisition using hierarchical has a smaller parameter space than hierar- sequence models, but either without the capacity to chical sequence models, and therefore more learn full grammar rules (e.g. that a noun phrase fully exploits the space reductions of depth- may consist of a noun phrase followed by a preposi- bounding. Results for this model on grammar acquisition from transcribed child-directed tional phrase), or with a very large parameter space speech and newswire text exceed or are com- that may offset the gains of depth-bounding. This petitive with those of other models when eval- work extends the depth-bounding approach to di- uated on parse accuracy. Moreover, gram- rectly induce probabilistic context-free grammars,1 mars acquired from this model demonstrate a which have a smaller parameter space than hier- consistent use of category labels, something archical sequence models, and therefore arguably which has not been demonstrated by other ac- make better use of the space reductions of depth- quisition models. bounding. This approach employs a procedure for deriving a sequence model from a PCFG (van Schi- 1 Introduction jndel et al., 2013), developed in the context of a su- pervised learning model, and adapts it to an unsu- Grammar acquisition or grammar induction (Car- pervised setting. roll and Charniak, 1992) has been of interest to lin- guists and cognitive scientists for decades. This task Results for this model on grammar acquisi- is interesting because a well-performing acquisition tion from transcribed child-directed speech and model can serve as a good baseline for examin- newswire text exceed or are competitive with those ing factors of grounding (Zettlemoyer and Collins, of other models when evaluated on parse accu- 2005; Kwiatkowski et al., 2010), or as a piece of racy. Moreover, grammars acquired from this model evidence (Clark, 2001; Zuidema, 2003) about the demonstrate a consistent use of category labels, as Distributional Hypothesis (Harris, 1954) against the shown in a noun phrase discovery task, something poverty of the stimulus (Chomsky, 1965). Unfor- which has not been demonstrated by other acquisi- tunately, previous attempts at inducing unbounded tion models. context-free grammars (Johnson et al., 2007; Liang 1https://github.com/lifengjin/db-pcfg 211 Transactions of the Association for Computational Linguistics, vol. 6, pp. 211–224, 2018. Action Editor: Xavier Carreras. Submission batch: 9/2017; Revision batch: 12/2017; Published 4/2018. c 2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00016 by guest on 25 September 2021 2 Related work complete phrase structure grammars (Ponvert et al., 2011) or do so at the expense of large parameter sets This paper describes a Bayesian Dirichlet model of (Shain et al., 2016). Other work implements depth depth-bounded probabilistic context-free grammar bounds on left-corner configurations of dependency (PCFG) induction. Bayesian Dirichlet models have grammars (Noji and Johnson, 2016), but the use of been applied to the related area of latent variable a dependency grammar makes the system impracti- PCFG induction (Johnson et al., 2007; Liang et al., cal for addressing questions of how category types 2009), in which subtypes of categories like noun such as noun phrases may be learned. Unlike these, phrases and verb phrases are induced on a given tree the model described in this paper induces a PCFG structure. The model described in this paper is given directly and then bounds it with a model-to-model only words and not only induces categories for con- transform, which yields a smaller space of learn- stituents but also tree structures. able parameters and directly models the acquisition There are a wide variety of approaches to of category types as labels. grammar induction outside the Bayesian modeling Some induction models learn semantic gram- paradigm. The CCL system (Seginer, 2007a) uses mars from text annotated with semantic predicates deterministic scoring systems to generate bracketed (Zettlemoyer and Collins, 2005; Kwiatkowski et output of raw text. UPPARSE (Ponvert et al., 2011) al., 2012). There is evidence humans use semantic uses a cascade of HMM chunkers to produce syn- bootstrapping during grammar acquisition (Naigles, tactic structures. BMMM+DMV (Christodoulopou- 1990), but these models typically rely on a set of los et al., 2012) combines an unsupervised part- pre-defined universals, such as combinators (Steed- of-speech (POS) tagger BMMM and an unsuper- man, 2000), which simplify the induction task. In vised dependency grammar inducer DMV (Klein order to help address the question of whether such and Manning, 2004). The BMMM+DMV system universals are indeed necessary for grammar induc- alternates between phases of inducing POS tags and tion, the model described in this paper does not inducing dependency structures. A large amount assume any strong universals except independently work (Klein and Manning, 2002; Klein and Man- motivated limits on working memory. ning, 2004; Bod, 2006; Berg-kirkpatrick et al., 2010; Gillenwater et al., 2011; Headden et al., 2009; Bisk 3 Background and Hockenmaier, 2013; Scicluna and de la Higuera, 2014; Jiang et al., 2016; Han et al., 2017) has Like Noji and Johnson (2016) and Shain et al. been on grammar induction with input annotated (2016), the model described in this paper de- with POS tags, mostly for dependency grammar in- fines bounding depth in terms of memory ele- duction. Although POS tags can also be induced, ments required in a left-corner parse. A left-corner this separate induction has been criticized (Pate and parser (Rosenkrantz and Lewis, 1970; Johnson- Johnson, 2016) for missing an opportunity to lever- Laird, 1983; Abney and Johnson, 1991; Resnik, age information learned in grammar induction to es- 1992) uses a stack of memory elements to store timate POS tags. Moreover, most of these mod- derivation fragments during incremental processing. els explore a search space that includes syntactic Each derivation fragment represents a disjoint con- analyses that may be extensively center embedded nected component of phrase structure a/b consist- and therefore are unlikely to be produced by hu- ing of a top sign a lacking a bottom sign b yet to man speakers. Unlike most of these approaches, the come. For example, Figure 1 shows the deriva- model described in this paper uses cognitively moti- tion fragments in a traversal of a phrase structure vated bounds on the depth of human recursive pro- tree for the sentence The cart the horse the man cessing to constrain its search of possible trees for bought pulled broke. Immediately before process- input sentences. ing the word man, the traversal has recognized three Some previous work uses depth bounds in the fragments of tree structure: two from category NP form of sequence models (Ponvert et al., 2011; to category RC (covering the cart and the horse) Shain et al., 2016), but these either do not produce and one from category NP to category N (cover- 212 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00016 by guest on 25 September 2021 S Johnson-Laird (1983) terms, to shift with or without match and to predict with or without NP VP match). Probabilities from these distributions are then mul- NP RC tiplied together to define a transition model M over hidden states: D N NP VP M[qt 1,qt] = P(qt qt 1) (1a) − | − NP RC def 1..D 1..D = P( ft pt jt at bt qt 1) (1b) | − = P( ft qt 1) D N NP VP | − P(pt qt 1 ft) · | − D N P( jt qt 1 ft pt) · | − 1..D P(at qt 1 ft pt jt) cart horse man the the the · | − broke bought pulled 1..D 1..D P(bt qt 1 ft pt jt at ). (1c) · | − Figure 1: Derivation fragments before the word man in For example, just after the word horse is rec- a left-corner traversal of the sentence The cart the horse ognized in Figure 1, the parser store contains two the man bought pulled broke. derivation fragments yielding the cart and the horse, both with top category NP and bottom category RC. ing the). Derivation fragments at every time step The parser then decides to fork out the next word the are numbered top-down by depth d to a maximum based on the bottom category RC of the last deriva- depth of D. A left-corner parser requires more tion fragment on the store. Then the parser gener- derivation fragments — and thus more memory — ates a preterminal category D for this word based on to process center-embedded constructions than to this fork decision and the bottom category of the last process left- or right-embedded constructions, con- derivation fragment on the store.