Unsupervised Grammar Induction with Depth-bounded PCFG

Lifeng Jin Finale Doshi-Velez Timothy Miller Department of Linguistics Harvard University Boston Children’s Hospital & The Ohio State University [email protected] Harvard Medical School [email protected] [email protected]

William Schuler Lane Schwartz Department of Linguistics Department of Linguistics The Ohio State University University of Illinois at Urbana-Champaign [email protected] [email protected]

Abstract et al., 2009) converged to weak modes of a very multimodal distribution of grammars. There has There has been recent interest in applying been recent interest in applying cognitively- or cognitively- or empirically-motivated bounds empirically-motivated bounds on recursion depth to on recursion depth to limit the search space of grammar induction models (Ponvert et al., limit the search space of grammar induction mod- 2011; Noji and Johnson, 2016; Shain et els (Ponvert et al., 2011; Noji and Johnson, 2016; al., 2016). This work extends this depth- Shain et al., 2016). Ponvert et al. (2011) and Shain bounding approach to probabilistic context- et al. (2016) in particular report benefits for depth free grammar induction (DB-PCFG), which bounds on grammar acquisition using hierarchical has a smaller parameter space than hierar- sequence models, but either without the capacity to chical sequence models, and therefore more learn full grammar rules (e.g. that a noun phrase fully exploits the space reductions of depth- may consist of a noun phrase followed by a preposi- bounding. Results for this model on grammar acquisition from transcribed child-directed tional phrase), or with a very large parameter space speech and newswire text exceed or are com- that may offset the gains of depth-bounding. This petitive with those of other models when eval- work extends the depth-bounding approach to di- uated on parse accuracy. Moreover, gram- rectly induce probabilistic context-free grammars,1 mars acquired from this model demonstrate a which have a smaller parameter space than hier- consistent use of category labels, something archical sequence models, and therefore arguably which has not been demonstrated by other ac- make better use of the space reductions of depth- quisition models. bounding. This approach employs a procedure for deriving a sequence model from a PCFG (van Schi- 1 Introduction jndel et al., 2013), developed in the context of a su- pervised learning model, and adapts it to an unsu- Grammar acquisition or grammar induction (Car- pervised setting. roll and Charniak, 1992) has been of interest to lin- guists and cognitive scientists for decades. This task Results for this model on grammar acquisi- is interesting because a well-performing acquisition tion from transcribed child-directed speech and model can serve as a good baseline for examin- newswire text exceed or are competitive with those ing factors of grounding (Zettlemoyer and Collins, of other models when evaluated on parse accu- 2005; Kwiatkowski et al., 2010), or as a piece of racy. Moreover, grammars acquired from this model evidence (Clark, 2001; Zuidema, 2003) about the demonstrate a consistent use of category labels, as Distributional Hypothesis (Harris, 1954) against the shown in a noun phrase discovery task, something poverty of the stimulus (Chomsky, 1965). Unfor- which has not been demonstrated by other acquisi- tunately, previous attempts at inducing unbounded tion models. context-free grammars (Johnson et al., 2007; Liang 1https://github.com/lifengjin/db-pcfg

211

Transactions of the Association for Computational Linguistics, vol. 6, pp. 211–224, 2018. Action Editor: Xavier Carreras. Submission batch: 9/2017; Revision batch: 12/2017; Published 4/2018. c 2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00016 by guest on 25 September 2021 2 Related work complete phrase structure grammars (Ponvert et al., 2011) or do so at the expense of large parameter sets This paper describes a Bayesian Dirichlet model of (Shain et al., 2016). Other work implements depth depth-bounded probabilistic context-free grammar bounds on left-corner configurations of dependency (PCFG) induction. Bayesian Dirichlet models have grammars (Noji and Johnson, 2016), but the use of been applied to the related area of a dependency grammar makes the system impracti- PCFG induction (Johnson et al., 2007; Liang et al., cal for addressing questions of how category types 2009), in which subtypes of categories like noun such as noun phrases may be learned. Unlike these, phrases and verb phrases are induced on a given tree the model described in this paper induces a PCFG structure. The model described in this paper is given directly and then bounds it with a model-to-model only words and not only induces categories for con- transform, which yields a smaller space of learn- stituents but also tree structures. able parameters and directly models the acquisition There are a wide variety of approaches to of category types as labels. grammar induction outside the Bayesian modeling Some induction models learn semantic gram- paradigm. The CCL system (Seginer, 2007a) uses mars from text annotated with semantic predicates deterministic scoring systems to generate bracketed (Zettlemoyer and Collins, 2005; Kwiatkowski et output of raw text. UPPARSE (Ponvert et al., 2011) al., 2012). There is evidence humans use semantic uses a cascade of HMM chunkers to produce syn- bootstrapping during grammar acquisition (Naigles, tactic structures. BMMM+DMV (Christodoulopou- 1990), but these models typically rely on a set of los et al., 2012) combines an unsupervised part- pre-defined universals, such as combinators (Steed- of-speech (POS) tagger BMMM and an unsuper- man, 2000), which simplify the induction task. In vised dependency grammar inducer DMV (Klein order to help address the question of whether such and Manning, 2004). The BMMM+DMV system universals are indeed necessary for grammar induc- alternates between phases of inducing POS tags and tion, the model described in this paper does not inducing dependency structures. A large amount assume any strong universals except independently work (Klein and Manning, 2002; Klein and Man- motivated limits on working memory. ning, 2004; Bod, 2006; Berg-kirkpatrick et al., 2010; Gillenwater et al., 2011; Headden et al., 2009; Bisk 3 Background and Hockenmaier, 2013; Scicluna and de la Higuera, 2014; Jiang et al., 2016; Han et al., 2017) has Like Noji and Johnson (2016) and Shain et al. been on grammar induction with input annotated (2016), the model described in this paper de- with POS tags, mostly for dependency grammar in- fines bounding depth in terms of memory ele- duction. Although POS tags can also be induced, ments required in a left-corner parse. A left-corner this separate induction has been criticized (Pate and parser (Rosenkrantz and Lewis, 1970; Johnson- Johnson, 2016) for missing an opportunity to lever- Laird, 1983; Abney and Johnson, 1991; Resnik, age information learned in grammar induction to es- 1992) uses a stack of memory elements to store timate POS tags. Moreover, most of these mod- derivation fragments during incremental processing. els explore a search space that includes syntactic Each derivation fragment represents a disjoint con- analyses that may be extensively center embedded nected component of phrase structure a/b consist- and therefore are unlikely to be produced by hu- ing of a top sign a lacking a bottom sign b yet to man speakers. Unlike most of these approaches, the come. For example, Figure 1 shows the deriva- model described in this paper uses cognitively moti- tion fragments in a traversal of a phrase structure vated bounds on the depth of human recursive pro- tree for the sentence The cart the horse the man cessing to constrain its search of possible trees for bought pulled broke. Immediately before process- input sentences. ing the word man, the traversal has recognized three Some previous work uses depth bounds in the fragments of tree structure: two from category NP form of sequence models (Ponvert et al., 2011; to category RC (covering the cart and the horse) Shain et al., 2016), but these either do not produce and one from category NP to category N (cover-

212

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00016 by guest on 25 September 2021 S Johnson-Laird (1983) terms, to shift with or without match and to predict with or without NP VP match). Probabilities from these distributions are then mul- NP RC tiplied together to define a transition model M over hidden states: D N NP VP

M[qt 1,qt] = P(qt qt 1) (1a) − | − NP RC def 1..D 1..D = P( ft pt jt at bt qt 1) (1b) | − = P( ft qt 1) D N NP VP | − P(pt qt 1 ft) · | − D N P( jt qt 1 ft pt) · | − 1..D P(at qt 1 ft pt jt) cart horse man the the the · | − broke bought pulled 1..D 1..D P(bt qt 1 ft pt jt at ). (1c) · | − Figure 1: Derivation fragments before the word man in For example, just after the word horse is rec- a left-corner traversal of the sentence The cart the horse ognized in Figure 1, the parser store contains two the man bought pulled broke. derivation fragments yielding the cart and the horse, both with top category NP and bottom category RC. ing the). Derivation fragments at every time step The parser then decides to fork out the next word the are numbered top-down by depth d to a maximum based on the bottom category RC of the last deriva- depth of D. A left-corner parser requires more tion fragment on the store. Then the parser gener- derivation fragments — and thus more memory — ates a preterminal category D for this word based on to process center-embedded constructions than to this fork decision and the bottom category of the last process left- or right-embedded constructions, con- derivation fragment on the store. Then the parser de- sistent with observations that center embedding is cides not to join the resulting D directly to the RC more difficult for humans to process (Chomsky and above it, based on these fork and preterminal deci- Miller, 1963; Miller and Isard, 1964). Grammar sions and the bottom category of the store. Finally acquisition models (Noji and Johnson, 2016; Shain the parser generates NP and N as the top and bottom et al., 2016) then restrict this memory to some low categories of a new derivation fragment yielding just bound, e.g. two derivation fragments. the new word the based on all these previous deci- For sequences of observed word tokens wt for sions, resulting in the store state shown in the figure. time steps t 1..T , sequence models like Ponvert The model over the fork decision (shift with ∈ { } et al. (2011) and Shain et al. (2016) hypothesize se- or without match) is defined in terms of a depth- θ quences of hidden states qt. Models like Shain et al. specific sub-model F,d¯, where is an empty deriva- ¯ ⊥ (2016) implement bounded grammar rules as depth tion fragment and d is the depth of the deepest non- empty derivation fragment at time step t 1: bounds on a hierarchical sequence model implemen- − tation of a left-corner parser, using random variables def= d¯ ¯= d P( ft qt 1) PθF,d¯ ( ft bt 1); d max bt 1, (2) within each hidden state qt for: | − | − d { − ⊥} The model over the preterminal category label is 1. preterminal labels pt and labels of top and bot- d d then conditioned on this fork decision. When there tom signs, at and bt , of derivation fragments at each depth level d (which correspond to left is no fork, the preterminal category label is deter- and right children in tree structure), and ministically linked to the category label of the bot- tom sign of the deepest derivation fragment at the 2. Boolean variables for decisions to ‘fork out’ ft previous time step (using ~φ as a deterministic indi- and ‘join in’ jt derivation fragments (in cator function, equal to one when φ is true and zero

213

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00016 by guest on 25 September 2021 otherwise). When there is a fork, the preterminal In a sequence model inducer like Shain et al. category label is defined in terms of a depth-specific (2016), these depth-specific models are assumed to 2 sub-model θP,d¯: be independent of each other and fit with a Gibbs d¯ sampler, backward sampling hidden variable se- def ~pt=bt 1 if ft = 0 P(pt qt 1 ft) = − ¯ (3) quences from forward distributions using this com- | − P (p bd ) if f = 1.  θP,d¯ t t 1 t piled transition model M (Carter and Kohn, 1996),  | − The model over the join decision (predict with or then counting individual sub-model outcomes from  without match) is also defined in terms of a depth- sampled hidden variable sequences, then resampling specific sub-model θJ,d¯ with parameters depending each sub-model using these counts with Dirichlet on the outcome of the fork decision:3 priors over a, b, and p models and Beta priors over d¯ 1 d¯ f and j models, then re-compiling these resampled Pθ ( j b a ) if f =0 def= J,d¯ t | t −1 t 1 t P( jt qt 1 ft pt) −d¯ − models into a new M. | − Pθ ( j b p ) if f =1.  J,d¯+1 t t 1 t t However, note that with K category labels this  | −  (4) model contains DK2 + 3DK3 separate parameters  Decisions about the top categories of derivation for preterminal categories and top and bottom cat- 1..D fragments at (which correspond to left siblings in egories of derivation fragments at every depth level, tree structures) are decomposed into fork- and join- each of which can be independently learned by the specific cases. When there is a join, the top category Gibbs sampler. Although this allows the hierarchi- of the deepest derivation fragment deterministically cal sequence model to learn grammars that are more depends on the corresponding value at the previous expressive than PCFGs, the search space is several time step. When there is no join, the top category is times larger than the K3 space of PCFG nonterminal defined in terms of a depth-specific sub-model:4 expansions. The model described in this paper in- 1..D def stead induces a PCFG and derives sequence model PθA (at qt 1 ft pt jt) = | − distributions from the PCFG, which has fewer pa- d¯ 1 d¯ 1 φd¯ 2 ~at − =at −1  ψd¯+0 if ft, jt =0, 1 rameters, and thus strictly reduces the search space − · ¯ − ¯ ¯ · φ P (ad bd 1 ad ) ψ if f , j =0, 0 of the model.  d¯ 1 θA,d¯ t t −1 t 1 d¯+1 t t − · | − − · φ d¯ = d¯ ψ , = ,  d¯ 1 ~at at 1 d¯+1 if ft jt 1 1  − · − · 4 The DB-PCFG Model φ d¯+1 d¯ ψ , = , . d¯ 0 PθA,d¯+1 (at bt 1 pt) d¯+2 if ft jt 1 0  − · | ·  − The depth-bounded probabilistic context-free gram-  (5)  mar (DB-PCFG) model described in this paper di- Decisions about the bottom categories b1..D t rectly induces a PCFG and then deterministically (which correspond to right children in tree struc- derives the parameters of a probabilistic left-corner tures) also depend on the outcome of the fork and parser from this single source. This derivation join variables, but are defined in terms of a side- and is based on an existing derivation of probabilistic depth-specific sub-model in every case:5 left-corner parser models from PCFGs (van Schijn- 1..D 1..D def PθB (bt qt 1 ft pt jt at ) = del et al., 2013), which was developed in a super- | − φ d¯ 1 d¯ 1 d¯ ψ , = , vised parsing model, adapted here to run more effi- d¯ 2 PθB,R,d¯ 1 (bt − bt −1 at 1) d¯+0 if ft jt 0 1 − · − ¯ ¯ | ¯ − − · ciently within a larger unsupervised grammar induc- φ P (bd ad ad ) ψ if f , j =0, 0  d¯ 1 θB,L,d¯ t t t 1 d¯+1 t t 6 − · | − · tion model. φ d¯ d¯ ψ , = ,  d¯ 1 PθB,R,d¯ (bt bt 1 pt) d¯+1 if ft jt 1 1  − · | · A PCFG can be defined in Chomsky normal form  d¯+1 − d¯+1 φd¯ 0 Pθ , , ¯+ (bt at pt) ψd¯+2 if ft, jt =1, 0. as a matrix G of binary rule probabilities with one  − · B L d 1 | ·  row for each of K parent symbols c and one col-  (6)  2+ 2 ¯ d umn for each of K W combinations of left and Here, again, d =maxd bt 1 , . 3 ¯ d { − ⊥} 6 Again, d =maxd bt 1 , . More specifically, the derivation differs from that of van 4 1..d{¯ − 1..⊥}d¯ d¯+1..D Here φd¯ = ~at = at 1 , ψd¯ = ~at = , and again, Schijndel et al. (2013) in that it removes terminal symbols from ¯ d − ⊥ d =maxd bt 1 , . conditional dependencies of models over fork and join decisions 5 { − ⊥} 1..d¯ 1..d¯ d¯+1..D Here φd¯ = ~bt = bt 1 , ψd¯ = ~bt = , and again, and top and bottom category labels, substantially reducing the ¯ d − ⊥ d =maxd bt 1 , . size of the derived model that must be run during induction. { − ⊥}

214

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00016 by guest on 25 September 2021 right child symbols a and b, which can be pairs of The probability P(GD) is itself an integral nonterminals or observed words from vocabulary W over the product of a deterministic transform φ followed by null symbols :7 from an unbounded grammar to a bounded gram- ⊥ mar P(GD G) = ~GD = φ(G) and a prior over un- G = P(c a b c) δ (δ δ )>. (7) | → | c a ⊗ b bounded grammars P(G): aX,b,c P(G ) = P(G G) P(G) dG. (12) A depth-bounded grammar is a set of side- and D D | · · depth-specific distributions: Z Distributions P(G) for each nonterminal symbol G = G , s L, R , d 1..D . (8) D { s d | ∈ { } ∈ { }} (rows) within this unbounded grammar can then be sampled from a Dirichlet distribution with a sym- The posterior probability of a depth-bounded metric parameter β: model GD given a corpus (sequence) of words w1..T is proportional to the product of a likelihood and a G Dirichlet(β), (13) prior: ∼ which then yields a corresponding transformed sam- P(G w .. ) P(w .. G ) P(G ). (9) D | 1 T ∝ 1 T | D · D ple in P(GD) for corresponding nonterminals. Note that this model is different than that of Shain et al. The likelihood is defined as a marginal over (2016), who induce a hierarchical HMM directly. bounded PCFG trees τ of the probability of that tree A depth-specific grammar G is (deterministi- given the grammar times the product of the proba- D cally) derived from G via transform φ with probabil- bility of the word at each time step or token index t ities for expansions constrained to and renormalized given this tree:8 over only those outcomes that yield terminals within a particular depth bound D. This depth-bounded P(w1..T GD) = P(τ GD) P(wt τ). (10) | τ | · t | grammar is then used to derive left-corner expec- X Y tations (anticipated counts of categories appearing The probability of each tree is defined to be the prod- as left descendants of other categories), and ulti- 9 uct of the probabilities of each of its branches: mately the parameters of the depth-bounded left- corner parser defined in Section 3. Counts for G are P(τ G ) = P (τη τη τη τη). (11) | D GD → 0 1 | then obtained from sampled hidden state sequences, τη τ Y∈ and rows of G are then directly sampled from the 7 This definition assumes a Kronecker delta function δi, de- posterior updated by these counts. fined as a vector with value one at index i and zeros everywhere else, and a Kronecker product M N over matrices M and N, ⊗ 4.1 Depth-bounded grammar which tiles copies of N weighted by values in M as follows: In order to ensure the bounded version of G is a con- M[1,1] NM[1,2] N ··· sistent probability model, it must be renormalized in M[2,1] NM[2,2] N M N = ··· . (1’) ⊗  . . .  transform φ to assign a probability of zero to any  . . ..    derivation that exceeds its depth bound D. For ex-   The Kronecker product specializes to vectors as single-column ample, if D = 2, then it is not possible to expand a matrices, generating vectors that contain the products of all left sibling at depth 2 to anything other than a lexical combinations of elements in the operand vectors. 8 item, so the probability of any non-lexical expansion This notation assumes the observed data w1..T is a single long sequence of words, and the hidden variable τ is a single must be removed from the depth-bounded model, large but depth-bounded tree structure (e.g. a right-branching and the probabilities of all remaining outcomes must discourse structure). Since the implementation is incremental, be renormalized to a new total without this probabil- segmentation decisions may indeed be treated as hidden vari- ity. Following van Schijndel et al. (2013), this can τ ables in , but the experiments described in Section 5 are run on be done by iteratively defining a side- and depth- sentence-segmented input. (i) 9 Here, η is a node address, with left child η0 and right child specific containment likelihood hs,d for left- or right- η1, or with right child equal to if unary. side siblings s L, R at depth d 1..D at each it- ⊥ ∈ { } ∈ { }

215

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00016 by guest on 25 September 2021 eration i 1..I ,10 as a vector with one row for each sibling ancestors: ∈ { } nonterminal or terminal symbol (or null symbol ) ⊥ (1) in G, containing the probability of each symbol gen- E = G , (diag(1) 1) (16a) d R d ⊗ erating a complete yield within depth d as an s-side (i) (i 1) E = E − G , (diag(1) 1) (16b) sibling: d d L d ⊗ + I (i) Ed = i=1 Ed . (16c) (0) = hs,d 0 (14a) P (i 1) (i 1) This left-corner expectation will be used to esti- (i) G (1 δ + h − h − ) if d D + 1 ⊥ L,d R,d mate the marginalized probability over all gram- hL,d = ⊗ ⊗ ≤ 0 if d > D + 1 mar rule expansions between derivation fragments,   (14b) which must traverse an unknown number of left chil-   dren of some right-sibling ancestor. δT if d = 0 (i) (i 1) (i 1) h = ( δ + − − ) if 0 < d D R,d G 1 hL,d+1 hR,d 4.2 Depth-bounded parsing  ⊗ ⊥ ⊗ ≤ 0 if d > D.  Again following van Schijndel et al. (2013), the fork  (14c)  and join decision, and the preterminal, top and bot- tom category label sub-models described in Sec- where ‘T’ is a top-level category label at depth zero. tion 3 can now be defined in terms of these side- A depth-bounded grammar Gs,d can then be de- and depth-specific grammars Gs,d and depth-specific fined to be the original grammar G reweighted and left-corner expectations E+. 11 d renormalized by this containment likelihood: First, probabilities for no-fork and yes-fork out- comes below some bottom sign of category b at (I) (I) G diag(1 δ + h ,d h ,d) depth d are defined as the normalized probabilities, = ⊗ ⊥ L ⊗ R (15a) GL,d (I) respectively, of any lexical expansion of a right sib- h L,d ling b at depth d, and of any lexical expansion fol- (I) (I) G diag(1 δ + h ,d+ h ,d) lowing any number of left child expansions from b = ⊗ ⊥ L 1 ⊗ R . (15b) GR,d (I) at depth d: hR,d

δb>GR,d (1 δ ) ⊥ This renormalization ensures the depth-bounded PθF,d (0 b) = + ⊗ (17a) | δb>(GR,d + E GL,d)(1 δ ) model is consistent. Moreover, this distinction be- d ⊗ ⊥ + tween a learned unbounded grammar and a de- δb>E GL,d (1 δ ) G d ⊗ ⊥ PθF,d (1 b) = + . (17b) rived bounded grammar Gs,d which is used to de- | δb>(GR,d + E GL,d)(1 δ ) d ⊗ ⊥ rive a parsing model may be regarded as an instance of Chomsky’s (1965) distinction between linguistic The probability of a preterminal p given a bot- competence and performance. tom category b is simply a normalized left-corner The side- and depth-specific grammar can then be expected count of p under b: used to define expected counts of categories occur- ring as left descendants (or ‘left corners’) of right- + def δb> Ed δp PθP,d (p b) = + . (18) | δb> E 1 10 Experiments described in this article use I = 20 following d observations of convergence at this point in supervised parsing. 11 where diag(v) is a diagonalization of a vector v: Yes-join and no-join probabilities below bottom sign b and above top sign a at depth d are then de- v[1] 0 ··· fined similarly to fork probabilities, as the normal- diag(v) = 0 v[2] . (2’)  . .  ized probabilities, respectively, of an expansion to  . ..    left child a of a right sibling b at depth d, and of an     expansion to left child a following any number of

216

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00016 by guest on 25 September 2021 left child expansions from b at depth d: Grammar rule applications C are then counted from these sampled sequences:12 δb>GR,d (δa 1) PθJ,d (1 b a) = + ⊗ (19a) | δb>(GR,d + E GL,d)(δa 1) d ⊗ δ d¯ 1 (δ d¯ δ d¯ 1 )> if ft, jt = 0, 1 bt −1 at 1 bt − + − − ⊗ δb>Ed GL,d (δa 1) Pθ (0 b a) = ⊗ . δad¯ (δad¯ δbd¯ )> if ft, jt = 0, 0 J,d + =  t t 1 ⊗ t | δb>(GR,d + Ed GL,d)(δa 1) C  − ⊗  ¯ ¯ t δbd (δpt δbd )> if ft, jt = 1, 1 (19b)  t 1 ⊗ t X  −  ¯ ¯ δad+1 (δpt δbd+1 )> if ft, jt = 1, 0 The distribution over category labels for top  t ⊗ t  signs a above some top sign of category c and be- + δpt (δwt δ )>, (25)  ⊗ ⊥ low a bottom sign of category b at depth d is defined Xt as the normalized distribution over category labels and a new grammar G is sampled from a Dirichlet following a chain of left children expanding from b distribution with counts C and a symmetric hyper- which then expands to have a left child of category c: parameter β as parameters: + δ >E diag(δ ) G , (δ 1) b d a L d c ⊗ PθA,d (a b c) = + . (20) G Dirichlet( C + β ). (26) | δb>E diag(1) GL,d (δc 1) ∼ d ⊗ The distribution over category labels for bottom This grammar is then used to define transition and signs b below some sign a and sibling of top sign c lexical models M and L as defined in Sections 3 is then defined as the normalized distribution over through 4.2 to complete the cycle. right children of grammar rules expanding from a to c followed by b: 4.4 Model hyper-parameters and priors There are three hyper-parameters in the model. K is δa>Gs,d (δc δb) PθB,s,d (b a c) = ⊗ . (21) the number of non-terminal categories in the gram- | δ >G , (δ 1) a s d c ⊗ mar G, D is the maximum depth, and β is the param- Finally, a lexical observation model L is defined eter for the symmetric Dirichlet prior over multino- as a matrix of unary rule probabilities with one row mial distributions in the grammar G. for each combination of store state and preterminal As seen from the previous subsection, the prior symbol and one column for each observation sym- is over all possible rules in an unbounded PCFG bol: grammar. Because the number of non-terminal cate- L = 1 G (diag(1) δ ). (22) gories of the unbounded PCFG grammar is given as ⊗ ⊗ ⊥ a hyper-parameter, the number of rules in the gram- 4.3 Gibbs sampling mar is always known. It is possible to use non- Grammar induction in this model then fol- parametric priors over the number of non-terminal lows a forward-filtering backward-sampling algo- categories, however due to the need to dynamically rithm (Carter and Kohn, 1996). This mitigate the computational complexity of filtering first computes a forward distribution vt over hidden and sampling using arbitrarily large category sets, states at each time step t from an initial value : ⊥ this is left for future work. v0> = δ > (23a) ⊥ 5 Evaluation vt> = vt 1> M diag(L δwt ). (23b) − The DB-PCFG model described in Section 4 is eval- The algorithm then samples hidden states backward uated first on synthetic data to determine whether from a multinomial distribution given the previously it can reliably learn a recursive grammar from data sampled state qt+1 at time step t+1 (assuming input with a known optimum solution, and to determine parameters to the multinomial function are normal- the hyper-parameter value for β for doing so. Two ized): experiments on natural data are then carried out. First, the model is run on natural data from the Adam qt Multinom( diag(vt) M diag(L δwt+1 ) δqt+1 ). ∼ 12 ¯ d (24) Again, d =maxd at 1 , . { − ⊥}

217

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00016 by guest on 25 September 2021 and Eve parts of the CHILDES corpus (Macwhin- a) X1 b) X1 c) X2 d) X2 ney, 1992) to compare with other grammar induc- X X X X X X X X tion systems on a human-like acquisition task. Then 1 2 1 2 1 2 1 2 data from the Wall Street Journal section of the Penn a b X1 X2 b a b a X1 X2 Treebank (Marcus et al., 1993) is used for further a b a b comparison in a domain for which competing sys- tems are optimized. The competing systems include Figure 2: Synthetic left-branching (a,b) and right- UPPARSE (Ponvert et al., 2011)13, CCL (Seginer, branching (c,d) datasets. 2007a)14, BMMM+DMV with undirected depen- dency features (Christodoulopoulos et al., 2012)15 1. whether the model is balanced or biased in fa- and UHHMM (Shain et al., 2016).16 vor of left- or right-branching solutions, For the natural language datasets, the variously 17 parametrized DB-PCFG systems are first validated 2. whether the model is able to posit recursive on a development set, and the optimal system is then structure in appropriate places, and run until convergence with the chosen hyperparam- eters on the test set. In development experiments, 3. what hyper-parameters enable the model to find the log-likelihood of the dataset plateaus usually af- optimal modes more quickly. ter 500 iterations. The system is therefore run at least 500 iterations in all test set experiments, with The risk of bias in branching structure is impor- one iteration being a full cycle of Gibbs sampling. tant because it might unfairly inflate induction re- The system is then checked to see whether the log- sults on languages like English, which are heavily likelihood has plateaued, and halted if it has. right branching. In order to assess its bias, the model The DB-PCFG model assigns trees sampled from is evaluated on two synthetic datasets, each consist- conditional posteriors to all sentences in a dataset ing of 200 sentences. The first dataset is a left- in every iteration as part of the inference. The sys- branching corpus, which consists of 100 sentences tem is further allowed to run at least 250 iterations of the form a b and 100 sentences of the form a b b, after convergence and proposed parses are chosen with optimal tree structures as shown in Figure 2 (a) from the iteration with the greatest log-likelihood af- and (b). The second dataset is a right-branching cor- ter convergence. However, once the system reaches pus, which consists of 100 sentences of the form convergence, the evaluation scores of parses from a b and 100 sentences of the form a a b, with opti- different iterations post-convergence appear to differ mal tree structures as shown in Figure 2 (c) and (d). very little. Results show both structures (and both correspond- ing grammars) are learnable by the model, and re- 5.1 Synthetic data sult in approximately the same log likelihood. These synthetic datasets are also used to tune the β hyper- Following Liang et al. (2009) and Scicluna and de la parameter of the model (as defined in Section 4) to Higuera (2014), an initial set of experiments on syn- enable it to find optimal modes more quickly. The thetic data are used to investigate basic properties of resulting β setting of 0.2 is then used in induction on the model—in particular: the CHILDES and Penn Treebank corpora. 13https://github.com/eponvert/upparse After validating that the model is not biased, 14https://github.com/DrDub/cclparser the model is also evaluated on a synthetic center- 15 BMMM:https://github.com/christos-c/bmmm embedding corpus consisting of 50 sentences each DMV:https://code.google.com/archive/p/ pr-toolkit/ of the form a b c; a b b c; a b a b c; and a b b a b b c, 16https://github.com/tmills/uhhmm/tree/ which has optimal tree structures as shown in Fig- coling16 ure 3.18 Note that the (b) and (d) trees have depth 2 17The most complex configuration that would run on avail- able GPUs was D=2, K=15. Analysis of full WSJ (Schuler et 18 Here, in order to more closely resemble natural language al., 2010) shows 47.38% of sentences require depth 2, 38.32% input, tokens a, b, and c are randomly chosen uniformly from require depth 3 and 6.26% require depth 4. a ,..., a , b ,..., b and c ,..., c , respectively. { 1 50} { 1 50} { 1 50}

218

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00016 by guest on 25 September 2021 a) X3 b) X3 System Precision Recall F1 CCL 83.2 71.1 76.7 X X X X 1 3 1 3 UPPARSE 91.4 80.7 85.7 X1 X2 c X1 X2 X1 X3 UHHMM 37.7 37.7 37.7 BMMM+DMV 99.2 83.2 90.5 X X b a b X X c 1 2 1 2 DB-PCFG 100.0 100.0 100.0 a b a b Table 1: The performance scores of unlabeled parse eval- c) X3 d) X3 uation of different systems on synthetic data.

X1 X3 X1 X3 Hyperparameters Precision Recall F1 X1 X2 c X1 X2 X1 X3 D1K15 57.1 70.7 63.2 D1K30 52.8 65.4 58.5 a b X1 X2 b X1 X2 c D1K45 44.4 54.9 49.1 a b X1 X2 b D2K15 44.0 54.5 48.7

a b Table 2: PARSEVAL results of different hyperparameter settings for the DB-PCFG system on the Adam dataset. Figure 3: Synthetic center-embedding structure. Note Hyperparameter D is the number of possible depths, and that tree structures (b) and (d) have depth 2 because they K is the number of non-terminals. have complex sub-trees spanning a b and a b b, respec- tively, embedded in the center of the yield of their roots. 5.2 Child-directed speech corpus

because they each have a complex sub-tree spanning After setting the β hyperparameter on synthetic a b and a b b embedded in the center of the yield of datasets, the DB-PCFG model is evaluated on the root. Results show the model is capable of learn- 14,251 sentences of transcribed child-directed ing depth 2 (recursive) grammars. speech from the Eve section of the Brown corpus of CHILDES (Macwhinney, 1992). Hyperparame- Finally, as a gauge of the complexity of this task, ters D and K are set to optimize performance on the results of the model described in this paper are com- Adam section of the Brown Corpus of CHILDES, pared with those of other grammar induction mod- which is about twice as long as Eve. Following pre- els on the center-embedding dataset. In this ex- vious work, these experiments leave all punctuation periment, all models are assigned hyper-parameters in the input for learning, then remove it in all evalu- matching the optimal solution. The DB-PCFG is ations on development and test data. run with K=5 and D=2 and β=0.2 for all priors, the BMMM+DMV (Christodoulopoulos et al., 2012) is Model performance is evaluated against Penn run with 3 preterminal categories, and the UHHMM Treebank style annotations of both Adam and Eve model is run with 2 active states, 4 awaited states corpora (Pearl and Sprouse, 2013). Table 2 shows and 3 parts of speech.19 Table 1 shows the PARSE- the PARSEVAL scores of the DB-PCFG system ff VAL scores for parsed trees using the learned gram- with di erent hyperparameters on the Adam corpus mar from each unsupervised system. Only the DB- for development.The simplest configuration, D1K15 PCFG model is able to recognize the correct tree (depth 1 only with 15 non-terminal categories), ob- structures and the correct category labels on this tains the best score, so this setting is applied to = , = dataset, showing the task is indeed a robust chal- the test corpus, Eve. Results of the D 1 K 15 lenge. This suggests that hyper-parameters opti- DB-PCFG model on Eve are then compared against mized on this dataset may be portable to natural data. those of other grammar induction systems which use only raw text as input on the same corpus. Follow- ing Shain et al. (2016) the BMMM+DMV system is 19It is not possible to use just 2 awaited states, which is the gold setting, since the UHHMM system errors out when the run for 10 iterations with 45 categories and its output number of categories is small. is converted from dependency graphs to constituent

219

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00016 by guest on 25 September 2021 System Precision Recall F1 System NP Recall NP agg F1 CCL 50.5 53.5 51.9 CCL 35.5 - UPPARSE 60.5 51.9 55.9 UPPARSE 69.1 - UHHMM 55.5 69.3 61.7 UHHMM 61.4 27.4 BMMM+DMV 63.5 63.3 63.4 BMMM+DMV 71.3 61.2 UHHMM-F 62.9 68.4 65.6 DB-PCFG (D1K15) 75.7 28.7 DB-PCFG 64.5 80.5 71.6∗∗ DB-PCFG (D1K30) 78.6 60.7 Right-branching 68.7 85.8 76.3 DB-PCFG (D1K45) 76.9 64.0 DB-PCFG (D2K15) 85.1 65.9 Table 3: PARSEVAL scores on Eve dataset for all com- Right-branching 64.2 - peting systems. These are unlabeled precision, recall and F1 scores on constituent trees without punctuation. Both Table 4: Performances of different systems for noun the right-branching baseline and the best performing sys- phrase recall and aggregated F1 scores on the Eve dataset. tem are in bold. (**: p < 0.0001, permutation test)

Some of these category types — in particular, trees (Collins et al., 1999). The UHHMM system is noun phrases — are fairly universal across lan- run on the Eve corpus using settings in Shain et al. guages, and may be useful in downstream tasks such (2016), which also includes a post-process option to as (unsupervised) named entity recognition. The flatten trees (reported here as UHHMM-F). DB-PCFG and other models that can be made to Table 3 shows the PARSEVAL scores for all the produce category types are therefore evaluated on a competing systems on the Eve dataset. The right- noun phrase discovery task. branching baseline is still the most accurate in terms Two metrics are used for this evaluation. First, the of PARSEVAL scores, presumably because of the evaluation counts all constituents proposed by the highly right-branching structure of child-directed candidate systems, and calculates recall against the speech in English. The DB-PCFG system with only gold annotation of noun phrases. This metric is not one memory depth and 15 non-terminal categories affected by which branching paradigm the system is achieves the best performance in terms of F1 score using and reveals more about the systems’ perfor- and recall among all the competing systems, signif- mances. This metric differs from that used by Pon- icantly outperforming other systems (p < 0.0001, 20 vert et al. (2011) in that this metric takes NPs at all permutation test). levels in gold annotation into account, not just base The Eve corpus has about 5,000 sentences with NPs.21 more than one depth level, therefore one might ex- The second metric, for systems that produce cat- pect a depth-two model to perform better than a egory labels, calculates F1 scores of induced cate- depth-one model, but this is not true if only PAR- gories that can be mapped to noun phrases. The SEVAL scores are considered. This issue will be re- first 4,000 sentences are used as the development visited in the following section with the noun phrase set for learning mappings from induced category la- discovery task. bels to phrase types. The evaluation calculates pre- 5.3 NP discovery on child-directed speech cision, recall and F1 of all spans of proposed cate- gories against the gold annotations of noun phrases When humans acquire grammar, they do not only in the development set, and aggregates the categories learn tree structures, they also learn category types: ranked by their precision scores so that the F1 score noun phrases, verb phrases, prepositional phrases, of the aggregated category is the highest on the de- and where each type can and cannot occur. velopment set. The evaluation then calculates the F1 20Resulting scores are better when applying Shain et al. score of this aggregated category on the remainder (2016) flattening to output binary-branching trees. For the D=1, of the dataset, excluding this development set. K=15 model, precision and F1 can be raised to 70.31% and 74.33%. However, since the flattening is a heuristic which may 21Ponvert et al. (2011) define base NPs as NPs with no NP not apply in all cases, these scores are not considered to be com- descendants, a restriction motivated by their particular task parable results. (chunking).

220

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00016 by guest on 25 September 2021 WSJ10test WSJ20test System Precision Recall F1 Precision Recall F1 CCL 63.4 71.9 67.4 60.1 61.7 60.9∗∗ UPPARSE 54.7 48.3 51.3 47.8 40.5 43.9 UHHMM 49.1 63.4 55.3 - - - BMMM+DMV(K10) 36.2 40.6 38.2 25.3 29.0 27.0 UHHMM-F 57.1 54.4 55.7 - - - DB-PCFG (D2K15) 64.5 82.6 72.4∗∗ 53.0 70.5 60.5 Right-branching 55.1 70.5 61.8 41.5 55.3 47.4

Table 5: PARSEVAL scores for all competing systems on WSJ10 and WSJ20 test sets. These are unlabeled precision, recall and F1 scores on constituent trees without punctuation (**: p <0.0001, permutation test).

The UHHMM system is the only competing sys- noun phrases. tem that is natively able to produce labels for pro- posed constituents. BMMM+DMV does not pro- 5.4 Penn Treebank duce constituents with labels by default, but can To further facilitate direct comparison to previous be evaluated using this metric by converting depen- work, we run experiments on sentences from the dency graphs into constituent trees, then labeling Penn Treebank (Marcus et al., 1993). The first ex- each constituent with the part-of-speech tag of the periment uses the sentences from Wall Street Jour- head. For CCL and UPPARSE, the NP agg F1 scores nal part of the Penn Treebank with at most 20 words are not reported because they do not produce labeled (WSJ20). The first half of the WSJ20 dataset is used constituents. as a development set (WSJ20dev) and the second Table 4 shows the scores for all systems on the half is used as a test set (WSJ20test). We also extract Eve dataset and four runs of the DB-PCFG system sentences in WSJ20test with at most 10 words from on these two evaluation metrics. Surprisingly the the proposed parses from all systems and report re- D=2, K=15 model which has the lowest PARSE- sults on them (WSJ10test). WSJ20dev is used for VAL scores is most accurate at discovering noun finding the optimal hyperparameters for both DB- phrases. It has the highest scores on both evalua- PCFG and BMMM-DMV systems.22 tion metrics. The best model in terms of PARSE- Table 5 shows the PARSEVAL scores of all sys- VAL scores, the D=1, K=15 DB-PCFG model, per- tems. The right-branching baseline is relatively forms poorly among the DB-PCFG models, despite weak on these two datasets, mainly because formal the fact that its NP recall is higher than the com- writing is more complex and uses more non-right- peting systems. The low score of NP agg F1 of branching structures (e.g., subjects with modifiers DB-PCFG at D1K15 shows a diffusion of induced or parentheticals) than child-directed speech. For syntactic categories when the model is trying to find WSJ10test, both the DB-PCFG system and CCL are a balance among labeling and branching decisions. able to outperform the right branching baseline. The The UPPARSE system, which is proposed as a base F1 difference between the best-performing previous- NP chunker, is relatively poor at NP recall by this work system, CCL, and DB-PCFG is highly sig- definition. nificant. For WSJ20test, again both CCL and DB- The right-branching baseline does not perform 22Although UHHMM also needs tuning, in practice we find well in terms of NP recall. This is mainly because that this system is too inefficient to be tuned on a development noun phrases are often left children of some other set, and it requires too many resources when the hyperparame- constituent and the right branching model is un- ters become larger than used in previous work. We believe that able to incorporate them into the syntactic structures further increasing the hyperparameters of UHHMM may lead to of whole sentences. Therefore although the right- performance increase, but the released version is not scalable to larger values of these settings. We also do not report UHHMM branching model is the best model in terms of PAR- on WSJ20test for the same scalabilty reason. The results of SEVAL scores, it is not helpful in terms of finding WSJ10test of UHHMM is induced with all WSJ10 sentences.

221

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00016 by guest on 25 September 2021 WSJ10 WSJ40 System Precision Recall F1 Precision Recall F1 CCL 75.3 76.1 75.7 58.7 55.9 57.2 UPPARSE 74.6 66.7 70.5 60.0 49.4 54.2 DB-PCFG (D2K15) 65.5 83.6 73.4 47.0 63.6 54.1 Right-branching 55.2 70.0 61.7 35.4 47.4 40.5

Table 6: Published PARSEVAL results for competing systems. Please see text for details as the systems are trained and evaluated differently.

PCFG are above the right-branching baseline. The on PCFGs by derivation, reducing the search space difference between the F scores of CCL and DB- of possible trees for input words without exploding PCFG is very small compared to WSJ10, however the search space of parameters with multiple side- it is also significant. and depth-specific copies of each rule. Results for It is possible that the DB-PCFG is being penal- this model on grammar acquisition from transcribed ized for inducing fully binarized parse trees. The child-directed speech and newswire text exceed or accuracy of the DB-PCFG model is dominated by are competitive with those of other models when recall rather than precision, whereas CCL and other evaluated on parse accuracy. Moreover, grammars systems are more balanced. This is an important acquired from this model demonstrate a consistent distinction if it is assumed that phrase structure use of category labels, something which has not is binary (Kayne, 1981; Larson, 1988), in which been demonstrated by other acquisition models. case precision merely scores non-linguistic deci- In addition to its practical merits, this model may sions about whether to suppress annotation of non- offer some theoretical insight for linguists and other maximal projections. However, since other systems cognitive scientists. First, the model does not as- are not optimized for recall, it would not be fair to sume any universals except independently motivated use only recall as a comparison metric in this study. limits on working memory, which may help address Finally, Table 6 shows the published results of dif- the question of whether universals are indeed neces- ferent systems on WSJ. The CCL results come from sary for grammar induction. Second, the distinction Seginer (2007b), where the CCL system is trained this model draws between its learned unbounded with all sentences from WSJ, and evaluated on sen- grammar G and its derived bounded grammar GD tences with 40 words or fewer from WSJ (WSJ40) seems to align with Chomsky’s (1965) distinction and WSJ10. The UPPARSE results come from Pon- between competence and performance, and has the vert et al. (2011), where the UPPARSE system is potential to offer some formal guidance to linguistic trained using 00-21 sections of WSJ, and evaluated inquiry about both kinds of models. on section 23 and the WSJ10 subset of section 23. Acknowledgments The DB-PCFG system uses hyperparameters opti- mized on the WSJ20dev set, and is evaluated on The authors would like to thank Cory Shain and WSJ40 and WSJ10, both excluding WSJ20dev. The William Bryce for their valuable input. We would results are not directly comparable, but the results like also to thank the Action Editor Xavier Carreras from the DB-PCFG system is competitive with the and the anonymous reviewers for insightful com- other systems, and numerically have the best recall ments. Computations for this project were partly scores. run on the Ohio Supercomputer Center (1987). This research was funded by the Defense Advanced Re- 6 Conclusion search Projects Agency award HR0011-15-2-0022. The content of the information does not necessarily This paper describes a Bayesian Dirichlet model reflect the position or the policy of the Government, of depth-bounded PCFG induction. Unlike earlier and no official endorsement should be inferred. work this model implements depth bounds directly

222

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00016 by guest on 25 September 2021 References Wenjuan Han, Yong Jiang, and Kewei Tu. 2017. De- pendency grammar induction with neural lexicaliza- Steven P. Abney and Mark Johnson. 1991. Memory re- tion and big training data. In Proceedings of the Con- quirements and local ambiguities of parsing strategies. ference on Empirical Methods in Natural Language J. Psycholinguistic Research, 20(3):233–250. Processing, pages 1684–1689. Taylor Berg-kirkpatrick, Alexandre Bouchard-Cotˆ e,´ John Zellig Harris. 1954. Distributional structure. In Jerry A. DeNero, and Dan Klein. 2010. Painless unsupervised Fodor and Jerrold J. Katz, editors, The Structure of learning with features. In Proceedings of the Annual Language: Readings in the Philosophy of Language, Conference of the North American Chapter of the As- volume 10, pages 33–49. Prentice-Hall. sociation for Computational Linguistics: Human Lan- guage Technologies, pages 582–590. William P. Headden, III, Mark Johnson, and David Mc- Closky. 2009. Improving unsupervised dependency Yonatan Bisk and Julia Hockenmaier. 2013. An HDP parsing with richer contexts and smoothing. In Pro- Model for Inducing Combinatory Categorial Gram- ceedings of the Annual Conference of the North Ameri- mars. In Transactions Of The Association For Com- can Chapter of the Association for Computational Lin- putational Linguistics, pages 75–88. guistics: Human Language Technologies, pages 101– Rens Bod. 2006. Unsupervised parsing with U-DOP. 109. In Proceedings of the Conference on Computational Yong Jiang, Wenjuan Han, and Kewei Tu. 2016. Unsu- Natural Language Learning, pages 85–92. pervised neural dependency parsing. In Proceedings Glenn Carroll and Eugene Charniak. 1992. Two ex- of the Conference on Empirical Methods in Natural periments on learning probabilistic dependency gram- Language Processing, number 61503248, pages 763– mars from corpora. Working Notes of the Workshop on 771. Statistically-Based NLP Techniques, (March):1–13. Mark Johnson, Thomas L. Griffiths, and Sharon Goldwa- C. K. Carter and R. Kohn. 1996. Markov chain Monte ter. 2007. Adaptor grammars: A framework for speci- Carlo in conditionally Gaussian state space models. fying compositional nonparametric Bayesian models. Biometrika, 83(3):589–601. In Advances in Neural Information Processing Sys- Noam Chomsky and George A. Miller. 1963. Introduc- tems, volume 19, page 641. tion to the formal analysis of natural languages. In Handbook of Mathematical Psychology, pages 269– Philip N. Johnson-Laird. 1983. Mental models: To- 321. Wiley, New York, NY. wards a cognitive science of language, inference, and consciousness. Harvard University Press, Cambridge, Noam Chomsky. 1965. Aspects of the Theory of Syntax. MA, USA. MIT Press, Cambridge, MA. Richard Kayne. 1981. Unambiguous Paths. In R. May Christos Christodoulopoulos, Sharon Goldwater, and and J. Koster, editors, Levels of Syntactic Representa- Mark Steedman. 2012. Turning the pipeline into a tion, pages 143–183. Foris Publishers. loop: iterated unsupervised dependency parsing and POS induction. In Proceedings of the Annual Confer- Dan Klein and Christopher D. Manning. 2002. A genera- ence of the North American Chapter of the Associa- tive constituent-context model for improved grammar tion for Computational Linguistics: Human Language induction. In Proceedings of the Annual Meeting of Technologies; Workshop on the Induction of Linguistic the Association for Computational Linguistics, pages Structure, pages 96–99. 128–135. Alexander Clark. 2001. Unsupervised induction of Dan Klein and Christopher D. Manning. 2004. Corpus- stochastic context-free grammars using distributional based induction of syntactic structure: Models of de- clustering. In Proceedings of the Workshop on Com- pendency and constituency. In Proceedings of the An- putational Natural Language Learning, volume 7, nual Meeting on Association for Computational Lin- pages 1–8. guistics, volume 1, pages 478–485. Michael Collins, Lance Ramshaw, Jan Hajic,ˇ and Tom Kwiatkowski, Luke S. Zettlemoyer, Sharon Gold- Christoph Tillmann. 1999. A Statistical Parser for water, and Mark Steedman. 2010. Inducing proba- Czech. In Proceedings of the Annual Meeting of bilistic CCG grammars from logical form with higher- the Association for Computational Linguistics, pages order unification. In Proceedings of the Conference on 505–512. Empirical Methods in Natural Language Processing, Jennifer Gillenwater, Kuzman Ganchev, Joao˜ Grac¸a, Fer- pages 1223–1233. nando Pereira, and Ben Taskar. 2011. Posterior spar- Tom Kwiatkowski, Sharon Goldwater, Luke S. Zettle- sity in unsupervised dependency parsing. Journal of moyer, and Mark Steedman. 2012. A probabilis- Research, 12:455–490. tic model of syntactic and semantic acquisition from

223

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00016 by guest on 25 September 2021 child-directed utterances and their meanings. In Pro- William Schuler, Samir AbdelRahman, Tim Miller, and ceedings of the Annual Meeting of European Chapter Lane Schwartz. 2010. Broad-coverage parsing using of Association for Computational Linguistics. human-Like memory constraints. Computational Lin- Richard K. Larson. 1988. On the double object construc- guistics, 36(1):1–30. tion. Linguistic Inquiry, 19(3):335–391. James Scicluna and Colin de la Higuera. 2014. PCFG Percy Liang, Michael I. Jordan, and Dan Klein. 2009. induction for unsupervised parsing and language mod- Probabilistic Grammars and Hierarchical Dirichlet elling. In Proceedings of the Conference on Empiri- Processes. The Handbook of Applied Bayesian Analy- cal Methods in Natural Language Processing, pages sis. 1353–1362. Brian Macwhinney. 1992. The CHILDES Project: Tools Yoav Seginer. 2007a. Fast Unsupervised Incremental for Analyzing Talk. Lawrence Elrbaum Associates, Parsing. In Proceedings of the Annual Meeting of the Mahwah, NJ, third edition. Association of Computational Linguistics, pages 384– 391. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Yoav Seginer. 2007b. Learning Syntactic Structure. Marcinkiewicz. 1993. Building a large annotated cor- Ph.D. thesis, University of Amsterdam. pus of English: The Penn Treebank. Computational Cory Shain, William Bryce, Lifeng Jin, Victo- Linguistics, 19(2):313–330. ria Krakovna, Finale Doshi-Velez, Timothy Miller, George A. Miller and Stephen Isard. 1964. Free recall William Schuler, and Lane Schwartz. 2016. Memory- of self-embedded English sentences. Information and bounded left-corner unsupervised grammar induction Control, 7:292–303. on child-directed input. In Proceedings of the In- Letitia R. Naigles. 1990. Children use syntax to learn ternational Conference on Computational Linguistics, verb meanings. The Journal of Child Language, pages 964–975. 17:357–374. Mark Steedman. 2000. The Syntactic Process. MIT Hiroshi Noji and Mark Johnson. 2016. Using Left- Press/Bradford Books, Cambridge, MA. corner Parsing to Encode Universal Structural Con- Marten van Schijndel, Andy Exley, and William Schuler. straints in Grammar Induction. In Proceedings of 2013. A Model of language processing as hierarchic the Conference on Empirical Methods in Natural Lan- sequential prediction. Topics in Cognitive Science, guage Processing, pages 33–43. 5(3):522–540. The Ohio Supercomputer Center. Luke S. Zettlemoyer and Michael Collins. 2005. Learn- 1987. Ohio Supercomputer Center. ing to map sentences to logical form: Structured clas- url http://osc.edu/ark:/19495/f5s1ph73 . sification with probabilistic categorial grammars. In \ { } John K. Pate and Mark Johnson. 2016. Grammar induc- Proceedings of the Proceedings of the Conference An- tion from ( lots of ) words alone. In Proceedings of the nual Conference on Uncertainty in Artificial Intel- International Conference on Computational Linguis- ligence, pages 658–666, Arlington, Virginia. AUAI tics, pages 23–32. Press. Lisa Pearl and Jon Sprouse. 2013. Syntactic islands and Willem Zuidema. 2003. How the poverty of the stim- learning biases: Combining experimental syntax and ulus solves the poverty of the stimulus. In Advances computational modeling to investigate the language in Neural Information Processing Systems, volume 15, acquisition problem. , 20(1):23– page 51. 68, 1. Elias Ponvert, Jason Baldridge, and Katrin Erk. 2011. Simple unsupervised grammar induction from raw text with cascaded finite state models. In Proceedings of the Annual Meeting of the Association for Computa- tional Linguistics, pages 1077–1086. Philip Resnik. 1992. Left-corner parsing and psycholog- ical plausibility. In Proceedings of the International Conference on Computational Linguistics, pages 191– 197. Stanley J. Rosenkrantz and Philip M. Lewis, II. 1970. Deterministic left corner parser. In IEEE Conference Record of the 11th Annual Symposium on Switching and Automata, pages 139–152.

224

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00016 by guest on 25 September 2021