Unsupervised Formal Grammar Induction with Confidence

Proceedings of the Society for Computation in Linguistics Volume 3 Article 19 2020 Unsupervised Formal Grammar Induction with Confidence Jacob Collard Cornell University, [email protected] Follow this and additional works at: https://scholarworks.umass.edu/scil Part of the Computational Linguistics Commons Recommended Citation Collard, Jacob (2020) "Unsupervised Formal Grammar Induction with Confidence," Proceedings of the Society for Computation in Linguistics: Vol. 3 , Article 19. DOI: https://doi.org/10.7275/5qfp-sg41 Available at: https://scholarworks.umass.edu/scil/vol3/iss1/19 This Paper is brought to you for free and open access by ScholarWorks@UMass Amherst. It has been accepted for inclusion in Proceedings of the Society for Computation in Linguistics by an authorized editor of ScholarWorks@UMass Amherst. For more information, please contact [email protected]. Unsupervised Formal Grammar Induction with Confidence Jacob Collard Cornell University [email protected] Abstract Unsupervised learning for natural language processing is a much more difficult task, as the algo- I present a novel algorithm for minimally su- rithm must explore the entire search space with pervised formal grammar induction using a linguistically-motivated grammar formalism. minimal confirmation of its hypotheses. Never- This algorithm, called the Missing Link algo- theless, a number of algorithms have attempted to rithm (ML), is built off of classic chart parsing solve the problem of unsupervised parsing. Most methods, but makes use of a probabilistic con- of these rely on gold standard part of speech fidence measure to keep track of potentially tags (Headden III et al., 2009; Spitkovsky et al., ambiguous lexical items. Because ML uses 2010), though there are some exceptions, such as a structured grammar formalism, each step of Spitkovsky (2011). Almost all unsupervised al- the algorithm can be easily understood by lin- guists, making it ideal for studying the learn- gorithms for natural language syntactic process- ability of different linguistic phenomena. The ing are based on dependency parsing; most of the algorithm requires minimal annotation in its published literature on other grammar formalisms, training data, but is capable of learning nu- such as tree-adjoining grammar (TAG) and com- anced data from relatively small training sets binatory categorial grammars (CCG) is either su- and can be applied to a variety of grammar for- pervised or hand-engineered. Again, there are malisms. Though evaluating an unsupervised some exceptions, such as (Bisk et al., 2015), which syntactic model is difficult, I present an eval- learns CCGs using a small amount of initial part- uation using the Corpus of Linguistic Accept- ability and show state-of-the-art performance.1 of-speech data. Edelman et al. (2003) also present a model of unsupervised learning which blends properties of construction grammars with tree- 1 Introduction adjoining grammars; however, their model has Most research on learning algorithms for natural not, as yet, been evaluated empirically. language syntax has focused on supervised pars- Other models are only indirectly supervised; ing, in which the parser learns from sentences in the syntax of the target language is learned with- the target language paired with a corresponding, out any syntactic annotations, but annotations may hand-constructed parse tree. Major natural lan- be present representing other facts about the sen- guage corpora, such as the Penn Treebank (Mar- tence, such as its logical form. Notable examples cus et al., 1994) and the Universal Dependencies of this include work by Kwiatkowski et al. (2010; framework (Nivre et al., 2016) exemplify this ten- 2011) and Artzi and Zettlemoyer (2013). dency. This has allowed for highly performant Another recent innovation in unsupervised models for dependency parsing such as ClearNLP learning is the introduction of pre-trained lan- (Choi and McCallum, 2013), CoreNLP (Manning guage models for deep learning algorithms, such et al., 2014), Mate (Bohnet, 2010), and Turbo as BERT (Devlin et al., 2018) and its relatives. (Martins et al., 2013), all of which have achieved These algorithms can be pre-trained on raw text an accuracy of over 89% on standard evaluation in order to produce a language model which can tasks (Choi et al., 2015). then be used to bootstrap learning for a wide variety of additional tasks. Though supervision may 1In the interest of reproducibility, the code used to gen- erate these results is provided at https://github.com/ be required by these downstream tasks, the un- thorsonlinguistics/scil2020 supervised component has been shown to greatly 180 Proceedings of the Society for Computation in Linguistics (SCiL) 2020, pages 180-188. New Orleans, Louisiana, January 2-5, 2020 improve learning. The representations that these parsing in Missing Link. This means that models produce are somewhat opaque; though the formalism can easily be replaced with an- they have been shown to represent syntactic infor- other and that the formalism can be studied mation for some tasks (Goldberg, 2019), an exact as a parameter. description of what the model is representing is difficult to produce. Interpretability. Due partly to formalism • Though most of the above systems do not rely dependence, the Missing Link algorithm is on a strict notion of grammar formalism, in this highly interpretable. Each step of the algo- paper, I will argue that a well-defined grammar rithm can be viewed as a derivation using the formalism can produce strong results when used as input formalism. the basis for an unsupervised learning algorithm. Performance. The Missing Link algorithm Dependence on a grammar formalism has a num- • ber of benefits. First, it means that each step of performs well on an evaluation using linguis- the algorithm can be (relatively) easily understood tic acceptability judgments. The results of by humans. Each processing step either produces the evaluation are competitive with super- a novel derivation for a sentence or reinforces an vised algorithms such as BERT for the spe- old one, and each derivation conforms to the rules cific task used. Missing Link supplements of the given formalism. Thus, as long as the rules the input formalism with a model of confi- of the formalism are understood, the meaning be- dence for lexical entries that allows it to ro- hind each processing step can also be understood. bustly handle potential ambiguity. Second, using a grammar formalism ensures that certain facts about the resulting grammar will al- 1.1 Related Work ways hold. For example, using CCG or TAG will The Missing Link algorithm builds off of rela- guarantee that the resulting grammar is in the class tively simple models for grammar induction and of mildly context-sensitive languages. Third, us- parsing. Parsing is done via a simple bottom- ing a grammar formalism means that the proper- up chart-based method (Younger, 1967; Kasami, ties of the grammar formalism can be studied as 1965). Learning is done in a top-down fashion well. Though formalisms such as CCG and TAG using the same chart, with some extensions de- are weakly equivalent (Joshi et al., 1990), there scribed in Section 2.2. may be differences between the two formalisms The Missing Link algorithm is closely related to with respect to learning. Similarly, different vari- the Unification-Based Learning (UBL) algorithm ants of a particular formalism can be studied as described in Kwiatkowski et al. (2010). UBL well. For example, different combinators can be is an algorithm for semantic parsing, but the de- added or removed from CCG to produce different composition operations used in Missing Link are learning results. By using the grammar formal- essentially the same as the higher-order unifica- ism as a core parameter in learning, the formalism tion used in UBL, albeit applied directly to syn- becomes an independent variable that can be ex- tactic categories instead of logical forms. Unlike plored. UBL, Missing Link ensures that every stop of pro- In this paper I introduce an algorithm, called the cessing is interpretable; the probabilistic grammar Missing Link algorithm (ML), which has several used in UBL can potentially obscure why individ- interesting properties which, I argue, are beneficial ual parses are excluded. to the study of linguistics, grammar formalisms, and natural language processing. These properties 2 The Missing Link Algorithm include: There are two main stages to the Missing Link Minimal supervision. The Missing Link al- • algorithm: parsing and learning, which are per- gorithm learns from raw, tokenized text. The formed in order for every sentence in the training only annotation required is assurance of the set. As the algorithm processes more sentences, it sentencial category, which is trivial for most updates a lexicon, mapping words to their syntac- training sets and grammar formalisms. tic categories and a probability representing how Formalism Dependence. The grammar for- confident the algorithm is that the given category • malism is the core motivator for learning and is valid in the target grammar. 181 S 1 ! 2.1 Inputs ? (S/b) 0.25 (S a) 0.25 ! \ ! (S a) 0.25 The core inputs to the Missing Link algorithm are: ? \ ! a 0.125 ((S a)/b) 0.5 b 0.5 \ ! ! ! ((S a)/b) 0.5 b 0.5 A grammar formalism, which defines a (pos- ? \ ! ! • It might work sibly infinite) set of grammatical units E and two functions: COMPOSE : E E E and ⇥ ! ⇤ Figure 1: A chart showing the parse of the sentence It DECOMPOSE : E E E (E E) . E might work. Parse values are given at the bottom of ⇥ ⇥ ! ⇥ ⇤ always contains a special null element 0 in- each node, while learn values are given at the top. dicating that the grammatical category is not known. result, the results can be summarized using vari- A collection of training examples. Each ables. • training example consists of a tokenized sen- 2.2 The Chart tence and an annotation describing the possi- ble grammatical categories of the sentence.

Load more