<<

Proceedings of the Society for Computation in

Volume 3 Article 19

2020

Unsupervised Formal Induction with Confidence

Jacob Collard Cornell University, [email protected]

Follow this and additional works at: https://scholarworks.umass.edu/scil

Part of the Computational Linguistics Commons

Recommended Citation Collard, Jacob (2020) "Unsupervised Formal with Confidence," Proceedings of the Society for Computation in Linguistics: Vol. 3 , Article 19. DOI: https://doi.org/10.7275/5qfp-sg41 Available at: https://scholarworks.umass.edu/scil/vol3/iss1/19

This Paper is brought to you for free and open access by ScholarWorks@UMass Amherst. It has been accepted for inclusion in Proceedings of the Society for Computation in Linguistics by an authorized editor of ScholarWorks@UMass Amherst. For more information, please contact [email protected]. Unsupervised Formal Grammar Induction with Confidence

Jacob Collard Cornell University [email protected]

Abstract Unsupervised learning for natural pro- cessing is a much more difficult task, as the algo- I present a novel algorithm for minimally su- rithm must explore the entire search space with pervised formal grammar induction using a linguistically-motivated grammar . minimal confirmation of its hypotheses. Never- This algorithm, called the Missing Link algo- theless, a number of algorithms have attempted to rithm (ML), is built off of classic chart solve the problem of unsupervised parsing. Most methods, but makes use of a probabilistic con- of these rely on gold standard part of speech fidence measure to keep track of potentially tags (Headden III et al., 2009; Spitkovsky et al., ambiguous lexical items. Because ML uses 2010), though there are some exceptions, such as a structured grammar formalism, each step of Spitkovsky (2011). Almost all unsupervised al- the algorithm can be easily understood by lin- guists, making it ideal for studying the learn- gorithms for natural language syntactic process- ability of different linguistic phenomena. The ing are based on dependency parsing; most of the algorithm requires minimal annotation in its published literature on other grammar formalisms, training data, but is capable of learning nu- such as tree-adjoining grammar (TAG) and com- anced data from relatively small training sets binatory categorial (CCG) is either su- and can be applied to a variety of grammar for- pervised or hand-engineered. Again, there are malisms. Though evaluating an unsupervised some exceptions, such as (Bisk et al., 2015), which syntactic model is difficult, I present an eval- learns CCGs using a small amount of initial part- uation using the Corpus of Linguistic Accept- ability and show state-of-the-art performance.1 of-speech data. Edelman et al. (2003) also present a model of unsupervised learning which blends properties of construction grammars with tree- 1 Introduction adjoining grammars; however, their model has Most research on learning algorithms for natural not, as yet, been evaluated empirically. language has focused on supervised pars- Other models are only indirectly supervised; ing, in which the parser learns from sentences in the syntax of the target language is learned with- the target language paired with a corresponding, out any syntactic annotations, but annotations may hand-constructed . Major natural lan- be present representing other facts about the sen- guage corpora, such as the Penn Treebank (Mar- tence, such as its logical form. Notable examples cus et al., 1994) and the Universal Dependencies of this include work by Kwiatkowski et al. (2010; framework (Nivre et al., 2016) exemplify this ten- 2011) and Artzi and Zettlemoyer (2013). dency. This has allowed for highly performant Another recent innovation in unsupervised models for dependency parsing such as ClearNLP learning is the introduction of pre-trained lan- (Choi and McCallum, 2013), CoreNLP (Manning guage models for deep learning algorithms, such et al., 2014), Mate (Bohnet, 2010), and Turbo as BERT (Devlin et al., 2018) and its relatives. (Martins et al., 2013), all of which have achieved These algorithms can be pre-trained on raw text an accuracy of over 89% on standard evaluation in order to produce a language model which can tasks (Choi et al., 2015). then be used to bootstrap learning for a wide vari- ety of additional tasks. Though supervision may 1In the interest of reproducibility, the code used to gen- erate these results is provided at https://github.com/ be required by these downstream tasks, the un- thorsonlinguistics/scil2020 supervised component has been shown to greatly

180 Proceedings of the Society for Computation in Linguistics (SCiL) 2020, pages 180-188. New Orleans, Louisiana, January 2-5, 2020 improve learning. The representations that these parsing in Missing Link. This means that models produce are somewhat opaque; though the formalism can easily be replaced with an- they have been shown to represent syntactic infor- other and that the formalism can be studied mation for some tasks (Goldberg, 2019), an exact as a parameter. description of what the model is representing is difficult to produce. Interpretability. Due partly to formalism • Though most of the above systems do not rely dependence, the Missing Link algorithm is on a strict notion of grammar formalism, in this highly interpretable. Each step of the algo- paper, I will argue that a well-defined grammar rithm can be viewed as a derivation using the formalism can produce strong results when used as input formalism. the basis for an unsupervised learning algorithm. Performance. The Missing Link algorithm Dependence on a grammar formalism has a num- • ber of benefits. First, it means that each step of performs well on an evaluation using linguis- the algorithm can be (relatively) easily understood tic acceptability judgments. The results of by humans. Each processing step either produces the evaluation are competitive with super- a novel derivation for a sentence or reinforces an vised algorithms such as BERT for the spe- old one, and each derivation conforms to the rules cific task used. Missing Link supplements of the given formalism. Thus, as long as the rules the input formalism with a model of confi- of the formalism are understood, the meaning be- dence for lexical entries that allows it to ro- hind each processing step can also be understood. bustly handle potential . Second, using a grammar formalism ensures that certain facts about the resulting grammar will al- 1.1 Related Work ways hold. For example, using CCG or TAG will The Missing Link algorithm builds off of rela- guarantee that the resulting grammar is in the class tively simple models for grammar induction and of mildly -sensitive . Third, us- parsing. Parsing is done via a simple bottom- ing a grammar formalism means that the proper- up chart-based method (Younger, 1967; Kasami, ties of the grammar formalism can be studied as 1965). Learning is done in a top-down fashion well. Though formalisms such as CCG and TAG using the same chart, with some extensions de- are weakly equivalent (Joshi et al., 1990), there scribed in Section 2.2. may be differences between the two formalisms The Missing Link algorithm is closely related to with respect to learning. Similarly, different vari- the Unification-Based Learning (UBL) algorithm ants of a particular formalism can be studied as described in Kwiatkowski et al. (2010). UBL well. For example, different combinators can be is an algorithm for semantic parsing, but the de- added or removed from CCG to produce different composition operations used in Missing Link are learning results. By using the grammar formal- essentially the same as the higher-order unifica- ism as a core parameter in learning, the formalism tion used in UBL, albeit applied directly to syn- becomes an independent variable that can be ex- tactic categories instead of logical forms. Unlike plored. UBL, Missing Link ensures that every stop of pro- In this paper I introduce an algorithm, called the cessing is interpretable; the probabilistic grammar Missing Link algorithm (ML), which has several used in UBL can potentially obscure why individ- interesting properties which, I argue, are beneficial ual parses are excluded. to the study of linguistics, grammar formalisms, and natural language processing. These properties 2 The Missing Link Algorithm include: There are two main stages to the Missing Link Minimal supervision. The Missing Link al- • algorithm: parsing and learning, which are per- gorithm learns from raw, tokenized text. The formed in order for every sentence in the training only annotation required is assurance of the set. As the algorithm processes more sentences, it sentencial category, which is trivial for most updates a lexicon, mapping words to their syntac- training sets and grammar formalisms. tic categories and a probability representing how Formalism Dependence. The grammar for- confident the algorithm is that the given category • malism is the core motivator for learning and is valid in the target grammar.

181 S 1 ! 2.1 Inputs ? (S/b) 0.25 (S a) 0.25 ! \ ! (S a) 0.25 The core inputs to the Missing Link algorithm are: ? \ ! a 0.125 ((S a)/b) 0.5 b 0.5 \ ! ! ! ((S a)/b) 0.5 b 0.5 A grammar formalism, which defines a (pos- ? \ ! ! • It might work sibly infinite) set of grammatical units E and two functions: COMPOSE : E E E and ⇥ ! ⇤ Figure 1: A chart showing the parse of the sentence It DECOMPOSE : E E E (E E) . E might work. Parse values are given at the bottom of ⇥ ⇥ ! ⇥ ⇤ always contains a special null element 0 in- each node, while learn values are given at the top. dicating that the grammatical category is not known. result, the results can be summarized using vari- A collection of training examples. Each ables. • training example consists of a tokenized sen- 2.2 The Chart tence and an annotation describing the possi- ble grammatical categories of the sentence. The Missing Link algorithm is built around a CKY-style chart. The chart consists of cells rep- The two functions of the grammar formalism resenting potential parses for each substring of the determine the behavior of the parser and learner. sentence. In Missing Link, each cell stores two The COMPOSE function returns a collection of el- separate analyses: one for the potential bottom-up ements in E that can be produced by combining parses of the sentence, and one for the potential the two input elements. This typically represents top-down decompositions of the sentence, based the basic structure-building operation of the given on the category assigned to the sentence. These formalism. In Minimalism, for example, it corre- will be referred to as the “parse value” and the sponds to MERGE; in CCG, it corresponds to the “learn value” for each cell. various combinators; in TAG to substitution and Both the parse value and the learn value are rep- adjunction, etc. resented using the same data structure, which is The DECOMPOSE function is essentially the in- also used in the lexicon. This data structure maps verse of COMPOSE. It returns the set of pairs of potential categories (elements of E according to elements that can be composed to produce a given the target formalism) to probabilities, which repre- input. Though it takes three elements, only the sent the algorithm’s confidence that the given cate- first, representing the root, is necessary. The sec- gory is valid for the corresponding substring. Note ond arguments are supplied to force one or both that the sum of the probabilities for the different of the elements in the results to take a particular categories is not necessarily 1: in the case where value. This will become important to avoid ex- the algorithm is certain that a substring is ambigu- ploring the entirety of the search space; when a ous, the probabilities will sum to at least 2. Miss- value is already known, the DECOMPOSE function ing Link does not assign probabilities based on will maintain that value whenever possible. frequency relative to other substrings, sentences, Note that the input of Missing Link places some or categories and makes no distinction between al- restrictions on the types of formalisms that can be ternative parses other than their probability of be- used. In particular, the formalism does not carry ing grammatical. any language-specific information outside of the An example chart is shown in Figure 1. lexicon – the formalism must be strongly lexical- ized. Many major formalisms, including CCG and 2.3 Parsing TAG, adhere to this rule, though some formalisms, For each sentence in the training data, the Missing such as standard context-free grammars, cannot be Link algorithm begins by looking up each word represented by Missing Link. In addition, the re- in the lexicon. The results of the lookup are as- sults of DECOMPOSE and COMPOSE must be fi- signed to the parse value for the bottom cells of nite sets. Though this seems problematic for for- the chart, which represent the length-1 substrings. malisms like CCG, where there are infinite ways The algorithm then attempts to parse as much of to compose two arbitrary elements to produce a the sentence as possible. Initially, it will not be third, this can usually be avoided by schematiza- possible to parse any sentences, as no words have tion. That is, instead of returning every possible been introduced to the lexicon. However, as the

182 algorithm sees more sentences in the training set, the number of possible trees for a sentence of a the lexicon will expand and the parses will become given length. The probability must then also be more complete. modulated by the n 1st Catalan number, where The parsing step is fairly typical for a proba- n is the length of the substring corresponding to bilistic CKY parser, proceeding in a bottom-up the current cell. direction by calling COMPOSE for each pair of Thus, the total probability for a given result is equal to p where p is the base probability, adjacent substrings. The only major difference Cn 1l n does not come from the parsing strategy per se, is the nth Catalan number, n is the length of the but from necessary constraints on the formalisms substring, and l is the number of results produced compatible with Missing Link. by DECOMPOSE. The confidence values for subsequent cells in The results are also dependent on the parse val- the chart are determined under the assumption that ues of the corresponding sub-constituents. If both all assignments are independent. Thus, in most of the sub-constituents have known parse values, cases, P (COMPOSE(A, B)) = P (A)P (B). If the then learning does not necessarily need to occur. same category is found multiple times, these val- If these sub-constituents can be composed to pro- ues are also treated as independent; thus, P (A)= duce at least one value in the current cell’s learn P (A )+P (A ) P (A )P (A ), where A and value, then those sub-constituents will be added 1 2 1 2 1 A2 are separate instances of the category A. to the learn values of their cells, with their origi- Once as much of the chart has been filled by nal probabilities. In other words, if the values are the parser as possible, the parse stage ends for that known and can produce the target value, then no sentence. The parse values of the cells are retained additional learning needs to occur. If the target when the chart is passed to the learning stage; in- value cannot be produced, then learning occurs as complete parses are used to inform the learning normal, as if neither value were known. stage. Even if the parse was successful, the learn- A similar situation occurs when only one of ing stage still occurs, in order to update the sys- the sub-constituents is known. In this case, tem’s confidence in the values used to produce the DECOMPOSE is applied as normal, with the re- successful parse. striction that the known value remains constant. If DECOMPOSE is successful, then the probabil- 2.4 Learning ity remains constant as well. On the other hand, if DECOMPOSE cannot learn any values, then the de- The learning stage is similar to the parsing stage, composition is attempted again, as though neither although it is somewhat more involved as it is able value were known. to take advantage of the results of the parse to min- imize its search space. 2.5 Lexical Update First, the learn value of the root of the chart is Once both parsing and learning have occurred for initialized with the annotated category for the sen- a given sentence, the algorithm updates the lex- tence. Then, the algorithm proceeds in a top-down icon based on the learned values for the length-1 manner calling DECOMPOSE on each cell and two substring cells in the chart. Since values in the lex- corresponding substrings. For the most part, this icon are represented the same way as cells in the proceeds in the same manner as the parsing stage, chart, this is a fairly straightforward process. If except in a top-down direction. There are, how- the category for a word in the current sentence is ever, a few differences. the same as a category already in the lexicon, the The base probability for a learned category is probability of the word is updated according to the based on the probability of the root and the proba- assumption that the two probabilities are indepen- bility of any known sub-constituents. dent, as described above. When assigning probabilities in the learning stage, the learner must contend with the fact that 3 Implementing Combinatorial there are multiple possibilities and no guarantee that they are all valid. The learner must also con- tend with the fact that there are multiple possible For the purposes of this paper, Combinatory Cat- tree structures. Since chart parsing deals only with egorial Grammar (CCG) will be used as an ex- binary-branching trees, it is possible to calculate ample formalism with Missing Link. CCG is an

183 efficiently parseable grammar formalism in which 3.1 Modalities all language-specific rules are stored in the lexicon As written above, this formalism is performs (Steedman and Baldridge, 2002). poorly with Missing Link. Though COMPOSE To implement a formalism for Missing Link, it and DECOMPOSE always produce finite results, is only necessary to define the set E, and the func- the generalized combinators prove problematic for tions COMPOSE and DECOMPOSE. In CCG, the set learning. This is because crossed composition E will be the set of categories, which is defined as in CCG can result in permutation, which makes follows: it impossible for Missing Link to distinguish be- Given a set A of atoms, if a A, then a is a tween certain alternatives involving functional cat- • 2 category. egories. For example, consider the first learning instance If a and b are categories, then (a b) and (a/b) • \ involving a two-word sentence with no words are categories. These are referred to as com- known. This will produce a derivation such as plex or functional categories. the following, where multiple categories on a node For the purposes of Missing Link, it is also nec- correspond to alternative hypotheses proposed by essary to define one additional type of category: Missing Link. the variable. In this paper, variables are repre- S sented using lowercase letters while atoms are rep- resented using capital letters. In this paper, I start with the variant of CCG (S/a) a defined in Eisner (1996). In this variant, the b (S b) \ COMPOSE operator can be defined fairly straight- If the algorithm then attempts to parse this same forwardly using two combinators: sentence (or any similar one), it will run into a (X/Y) X n 2Z2 1Z1 | ···| | >Bn problem. Since the algorithm cannot tell which X Z Z Z |n n ···|2 2|1 1 hypothesis for the left node was originally paired X n 2Z2 1Z1 (X Y) with which hypothesis for the right node, it is | ···| | \

184 the slash modalities used in many CCG variants, a value is known). When a variable is evaluated such as Baldridge (2002). These variants place in parsing or learning, it is first resolved accord- modalities on the slash categories that restrict their ing to the state. In most cases, a variable can- application to certain combinators. I use four basic not be resolved completely (the final representa- modalities, taken from Steedman and Baldridge tion will still contain one or more variables); this (2002). is expected, as the algorithm is not able to induce new atomic categories, it must instead make use Star ( ): Categories with this slash can only • ⇤ of variables that are designated to represent new apply in simple applicative contexts (function categories. composition of all types is forbidden). The value of a variable may be a complex cate- Diamond ( ): Categories with this slash can gory, an atomic category, or another variable (the • ⇧ compose only with categories of the same latter case being used primarily to set two vari- slash direction. ables equal to one another). If a variable is set equal to a complex category, it may be that the Cross ( ): Categories with this slash can • ⇥ complex category itself contains variables. This compose only with categories of the opposite creates the possibility of reference cycles, which slash direction. would produce undefined values. To avoid this, Dot ( ): Categories with this slash can com- every time an assignment is made, the algorithm • · pose with any other category. performs an occurs check to ensure that the as- signment will not produce any reference cycles: Using these modalities restricts the cases where the new value is checked for any instances, direct certain compositions may occur. During learning, or indirect, of the variable. If there are any, the as- the most restrictive category that applies to the signment cannot be completed and the algorithm given decomposition is always used. For example, must try another alternative. the decomposition of S results in ((S/ a), a) and ⇤ Once a variable is assigned a value, it is per- (b, (S b)). Due to the restrictions on the modali- \⇤ manent. However, the algorithm is still free to in- ties, it is no longer possible to compose the results troduce a new variable by re-decomposing cate- using crossed composition: (S/ a) and (S b) are ⇤ \⇤ gories in future training samples, according to the incompatible! The problem of convergence to S rules given in Section 2. Furthermore, if a vari- no longer exists, and the computational properties able assignment fails within a combinator (due to of the formalism are maintained: crossed compo- an occurs check), the state is rolled back to the sition can still occur as a last resort, in cases where way it was before the combinator began process- it is clear that it can be derived. ing. This prevents known inconsistencies from As an aside, it is no accident that crossed com- filling the state; though the state may still contain position was the cause of this issue. The permuta- inconsistencies, all values in the state are part of tions caused by crossed composition also make it a potential analysis. Any remaining inconsisten- necessary in mildly context-sensitive CCGs to ac- cies typically do not achieve high probability dur- count for data in languages such as Swiss German ing learning, as they cannot be used to successfully and Dutch. It is interesting, though not surprising, parse many sentences. that this comes with its own difficulties in learn- ing. 4 Evaluation

3.2 Variables and State Evaluating an unsupervised learning algorithm is As described in previous sections, this implemen- a difficult prospect. Though many unsupervised tation of CCG for Missing Link uses variables to syntactic learning algorithms are evaluated by represent categories whose exact value cannot be comparing the resulting dependency structures to determined. Though variables are necessary, they a gold standard, typically by ensuring that each also introduce additional complexity into the algo- predicted dependency is directed in the same way rithm. There are a few special notes that relate to between the same two lexical units as the gold the treatment of variables in CCG. standard dependency. However, comparison by The values of variables are stored in a global dependency structures is not always a good choice state, which maps variable IDs to their values (if for unsupervised learning algorithms. In particu-

185 lar, because one of the goals of Missing Link is to lows it to have a relatively small set of hypothe- provide a framework for analyzing grammar for- ses for many words that occur in shorter sentences, malisms in an otherwise theory-independent man- which it can then use to better learn nearby words ner, it is undesirable to make use of any theory- in longer sentences. Sentences of length less than dependent analysis. Though dependencies may 3 were excluded, since most short sentences in be largely independent of theory, they still make BWB are simple noun phrases or noisy punctu- conventional decisions. For example, the Univer- ation, which can confuse Missing Link. Sen- sal Dependencies project does not usually allow tences of length greater than 10 were excluded as functional categories to be heads, while alterna- well, since they tend to contribute little to Missing tives, such as Stanford Dependencies, do allow Link’s confidence. functional heads. If the learning algorithm is free For practical reasons, I also restricted the num- to choose from the alternatives on its own, then it ber of categories learned for each cell in the chart cannot be accurately evaluated against such a stan- and lexical entry to 50, to improve efficiency while dard. still allowing for multiple simultaneous hypothe- To evaluate Missing Link, I therefore use a ses. secondary task. Similar to BERT (Devlin et al., 2018), I use Missing Link as an unsupervised pre- 4.2 Training and Testing training algorithm for a downstream task. In this Once the linguistic model is pre-trained, then it case, I use linguistic acceptability (grammatical- can be used to train a logistic regression. To train ity) judgments as the downstream task. This is an the logistic regression, Missing Link processes ideal task for Missing Link, since Missing Link’s each sentence in the CoLA training set, produc- confidence values essentially capture the notion of ing a confidence value for each potential senten- probability that a sentence (or substring) is gram- tial category. If no parse is produced, or if S is matical. I use the Corpus of Linguistic Accept- not in the results, then Missing Link is allowed to ability (Warstadt et al., 2018) to provide the gold learn from the sentence. By learning from sen- standard and training data. The Corpus of Linguis- tences where no valid parse was produced, Miss- tic Acceptability (CoLA) provides a basic classifi- ing Link becomes robust to sentences where some cation task in which sentences are annotated with words were not in the original pre-training set. Af- boolean grammaticality judgments – that is, each ter learning, Missing Link attempts to parse the sentence is either considered grammatical or not. sentence again. Missing Link will provide the pre-trained linguis- tic model, and a simple logistic regression will use The confidence of the category S is used as one Missing Link confidence values to classify the test independent variable for training the logistic re- data. gression. The other independent variables are the length of the sentence, whether the category S was found (including after re-training), and whether re- 4.1 Pre-Training training was required. Longer sentences are in- Before training the model, Missing Link is used herently associated with lower probabilities due to to pre-train a linguistic model of the target lan- the independence assumptions used in composi- guage, in this case English. In order to keep the tion. Sentences where the category S was never conditions between the training set and the testing found are also distinguished from sentences where set as close as possible, it is necessary to pre-train the probability of S was negligible; although both Missing Link on a different dataset than the an- are likely to indicate an ungrammatical sentence, notated linguistic acceptability data. In addition, for long sentences, the distinction may be neces- the corpus of linguistic acceptability is relatively sary. Lastly, whether retraining was necessary is small. To this end, I pre-train Missing Link us- a plausible predictor as well, since retraining indi- ing the much larger Billion Words corpus (Chelba cates that either some words are out of vocabulary et al., 2013). or the sentence could not be parsed with the previ- For pre-training, I sorted the sentences of the ously learned categories. Billion Words corpus (BWB) in ascending order Once the logistic regression has been fit to the by length. Missing Link is able to assign higher training data, the same process is applied to the confidence to words in shorter sentences. This al- testing data in order to predict whether each sen-

186 Model MCC that such a paradigm is feasible for linguistic the- MT-DNN 68.4 ory. RoBERTa 67.8 XLNet-Large 67.8 GLUE Human Baseline 66.4 References Missing Link 63.0 Yoav Artzi and Luke Zettlemoyer. 2013. Weakly su- XLM 62.9 pervised learning of semantic parsers for mapping instructions to actions. Transactions of the Associa- BERT-24 60.5 tion for Computational Linguistics, 1:49–62. Table 1: CoLA Benchmark2 Jason Baldridge. 2002. Lexically specified control in combinatory categorial grammar. Ph.d. disserta- tion, University of Edinburgh. tence is grammatical or not. Yonatan Bisk, Christos Christodoulopoulos, and Julia Hockenmaier. 2015. Labeled grammar induction 5 Results and Conclusion with minimal supervision. In Proceedings of the The Corpus of Linguistic Acceptability is eval- 53rd Annual Meeting of the Association for Compu- tational Linguistics, pages 870–876, Beijing, China. uated using Matthews Correlation Coefficients, Association for Computational Linguistics. since there are far more grammatical sentences in the data than ungrammatical ones. A number of Bernd Bohnet. 2010. Very high accuracy and fast de- pendency parsing is not a contradiction. In Proceed- other systems have been tested against CoLA and ings of COLING. can be used as benchmarks for Missing Link, since the same training-testing split is used by all sys- Ciprian Chelba, Tomas Mikolov, Mike Schuster, tems. Qi Ge, Thorston Brants, and Phillip Koehn. 2013. One billion word benchmark for measuring The results of the evaluation of Missing Link progress in statistical language modeling. CoRR, as well as some of the top performing competi- abs/1312.3005. tors are given in Table 1. With an MCC of 63.0, Jinho D. Choi and Andrew McCallum. 2013. Missing Link does not advance the state-of-the-art Transition-based dependency parsing with selec- compared to deep learning models, but it does per- tional branching. In Proceedings of the 51st annual form competitively. Given that Missing Link uses meeting of the Association for Computational Lin- only a basic logistic regression on top of the pre- guistics, pages 1052–1062. trained model, this presents evidence that Missing Jinho D. Choi, Joel Tetreault, and Amanda Stent. 2015. Link is producing a reasonable grammar for the It depends: Dependency parser comparison using a data. web-based evaluation tool. In Proceedings of the Given that Missing Link produces a reasonable 53rd Annual Meeting of the Association for Compu- tational Linguistics, pages 387–396. grammar, it can then be used for further study in the fields of grammar formalisms and theoretical Jacob Devlin, Ming-Wei Chang, Kenton Lee, and linguistics. Different grammar formalisms can be Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language under- compared using the same core algorithm, allowing standing. CoRR, abs/1810.04805. for any variation in performance to be attributed to properties of the grammar formalism. The algo- Shimon Edelman, Zach Solan, David Horn, and Ey- tan Ruppin. 2003. Rich syntax from a raw corpus: rithm can also be used to explore specific linguis- Unsupervised does it. In Syntax, and tic phenomena from a learning perspective. Given Statstics Workshop at NIPS ’03, Whistler, British two alternatives to a linguistic phenomenon, it is Columbia, Canada. possible to use Missing Link as one potential way Jason Eisner. 1996. Efficient normal-form parsing for of distinguishing between the two. This presents a combinatory categorial grammar. In Proceedings of new paradigm in linguistic research as a means of the 34th Annual Meeting of the Association for Com- exploring generative linguistics through a formal, putational Linguistics, pages 79–86. Santa Cruz. but nuanced, model of learning and learnability. Yoav Goldberg. 2019. Assessing BERT’s syntactic Missing Link is not necessarily the only algorithm abilities. Unpublished manuscript. that supports this paradigm, but presents evidence William P. Headden III, Mark Johnson, and David 2These baselines are taken from the GLUE leaderboard at McClosky. 2009. Improving unsupervised depen- the time of writing (Wang et al., 2019). dency parsing with richer contexts and smoothing.

187 In Proceedings of Human Language Technologies: Methods in Natural Language Processing, EMNLP The 2009 Annual Conference of the North American ’11, pages 1281–1290, Stroudsburg, PA, USA. As- Chapter of the Association for Computational Lin- sociation for Computational Linguistics. guistics, NAACL ’09, pages 101–109, Stroudsburg, PA, USA. Association for Computational Linguis- Valentin I. Spitkovsky, Hiyan Alshawi, and Daniel Ju- tics. rafsky. 2010. From baby steps to leapfrog: How ”Less is More” in unsupervised dependency parsing. Aravind K. Joshi, K. Vijay Shankar, and David Weir. In Human Language Technologies: The 2010 An- 1990. The convergence of mildly context-sensitive nual Conference of the North American Chapter of grammar formalisms. Technical Report MS-CIS- the Association for Computational Linguistics, HLT 90-01, Department of Computer and Information ’10, pages 751–759, Stroudsburg, PA, USA. Associ- Science, University of Pennsylvania. ation for Computational Linguistics.

Tadao Kasami. 1965. An efficient recognition Mark Steedman and Jason Baldridge. 2002. Combina- and syntax-analysis algorithm for context-free lan- tory categorial grammar. In Robert D. Borsley and guages. Technical report, AFCRL. Kersti Borjars,¨ editors, Non-Transformational Syn- tax, pages 181–224. Blackwell. Tom Kwiatkowski, Luke Zettlemoyer, Sharon Goldwa- ter, and Mark Steedman. 2010. Inducing probabilis- Alex Wang, Amanpreet Singh, Julian Michael, Felix tic CCG grammars from logical form with higher- Hill, Omer Levy, and Samuel R. Bowman. 2019. order unification. In Proceedings of the 2010 Con- GLUE: A multi-task benchmark and analysis plat- ference on Empirical Methods in Natural Language form for natural language understanding. In Inter- Processing, pages 1223–1233, Cambridge, MA. national Conference on Learning Representations.

Tom Kwiatkowski, Luke Zettlemoyer, Sharon Goldwa- Alex Warstadt, Amanpreet Singh, and Samuel R. Bow- ter, and Mark Steedman. 2011. Lexical generaliza- man. 2018. Neural network acceptability judg- tion in CCG grammar induction for semantic pars- ments. CoRR, abs/1805.12471. ing. In Proceedings of the Conference on Empirical Daniel H. Younger. 1967. Recognition and parsing of Methods in Natural Language Processing, EMNLP 3 ’11, pages 1512–1523, Stroudsburg, PA, USA. As- context-free languages in time n . Information and sociation for Computational Linguistics. Control, 10:189–208.

Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David Mc- Closky. 2014. The Stanford CoreNLP natural lan- guage processing toolkit. In Association for Compu- tational Linguistics (ACL) System Demonstrations, pages 55–60.

Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karne Katz, and Britta Schas- berger. 1994. The penn treebank: Annotating argument structure. In Proceedings of the workshops on Human Language Technology, pages 114–119. Association for Computational Linguistics.

Andre´ F.T. Martins, Miguel B. Almeida, and Noah A. Smith. 2013. Turning on the turbo: Fast third-order non-projective turbo parsers. In Proceedings of the ACL.

Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin- ter, Yoav Goldberg, Jan Hajic,ˇ Christopher D. Man- ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zema. 2016. Universal dependencies v1: A multilingual treebank collection. In Proceedings of LREC, pages 1659–1666.

Valentin I. Spitkovsky, Hiyan Alshawi, Angel X. Chang, and Daniel Jurafsky. 2011. Unsupervised dependency parsing without gold part-of-speech tags. In Proceedings of the Conference on Empirical

188