Judging Grammaticality with Tree Substitution Grammar Derivations

Judging Grammaticality with Tree Substitution Grammar Derivations Matt Post Human Language Technology Center of Excellence Johns Hopkins University Baltimore, MD 21211 Abstract as for machine translation. As a result, the output of such text generation systems is often very poor In this paper, we show that local features com- grammatically, even if it is understandable. puted from the derivations of tree substitution Since grammaticality judgments are a matter of grammars — such as the identify of particu- lar fragments, and a count of large and small the syntax of a language, the obvious approach for fragments — are useful in binary grammatical modeling grammaticality is to start with the exten- classification tasks. Such features outperform sive work produced over the past two decades in n-gram features and various model scores by the field of parsing. This paper demonstrates the a wide margin. Although they fall short of utility of local features derived from the fragments the performance of the hand-crafted feature of tree substitution grammar derivations. Follow- set of Charniak and Johnson (2005) developed ing Cherry and Quirk (2008), we conduct experi- for parse tree reranking, they do so with an ments in a classification setting, where the task is to order of magnitude fewer features. Further- more, since the TSGs employed are learned distinguish between real text and “pseudo-negative” in a Bayesian setting, the use of their deriva- text obtained by sampling from a trigram language tions can be viewed as the automatic discov- model (Okanohara and Tsujii, 2007). Our primary ery of tree patterns useful for classification. points of comparison are the latent SVM training On the BLLIP dataset, we achieve an accuracy of Cherry and Quirk (2008), mentioned above, and of 89.9% in discriminating between grammat- the extensive set of local and nonlocal feature tem- ical text and samples from an n-gram language plates developed by Charniak and Johnson (2005) model. for parse tree reranking. In contrast to this latter set of features, the feature sets from TSG derivations 1 Introduction require no engineering; instead, they are obtained The task of a language model is to provide a measure directly from the identity of the fragments used in of the grammaticality of a sentence. Language mod- the derivation, plus simple statistics computed over els are useful in a variety of settings, for both human them. Since these fragments are in turn learned au- and machine output; for example, in the automatic tomatically from a Treebank with a Bayesian model, grading of essays, or in guiding search in a machine their usefulness here suggests a greater potential for translation system. Language modeling has proved adapting to other languages and datasets. to be quite difficult. The simplest models, n-grams, 2 Tree substitution grammars are self-evidently poor models of language, unable to (easily) capture or enforce long-distance linguis- Tree substitution grammars (Joshi and Schabes, tic phenomena. However, they are easy to train, are 1997) generalize context-free grammars by allow- long-studied and well understood, and can be ef- ing nonterminals to rewrite as tree fragments of ar- ficiently incorporated into search procedures, such bitrary size, instead of as only a sequence of one or 217 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:shortpapers, pages 217–222, Portland, Oregon, June 19-24, 2011. c 2011 Association for Computational Linguistics S. tions with larger fragments, whereas ungrammatical sentences will be forced to resort to small fragments. NP. VP. .. This is the central idea explored in this paper. VBD. NP. SBAR. This raises the question of what exactly the larger fragments are. A fundamental problem with TSGs is said. that they are hard to learn, since there is no annotated corpus of TSG derivations and the number of possi- Figure 1: A Tree Substitution Grammar fragment. ble derivations is exponential in the size of a tree. The most popular TSG approach has been Data- Oriented Parsing (Scha, 1990; Bod, 1993), which more children. Evaluated by parsing accuracy, these takes all fragments in the training data. The large grammars are well below state of the art. However, size of such grammars (exponential in the size of the they are appealing in a number of ways. Larger frag- training data) forces either implicit representations ments better match linguists’ intuitions about what (Goodman, 1996; Bansal and Klein, 2010) — which the basic units of grammar should be, capturing, for do not permit arbitrary probability distributions over example, the predicate-argument structure of a verb the grammar fragments — or explicit approxima- (Figure 1). The grammars are context-free and thus tions to all fragments (Bod, 2001). A number of re- retain cubic-time inference procedures, yet they re- searchers have presented ways to address the learn- duce the independence assumptions in the model’s ing problems for explicitly represented TSGs (Zoll- generative story by virtue of using fewer fragments mann and Sima’an, 2005; Zuidema, 2007; Cohn et (compared to a standard CFG) to generate a tree. al., 2009; Post and Gildea, 2009a). Of these ap- 3 A spectrum of grammaticality proaches, work in Bayesian learning of TSGs pro- duces intuitive grammars in a principled way, and The use of large fragments in TSG grammar deriva- has demonstrated potential in language modeling tions provides reason to believe that such grammars tasks (Post and Gildea, 2009b; Post, 2010). Our ex- might do a better job at language modeling tasks. periments make use of Bayesian-learned TSGs. Consider an extreme case, in which a grammar con- sists entirely of complete parse trees. In this case, 4 Experiments ungrammaticality is synonymous with parser fail- We experiment with a binary classification task, de- ure. Such a classifier would have perfect precision fined as follows: given a sequence of words, deter- but very low recall, since it could not generalize mine whether it is grammatical or not. We use two at all. On the other extreme, a context-free gram- datasets: the Wall Street Journal portion of the Penn mar containing only depth-one rules can basically Treebank (Marcus et al., 1993), and the BLLIP ’99 produce an analysis over any sequence of words. dataset,1 a collection of automatically-parsed sen- However, such grammars are notoriously leaky, and tences from three years of articles from the Wall the existence of an analysis does not correlate with Street Journal. grammaticality. Context-free grammars are too poor For both datasets, positive examples are obtained models of language for the linguistic definition of from the leaves of the parse trees, retaining their to- grammaticality (a sequence of words in the language kenization. Negative examples were produced from of the grammar) to apply. a trigram language model by randomly generating TSGs permit us to posit a spectrum of grammati- sentences of length no more than 100 so as to match cality in between these two extremes. If we have a the size of the positive data. The language model grammar comprising small and large fragments, we was built with SRILM (Stolcke, 2002) using inter- might consider that larger fragments should be less polated Kneser-Ney smoothing. The average sen- likely to fit into ungrammatical situations, whereas tence lengths for the positive and negative data were small fragments could be employed almost any- 23.9 and 24.7, respectively, for the Treebank data where as a sort of ungrammatical glue. Thus, on average, grammatical sentences will license deriva- 1LDC Catalog No. LDC2000T43. 218 dataset training devel. test 4. The Charniak parser (Charniak, 2000), run in Treebank 3,836 2,690 3,398 language modeling mode 91,954 65,474 79,998 BLLIP 100,000 6,000 6,000 The parsing models for both datasets were built from 2,596,508 155,247 156,353 sections 2 - 21 of the WSJ portion of the Treebank. These models were used to score or parse the train- Table 1: The number of sentences (first line) and words ing, development, and test data for the classifier. (second line) using for training, development, and test- From the output, we extract the following feature ing of the classifier. Each set of sentences is evenly split sets used in the classifier. between positive and negative examples. Sentence length (l). • and 25.6 and 26.2 for the BLLIP data. Model scores (S). Model log probabilities. Each dataset is divided into training, develop- • ment, and test sets. For the Treebank, we trained Rule features (R). These are counter features the n-gram language model on sections 2 - 21. The • based on the atomic unit of the analysis, i.e., in- classifier then used sections 0, 24, and 22 for train- dividual n-grams for the n-gram models, PCFG ing, development, and testing, respectively. For rules, and TSG fragments. the BLLIP dataset, we followed Cherry and Quirk Reranking features (C&J). From the Char- (2008): we randomly selected 450K sentences to • train the n-gram language model, and 50K, 3K, and niak parser output we extract the complete set 3K sentences for classifier training, development, of reranking features of Charniak and Johnson 3 and testing, respectively. All sentences have 100 (2005), and just the local ones (C&J local). or fewer words. Table 1 contains statistics of the l Frontier size ( n, n). Instances of this fea- datasets used in our experiments. • ture class countF theF number of TSG fragments To build the classifier, we used liblinear (Fan having frontier size n, 1 n 9.4 Instances et al., 2008). A bias of 1 was added to each feature of l count only lexical items≤ ≤ for 0 n 5. vector. We varied a cost or regularization parame- Fn ≤ ≤ ter between 1e 5 and 100 in orders of magnitude; 4.2 Results at each step, we− built a model, evaluating it on the Table 2 contains the classification results.

Load more