Ensemble Methods for Automatic Thesaurus Extraction
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Philadelphia, July 2002, pp. 222-229. Association for Computational Linguistics. Ensemble Methods for Automatic Thesaurus Extraction James R. Curran Institute for Communicating and Collaborative Systems University of Edinburgh 2 Buccleuch Place, Edinburgh EH8 9LW United Kingdom [email protected] Abstract Ensemble methods ameliorate learner bias by amortising individual classifier bias of over differ- Ensemble methods are state of the art ent systems. For an ensemble to be more effec- for many NLP tasks. Recent work by tive than its constituents, the individual classifiers Banko and Brill (2001) suggests that this must have better than 50% accuracy and must pro- would not necessarily be true if very large duce diverse erroneous classifications (Dietterich, training corpora were available. However, 2000). Brill and Wu (1998) call this complementary their results are limited by the simplic- disagreement complementarity. Although ensem- ity of their evaluation task and individual bles are often effective on problems with small train- classifiers. ing sets, recent work suggests this may not be true as dataset size increases. Banko and Brill (2001) found Our work explores ensemble efficacy for that for confusion set disambiguation with corpora the more complex task of automatic the- larger than 100 million words, the best individual saurus extraction on up to 300 million classifiers outperformed ensemble methods. words. We examine our conflicting results One limitation of their results is the simplicity of in terms of the constraints on, and com- the task and methods used to examine the efficacy plexity of, different contextual representa- of ensemble methods. However, both the task and tions, which contribute to the sparseness- applied methods are constrained by the ambitious and noise-induced bias behaviour of NLP use of one billion words of training material. Dis- systems on very large corpora. ambiguation is relatively simple because confusion sets are rarely larger than four elements. The indi- 1 Introduction vidual methods must be inexpensive because of the computational burden of the massive training set, so Ensemble learning is a machine learning technique they must perform limited processing of the training that combines the output of several different classi- corpus and can only consider a fairly narrow context fiers with the goal of improving classification per- surrounding each instance. formance. The classifiers within the ensemble may We explore the value of ensemble methods for the differ in several ways, such as the learning algorithm more complex task of automatic thesaurus extrac- or knowledge representation used, or data they were tion, training on corpora of up to 300 million words. trained on. Ensemble learning has been successfully The increased complexity leads to results contradict- applied to numerous NLP tasks, including POS tag- ing Banko and Brill (2001), which we explore using ging (Brill and Wu, 1998; van Halteren et al., 1998), ensembles of different contextual complexity. This chunking (Tjong Kim Sang, 2000), word sense dis- work emphasises the link between contextual com- ambiguation (Pederson, 2000) and statistical pars- plexity and the problems of representation sparse- ing (Henderson and Brill, 1999). Dietterich (2000) ness and noise as corpus size increases, which in presents a good introduction to ensemble methods. turn impacts on learner bias and ensemble efficacy. (adjective, good) 2005 2 Automatic Thesaurus Extraction (adjective, faintest) 89 (direct-obj, have) 1836 The development of large thesauri and semantic re- (indirect-obj, toy) 74 sources, such as WordNet (Fellbaum, 1998), has al- (adjective, preconceived) 42 (adjective, foggiest) 15 lowed lexical semantic information to be leveraged to solve NLP tasks, including collocation discov- Figure 1: Example attributes of the noun idea ery (Pearce, 2001), model estimation (Brown et al., 1992; Clark and Weir, 2001) and text classification (Baker and McCallum, 1998). ing for thesaurus extraction, which calculates the Unfortunately, thesauri are expensive and time- pairwise similarity of the target term with every po- consuming to create manually, and tend to suffer tential synonym. Given n terms and up to m at- from problems of bias, inconsistency, and limited tributes for each term, the asymptotic time complex- coverage. In addition, thesaurus compilers cannot ity of k-nearest-neighbour algorithm is O(n2m). We keep up with constantly evolving language use and reduce the number of terms by introducing a mini- cannot afford to build new thesauri for the many sub- mum occurrence filter that eliminates potential syn- domains that NLP techniques are being applied to. onyms with a frequency less than five. There is a clear need for automatic thesaurus extrac- tion methods. 3 Individual Methods Much of the existing work on thesaurus extrac- The individual methods in these ensemble experi- tion and word clustering is based on the observa- ments are based on different extractors of contex- tions that related terms will appear in similar con- tual information. All the systems use the JACCARD texts. These systems differ primarily in their defi- similarity metric and TTEST weighting function that nition of “context” and the way they calculate simi- were found to be most effective for thesaurus extrac- larity from the contexts each term appears in. Many tion by Curran and Moens (2002a). systems extract co-occurrence and syntactic infor- The simplest and fastest contexts to extract are mation from the words surrounding the target term, the word(s) surrounding each thesaurus term up to which is then converted into a vector-space repre- some fixed distance. These window methods are la- sentation of the contexts that each target term ap- belled W(L1R1), where L1R1 indicates that window pears in (Pereira et al., 1993; Ruge, 1997; Lin, extends one word on either side of the target term. 1998b). Curran and Moens (2002b) evaluate the- Methods marked with an asterisk, e.g. W(L1R1∗), saurus extractors based on several different models do not record the word's position in the relation type. of context on large corpora. The context models The more complex methods extract grammatical used in our experiments are described in Section 3. relations using shallow statistical tools or a broad We define a context relation instance as a tuple coverage parser. We use the grammatical relations (w; r; w0) where w is a thesaurus term, occurring in a extracted from the parse trees of Lin's broad cov- 0 relation of type r, with another word w in the sen- erage principle-based parser, MINIPAR (Lin, 1998a) 0 tence. We refer to the tuple (r; w ) as an attribute and Abney's cascaded finite-state parser, CASS (Ab- of w. The relation type may be grammatical or it ney, 1996). Finally, we have implemented our own 0 may label the position of w in a context window: relation extractor, based on Grefenstette's SEXTANT e.g. the tuple (dog, direct-obj, walk) indicates (Grefenstette, 1994), which we describe below as an that the term dog, was the direct object of the verb example of the NLP system used to extract relations walk. After the contexts have been extracted from from the raw text. the raw text, they are compiled into attribute vec- Processing begins with POS tagging and NP/VP tors describing all of the contexts each term appears chunking using a Na¨ıve Bayes classifier trained in. The thesaurus extractor then uses clustering or on the Penn Treebank. Noun phrases separated nearest-neighbour matching to select similar terms by prepositions and conjunctions are then concate- based on a vector similarity measure. nated, and the relation attaching algorithm is run on Our experiments use k-nearest-neighbour match- the sentence. This involves four passes over the sen- Corpus Sentences Words MIXTURE ranking based on the mean score for British National Corpus 6.2M 114M each term. The individual extractor scores are Reuters Corpus Vol 1 8.7M 193M not normalised because each extractor uses the same similarity measure and weight function. Table 1: Training Corpora Statistics We assigned a rank of 201 and similarity score of zero to terms that did not appear in the 200 syn- tence, associating each noun with the modifiers and onyms returned by the individual extractors. Finally, verbs from the syntactic contexts they appear in: we build ensembles from all the available extractor methods (e.g. MEAN(∗)) and the top three perform- 1. nouns with pre-modifiers (left to right) ing extractors (e.g. MEAN(3)). 2. nouns with post-modifiers (right to left) To measure the complementary disagreement be- tween ensemble constituents we calculated both the 3. verbs with subjects/objects (right to left) complementarity C and the Spearman rank-order 4. verbs with subjects/objects (left to right) correlation Rs. This results in relations representing the contexts: j errors(A) \ errors(B)j C(A; B) = (1 − ) ∗ 100% (1) j errors(A)j 1. term is the subject of a verb − − = i(r(Ai) r(A))(r(Bi) r(B)) Rs(A; B) P (2) 2. term is the (direct/indirect) object of a verb 2 2 q i(r(Ai) − r(A)) q i(r(Bi) − r(B)) P P 3. term is modified by a noun or adjective where r(x) is the rank of synonym x. The Spearman 4. term is modified by a prepositional phrase rank-order correlation coefficient is the linear corre- The relation tuple is then converted to root form lation coefficient between the rankings of elements using the Sussex morphological analyser (Minnen et of A and B. Rs is a useful non-parametric compari- al., 2000) and the POS tags are stripped. The re- son for when the rank order is more relevant than the lations for each term are collected together produc- actual values in the distribution.