From N-Grams to Collocations an Evaluation of Xtract

FROM N-GRAMS TO COLLOCATIONS AN EVALUATION OF XTRACT Frank A. Smadja Department of Computer Science Columbia University New York, NY 10027 Abstract that produces various types of collocations from a two- stage statistical analysis of large textual corpora briefly In previous papers we presented methods for sketched in the next section. In Sections 3 and 4, we retrieving collocations from large samples of show how robust parsing technology can be used to both texts. We described a tool, Xtract, that im- filter out a number of invalid collocations as well as add plements these methods and able to retrieve useful syntactic information to the retained ones. This a wide range of collocations in a two stage filter/analyzer is implemented in a third stage of Xtract process. These methods a.s well as other re- that automatically goes over a the output collocations to lated methods however have some limitations. reject the invalid ones and label the valid ones with syn- Mainly, the produced collocations do not in- tactic information. For example, if the first two stages clude any kind of functional information and of Xtract produce the collocation "make-decision,"the many of them are invalid. In this paper we goal of this third stage'is to identify it as a verb-object introduce methods that address these issues. collocation. If no such syntactic relation is observed, These methods are implemented in an added then the collocation is rejected. In Section 5 we present third stage to Xtract that examines the set of an evaluation of Xtract as a collocation retrieval sys- collocations retrieved during the previous two tem. The addition of the third stage of Xtract has been stages to both filter out a number of invalid col- evaluated to raise the precision of Xtract from 40% to locations and add useful syntactic information 80°£ and it has a recall of 94%. In this paper we use ex- to the retained ones. By combining parsing and amples related to the word "takeover" from a 10 million statistical techniques the addition of this third word corpus containing stock market reports originating stage has raised the overall precision level of from the Associated Press newswire. Xtract from 40% to 80% With a precision of 94%. In the paper we describe the methods and the evaluation experiments. 2 FIRST 2 STAGES OF XTRACT, PRODUCING N-GRAMS 1 INTRODUCTION In afirst stage, Xtract uses statistical techniques to retrieve pairs of words (or bigrams) whose common ap- In the past, several approaches have been proposed to pearances within a single sentence are correlated in the retrieve various types of collocations from the analysis corpus. A bigram is retrieved if its frequency of occur- of large samples of textual data. Pairwise associations rence is above a certain threshold and if the words are (bigrams or 2-grams) (e.g., [Smadja, 1988], [Church and used in relatively rigid ways. Some bigrams produced Hanks, 1989]) as well as n-word (n > 2) associations by the first stage of Xtract are given in Table 1: the (or n-grams) (e.g., [Choueka el al., 1983], [Smadja and bigrams all contain the word "takeover" and an adjec- McKeown, 1990]) were retrieved. These techniques auto- tive. In the table, the distance parameter indicates the matically produced large numbers of collocations along usual distance between the two words. For example, with statistical figures intended to reflect their relevance. distance = 1 indicates that the two words are fre- However, none of these techniques provides functional in- quently adjacent in the corpus. formation along with the collocation. Also, the results produced often contained improper word associations re- In a second stage, Xtract uses the output bi- flecting some spurious aspect of the training corpus that grams to produce collocations involving more than two did not stand for true collocations. This paper addresses words (or n-grams). It examines all the sentences con- these two problems. taining the bigram and analyzes the statistical distribution of words and parts of speech for each position Previous papers (e.g., [Smadja and McKeown, around the pair. It retains words (or parts of speech) oc- 1990]) introduced a. set of tecl)niques and a. tool, Xtract, cupying a position with probability greater than a given 279 threshold. For example, the bigram "average-industrial" and thwart with a distance of 2. produces the n-gram "the Dow Jones industrial average" since the words are always used within this compound 3.1 DESCRIPTION OF THE ALGORITHM in the training corpus. Example. outputs of the second stage of Xtraet are given in Figure 1. In the figure, the numbers on the left indicate the frequency of the n-grams Input: A bigram with some distance information in- in the corpus, NN indicates that. a noun is expected at dicating the most probable distance between the two this position, AT indicates that an article is expected, words. For example, takeover and thwart with a distance NP stands for a proper noun and VBD stands for a verb of 2. in the past tense. See [Smadja and McKeown, 1990] and [Smadja, 1991] for more details on these two stages. Output/Goah Either a syntactic label for the bigram or a rejection. In the case of takeover and thwart the collocation is accepted and its produced label is VO for Table 1: Output of Stage 1 verb-object. Wi wj distance The algorithm works in the following 3 steps: hostile takeovers 1 hostile takeover 1 corporate takeovers 1 3.1.1 Step 1: PRODUCE TAGGED hostile takeovers 2 CONCORDANCES unwanted takeover 1 potential takeover 1 All the sentences in the corpus that contain the unsolicited takeover 1 two words in this given position are produced. This unsuccessful takeover 1 is done with a concord,acing program which is part of friendly takeover 1 Xtraet (see [Smadja, 1991]). The sentences are labeled takeover expensive 2 with part of speech information by preprocessing the cor- takeover big 4 pus with an automatic stochastic tagger. 1 big takeover 1 3.1.2 Step 2: PARSE THE SENTENCES Each sentence is then processed by Cass, a 3 STAGE THREE: SYNTACTICALLY bottom-up incremental parser [Abney, 1990]. 2 Cass LABELING COLLOCATIONS takes input sentences labeled with part of speech and attempts to identify syntactic structure. One of Cass modules identifies predicate argument relations. We use In the past, Debili [Debili, 1982] parsed corpora of French this module to produce binary syntactic relations (or la- texts to identify non-ambiguous predicate argument rela- bels) such as "verb-object" (VO), %erb-subject" (VS), tions. He then used these relations for disambiguation in "noun-adjective" (N J), and "noun-noun" ( N N ). Con- parsing. Since then, the advent of robust parsers such as sider Sentence (1) below and all the labels as produced Cass [Abney, 1990], Fidditeh [Itindle, 1983] has made it by Cass on it. possible to process large amounts of text with good per- formance. This enabled Itindle and Rooth [Hindle and (1) "Under the recapitalization plan it proposed to Rooth, 1990], to improve Debili's work by using bigram thwart the takeover." statistics to enhance the task of prepositional phrase at- tachment. Combining statistical and parsing methods label bigrarn has also been done by Church and his colleagues. In SV it proposed [Church et al., 1989] and [Church'et ai., 1991] they con- NN recapitalization plan sider predicate argument relations in the form of ques- VO thwart takeover tions such as What does a boat typically do? They are preprocessing a corpus with the Fiddlteh parser in order For each sentence in the concordance set, from to statistically analyze the distribution of the predicates the output of Cass, Xtract determines the syntactic used with a given argument such as "boat." relation of the two words among VO, SV, N J, NN and Our goal is different, since we analyze a set of assigns this label to the sentence. If no such relation is collocations automatically produced by Xtract to either observed, Xtract associates the label U (for undefined) enrich them with syntactic information or reject them. to the sentence. We note label[ia~ the label associated For example, if, bigram collocation produced by Xtract 1For this, we use the part of speech tagger described in involves a noun and a verb, the role of Stage 3 of Xtract [Church, 1988]. This program was developed at Bell Labora- is to determine whether it is a subject-verb or a verb- tories by Ken Church. object collocation. If no such relation can be identified, UThe parser has been developed at Bell Communication then the collocation is rejected. This section presents Research by Steve Abney, Cass stands for Cascaded Analysis the algorithm for Xtract Stage 3 in some detail. For of Syntactic Structure. I am much grateful to Steve Abney illustrative purposes we use the example words takeover to help us use and customize Cass for this work. 280 681 .... takeover bid ...... 310 .... takeover offer ...... 258 .... takeover attempt ..... 177 .... takeover battle ...... 154 ...... NN NN takeover defense ...... 153 .... takeover target ....... 119 ..... a possible takeover NN ...... 118 ....... takeover law ....... 109 ....... takeover rumors ...... 102 ....... takeover speculation ...... 84 .... takeover strategist ...... 69 ....... AT takeover fight .... 62 ....... corporate takeover... 50 .... takeover proposals ...... 40 ....... Federated's poison pill takeover defense ...... 33 .... NN VBD a sweetened takeover offer from . NP... Figure 1: Some n-grams containing "takeover" with Sentence id. For example, the label for Sentence (1) 4 A LEXICOGRAPHIC is: label[l] - VO. EVALUATION 3.1.3 Step 3: REJECT OR LABEL The third stage of Xtract can thus be considered as a COLLOCATION retrieval system which retrieves valid collocations from a set of candidates.

Load more