Comparative Evaluation of Extraction Metrics

Aristomenis Thanopoulos, Nikos Fakotakis, George Kokkinakis

Wire Communications Laboratory Electrical & Computer Engineering Dept., University of Patras 265 00 Rion, Patras, Greece {aristom,fakotaki,gkokkin}@wcl.ee.upatras.gr

Abstract Corpus-based automatic extraction of is typically carried out employing some statistic indicating concurrency in order to identify words that co-occur more often than expected by chance. In this paper we are concerned with some typical measures such as the t-score, Pearson’s χ-square test, log-likelihood ratio, pointwise mutual information and a novel information theoretic measure, namely mutual dependency. Apart from some theoretical discussion about their correlation, we perform comparative evaluation experiments judging performance by their ability to identify lexically associated . We use two different gold standards: WordNet and lists of named-entities. Besides discovering that a frequency-biased version of mutual dependency performs the best, followed close by likelihood ratio, we point out some implications that usage of available electronic dictionaries such as the WordNet for evaluation of collocation extraction encompasses.

dependence, several metrics have been adopted by the 1. Introduction community. Typical statistics are Collocational information is important not only for the t-score (TSC), Pearson’s χ-square test (χ2), log- second language learning but also for many natural likelihood ratio (LLR) and pointwise mutual language processing tasks. Specifically, in natural information (PMI). However, usually no systematic language generation and it is comparative evaluation is accomplished. For example, necessary to ensure generation of lexically correct Kita et al. (1993) accomplish partial and intuitive expressions; for example, “strong”, unlike “powerful”, comparisons between metrics, while Smadja (1993) modifies “coffee” but not “computers”. From the other resorts to lexicographic evaluation of his approach. hand, in automatic construction of thesauri and However, it is obvious that an objective and ontologies, identification of multiwords representing automated evaluation process offers repeatability, characteristic entities and concepts of the domain, allowing for comparison among different collocation such as “General Motors” or “general relativity”, is an extraction measures and techniques. obvious necessity. These two types of collocations are The aforementioned metrics estimate lexical quite different in terms of both compositionality and selectivity on bigrams. Smadja (1993) proposes an offset (signed distance between constituents in text). algorithm for joining significant bigrams together to In this paper we are concerned with the latter type of construct significant n-grams. However, since his collocations; that is multiwords the meaning of which target task is NLG and therefore he is interested in is not compositionally derivable. collocations not necessarily non-compositional, he Collocations are abundant in language and vary keeps the larger produced n-grams, while significantly in terms of length, syntactic patterns and conceptually-oriented tasks such as automatic offset. They are also domain-dependent and language- construction of ontologies and thesauri, require dependent; therefore their automatic extraction from identification of the minimal semantic constituents. domain specific corpora is of high importance, as far For example, “the president of United States” is a as portability of NLP systems is concerned. Therefore, collocation but it can be decomposed in two minimal in this paper we will focus on purely corpus-based lexico-semantic units of which the straightforward automatic extraction of collocations, assuming that combination produces its meaning: “president” and other available knowledge sources, such as in “United States”. However, extraction of significant (Justeson and Katz, 1995) or in (Pearce, 2001), can be bigrams is an important task for both applications. In employed additionally. order to eliminate the factor of offset and exploit available recourses for evaluation we focus on 2. Corpus-based Collocation Extraction sequential collocations (multi-words). An objective set of multi-words is necessary, as Collocations are recurrent in texts and express lexical selectivity. Therefore two or more words that “gold standard”, for the comparative experiments. co-occur in text corpora (much) more often than Unfortunately, complete machine-readable databases of collocations are not widely available, even for expected by chance (most) possibly constitute a collocation. In order to test this hypothesis of English. On-line lexical resources, such as WordNet (Miller, 1990), contain such information, although incomplete (Roark and Charniak, 1998) and not pure, is its self-information (Gallager, 1968): I(x) = as we will note. Obvious, easily obtained and −log(P(x)). We call the proposed measure Mutual explicitly non-compositional multi-words are also Dependency (MD): entity names, such as location and organisation names. Terminology lists can also be used in the case of D(w 1 , w 2 ) = I(w 1 , w 2 ) - I(w 1 w 2 ) = restricted domain corpora. P 2 (w w ) = log 1 2 2 P(w ) ⋅ P(w ) 3. Pairwise significance measures 1 2 Since multiwords can be quite long (e.g. “Default which is obviously maximized for perfectly dependent Proof Credit Card System Inc”), applying bigrams, without dependence to their frequency. straightforwardly statistics on n-gram counts are However, intuition suggests that a slight bias towards computationally prohibitive. The only feasible frequency can be beneficial reflecting statistical strategy for extraction of multiwords from large confidence; that is among similarly dependent bigrams corpora is to initially extract a set of significant the most frequent ones should be favored. Therefore bigrams (according to a measure of pairwise we also tested a few such measures (for example dependence) and then, connecting significant bigrams combining MD with the t-score), of which Log- together, to calculate the statistical significance of the Frequency biased MD (LFMD) proved experimentally larger strings (Smadja, 1993). In order to simplify the the most satisfactory: task of comparison among measures of significance, the present study is confined to the first stage, i.e. the DLF(w1w2) = D(w1,w2) + log2P(w1w2) extraction of significant bigrams. The most simplistic approach to collocation 3.2. Hypothesis testing extraction is to exploit the property of recurrency and From a statistical point of view, our problem can therefore to extract the most frequent word co- be expressed as to determine if the word co- occurrences. However this has the obvious drawback occurrence indicates lexical correlation or it is due to that the a priori word frequencies are not taken into chance. The latter case, which constitutes the null account; therefore such collocations are often fully hypothesis, is that the considered words w1 and w2 are compositional and thus of no lexical interest (e.g. generated independently in the corpus and thus their “analyst said”, “three years”, etc.). Therefore, co-occurrence probability can be estimated as: measures of dependence taking into account word P(w1w2) = P(w1)·P(w2). Word probabilities are probabilities (using maximum likelihood estimators) calculated using maximum likelihood estimators have been widely employed for collocation extraction. (MLE). In the following sub-sections we define some of the The most widely used statistical tests employed to most prominent measures (see (Manning and Schütze, estimate the divergence of the observed co-occurrence 1999) for more details), we discuss some of their frequency from the one according to the null prominent characteristics and we propose a few more hypothesis are the t-score (TSC), the Pearson’s χ- measures. square test (XSQ) and the log-likelihood ratio (LLR). The null hypothesis can be rejected with a certain 3.1. Information theoretic measures confidence when the used statistic surpasses a certain In one of the premier studies in automatic corpus- threshold. based collocation extraction, Church and Hanks (1990) proposed the association ratio, a metric based 3.2.1. T-score on the information theoretic concept of mutual The t statistic is defined as: information, and specifically to the pointwise mutual x − µ information (PMI), which is defined as: t = s 2 P(w w ) 1 2 I (w 1 , w 2 ) = log 2 N P(w 1 ) ⋅ P(w 2 ) However, PMI is actually a measure of where x is the sample mean, s2 is the sample variance, independence rather than of dependence (Manning N is the sample size and µ is the mean of the and Schütze, 1999). Considering perfectly dependent distribution which generation of words follows. If the bigrams, i.e. P(w1) = P(w2) = P(w1w2), we obtain: t statistic is large enough the null hypothesis can be I(w1,w2) = −log2(P(w1)), which shows that PMI rejected. The t-test is typically calculated supposing a exhibits preference to rare events, because they a normal distribution. That is, for a uv we have: priori contain higher amount of information, in x = P(uv), µ = P(u)·P(v) and comparison to frequent events. This suggests that s2 = P(uv)·(1-P(uv)) ≈ P(uv). dependence can actually be identified subtracting from In terms of frequency counts, we have: PMI the information that the whole event bears, which t(u,v)=1−fu·fv/fuv Then the relation between the t-score and mutual corpora and, mainly, lexical resources for automatic information can be easily derived: evaluation. Since the involved linguistic analysis is −I (u,v) minimal we deem that our results and conclusions are t = f (1− 2 ) uv valid for any language. Statistical data were acquired from the Wall Street This formula reveals that even loosely related bigrams Journal corpus (WSJC) of 1988, comprised of about (e.g. PMI ≈ 3) are ranked roughly according to their 11 million content words. Since Sentence Boundary frequency; which indicates that the t-score is rather Detection is, although not trivial, a rather solved NLP frequency biased. task and in WSJC sentence boundaries are annotated, we considered only intra-sentential co-occurrences. 3.2.2. Pearson’s χ-square test Moreover, we eliminate bigrams containing functional Pearson’s χ-square test, bypasses the arbitrary words or numbers and bigrams appearing less than 3 assumption of normality employed for the calculation times in the corpus. Therefore our statistical data of the t-test. It sums the squared differences between consist of roughly 4.4 million bigrams, that is 800000 the expected and observed frequencies scaled by the distinct bigrams were considered. expected frequencies in all combinations of co- The 30-best scoring bigrams for each measure are occurrence or not of the words under consideration: shown in Table 1. We can see that the t-score produces exactly the same hits (ranked slightly 2 different) as plain frequency, some of which are ( f − f f ) 2 ( f − f f ) χ 2 = uv u v + uv u v + compositional and uninteresting (e.g. “company said”, f f f f “says mr”), while LFMD and LLR produce some u v u v interesting and important collocations (e.g. merrill 2 2 lynch, los angeles, dow jones). MI and MD (or χ2) 30- ( f − f f v ) ( f − f f ) uv u + uv u v best lists contain exclusively named-entity bigrams f f f f u v u v (for all, it is fu = fv = fuv); the former, unlike the latter, ranks higher the low frequency bigrams. where uv represents the sequence of events u and v For the automatic evaluation we employed two and ū the event (word or bigram) not(u). completely unrelated gold standards. The one is In the case of statistical data from large corpora, WordNet (Miller, 1990); one of the most copious frequency of occurrence is always many orders of lexical resources in electronic form for English. magnitude higher than frequency of non-occurrence, WordNet contains semantic relations between lexical and therefore the 3 latter additives are insignificant, entities representing entities and concepts and compared to the former, which is actually a different therefore this set of lexical entities was used as a “gold formulation of the MD measure. Indeed, the ranking standard”. Since we study bigram dependency lists of the two measures are quite similar. measures, we kept only lexical entities comprised of two words. We didn’t extract bigrams from WordNet 3.2.3. Likelihood ratio n-gram entities with n>2, because in many cases they are analytical descriptions of synsets or semantic A widely accepted measure of statistical significance is the logarithm of the ratio between the categories (e.g. “capital_of_the_United_Kingdom”, likelihoods of the hypotheses of dependence and “ABO_blood_group_system”) rather than lexical collocations. independence (LLR): The second source is comprised of named entities L(H1 ) which appear in abundance in American newswire − 2log λ = 2 ⋅ log L(H ) texts and were gathered mainly from the Internet for 0 this specific purpose. Namely they are 5700 US cities (2000 of them periphrastic), 11000 US companies and where L(H) is the likelihood of hypothesis H based on 11000 person names which recurrently occur in observed data assuming that occurrence of words newswire texts in the domains of politics, business, follows binomial distribution (Dunning, 1993), which sports, arts, etc. In this case we maintained all is a more plausible assumption than the normality extracted bigrams, excluding only words which are assumption. designators of the respective category (e.g. city, corp, company, ltd, sir, etc.) and therefore their contribution 4. Experimental evaluation to the meaning of the phrase is compositional. In total We conducted comparative evaluation experiments there were 23000 distinct bigrams the 5400 of which of the discussed measures using English language for occur in our corpus. the obvious reason of availability of both training text

R FREQUENCY T-score LLR LFMD MD, χ2 PMI

1 vice president vice president vice president vice president bala cynwyd leonie rysanek 2 stock exchange stock exchange stock exchange wall street zoete wedd lineas aereas 3 chief executive chief executive chief executive chief executive ralston purina yand renjun 4 year earlier year earlier cents share cents share bateman eichler yue-kong pao 5 cents share cents share year earlier stock exchange corpus christi fayez sarofim executive 6 york stock york stock executive officer gallium arsenide steer-mom pop's officer 7 million shares executive officer wall street year earlier kaposi's sarcoma tech-ops landauer composite 8 company said composite trading dow jones rotan mosle sunder rajan trading 9 executive officer million shares york stock los angeles cahora bassa tiang siew composite 10 composite trading company said net income mager dietz tercel ez trading exchange 11 spokesman said spokesman said real estate vazquez rana toa nenryo composite shearson 12 wall street wall street interest rates ku klux rabi blancos lehman 13 interest rates net income real estate hong kong kwik kopy rodrigo borja 14 net income interest rates dow jones merrill lynch nissho iwai rubik's cube exchange exchange 15 shares outstanding net income deja vu roussel uclaf composite composite exchange schiapparelli 16 trading yesterday trading yesterday tender offer fii fyffes composite farmaceutici 17 common shares common shares years ago interest rates epeda bertrand tu bishvat shares shares revolucionario 18 los angeles tender offer usinor sacilor outstanding outstanding institucional burnham 19 years ago years ago shearson lehman minas gerais ils sont lambert shares 20 inc said inc said million shares ds bancor immanuel kant outstanding 21 corp said tender offer trading yesterday years ago yom kippur jebel kusha 22 says mr real estate common shares york stock munoz ledo jovito salonga leveraged buy- koninklijke 23 stock market says mr merrill lynch kumagai gumi out nedlloyd aerolineas 24 tender offer stock market hong kong lehman hutton josip broz argentinas 25 real estate corp said years old morgan stanley mats wilander lanthanum gallate 26 holding company holding company spokesman said san francisco aer lingus katsuya takanashi 27 shares closed shares closed fourth quarter years old khalifa al-sabah sont partis 28 million year dow jones white house white house modus operandi sotto voce 29 dow jones years old holding company drexel burnham dextran sulfate hau pei-tsun exchange seasonally 30 said mr west german twyla tharp harve benard commission adjusted

Table 1: The 30-best bigrams for all tested measures of bigram association. Bigrams lacking interesting (i.e. lexical) association are marked with a diagonal line, while lexically associated bigrams appearing in any of the gold standards are indicated with grey fill. Note that some named entity fragments (e.g. “ku klux” from “Ku Klux Klan” and “york stock” from “New York Stock Exchange”) have not matched WordNet entities due to the exclusion of n-grams with n>2.

60%

50%

40% Baseline Freq LLR age r 30% MI

ove MD;χ2 c T-score LFMD 20%

10%

0% 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 N-best bigrams

Figure 1: Comparative evaluation against WordNet entities.

60%

50%

40% Baseline

e Freq g

era 30% LLR cov MI

MD;χ2 20% T-score

LFMD

10%

0% 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 N-best bigrams

Figure 2: Comparative evaluation against Named Entities

Since for a certain level of confidence identify rare collocations, which comprise a different statistics may provide different large portion of terminology, as Zipf’s law numbers of suggestions, resulting in different suggests. From the other hand, the inverse- levels of coverage and precision, we follow a frequency bias of pointwise mutual simple but objective evaluation scheme: We information can be corrected taking into rank bigrams according to every measure and consideration the self-information of the co- we depict the percentage of pairs retrieved occurrence. Introducing a slight bias towards correctly among the N-best candidates, with N frequency results in a top-performing measure, ranging from 200 to 100000. As we noted in surpassing even likelihood ratio. Section 3, the performance of Pearson’s χ- The evaluation procedure needs also square test is almost identical to that of Mutual special attention. Available lexical resources Dependency. Baseline is drawn supposing such as WordNet are both impure and random selection of bigrams. incomplete regarding non-compositional The evaluation results show that LFMD collocations. Therefore, enrichment with and LLR outperform the other measures in terminological information and elaborated both gold standard evaluations. It is interesting preprocessing for the exclusion of descriptive however that evaluation against WordNet expressions seem necessary prior to evaluation favours not only the t-score but plain experiments. frequency as well, at the expense of information theoretic metrics. We argue that 6. References this demonstrates that employing Church K. and Hanks P., 1990. Word straightforwardly WordNet as a “gold Association Norms, Mutual Information, and standard” multiword repository is not quite Lexicography. Computational Linguistics, infallible for two reasons: 16/1: 22-29. 1. Many WordNet entities are rather analytical descriptions of lexical entities and Dunning T., 1993. Accurate methods for the therefore by no means non-compositional statistics of surprise and coincidence. multiwords of interest. For example both Computational Linguistics, 19/1: 61-74. “Japanese_capital” and “capital_of_Japan” are Gallager R., 1968. Information theory and included in WordNet as well as many phrases reliable communication. John Wiley & of the same pattern. A preprocesing filter Sons, Inc. should probably be applied to exclude Justeson J., Katz S., 1995. Principled compositional expressions using the WordNet Disambiguation: Discriminating Adjective hypernym relations. Indeed, compositionality Senses with Modified Nouns. of “capital_of_Japan” can be easily identified Computational Linguistics, 21/1: 1-27. as since there are numerous “capital_of_X” entities where X has the same hypernym as Kita K., Kato Y., Omoto T., Yano Y., 1993. “Japan”. Moreover, the specific level of the “A Comparative Study of Automatic hierarchy the entity occurs can be of assistance Extraction of Collocations from Corpora: towards the same end. That is, the lower the Mutual Information vs. Cost Criteria. level, the more possible the WordNet word Journal of Natural Language Processing sequence to be a non-compositional lexical 1/1: 21-32. entity rather than a synset description. Manning C. and Schütze H., 1999. 2. No systematic attempt to include (or Foundations of Statistical Natural Language omit) all named entities in WordNet (as Processing. Cambridge: MIT Press. probably in most dictionaries) have been done. Miller G., 1990. WordNet: An on-line lexical Therefore only the most frequent named database. International Journal of entities are included; introducing a preference Lexicography, 34. to frequency-biased measures of association. Pearce, D., 2001. Synonymy in collocation extraction. Proceedings of the NAACL'01 5. Conclusion Workshop on WordNet and Other Lexical We have studied certain pairwise word Resources: Applications, Extensions and association measures, typically applied for Customizations. Pittsburgh, PA. collocation extraction. We have discussed Smadja F., 1993. Retrieving Collocations from some associations among them, both text: Xtract, Computational Linguistics, analytically and experimentally. Although 19/1: 143-177. length and offset of collocations varies Roark B., Charniak E., 1998. Noun-phrase co- significantly, our study was restricted on occurrence statistics for semi-automatic bigrams in order to eliminate other intervening semantic lexicon construction. COLING- factors and to exploit available lexical ACL 1998, 1:1110-1116. resources for evaluation. In specific, the strong bias of the t-score towards frequency, makes it incapable to