A Comparative Evaluation of Collocation Extraction Techniques
Total Page:16
File Type:pdf, Size:1020Kb
A Comparative Evaluation of Collocation Extraction Techniques Darren Pearce School of Cognitive and Computing Sciences (COGS) University of Sussex Falmer Brighton BN1 9QH [email protected] Abstract This paper describes an experiment that attempts to compare a range of existing collocation extraction techniques as well as the imple- mentation of a new technique based on tests for lexical substitutability. After a description of the experiment details, the techniques are discussed with particular emphasis on any adaptations that are required in order to evaluate it in the way proposed. This is followed by a discussion on the relative strengths and weaknesses of the techniques with reference to the results obtained. Since there is no general agreement on the exact nature of collocation, evaluating techniques with reference to any single standard is somewhat controversial. De- parting from this point, part of the concluding discussion includes initial proposals for a common framework for evaluation of collocation extraction techniques. 1. Introduction With recent significant increases in parsing efficiency Over the last thirty years, there has been much discus- and accuracy, there is no reason why explicit parse infor- sion in the linguistics literature on the exact nature of col- mation should not be used. Indeed, Goldman et al. (2001) location. Even now, there is no widely-accepted definition. emphasises the necessity to use such information since some collocational dependencies can span as many as 30 A by-product of the lack of any agreed definition is a words (in French). Several researchers have exploited the lack of any consistent evaluation methodology. For exam- availability of parse data. Lin (1998) uses an information- ple, Smadja (1993) employs the skills of a professional lex- theoretic approach over parse dependency triples, Blaheta icographer, Blaheta and Johnson (2001) use native speaker and Johnson (2001) use log-linear models to measure asso- judgements and Lin (1998) uses a term bank. Evaluation ciation between verb-particle pairs and Pearce (2001) uses can also take the form of a discussion of the ‘quality’ of the restrictions on synonym substitutions within parse depen- extracted collocations (e.g. Kita and Ogata (1997)). dency pairs. Since it is somewhat controversial to evaluate these Recent research into collocation extraction (Pearce, techniques with reference to anything considered as a stan- 2001) has produced a new technique based on analysing dard whether relying on human judgement or machine read- the possible substitutions for synonyms within candidate able resources, relatively little work has been done on com- phrases. For example, emotional baggage (a collocation) parative evaluation in this area (e.g. Evert and Krenn occurs more frequently than the phrase emotional luggage (2001), Krenn and Evert (2001) and Kita et al. (1994)). formed when baggage is substituted for its synonym lug- However, the view taken in this paper is that such evalua- gage. This technique uses WordNet (Miller, 1990) as a tion does offer at least some departure points for a discus- source of synonymic information. sion of the relative merits of each technique. 3. Experiment Description 2. Existing Collocation Extraction The experiment compares implementations of many of Techniques the techniques discussed in the previous section. Each tech- Various researchers in natural language processing have nique was supplied with bigram co-occurrence frequencies proposed computationally tractable definitions of colloca- obtained from 5.3 million words of the BNC. For each pair tion accompanying empirical experiments seeking to vali- of words in this frequency data, the technique was used to date their formulation. With the availability of large cor- assign a ‘score’ where the higher the score, the ‘better’ the pora came several techniques based on N-gram statistics collocation (according to the technique). Alternatively, the derived from these large bodies of text. This begins with technique, based on some predefined criteria (such as the the -score of Berry-Rogghe (1973) and later with Church application of a threshold), could opt to ignore scoring the and Hanks (1990), Kita et al. (1994) and Shimohata et al. pair. Such a strategy tends to increase precision at the ex- (1997). pense of recall. Smadja (1993) went one step further and developed a Evaluation of the output list uses the multi-word infor- technique that not only extracted continuous sequences of mation in the machine-readable version of the New Oxford words that form collocations but also those with gaps (e.g. Dictionary of English (NODE) (Pearsall and Hanks, 1998). ¡ break-down, door ¢ in break down the old door). Under- This was processed to produce a list of 17,485 two-word pinning the technique was the implicit inference of syntax collocations such as adhesive tape, hand grenade and value through analysis of co-occurrence frequency distributions. judgement. This list was subsequently reduced to 4,152 en- tries that occurred at least once in the same data supplied to IJII the techniques, thus forming the ‘gold’ standard. IIJI open the door but the door III was open III so I left the door open III open III the open door 4. Technique Details 6¤LK © ?¤NM The following subsections briefly describe each of the © open and door so: evaluated techniques with particular emphasis on the as- M © door > 6¤ @ ¤ ¤QO(¨ O%ORM signment of scores to word sequences, especially sequences door PO%O%O!-AK of length two. In addition, accompanying each description -A© open is a figure consisting of a small example of the application of the technique to a fabricated corpus of a 1000 words. The and subsection titles also serve to identify the way in which the B * techniques will be referred to in Section 5. > ©DP© ©DS ©¡'& ¢( 6¤ door open ¡ ¢ Throughout the descriptions, a word, ( , , etc), when ¡ ¤QO(¨ O%ORM4TUKTVM;¤QO(¨ O%W §¦©¨ ¨ ¨ ¢ part of a (contiguous) sequence of words, £¥¤ (or ¦ ), may be written indicating its position within With ©¡'& ¢( 6¤XM : £©¤ this sequence where and . It is use- MY-ZO(¨ O%W © ! ful to distinguish the frequency of a word, , the fre- [ ¤ ¤^P_(¨ ` ©£" O%W\7]-ZO(¨ O%ORMS quency of a sequence, , and, in particular, the distance O(¨ frequency, $#%¡'& ¢( , which represents the count of word ) ¡ ) ¢ occurring words after . In general. the distance, , Figure 1: Example for Berry-Rogghe (1973). can be negative as well as positive, represented by the set * ¤,+.- /& ¨ ¨ ¨0%& %& ¨ ¨ ¨1©2 . When this set is restricted to a *43 positive range only, this is written ¤5+.6¨ ¨ ¨7©2 . The 4.2. Church and Hanks (1990) (church) 96:8; arithmetic mean of a bag, 8 is notated and the stan- Although widely used as a basis for collocation extrac- 1 dard deviation by </:8; . tion, Church and Hanks (1990) in fact measured word as- ¢ sociation. The probability of seeing words ¡ and within 4.1. Berry-Rogghe (1973) (berry) a pre-defined window is given by: One of the earliest attempts at automatic extraction, this a # ¡'& ¢( technique uses a window of words. Given a word, ¡ , and > ©¡= ¢ its frequency, , the probability of another word, , oc- # bRc6d ¡'& ¢( ?¤ * @ 3 ED curring in any other position in the corpus is approximated by: where the normalisation factor takes account of the possi- bility of counting the same word more than once within the ©¢( > 2 @ ¢( ?¤ window. The co-occurrence of the two words is scored -A©¡= using point-wise mutual information: @ > -"©¡= e where the normalisation factor ( ) is the number of ¡'& ¢( ¡'& ¢( ?¤Nf g%h.i > > words in the corpus that aren’t ¡ . The expected frequency ¡= ©D ¢( ¢ of ¡ and within a certain window of words is then: where word probabilities are calculated directly using rel- B * > ¢( CDE©¡= ©D. C¡'& ¢( ?¤ ative frequency. This gives the amount of information (in bits) that the co-occurrence gives over and above the infor- * ¡ ¢ in which there are chances around each for to occur. mation of the individual occurrences of the two words. The significance of the actual frequency count, ©¡'& ¢( in Since this measure becomes unstable when the counts B ©¡'& ¢( ]jNK comparison to ©¡'& ¢( is computed using a normal approx- are small, it is only calculated if . imation to the binomial distribution, yielding the -score: 4.3. Kita et al. (1994) (kita) B ©¡'& ¢( ©¡'& ¢( C- This technique is based on the idea of the cognitive cost F ¤ B of processing a sequence of words, kR£" . In general, the > ¢( ©¡'& ¢( ©D%7- non-collocational cost of a word sequence is assumed to be 3 £ linear in the number of words ( k l]£" §¤, ). However, In order to process bigram data, this must be modified collocations are processed as a single unit ( kEm £" ¤n ). slightly such that the window is just one word wide. Motivated by the rationale that a collocation would serve to reduce the cognitive cost of processing a word sequence, 1These functions, de®ned purely for the sake of brevity, neces- collocations are extracted by considering cost reduction sitate the use of bag-theory. Bags are similar to sets except that 2 p!q/o rtsvu they can contain repeated elements. To correspond with this sim- Note that o corresponds to in Church and Hanks H ilarity, they are written using the same font as for sets ( G , , etc). (1990). The distinction is made where appropriate although this is usually 3This is, as the authors admit, a greatly simplifying assump- obvious from the context. tion. IJII III III III butterfly stroke butterfly IIJI animal liberation animal lib- III III III stroke butterfly stroke stroke eration front IIJI animal liberation III III III stroke stroke animal liberation III animal liberation front III _ > > The table below shows the cost reduction calculations ?¤ ?¤ butterfly stroke £ ¦ PO%O%O PO%O%O for two word sequences: animal liberation ( ) and i animal liberation front ( £ ). so, using a window of two words: ¢ ¢ ¡ £ £ l m£ ¡ > 6¤ ¡ K PO K K butterfly stroke £"¦ ¡ ¡ T PO%O%O i £ _ M i n£ and However, since £V¦ , the calculations above i have considered two occurrences of £ also as occur- e O(¨ O%O( £V¦ 6¤Xf g%h ¤LK ¨ W butterfly stroke i rences of a larger collocation, .