Evaluation of Automatic Collocation Extraction Methods for Language Learning

Evaluation of automatic collocation extraction methods for language learning Vishal Bhalla Klara Klimcikova Department of Informatics Department of English and American Studies Technical University of Munich Ludwig Maximilian University of Munich Arcisstraße 21, 80333 Schellingstrae 3, 80799 Munchen,¨ Germany Munchen,¨ Germany [email protected] [email protected] Abstract Mathews-Aydınlı, 2011). To increase their language proficiency, beginners and advanced learn- A number of methods have been proposed to ers often look up for words and it’s common col- automatically extract collocations, i.e., con- locates online, using a mobile app or web browser ventionalized lexical combinations, from text corpora. However, the attempts to evaluate and it benefits to provide personalized items, tai- and compare them with a specific applica- lored to the user’s interest and proficiency. Exam- tion in mind lag behind. This paper com- ples of using collocations for building educational pares three end-to-end resources for colloca- applications include question generation (e.g., Lin tion learning, all of which used the same cor- et al., 2007), distractor generation (e.g., Liu et al., pus but different methods. Adopting a gold- 2005; Lee and Seneff, 2007) for multiple choice standard evaluation method, the results show cloze items and an online collocation writing as- that the method of dependency parsing outper- forms regex-over-pos in collocation identifica- sistant - Collocation Inspector (Wu et al., 2010a) tion. The lexical association measures (AMs) in the form of a web service. used for collocation ranking perform about the Despite their widely recognized importance and same overall but differently for individual col- ubiquity in language use, collocations pose a location types. Further analysis has also re- great challenge for language learners thanks to vealed that there are considerable differences their arbitrary nature and the learner’s insuffi- between other commonly used AMs. cient experience with the target language (Ellis, 1 Introduction 2012). Thus, there is a pressing need to create resources for language learners to support their ex- Collocations, as the most common manifestation plicit collocation learning. Given the vast amount of formulaic language, have attracted a great deal of collocations and the different goals of lan- of research in the last decade (Wray, 2012). Most guage learners, various methods have been pro- of the research on collocations has been connected posed to extract them automatically from text. Yet to their definition (section 2.1) and extraction (sec- it is still not conclusive which one performs the tion 2.2), but also to their acquisition, and conse- best for language learning and “the selection of quently teaching. Herbst and Schmid (2014) ar- one or another seems to be somewhat arbitrary” gue, “Any reflection upon what is important in the (Gonzalez´ Fernandez´ and Schmitt, 2015) (p. 96). learning and, consequently, also in the teaching This paper1 attempts to evaluate three end- of a foreign language will have to take into ac- to-end resources of collocations built for lan- count the crucial role of conventionalized but un- guage learning: Sketch Engine2 (Kilgarriff et al., predictable collocations. Any attempt by a learner 2014), Flexible Language Acquisition (FLAX)3 to achieve some kind of near-nativeness will have (Wu, 2010) and Elia (Bhalla et al., 2018). They to include facts of language such as the fact that it use the same British Academic Written English is lay or set the table in English, but Tisch decken (BAWE) corpus (Nesi, 2011), but different meth- in German, and mettre la table in French” (p. 1). Collocation learning comes down to three main 1Code, data and evaluation results are avail- able at https://github.com/vishalbhalla/ benefits for language learners: accurate produc- autocoleval tion, efficient comprehension and increased flu- 2https://www.sketchengine.eu/ ency of processing (e.g., Men, 2017; Durrant and 3http://flax.nzdl.org/greenstone3/flax 264 Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 264–274 Florence, Italy, August 2, 2019. c 2019 Association for Computational Linguistics ods for collocation identification, i.e., regex-over- occurrence of two or more words within a short pos, n-grams combined with regex-over-pos and space of each other in a text. The usual mea- dependency parsing, respectively, and also differ- sure of proximity is a maximum of four words ent association measures for collocation ranking, intervening” (p. 170). Following this definition, i.e., Log Dice, raw frequency and Formula Teach- various criteria have been considered for identify- ing Worth (FTW). On top of that, we compare ing collocation, e.g., distance, frequency, exclusiv- other widely used lexical association measures of ity, directionality, dispersion, type-token distribu- MI, MI2, MI3, t-score, log-likelihood, Salience tion and connectivity (Brezina et al., 2015). How- and Delta P using the data from the best perform- ever, some researchers (e.g., Bartsch, 2004) argue ing candidate identification method as a baseline. that because of the little account of syntactic fea- For our evaluation, we use the expert-judged Aca- tures of the words, it fails to capture certain col- demic Collocation List (ACL) (Ackermann and locations, e.g., the collocation collect stamps in Chen, 2013) as a reference set (section 3.1), and the sentence They collect many things, but chiefly calculate the recall and precision metrics sepa- stamps, or vice versa, captures false collocations, rately for collocation identification and ranking. such as things but. Despite the obvious differences, there is con- 2 Theoretical Background siderable overlap between the three approaches as Durrant and Mathews-Aydnl (2011) rightly point 2.1 Notion of Collocation out, “Non-compositionality and high frequency Among the many different interpretations of collo- of occurrence can both be cited as evidence for cations in the literature, three leading approaches holistic mental storage, and non-substitutability of can be distinguished: psychological, phraseologi- parts can be evidenced in terms of co-occurrence cal and distributional (Men, 2017). frequencies in a corpus” (p. 59). It is precisely this The psychological approach envisages colloca- extended notion of two-word collocations which tions as lexical associations in the mental lexi- was adopted by the collocation references under con of language users underlying their fluent and investigation in this study. meaningful language use (e.g., Ellis et al., 2008). This perspective on collocations is supported by 2.2 Automatic Extraction of Collocations the evidence from psycholinguistic research using The task of collocation extraction is usually split reaction time tasks, free associations tasks, self- into two steps, that of candidate identification paced reading and eye-tracking which suggests which automatically generates a list of potential that collocations are holistically stored as chunks collocations from a text according to some crite- and thus processed faster (Wray, 2012). However, ria, and that of candidate ranking, which ranks the as found out by Meara (2009), the storage of word list to keep the best collocations on top according associations in the mental lexicon of native speak- to some association measure (Seretan, 2008). ers is different from that of nonnative speakers. The phraseological approach focuses predomi- 2.2.1 Candidate Identification nantly on delimiting collocations (call a meeting) In the candidate identification step, four prominent from free word combinations with a predictable methods can be distinguished based on the prox- meaning (call a doctor), on the one hand, and imity of words and the amount of linguistic infor- fixed idioms with an unpredictable meaning (call mation used: window, n-gram, regex-over-pos and it a day) on the other (e.g., Cowie, 1998) by defin- parsing. The first two are based on linear proxim- ing a set of criteria related to the compositionality ity whereas the other two are on syntactic proxim- of meaning and fixedness of form. Schmitt (2010) ity. argues that such approach is rather problematic for The window-based method (e.g., Brezina et al., the identification task as it is not clear how to op- 2015) identifies collocations within a window of erationalize such criteria without making it subjec- n words before and after the target word. It be- tive and labor-intensive. longs to the most commonly known and used and The distributional approach, also called Firthian directly follows the Firthian definition of colloca- or frequency-based, shifts the focus from the se- tions. Similarly, the n-gram method (e.g., Smadja mantic aspects of collocations to structural. As and McKeown, 1990), extracts sequences of adja- Sinclair (1991) put it, “Collocation is the co- cent n words including the target word. The appli- 265 cation of these two methods can vary along sev- out putting too much weight to rare combinations. eral dimensions, e.g., the nature of words consid- However, Log Dice, in contrast to MI2, is suitable ered, such as word forms, lemmas or word fam- for comparing scores from different corpora and ilies (Seretan, 2008), the context span on the left has been described as a “lexicographer-friendly and right or the number of grams, part-of-speech association score” (Rychly`, 2008, p. 6-9). An- filtering, etc. However, due to the lack of linguis- other measure adjusted for lexicographic purposes tic information used, these methods are prone to is Salience, the forerunner of Log Dice, which many recall and precision errors (for a detailed dis- combines the strengths of MI and log frequency cussion, see Lehmann and Schneider, 2009). (Kilgarriff and Tugwell, 2002). In contrast to the previous two methods, the The t-score represents the strength of associa- regex-over-pos method takes into account the tion between words by calculating the probability grammatical relations between words (e.g., Wu, that a certain collocation will occur without con- 2010).

Evaluation of Automatic Collocation Extraction Methods for Language Learning

D1.3 Deliverable Title: Action Ontology and Object Categories Type

Collocation Classification with Unsupervised Relation Vectors

Comparative Evaluation of Collocation Extraction Metrics

Collocation Extraction Using Parallel Corpus

Collocation Classification with Unsupervised Relation Vectors

Dish2vec: a Comparison of Word Embedding Methods in an Unsupervised Setting

Evaluating Language Models for the Retrieval and Categorization of Lexical Collocations

Collocation Segmentation for Text Chunking

17058 FULLTEXT.Pdf (1.188Mb)

Exploratory Search on Mobile Devices

Automatic Extraction of Synonymous Collocation Pairs from a Text Corpus

Comparison of Latent Dirichlet Modeling and Factor Analysis for Topic Extraction: a Lesson of History