Sentence Analysis and Collocation Identification

Total Page:16

File Type:pdf, Size:1020Kb

Sentence Analysis and Collocation Identification Sentence Analysis and Collocation Identification Eric Wehrli, Violeta Seretan, Luka Nerima Language Technology Laboratory University of Geneva Eric.Wehrli, Violeta.Seretan, Luka.Nerima @unige.ch { } Abstract the actual use made of these resources in other NLP applications. Identifying collocations in a sentence, in In this paper, we consider the particular appli- order to ensure their proper processing in cation of syntactic parsing. Just as other types of subsequent applications, and performing multi-word expressions (henceforth, MWEs), col- the syntactic analysis of the sentence are locations are problematic for parsing because they interrelated processes. Syntactic informa- have to be recognised and treated as a whole, ra- tion is crucial for detecting collocations, ther than compositionally, i.e., in a word by word and vice versa, collocational information fashion (Sag et al., 2002). The standard approach is useful for parsing. This article describes in dealing with MWEs in parsing is to apply an original approach in which collocations a “words-with-spaces” preprocessing step, which are identified in a sentence as soon as pos- marks the MWEs in the input sentence as units sible during the analysis of that sentence, which will later be integrated as single blocks in rather than at the end of the analysis, as in the parse tree built during analysis. our previous work. In this way, priority is We argue that such an approach, albeit suffi- given to parsing alternatives involving col- ciently appropriate for some subtypes of MWEs2, locations, and collocational information is not really adequate for processing colloca- guide the parser through the maze of alter- tions. Unlike other expressions that are fixed or natives. This solution was shown to lead semi-fixed3, collocations do not allow a “words- to substantial improvements in the perfor- with-spaces” treatment because they have a high mance of both tasks (collocation identifi- morpho-syntactic flexibility. cation and parsing), and in that of a sub- There is no systematic restriction, for instance, sequent task (machine translation). on the number of forms a lexical item (such as a 1 Introduction verb) may have in a collocation, on the order of items in a collocation, or on the number of words Collocations1 constitute a central language phe- that may intervene between these items. Collo- nomenon and an impressive amount of work has cations are situated at the intersection of lexicon been devoted over the past decades to the automa- and grammar; therefore, they cannot be accounted tic acquisition of collocational resources – as at- for merely by the lexical component of a parsing tested, among others, by initiatives like the MWE system, but have to be integrated to the grammati- 2008 shared task aimed at creating a repository of cal component as well, as the parser has to consi- reference data (Gregoire´ et al., 2008). However, 2 little or no reference exist in the literature about Sag et al. (2002) thoroughly discusses the extend to which a “words-with-spaces” approach is appropriate for dif- 1We adopt the lexicographic understanding for the term ferent kinds of MWEs. collocation (Benson et al., 1986), as opposed to the British 3For instance, compound words: by and large, ad hoc; contextualist tradition focused on statistical co-occurrence named entities: New York City; and non-decomposable (Firth, 1957; Sinclair, 1991). idioms: shoot the breeze. 28 Proceedings of the Multiword Expressions: From Theory to Applications (MWE 2010), pages 28–36, Beijing, August 2010 der all the possible syntactic realisations of collo- sing results. For instance, Brun (1998) compared cations. the coverage of a French parser with and wi- Alternatively, a post-processing approach (such thout terminology recognition in the preproces- as the one we pursued previously in Wehrli et sing stage. She found that the integration of 210 al. (2009b)) would identify collocations after the nominal terms in the preprocessing components of syntactic analysis has been performed, and out- the parser resulted in a significant reduction of the put a parse tree in which collocational relations number of alternative parses (from an average of are highlighted between the composing items, in 4.21 to 2.79). The eliminated parses were found order to inform the subsequent processing appli- to be semantically undesirable. No valid analy- cations (e.g., a machine translation application). sis were ruled out. Similarly, Zhang and Kor- Again, this solution is not fully appropriate, and doni (2006) extended a lexicon with 373 additio- the reason lies with the important observation that nal MWE lexical entries and obtained a significant prior collocational knowledge is highly relevant increase in the coverage of an English grammar for parsing. Collocational restrictions are, along (14.4%, from 4.3% to 18.7%). with other types of information like selectional In the cases mentioned above, a “words-with- preferences and subcategorization frames, a major spaces” approach was used. In contrast, Ale- means of structural disambiguation. Collocational gria et al. (2004) and Villavicencio et al. (2007) relations between the words in a sentence proved adopted a compositional approach to the enco- very helpful in selecting the most plausible among ding of MWEs, able to capture more morpho- all the possible parse trees for a sentence (Hindle syntactically flexible MWEs. Alegria et al. (2004) and Rooth, 1993; Alshawi and Carter, 1994; Ber- showed that by using a MWE processor in the pre- thouzoz and Merlo, 1997; Wehrli, 2000). Hence, processing stage of their parser (in development) the question whether collocations should be iden- for Basque, a significant improvement in the POS- tified in a sentence before or after parsing is not an tagging precision is obtained. Villavicencio et al. easy one. The previous literature on parsing and (2007) found that the addition of 21 new MWEs collocations fails to provide insightful details on to the lexicon led to a significant increase in the how this circular issue is (or can be) solved. grammar coverage (from 7.1% to 22.7%), without In this paper, we argue that the identification of altering the grammar accuracy. collocations and the construction of a parse tree An area of intensive research in parsing is are interrelated processes, that must be accounted concerned with the use of lexical preferences, co- for simultaneously. We present a processing mo- occurrence frequencies, collocations, and contex- del in which collocations, if present in a lexicon, tually similar words for PP attachment disambi- are identified in the input sentence during the ana- guation. Thus, an important number of unsupervi- lysis of that sentence. At the same time, they are sed (Hindle and Rooth, 1993; Ratnaparkhi, 1998; used to rank competing parsing hypotheses. Pantel and Lin, 2000), supervised (Alshawi and The paper is organised as follows. Section 2 Carter, 1994; Berthouzoz and Merlo, 1997), and reviews the previous work on the interrelation combined (Volk, 2002) methods have been deve- between parsing and processing of collocations loped to this end. (or, more generally, MWEs). Section 3 introduces However, as Hindle and Rooth (1993) pointed our approach, and section 4 evaluates it by compa- out, the parsers used by such methods lack pre- ring it against the standard non-simultaneous ap- cisely the kind of corpus-based information that proach. Section 5 provides concluding remarks is required to resolve ambiguity, because many and presents directions for future work. of the existing attachments may be missing or wrong. The current literature provides no indi- 2 Related Work cation about the manner in which this circular Extending the lexical component of a parser with problem can be circumvented, and on whether MWEs was proved to contribute to a significant flexible MWEs should be processed before, du- improvement of the coverage and accuracy of par- ring or after the sentence analysis takes place. 29 3 Parsing and Collocations contribute to the disambiguation process so cru- cial for parsing. To put it differently, identifying As argued by many researchers – e.g., Heid (1994) collocations should not be seen as a burden, as an – collocation identification is best performed on additional task the parser should perform, but on the basis of parsed material. This is due to the the contrary as a process which may help the par- fact that collocations are co-occurrences of lexi- ser through the maze of alternatives. Collocations, cal items in a specific syntactic configuration. The in their vast majority, are made of frequently used collocation break record, for instance, is obtained terms, often highly ambiguous (e.g., break record, only in the configurations where break is a verb loose change). Identifying them and giving them whose direct object is (semantically) headed by high priority over alternatives is an efficient way the lexical item record. In other words, the collo- to reduce the ambiguity level. Ambiguity reduc- cation is not defined in terms of linear proximity, tion through the identification of collocations is but in terms of a specific grammatical relation. not limited to lexical ambiguities, but also applies As the examples in this section show, the rela- to attachment ambiguities, and in particular to the tive order of the two items is not relevant, nor is well-known problem of PP attachment. Consider the distance between the two terms, which is unli- the following French examples in which the pre- mited as long as the grammatical relation holds4. positions are highlighted: In our system, the grammatical relations are com- puted by a syntactic parser, namely, Fips (Wehrli, (1)a. ligne de partage des eaux (“watershed”) 2007; Wehrli and Nerima, 2009). Until now, the b. systeme` de gestion de base de donnees´ (“da- collocation identification process took place at the tabase management system”) end of the parse in a so-called “interpretation” c. force de maintien de la paix (“peacekeeping procedure applied to the complete parse trees.
Recommended publications
  • The Impact of Corpus-Based Collocation Instruction on Iranian EFL Learners' Collocation Learning
    Universal Journal of Educational Research 2(6): 470-479, 2014 http://www.hrpub.org DOI: 10.13189/ ujer.2014.020604 The Impact of Corpus-Based Collocation Instruction on Iranian EFL Learners' Collocation Learning Shabnam Ashouri*, Masoume Arjmandi, Ramin Rahimi Department of English, Guilan Science and Research Branch, Islamic Azad University, Guilan, Iran *Corresponding Author: [email protected] Copyright © 2014 Horizon Research Publishing All rights reserved. Abstract Over the past decades, studies of EFL/ESL drawn to vocabulary teaching and its collocation. Afterward vocabulary acquisition have identified the significance of they came to this truth that lack of knowledge of collocation collocations in language learning. Due to the fact that can prevent learners from inferring, and they also collocations have been regarded as one of the major concerns misunderstand native speakers. Wrong use of collocation can of both EFL teachers and learners for many years, the present cause native speakers not to comprehend what the learners study attempts to shed light on the impact of corpus-based say. According to Zarie and Gholami (2007), another collocation on EFL learners' collocation learning and problem of second language learners is the function of awareness. 60 Iranian EFL learners, who participated in this near-synonyms, particularly their collocations. study, were chosen randomly based on their scores in an Near-synonyms are pairs of words with parallel meanings, OPT exam. There were two groups, experimental and control but diverse collocations. Strong and powerful are two cases ones. The study examined the effects of direct corpus-based of near-synonyms. Tea can be strong, but not powerful.
    [Show full text]
  • English Collocations in Use Intermediate Book with Answers
    McCarthy and O’Dell McCarthy and ENGLISH COLLOCATIONS IN USE Collocations are combinations of words, which frequently appear together. Using Intermediate them makes your English sound more natural. Knowledge of collocations is often tested in examinations such as Cambridge FCE, CAE, CPE and IELTS. This book is suitable for ENGLISH students at good intermediate level and above. Using collocations will improve your style of written and spoken English: ENGLISH • Instead of ‘a big amount’, say ‘a substantial amount’ • Instead of ‘think about the options’, say ‘consider the options’ COLLOCATIONS • Using collocations will make your English sound more natural: • Instead of ‘get ill’, say ‘fall ill’ COLLOCATIONS • Instead of ‘a bigCURRENT fine’, say ‘a BCC heavy TOO fine’ LONG Using collocationsFOR will helpNEW you DESIGN avoid common learner errors: How words work • Instead of ‘do a choice’, say ‘make a choice’ together for fluent • Instead of ‘make your homework’, say ‘do your homework’ IN USE and natural English English Collocations in Use Intermediate Self-study and • 60 easy-to-use two-page units: collocations are presented and explained IN USE on left-hand pages with a range of practice exercises on right-hand pages. classroom use • Presents and explains approximately 1,500 collocations in typical contexts Second Edition using short texts, dialogues, tables and charts. Also available • Contains a comprehensive answer key and full index for easy reference. CAMBRIDGE LEARNER’S DICTIONARY• FOURTHHighlights EDITION register to help students choose the appropriate language for ENGLISH VOCABULARY IN USE UPPER-INTERMEDIATEparticular situations. Intermediate ENGLISH PRONUNCIATION IN USE INTERMEDIATE • Informed by the Cambridge English Corpus to ensure that the most frequently used collocations are presented.
    [Show full text]
  • Collocation As Source of Translation Unacceptabilty: Indonesian Students’ Experiences
    International Journal of English Linguistics; Vol. 3, No. 5; 2013 ISSN 1923-869X E-ISSN 1923-8703 Published by Canadian Center of Science and Education Collocation as Source of Translation Unacceptabilty: Indonesian Students’ Experiences Syahron Lubis1 1 Postgraduate Studies of Linguistics, University of Sumatera Utara, Indonesia Correspondence: Syahron Lubis, Postgraduate Studies of Linguistics, University of Sumatera Utara, Indonesia. E-mail: [email protected] Received: July 24, 2013 Accepted: August 19, 2013 Online Published: September 23, 2013 doi:10.5539/ijel.v3n5p20 URL: http://dx.doi.org/10.5539/ijel.v3n5p20 Abstract The aims of the present study are to explore wrong English collocations made by some Indonesian English learners and to find out the causes of the wrong collocations. Twenty seven wrong English lexical collocations and nine wrong grammatical collocations collected from the students’ translation and writing assignments have been examined. After comparing the patterns of English collocations and the Indonesian collocations it is found out that the erroneous English collocations are attributed to four causes: (1) learners’ lack of knowledge of collocation, (2) differences of collocations between English and Bahasa Indonesia, (3) learners’ low mastery of vocabulary and (4) strong interferences of the learners’ native language.It is suggested that the students should be informed that collocation, like grammar, is one aspect of not only English but all other languages that should be learnt if they want their English sound natural and mastery of vocabulary can contribute to the constructing of correct collocations. Keywords: collocation, translation, vocabulary, interference 1. Introduction In learning a foreign language a learner is faced with a variety of problems.
    [Show full text]
  • Towards a Model for Investigating Predicate-Intensifier Collocations
    WORK IN PROGRESS Towards a Model for Investigating Predicate-Intensifier Collocations. Silvia Cacchiani Post-doctoral student Dipartimento di Anglistica, Università di Pisa Via Santa Maria 67,1 - 56126 Pisa (PI) [email protected] Abstract Adverbial intensifiers express the semantic role of degree. Here, we shall focus on English upgrading intensifiers like very, absolutely, extremely, impossibly. Specifically, what we have mainly aspire at is to develop and apply a simple but efficient model that investigates the motivations behind choosing from among competing intensifiers in a non-haphazard way. Such a model is meant to work as a "combinatory chart" that allows for fair comparison ofnear-synonymic intensifiers with respect to a number ofparameters ofvariations (or textual preferences) on the morpho-syntactic, lexico-semantic and discourse-pragmatic levels. Its ultimate lexicographic contribution to the issue of predicate-intensifier collocations will be building a combinatory dictionary of English intensifiers - and, later on, a bilingual combinatory dictionary of English and Italian intensifiers. 1. Corpus Data and Methodology bi the current paper we want to depict a model for investigating predicate-intensifier collocations.1 Upgrading intensifiers constitute an extremely varied lexico-functional category. They boost a quality ab:eady present in their predicate (i.e. head) along an imaginary scale of degree of intensity. The modification introduced cannot be objectively measured (e.g. deadgorgeous as agaiastfully developed countries). The corpus consulted was the BNC (http://www.natcorp.ox.ac.uk/index.html). We only searched the www in the case ofpoorly or not represented at all intensifiers. Since most intensifiers are polyfimctional words ambiguous between different interpretations, one of which is precisely intensification (e.g.
    [Show full text]
  • A Corpus-Based Study of the Effects of Collocation, Phraseology
    A CORPUS-BASED STUDY OF THE EFFECTS OF COLLOCATION, PHRASEOLOGY, GRAMMATICAL PATTERNING, AND REGISTER ON SEMANTIC PROSODY by TIMOTHY PETER MAIN A thesis submitted to the University of Birmingham for the degree of DOCTOR OF PHILOSOPHY Department of English Language and Applied Linguistics College of Arts and Law The University of Birmingham December 2017 University of Birmingham Research Archive e-theses repository This unpublished thesis/dissertation is copyright of the author and/or third parties. The intellectual property rights of the author or third parties in respect of this work are as defined by The Copyright Designs and Patents Act 1988 or as modified by any successor legislation. Any use made of information contained in this thesis/dissertation must be in accordance with that legislation and must be properly acknowledged. Further distribution or reproduction in any format is prohibited without the permission of the copyright holder. Abstract This thesis investigates four factors that can significantly affect the positive-negative semantic prosody of two high-frequency verbs, CAUSE and HAPPEN. It begins by exploring the problematic notion that semantic prosody is collocational. Statistical measures of collocational significance, including the appropriate span within which evaluative collocates are observed, the nature of the collocates themselves, and their relationship to the node are all examined in detail. The study continues by examining several semi-preconstructed phrases associated with HAPPEN. First, corpus data show that such phrases often activate evaluative modes other than semantic prosody; then, the potentially problematic selection of the canonical form of a phrase is discussed; finally, it is shown that in some cases collocates ostensibly evincing semantic prosody occur in profiles because of their occurrence in frequent phrases, and as such do not constitute strong evidence of semantic prosody.
    [Show full text]
  • Lexical Collocation Errors in Literary Translation
    Dil Dergisi Sayı/Number: 170/1 Ocak/January Gönderildiği tarih: 15.11.2018 Kabul edildiği tarih: 10.01.2019 DOI: 10.33690/dilder.528981 Keywords Anahtar Sözcükler Lexical Collocation Translation, L1 Influence, Action, Verb, Russian, Turkish, Semantic Restriction On Collocability, Translations Of Literary Texts LEXICAL COLLOCATION ERRORS IN LITERARY TRANSLATION YAZINSAL METİNLERİN ÇEVİRİSİNDE SÖZCÜKSEL EŞDİZİM HATALARI • Özgür Şen Bartan Dr. Öğr. Üyesi, Kırıkkale Üniversitesi, Fen-Edebiyat Fakültesi, Batı Dilleri ve Edebiyatları Bölümü, Mütercim Tercümanlık A.B.D., [email protected] Abstract Öz This study aims to explore lexical collocation errors in Bu çalışmanın amacı Türkçe-İngilizce yazınsal metinlerin Turkish-English translations of literary texts. To identify, çevirisinde sözcüksel eşdizim hatalarını araştırmaktır. Türkçe describe and explain the errors (Ellis, 1985), a written corpus yazınsal metinlerin İngilizce çevirilerinde yapılan hataları of Turkish literary texts and their English translation texts belirlemek, tanımlamak ve açıklamak (Ellis, 1985) için İngilizce (ETT) was compiled from students of the Department of Mütercim Tercümanlık Anabilim dalındaki öğrencilerin English Translation and Interpretation. Benson et al.’s (1997) çevirilerinden oluşan yazılı derlem oluşturulmuştur. Bunun classifications for lexical collocations have been used in order yanı sıra, en sık yapılan sözcüksel eşdizim hataları (eylem+ad) to analyse lexical collocation errors for this study. Also, the kısıtlı eşdizimlilik ve anadilin etkisi açısından
    [Show full text]
  • Semantic Prosody and Intensifier Variation in Academic Speech
    SEMANTIC PROSODY AND INTENSIFIER VARIATION IN ACADEMIC SPEECH by ALLISON REBECCA WACHTER (Under the Direction of Chad Howe) ABSTRACT The study of English intensifiers has been of interest in sociolinguistic research. This paper analyzes the variation of common intensifiers very and really in a corpus of Academic English and the language-internal and –external factors that predict this variation. All factors are related to a speaker’s evaluation, specifically to the semantic notion of positive, negative, or neutral prosody. Using a variationst approach, this paper furthers insight to the grammaticalization of intensifiers and how the effect of delexicalization can predict the semantic properties of a modified adjective. The significant factors predicting very/really variation were academic setting, semantic prosody, academic discipline, and gender. The study further develops a methodological framework for operationalizing semantic prosody and producing quantitative results. A second analysis concerning the distinctions between other modifiers, namely reinforcers and attenuators, are also analyzed in the academic corpus. Finally, the paper discusses its support for the use of smaller corpora to examine and compare linguistic effects in more specific registers. INDEX WORDS: semantic prosody, intensifiers, language variation SEMANTIC PROSODY AND INTENSIFIER VARIATION IN ACADEMIC SPEECH by ALLISON REBECCA WACHTER BA, University of Michigan, 2008 A Thesis Submitted to the Graduate Faculty of The University of Georgia in Partial Fulfillment of the Requirements for the Degree MASTER OF ARTS ATHENS, GEORGIA 2012 © 2012 Allison Rebecca Wachter All Rights Reserved SEMANTIC PROSODY AND INTENSIFIER VARIATION IN ACADEMIC SPEECH by ALLISON REBECCA WACHTER Major Professor: Chad Howe Committee: Anya Lunden Paula Schwanenflugel Electronic Version Approved: Maureen Grasso Dean of the Graduate School The University of Georgia May 2012 ACKNOWLEDGEMENTS This thesis would not have been possible without the guidance of my advisor, Dr.
    [Show full text]
  • Collocations As a Language Resource a Functional and Cognitive Study in English Phraseology
    Collocations as a language resource A functional and cognitive study in English phraseology Sonja Poulsen Ph.d. dissertation Institute of Language and Communication University of Southern Denmark 2005 Supervisors Associate professor Fritz Larsen, University of Southern Denmark Associate professor John M. Dienhart, University of Southern Denmark Associate professor Alex Klinge, Copenhagen Business School Front cover Peter Harder’s analogy (1996a: 91) between the functional role that wings play for the survival of birds in biology and the functionality of linguistic constructions inspired me to use birds for my front cover illustration. My thanks to Kurt Normann for letting me use parts of his etching of flying rooks. Sonja Poulsen Institute of Language and Communication University of Southern Denmark DK-5230 Odense M Tel.: +45 6550 2188 E-mail: [email protected] © Sonja Poulsen 2005 Contents: Typographical conventions 3 Abbreviations 3 Tables 4 Figures 4 English résumé 6 Preface and acknowledgements 10 PART ONE 1. Introduction 11 1.1. Phraseology: the traditional approach 12 1.2. Definitions of ‘collocation’ 14 1.3. Methodology and classification in phraseology 17 1.4. Motivation for a functional and cognitive approach 23 1.5. What should a theory of collocations be able to account for? 28 1.6. Overview 33 PART TWO 2. The foundations of a traditional approach to phraseology 34 2.1. Theoretical influences on phraseology 35 2.1.1. A practical concern: teaching English as a foreign language 35 2.1.2. Firthian linguistics 38 2.1.3. Underlying assumptions 48 2.1.3.1. Structuralist dichotomies 48 2.1.3.2.
    [Show full text]
  • Verb Collocations in Medical English
    US-China Foreign Language, ISSN 1539-8080 August 2013, Vol. 11, No. 8, 609-618 D DAVID PUBLISHING Verb Collocations in Medical English Bionote Evelina Miščin College of Business and Management, Zagreb, Croatia The aim of this paper was to investigate the most frequent verb collocations which occur in medical texts and to check the competence of the first-year students of medical English in using them. Merck’s Manual of Medical Information (2004) was used as a corpus to find the most frequent collocations. It was analyzed by several programmes like Simple Concordance, Collocation Extract, TermeX. The corpus analysis established that the nouns “function” and “infection” occur with most verbs (30). They are followed by “pain” (28 verbs), “muscle” (24 verbs). Three hundred and sixty-two verbs occur with nouns and among them the most frequent are “cause”, “have”, “develop”, “treat”, “prevent”, and “produce”. After that, the test was devised to see which collocations students use with the most competence. Two hundred and ninety-seven first-year students of School of Medicine in Zagreb were tested. There were four types of exercises—multiple choice, gap-fill, translation from English into Croatian, and vice versa. The average result in multiple choice was 9.8 with the s = 2.0 as a standard deviation, in gap-fill 5.0 with s = 2.17, in translation into Croatian 6.7 with s = 2.10, and in translation into English 5.2 with s = 2.53. Then, glossary was made which should help future users of medical English. Keywords: collocations, collocational competence, corpus, collocational mistakes, glossary Introduction This paper deals with verb collocations in medical English.
    [Show full text]
  • Collocations Or Free Variation? Empirical Study on the Use of Adverbs in Pre-Verb Position by German ESL Learners
    Philipps-Universität Marburg Institut für Anglistik und Amerikanistik – Englische Sprachwissenschaft SS 2004 10 038 SE Foreign Language Vocabulary Prof. Dr. Rüdiger Zimmermann Collocations or Free Variation? Empirical Study on the Use of Adverbs in Pre-Verb Position by German ESL Learners Dirk Hovy Hirschberg 9 35037 Marburg Subjects: DsuL, Spw. Angl., Ling. Eng. MA Matr.Nr.: 1420461 Table of Content 1. INTRODUCTION.......................................................................................................................3 2. RESEARCH QUESTIONS .......................................................................................................3 3. PREVIOUS RESEARCH ..........................................................................................................6 4. METHODOLOGY .....................................................................................................................9 5. RESULTS...................................................................................................................................14 5.1 PARTICIPANTS .......................................................................................................................14 5.2 ADVERBS...............................................................................................................................15 Sentence 1 ..............................................................................................................................15 Sentence 2 ..............................................................................................................................16
    [Show full text]
  • Induction of Syntactic Collocation Patterns from Generic Syntactic Relations
    Induction of Syntactic Collocation Patterns from Generic Syntactic Relations Violeta Seretan University of Geneva Language Technology Laboratory 2 Rue de Candolle, Geneva, CH-1211, Switzerland [email protected] Abstract 2 Syntactic Patterns for Collocations Syntactic configurations used in collocation extrac- In the most permissive case (when no extraction patterns are tion are highly divergent from one system to an- defined), any pair of words is regarded as a valid collocation other, this questioning the validity of results and candidate. This results in much noise, since the most fre- making comparative evaluation difficult. We de- quent word combinations are also the least interesting (e.g., scribe a corpus-driven approach for inferring an ex- “of the”, “in the”). Therefore, many extraction systems per- haustive set of configurations from actual data by form the linguistic analysis of text and apply a linguistic filter finding, with a parser, all the productive syntactic on collocation candidates (very often, they perform POS tag- associations, then by appealing to human expertise ging in order to filter out the pairs involving function words2). for relevance judgements. There is unfortunately much disagreement with respect to the accepted syntactic configurations for collocations (often as a consequence of the lack of consensus in the understand- 1 Introduction ing of the notion of collocation). There is much divergence, The term collocation, often used in different senses in the first, with respect to the POS of participating words. Some literature, is understood here as in the following statement: authors consider open-class words only ([Justeson and Katz, “The term collocation refers to the idiosyncratic syntagmatic 1995; Hausmann, 1989]), while most of them allow for func- combination of lexical items and is independent of word tion words too (as in “agree on”).
    [Show full text]
  • Idioms, Collocations, and Structure Conventionalized Expressions and the Analysis of Ditransitives
    Idioms, Collocations, and Structure Conventionalized Expressions and the Analysis of Ditransitives Benjamin Bruening (University of Delaware) rough draft, January 15, 2018; comments welcome Abstract Noncompositional phrasal idioms have been used as evidence in syntactic theorizing for decades. A standard assumption, occasionally made explicit (e.g., Larson 2017), is that phrasal idioms differ signifi- cantly from fully compositional collocations in the kinds of syntactic structures they can be built from. I show with a detailed empirical study that this is false. In fact, the syntactic constraints on idioms and col- locations are identical. They should therefore be treated the same, as a broad class of conventionalized expressions. I show that the selection theory of idioms proposed in Bruening (2010) extends straight- forwardly to cover literal collocations. A minor modification accounts for patterns of conventionalized expressions that are striking in their absence. Furthermore, a detailed analysis of the patterns of conven- tionalized expressions in ditransitive constructions shows that they are systematic and decide between theoretical proposals for the structure of ditransitive constructions in English (contra Larson 2017). Only the ApplP analysis developed in Bruening (2010) accounts for the patterns. 1 Introduction There is a long tradition in generative syntax of examining the syntactic patterns of non-literal, convention- alized expressions, and using those patterns to inform our theories of syntax, or even to decide between competing analyses of some phenomenon. As an example, patterns of non-literal phrasal idioms have been used as arguments for the proper analysis of ditransitive constructions (e.g., Green 1974, Larson 1988, Richards 2001, Harley 2002, Bruening 2010).
    [Show full text]