Collocation and Term Extraction Using Linguistically Enhanced Statistical Methods

Collocation and Term Extraction Using Linguistically Enhanced Statistical Methods Dissertation zur Erlangung des akademischen Grades Doctor philosophiae (Dr. phil.) vorgelegt dem Rat der Philosophischen Fakultät der Friedrich-Schiller-Universität Jena von Joachim Wermter, MA (Master of Arts) geboren am 10.08.1969 in Horb am Neckar Gutachter 1. Prof. Dr. Udo Hahn 2. Prof. Dr. Rüdiger Klar 3. Prof. Dr. Adrian P. Simpson Tag des Kolloquiums: 04.08.2008 Acknowledgements My heartfelt thanks go out to Prof. Dr. Udo Hahn (Professor of Computational Linguistics at Jena University) who greatly supported me during the course of my work in a unique way. Without his advice and assistance this work probably would not have seen the light of day. I would also like to thank Prof. Dr. Rüdiger Klar and PD Dr. med. Stefan Schulz from the Department of Medical Informatics at Freiburg University Hospital who greatly helped me at the early stages of my research within the DFG-funded MorphoSaurus project. Next, I would like to thank Sabine Demsar, Kristina Meller, and Konrad Feld- meier, who did a great job at annoting the PNV triple candidates as a collocation gold standard. Many thanks also to the developers of the Umls Metathesaurus for setting up and maintaing such a great terminological resource. Both efforts made the evaluative part of this work possible in the first place. Many thanks also to Dr. med Peter Horn (Hannover Medical School) who greatly assisted me in selecting the right MeSH terms for HSCT and immunology. My dearest thanks go out to my wife Holly. Without her constant support and encouragement, particularly in difficult times, the whole enterprise would not have been worth while. Contents 1 Introduction 1 1.1 Main Objectives and Contributions . 4 1.2 StructureofthisThesis........................... 6 2 Defining Collocations and Terms 9 2.1 DefiningCollocations............................ 10 2.1.1 Defining Collocations from the Lexicographic Perspective . 11 2.1.1.1 The Meaning-based Lexicographic Approach to Collo- cations........................... 12 2.1.1.2 Other Lexicographic Accounts of Collocations . 16 2.1.2 Defining Collocations from the Frequentist Perspective . 18 2.1.2.1 Firth’s Model of Language Description . 19 2.1.2.2 The Collocational Layer of Firth’s Model . 20 2.1.2.3 Neo-Firthian Developments: Halliday and Sinclair . 21 2.1.3 Defining Collocations from a Computational Linguistics Per- spective ............................... 23 2.1.4 Linguistic Properties of Collocations Adopted . 27 2.1.4.1 Four Basic Characteristic Properties . 28 2.1.4.2 Demarcation Line: Collocations vs. Free Word Com- binations ......................... 31 2.2 DefiningTerms ............................... 34 2.2.1 General Theory of Terminology . 36 2.2.2 Beyond General Theory of Terminology . 38 2.2.3 Conventional Definitions of Terms . 39 2.2.4 Problems with the Classical and Conventional Approaches . 40 CONTENTS vi 2.2.5 Pragmatic Definitions of Terms . 42 2.2.6 TermsandSublanguage . 43 2.2.7 (Computational) Linguistic Definitions of Terms . 46 2.3 Assessment of Linguistic Definitions for Collocations and Terms . 49 3 Approaches to the Extraction of Collocations and Terms 53 3.1 Approaches to Collocation Extraction . 54 3.1.1 Berry-Rogghe ............................ 55 3.1.2 Smadja ............................... 56 3.1.3 Lin.................................. 59 3.1.4 EvertandKrenn .......................... 62 3.2 ApproachestoTermExtraction . 65 3.2.1 JustesonandKatz ......................... 66 3.2.2 Daille ................................ 68 3.2.3 FrantziandAnaniadou . 69 3.2.4 Jacquemin.............................. 73 3.3 Lexical Association Measures and their Application . 75 3.3.1 StatisticalFoundations . 77 3.3.2 T-test ................................ 82 3.3.3 Log-Likelihood ........................... 84 3.3.4 MutualInformation. 85 3.3.5 ExtensionstoLarger-SizeN-Grams . 86 3.3.6 Shortcomings and Linguistic Filtering . 88 3.3.7 Frequency of Co-Occurrence . 90 3.3.8 C-value ............................... 91 4 Linguistically Enhanced Statistics To Measure Lexical Association 93 4.1 StatisticalRequirements . 95 4.1.1 Avoidance of Non-Linguistic Assumptions . 96 4.1.2 ExtensibilityofSize. 97 4.1.3 The Frequency Of Co-Occurrence Factor . 98 4.1.4 OutputRanking .......................... 98 4.2 LinguisticRequirements . .100 CONTENTS vii 4.2.1 Firth as Linguistic Frame of Reference . 101 4.2.2 Linguistic Requirements for Collocations . 103 4.2.3 Linguistic Requirements for Terms . 105 4.3 Limited Syntagmatic Modifiability for CollocationExtraction . .108 4.3.1 Defining Limited Syntagmatic Modifiability . 109 4.3.2 Illustrating Limited Syntagmatic Modifiability . 111 4.4 Limited Paradigmatic Modifiability for Term Extraction . .......114 4.4.1 Defining Limited Paradigmatic Modifiability . 114 4.4.2 Illustrating Limited Paradigmatic Modifiability . 117 4.5 EvaluationSetting .............................119 4.5.1 General Requirements for Evaluation . 120 4.5.1.1 Assembling the Text Corpus . 121 4.5.1.2 Extracting and Counting the Candidates . 121 4.5.1.3 Classification of the Targets . 123 4.5.1.4 Quantitative Performance Evaluation . 125 4.5.1.5 Baselines and Significance . 128 4.5.1.6 Qualitative Performance Evaluation . 130 4.5.2 Evaluation Setting for Collocation Extraction . 131 4.5.2.1 Text Corpus and Linguistic Filtering . 131 4.5.2.2 Target Structure and Candidate Sets . 133 4.5.2.3 Classification of Candidate Set and Quality Control . 135 4.5.3 Evaluation Setting for Term Extraction . 139 4.5.3.1 Text Corpus and Linguistic Filtering . 139 4.5.3.2 Target Structures and Candidate Sets . 141 4.5.3.3 Classification of Candidate Sets . 143 5 Experimental Results 147 5.1 Experimental Results for Collocation Extraction . .......148 5.1.1 QuantitativeResults . .148 5.1.1.1 Results on Performance Metrics . 149 5.1.1.2 Results on Significance Testing . 157 5.1.2 QualitativeResults . .158 5.1.2.1 Results on the Static Criteria . 159 CONTENTS viii 5.1.2.2 Results on the Dynamic Criteria . 161 5.1.3 Limited Syntagmatic Modifiability Revisited . 165 5.2 Experimental Results for Term Extraction . 167 5.2.1 QuantitativeResults . .167 5.2.1.1 Results on Performance Metrics . 168 5.2.1.2 Results on Significance Testing . 180 5.2.2 QualitativeResults . .183 5.2.2.1 Results on the static criteria . 184 5.2.2.2 Results on the dynamic criteria . 189 5.3 Assessment of Experimental Results . 196 6 Conclusions and Outlook 201 7 Summary 205 A Collocation Classification Manual 207 A.1 Definitionen aus der einschlägigen Literatur . 207 A.2 Unsere Richtlinien für die Verwendung des Begriffs ‘Kollokation’ . 208 A.3 DetailszurKlassifizierung . .210 B MeSH Terms and UMLS Source Vocabularies 213 B.1 MeSHTerms ................................213 B.2 UMLSSourceVocabularies. .215 List of Tables 3.1 Observed and marginal frequencies ...................... 78 3.2 Expected frequencies and their computation from marginal frequencies . 79 3.3 Observed and marginal frequencies for a German PNV collocation. 80 3.4 Expected frequencies for the same German PNV collocation. ...... 80 4.1 Collocational and non-collocational PNV Triples with Associated Syntag- matic Attachments ..............................112 4.2 Support Verb Construction PNV Triple with Associated Syntagmatic At- tachments ..................................113 4.3 Possible selections for k = 1, k = 2 and k = 3 for a trigram noun phrase 115 4.4 LPM and k-selection modifiabilities for k = 1 and k = 2 for the trigram term “open reading frame” ..........................118 4.5 LPM and k-selection modifiabilities for k = 1 and k = 2 for the trigram non-term “t cell response” ..........................118 4.6 Context of quantitative performance evaluation. ........126 4.7 The McNemar significance test of differences for comparing two lexical association measures (LAMs). .........................129 4.8 Frequency distribution of PNV triple tokens and types for our 100-million- word German newspaper corpus .......................134 4.9 Frequency distribution of PNV triple tokens and types for 10 million words of German newspaper corpus .........................134 4.10 Proportion of actual PNV triple collocations and sub proportions of the three collocational categories. ........................136 4.11 Ranges of the Kappa coefficient and designated strengths of agreement . 137 4.12 Observed coarse-grained collocation classifications by two annotators . 137 LIST OF TABLES x 4.13 Overview of agreement rates and Kappa values for coarse-grained classification. ..................................138 4.14 Overview of agreement rates and Kappa values for fine-grained classification. 138 4.15 Frequency distribution for n-gram noun phrase term candidate tokens and types for the 100-million-word Medline text corpus .............142 4.16 Frequency distribution for n-gram noun phrase term candidate tokens and types for the 10-million-word Medline text corpus .............143 4.17 Proportion of actual terms among the bigram, trigram and quadgram term candidates in the large and small corpus. ..................144 5.1 Precision scores of association measures for collocation extraction on the 114 million word (upper table) and 10 million word (lower table) German newspaper corpus. ..............................150 5.2 Recall scores of association measures for collocation extraction on the 114 million word (upper table) and 10 million word (lower table) German newspaper corpus. .................................152 5.3 F-scores of association measures

Collocation and Term Extraction Using Linguistically Enhanced Statistical Methods

D1.3 Deliverable Title: Action Ontology and Object Categories Type

Collocation Classification with Unsupervised Relation Vectors

Comparative Evaluation of Collocation Extraction Metrics

Collocation Extraction Using Parallel Corpus

Collocation Classification with Unsupervised Relation Vectors

Dish2vec: a Comparison of Word Embedding Methods in an Unsupervised Setting

Evaluating Language Models for the Retrieval and Categorization of Lexical Collocations

Collocation Segmentation for Text Chunking

17058 FULLTEXT.Pdf (1.188Mb)

Exploratory Search on Mobile Devices

Automatic Extraction of Synonymous Collocation Pairs from a Text Corpus

Comparison of Latent Dirichlet Modeling and Factor Analysis for Topic Extraction: a Lesson of History