Using the Web As an Implicit Training Set: Application to Noun Compound
Total Page:16
File Type:pdf, Size:1020Kb
Using the Web as an Implicit Training Set: Application to Noun Compound Syntax and Semantics by Preslav Ivanov Nakov Magistar (Sofia University, Bulgaria) 2001 A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the GRADUATE DIVISION of the UNIVERSITY OF CALIFORNIA, BERKELEY arXiv:1912.01113v1 [cs.CL] 23 Nov 2019 Committee in charge: Professor Marti Hearst, Co-chair Professor Dan Klein, Co-chair Professor Jerome Feldman Professor Lynn Nichols Fall 2007 The dissertation of Preslav Ivanov Nakov is approved: Professor Marti Hearst, Co-chair Date Professor Dan Klein, Co-chair Date Professor Jerome Feldman Date Professor Lynn Nichols Date University of California, Berkeley Fall 2007 Using the Web as an Implicit Training Set: Application to Noun Compound Syntax and Semantics Copyright 2007 by Preslav Ivanov Nakov 1 Abstract Using the Web as an Implicit Training Set: Application to Noun Compound Syntax and Semantics by Preslav Ivanov Nakov Doctor of Philosophy in Computer Science University of California, Berkeley Professor Marti Hearst, Co-chair Professor Dan Klein, Co-chair An important characteristic of English written text is the abundance of noun compounds – sequences of nouns acting as a single noun, e.g., colon cancer tumor suppressor protein. While eventually mastered by domain experts, their interpretation poses a major challenge for automated analysis. Understanding noun compounds’ syntax and semantics is important for many natural language applications, including question answering, machine translation, information retrieval, and information extraction. For example, a question answering system might need to know whether ‘protein acting as a tumor suppressor’ is an acceptable paraphrase of the noun compound tumor suppressor protein, and an information extraction system might need to decide if the terms neck vein thrombosis and neck thrombosis can possibly co-refer when used in the same document. Similarly, a phrase-based machine 2 translation system facing the unknown phrase WTO Geneva headquarters, could benefit from being able to paraphrase it as Geneva headquarters of the WTO or WTO headquarters located in Geneva. Given a query like migraine treatment, an information retrieval system could use paraphrasing verbs like relieve and prevent for page ranking and query refinement. I address the problem of noun compounds syntax by means of novel, highly accu- rate unsupervised and lightly supervised algorithms using the Web as a corpus and search engines as interfaces to that corpus. Traditionally the Web has been viewed as a source of page hit counts, used as an estimate for n-gram word frequencies. I extend this approach by introducing novel surface features and paraphrases, which yield state-of-the-art results for the task of noun compound bracketing. I also show how these kinds of features can be applied to other structural ambiguity problems, like prepositional phrase attachment and noun phrase coordination. I address noun compound semantics by automatically generat- ing paraphrasing verbs and prepositions that make explicit the hidden semantic relations between the nouns in a noun compound. I also demonstrate how these paraphrasing verbs can be used to solve various relational similarity problems, and how paraphrasing noun compounds can improve machine translation. Professor Marti Hearst Dissertation Committee Co-chair Professor Dan Klein Dissertation Committee Co-chair i To Petya. ii Contents 1 Introduction 1 2 Noun Compounds 6 2.1 TheProcessofCompounding . 6 2.2 Defining the Notion of Noun Compound .................... 9 2.3 NounCompoundsinOtherLanguages . 14 2.3.1 GermanicLanguages........................... 15 2.3.2 RomanceLanguages ........................... 16 2.3.3 SlavicLanguages ............................. 16 2.3.4 TurkicLanguages............................. 21 2.4 Noun Compounds: My Definition, Scope Restrictions . ........ 22 2.5 LinguisticTheories .............................. 24 2.5.1 Levi’s Recoverably Deletable Predicates . ...... 24 2.5.2 Lauer’s Prepositional Semantics . 31 2.5.3 DescentofHierarchy........................... 31 3 Parsing Noun Compounds 35 3.1 Introduction.................................... 36 3.2 RelatedWork................................... 38 3.2.1 UnsupervisedModels........................... 38 3.2.2 SupervisedModels ............................ 40 3.2.3 WebasaCorpus............................. 42 3.3 AdjacencyandDependency . 46 3.3.1 AdjacencyModel............................. 46 3.3.2 DependencyModel............................ 46 3.3.3 Adjacencyvs.Dependency . 47 3.4 Frequency-based Association Scores . ...... 49 3.4.1 Frequency................................. 49 3.4.2 ConditionalProbability . 50 3.4.3 Pointwise Mutual Information . 50 3.4.4 Chi-square ................................ 52 3.5 Web-Derived Surface Features . 53 iii 3.5.1 Dash.................................... 54 3.5.2 GenitiveMarker ............................. 54 3.5.3 Internal Capitalization . 54 3.5.4 EmbeddedSlash ............................. 55 3.5.5 Parentheses ................................ 55 3.5.6 ExternalDash .............................. 55 3.5.7 OtherPunctuation ............................ 55 3.6 Other Web-Derived Features . 56 3.6.1 Genitive Marker – Querying Directly . 56 3.6.2 Abbreviation ............................... 57 3.6.3 Concatenation .............................. 57 3.6.4 Wildcard ................................. 59 3.6.5 Reordering ................................ 61 3.6.6 Internal Inflection Variability . 62 3.6.7 SwappingtheFirstTwoWords . 63 3.7 Paraphrases.................................... 63 3.8 Datasets...................................... 67 3.8.1 Lauer’sDataset.............................. 67 3.8.2 BiomedicalDataset............................ 70 3.9 Evaluation..................................... 72 3.9.1 ExperimentalSetup ........................... 72 3.9.2 Results for Lauer’s Dataset ....................... 73 3.9.3 Results for the Biomedical Dataset ................... 77 3.9.4 ConfidenceIntervals ........................... 79 3.9.5 Statistical Significance . 81 3.10 UsingMEDLINEinsteadoftheWeb . 86 3.11 ConclusionandFutureWork . 87 4 Noun Compounds’ Semantics and Relational Similarity 89 4.1 Introduction.................................... 90 4.2 RelatedWork................................... 91 4.3 Using Verbs to Characterize Noun-Noun Relations . ........ 96 4.4 Method ...................................... 97 4.5 SemanticInterpretation . 98 4.5.1 Verb-based Vector-Space Model . 98 4.5.2 ComponentialAnalysis . 100 4.5.3 Dynamic Componential Analysis . 101 4.6 Comparison to Abstract Relations in the Literature . .......... 102 4.6.1 Comparison to Girju et al. (2005) ................... 102 4.6.2 Comparison to Barker & Szpakowicz (1998) . 104 4.6.3 Comparison to Rosario et al. (2002) .................. 104 4.7 Comparison to Human-Generated Verbs . 109 4.8 ComparisontoFrameNet . .. .. .. .. .. .. .. .. 120 4.9 Application to Relational Similarity . ....... 124 4.9.1 Introduction ............................... 124 iv 4.9.2 Method .................................. 125 4.9.3 SATVerbalAnalogy .. .. .. .. .. .. .. .. 131 4.9.4 Head-Modifier Relations . 133 4.9.5 Relations Between Nominals . 136 4.10Discussion..................................... 143 4.11 ConclusionandFutureWork . 144 5 Improving Machine Translation with Noun Compound Paraphrases 146 5.1 Introduction.................................... 147 5.2 RelatedWork................................... 148 5.3 Method ...................................... 149 5.4 Paraphrasing ................................... 152 5.5 Evaluation..................................... 156 5.6 ResultsandDiscussion. 158 5.7 ConclusionandFutureWork . 164 6 Other Applications 166 6.1 Prepositional Phrase Attachment . 166 6.1.1 Introduction ............................... 166 6.1.2 RelatedWork............................... 167 6.1.3 ModelsandFeatures ........................... 170 6.1.4 Experiments and Evaluation . 177 6.1.5 ConclusionandFutureWork . 181 6.2 NounPhraseCoordination. 183 6.2.1 Introduction ............................... 183 6.2.2 RelatedWork............................... 185 6.2.3 ModelsandFeatures ........................... 187 6.2.4 Evaluation ................................ 190 6.2.5 ConclusionandFutureWork . 192 7 Quality and Stability of Page Hit Estimates 194 7.1 Using Web Page Hits: Problems and Limitations . 195 7.1.1 Lack of Linguistic Restrictions on the Query . 195 7.1.2 NoIndexingforPunctuation . 196 7.1.3 Page Hits are Not n-grams . 197 7.1.4 Rounding and Extrapolation of Page Hit Estimates . 197 7.1.5 Inconsistencies of Page Hit Estimates and Boolean Logic . 198 7.1.6 Instability of Page Hit Estimates . 199 7.2 Noun Compound Bracketing Experiments . 201 7.2.1 VariabilityoverTime . .. .. .. .. .. .. .. 201 7.2.2 Variability by Search Engine . 211 7.2.3 Impact of Language Filtering . 214 7.2.4 ImpactofWordInflections . 217 7.2.5 Variability by Search Engine: 28 Months Later . 219 7.2.6 Page Hit Estimations: Top 10 vs. Top 1,000 . 220 v 7.3 ConclusionandFutureWork . 220 8 Conclusions and Future Work 227 8.1 ContributionsoftheThesis . 227 8.2 DirectionsforFutureResearch . 229 A Semantic Entailment 255 B Phrase Variability in Textual Entailment 259 B.1 PrepositionalParaphrase . 260 B.2 Prepositional Paraphrase & Genitive Noun Compound . ........ 260 B.3 Verbal & Prepositional Paraphrases . 261 B.4 Nominalization .................................