Improving Exploration in Reinforcement Learning Through
Total Page:16
File Type:pdf, Size:1020Kb
Extraction of Linguistic Resources from Multilingual Corpora and their Exploitation AHMAD RAZA SHAHID Ph.D. Thesis The University of York Artificial Intelligence Group Department of Computer Science United Kingdom 9th February 2012 Abstract Increasing availability of on-line and off-line multilingual resources along with the developments in the related automatic tools that can process this information, such as GIZA++ (Och & Ney 2003), has made it possible to build new multilin- gual resources that can be used for NLP/IR tasks. Lexicon generation is one such task, which if done by hand is quite expensive with human and capital costs involved. Generation of multilingual lexicons can now be automated, as is done in this research work. Wikipedia1, an on-line multilingual resource was gainfully employed to automatically build multilingual lexicons using simple search strategies. Europarl parallel corpus (Koehn 2002) was used to create multilingual sets of synonyms, that were later used to carry out the task of Word Sense Disambigua- tion (WSD) on the original corpus from which they were derived. The theoretical analysis of the methodology validated our approach. The multilingualsets of synonymswere then used to learn unsupervised mod- 1http://www.wikipedia.org/ ii Abstract iii els of word morphology in the individual languages. The set of experiments we carried out, along with another unsupervised technique, were evaluated against the gold standard. Our results compared very favorably with the other approach. The combination of the two approaches gave even better results. Contents ListofTables .............................. ix ListofFigures.............................. xi Acknowledgements . .. .. .. .. .. .. .. xii Declaration ............................... xiii 1 Introduction and Motivation 2 1.1 InitialMotivation ......................... 2 1.2 NLPandInformationRetrieval(IR) . 3 1.2.1 NLP............................ 3 1.2.2 CorporabasedApproaches. 4 1.2.3 InformationRetrieval(IR) . 6 1.2.4 MultilingualNLPandIR. 7 1.3 MultilingualResources . 8 1.4 ProblemStatements........................ 9 1.4.1 BuildingMultilingualLexicons . 9 1.4.2 CreatingandusingMultilingualSynsets . 10 1.4.3 MorphologicalAnalysis of MultilingualSynsets . 11 1.4.4 EvaluationanditsChallenges . 12 1.5 ThesisOutline........................... 12 1.6 NoteonTerminology . .. .. .. .. .. .. 13 2 Literature Review 14 2.1 Resources ............................. 15 2.1.1 Wikipedia......................... 15 2.1.2 ParallelCorpora . .. .. .. .. .. 16 2.1.3 WordNet:ALexicalSemanticResource . 18 iv CONTENTS v 2.2 NLPandIR ............................ 20 2.2.1 WordSenseDisambiguation . 20 2.2.1.1 SupervisedDisambiguation . 22 2.2.1.2 UnsupervisedDisambiguation. 26 2.2.1.3 PPAttachmentAmbiguity. 31 2.2.1.4 WordNetandWSD . 35 2.2.1.5 MultilingualDisambiguation . 37 2.2.1.6 DisambiguationinWikipedia . 38 2.2.1.7 UsingWikipediaforWSD . 40 2.2.2 Morphology ....................... 41 2.2.2.1 Analogy .................... 42 2.2.2.2 Harris’sApproach . 43 2.2.2.3 UnsupervisedApproach . 44 2.2.3 InformationRetrieval. 46 2.2.3.1 VectorSpaceModel . 47 2.2.3.2 TFIDF..................... 49 2.2.3.3 PerformanceMeasures . 50 2.2.3.4 ProbabilisticInformationRetrieval . 51 2.2.3.5 Multi-Lingual Information Retrieval (MLIR) . 53 2.2.3.6 ProbabilisticMulti-lingualIR . 56 2.2.3.7 Dictionary-BasedMLIR. 57 2.2.3.8 CorporabasedApproachesforIR . 59 2.3 MachineLearning(ML) . 60 2.3.1 Clustering......................... 60 2.3.2 MeasuresofClusteringQuality . 63 2.3.3 DecisionTrees ...................... 64 2.4 BuildingResourcesfromCorpora . 64 2.4.1 Extracting LinguisticResources from Wikipedia . 64 2.4.2 Building MultilingualLexicons and WordNets using Par- allelCorpora ....................... 68 3 Extraction of Multilingual Lexicons from Wikipedia 73 3.1 MainIdea ............................. 73 3.2 Methodology ........................... 75 3.3 LexiconGeneration. .. .. .. .. .. .. 77 3.3.1 Algorithm......................... 79 3.3.2 GeneralLexicons. 80 3.3.2.1 EBGandEGFP ................ 80 3.3.2.2 HistogramsforEBG. 82 3.3.2.3 HistogramsforEGFP . 85 3.3.2.4 Removal of Redundancy and Numeric Values 86 vi CONTENTS 3.3.2.5 HeptaLex ................... 94 3.3.3 DomainSpecificDictionaries . 96 3.3.3.1 ComputerScienceSpecificLexicon . 97 3.3.3.2 CategoryTranslations . 99 3.4 SomeProgrammingRelatedIssues . 100 3.5 AnalysisofLanguagesinWikipedia . 104 3.6 Evaluation............................. 107 3.7 Conclusion ............................ 107 4 Extraction of Multilingual Synsets from Aligned Corpora 113 4.1 MainIdea ............................. 113 4.2 Assumptions ........................... 114 4.3 ParallelCorporaandPre-processing . 115 4.3.1 StructureoftheEuroparlCorpus . 116 4.3.2 Pre-processing . 118 4.4 WordAlignment ......................... 119 4.4.1 Creation of Vocabulary and Sentence Files . 120 4.4.2 CreationofWordClasses. 122 4.5 CollationofWordsintoPhrases . 124 4.5.1 Author’sNote. 134 4.6 Disambiguation .......................... 135 4.7 Evaluation............................. 135 4.7.1 Baseline Comparisonfor Extractionof MultilingualSynsets136 4.7.1.1 ExperimentalDesign . 137 4.7.2 IssueswithEvaluation . 140 4.7.3 UsingClusteringforEvaluation . 141 4.7.3.1 Experiments . 144 4.7.4 Discussion ........................ 152 4.7.4.1 WhytheTags?. 153 4.7.4.2 Which Machine Learning Approach to Use? . 153 4.7.5 UsingDecisionTreesforEvaluation . 153 4.7.5.1 ExperimentalDesign . 155 4.8 Discussion............................. 156 4.9 ErrorAnalysis........................... 157 4.10 SemEval Parallel Corpora and Generation of Multilingual Synsets 159 4.11 TheoreticalAnalysis . 161 4.11.1 SenseInventory. 161 4.11.2 GoldStandard . 163 4.11.3 Methodology . 164 4.11.4 Experiments . 165 4.11.5 Baseline Comparison for Extraction of Multilingual Synsets174 CONTENTS vii 4.12 Discussion............................. 175 4.13 Conclusion ............................ 175 5 Morphology and Lexical Distances 179 5.1 MainIdea ............................. 179 5.2 MorphologicalAnalysis. 180 5.3 Experiments............................ 181 5.4 EditDistance ........................... 183 5.4.1 CalculatingEditDistances . 184 5.4.2 Edit Distances between Multilingual Proto-Synsets . 186 5.5 LookingforWordParadigms . 188 5.5.1 MergingParadigms. 191 5.5.2 Merging Paradigms based on Common Number of Stems 194 5.5.2.1 Signature Refinement and Merging Paradigms basedonCommonNumberofStems . 197 5.5.3 Discussion ........................ 199 5.6 Further Experimentsin MultilingualMorphology . 201 5.7 EvaluationofMorphologicalAnalysis . 201 5.7.1 Evaluation ........................ 203 5.7.2 SegmentationwithSupport. 204 5.7.3 AnalogyPrinciple . 205 5.7.4 Results .......................... 205 5.8 Conclusion ............................ 206 6 Conclusion 210 6.1 Summary ............................. 210 6.1.1 Automatic Generation of MultilingualLexicons . 211 6.1.2 Extraction of Multilingual Proto-Synsets from Parallel Corpora.......................... 212 6.1.3 MorphologicalAnalysis . 213 6.2 Contributions ........................... 215 6.3 FutureWork............................ 215 References 217 List of Tables 3.1 100entrieschosenfromHeptaLex,part1 . 109 3.2 100entrieschosenfromHeptaLex,part2 . 110 3.3 100entrieschosenfromHeptaLexpart3. 111 3.4 ResultsofevaluationofHeptaLex . 112 4.1 AsamplefromEnglishpartoftheEuroparlcorpus . 117 4.2 Sample of German translationof the Englishexample . 118 4.3 SampleoftheEnglishvocabularyfile . 120 4.4 Sentence file containing 2 pairs of sentences for English-German 121 4.5 SampleofEnglish-Germaninput . 121 4.6 Examples of word alignment probability for English and German 123 4.7 TheresultofwordalignmentforEnglish-German . 123 4.8 Sentences in French and Greek corresponding to Table 4.7 . 124 4.9 Generatedmultilingualsynsets . 125 4.10 Number of English words aligned with a non-pivotal language . 128 4.11 GroupnumbersassignedtoEnglishwords . 130 4.12 Corresponding German words and their Num values. 130 4.13 Example of non-consecutive alignment of German words . 131 4.14 The German entries followingthe ones in Table 4.13 . 131 4.15 proto-synsetsforEnglishandGermanonly . 131 4.16 Corresponding entries for French for the generation of phrases . 132 4.17 Revised group numbers for English alignment with French ... 132 4.18 CorrespondinginformationforFrench . 132 4.19 Entries for Greek for English-Greek word alignment . 133 4.20 Revised group numbers for English after English-Greek alignment 133 viii LIST OF TABLES ix 4.21 Greek table after changing group number information in English 133 4.22 Phrase group information for final generation of phrases. .... 134 4.23 Thefinalphrasesgenerated. 134 4.24 ResultsofBaselineComparison . 140 4.25 Outputoftheclusteringtool. 149 4.26 Impuritymeasuresfordifferentscenarios . 151 4.27 Accuracy figures for Word Alignment when English is used as a targetlanguage .......................... 158 4.28 A sample of the sense inventory for the target word bank..... 162 4.29 Summary for all languages for the five target words in SemEval 172 4.30 Avg. figuresforFrench,Spanish,andItalian . 173 4.31 Average figures for French and Italian for the target words. 173 4.32 Average figures for Spanish and Italian for the target words.. 173 5.1 Anexampleofmorphologicalsyntacticvariation . 186 5.2 PairofGreeksynonyms. 186 5.3 Asampleofeditdistancesforasynset-pair . 187 5.4 Asampleofsynset-pairs . 188 5.5 Asampleoftheparadigmscreated.. 193 5.6 Numbersofparadigmscreatedvsmergethreshold .