Similarity Measures for Semantic Relation Extraction
Total Page:16
File Type:pdf, Size:1020Kb
Université catholique de Louvain & Bauman Moscow State Technical University Similarity Measures for Semantic Relation Extraction The dissertation is presented by Alexander Panchenko in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Jury : Prof. Cédrick Fairon (supervisor), Université catholique de Louvain Prof. Andrey Philippovich (supervisor), Bauman Moscow State Technical University Prof. Henri Bouillon (jury president) Université catholique de Louvain Prof. Marco Saerens, Université catholique de Louvain Dr. Jean-Michel Renders, Xerox Research Center Europe Prof. Marie-Francine Moens, KU Leuven Louvain-la-Neuve 2012-2013 To my parents Luidmila and Ivan for their unconditional love and support. Contents Acknowledgments vii Publications Related to this Thesis ix List of Notations and Abbreviations xiii Introduction xxi 1 Semantic Relation Extraction: the Context and the Problem 1 1.1 Semantic Relations and Resources . .1 1.1.1 Definition . .2 1.1.2 Examples . .5 1.2 Semantic Relation Extraction . 13 1.2.1 Extraction Process . 14 1.2.2 Similarity-Based Extraction . 15 1.2.3 Evaluation . 22 1.3 Conclusion . 31 2 Single Semantic Similarity Measures 33 2.1 Related Work . 33 2.2 SDA-MWE: A Similarity Measure Based on Syntactic Distributional Analysis 36 2.2.1 Dataset . 37 iv CONTENTS 2.2.2 Method . 37 2.2.3 Evaluation . 42 2.2.4 Results . 43 2.2.5 Summary . 45 2.3 DefVectors: A Similarity Measure Based on Definitions . 46 2.3.1 Method . 47 2.3.2 Results . 51 2.3.3 Discussion . 53 2.3.4 Summary . 54 2.4 PatternSim: A Similarity Measure Based on Lexico-Syntactic Patterns . 54 2.4.1 Lexico-Syntactic Patterns . 55 2.4.2 Semantic Similarity Measures . 56 2.4.3 Evaluation and Results . 61 2.4.4 Summary . 64 2.5 Conclusion . 64 3 Comparison of Network-, Corpus-, and Definition-Based Similarity Measures 65 3.1 Related Work . 66 3.2 Network-Based Measures . 67 3.3 Corpus-Based Measures . 69 3.3.1 Distributional Measures . 69 3.3.2 Web-Based Measures . 71 3.3.3 Latent Semantic Analysis . 72 3.4 Definition-Based Measures . 73 3.5 Classification of the Measures . 75 3.6 Results . 76 CONTENTS v 3.6.1 Correlation with Human Judgments . 76 3.6.2 Semantic Relation Ranking . 77 3.6.3 Comparison of Semantic Relation Distributions . 78 3.7 Discussion . 87 3.8 Conclusion . 89 4 Hybrid Semantic Similarity Measures 91 4.1 Features: Single Semantic Similarity Measures . 92 4.2 Combination Methods . 94 4.3 Measure Selection Methods . 100 4.4 Results . 102 4.4.1 General Performance . 103 4.4.2 Semantic Relation Distribution of the Hybrid Measure Logit-E15 . 108 4.5 Discussion . 108 4.6 Conclusion . 114 5 Applications of Semantic Similarity Measures 117 5.1 Serelex: Search and Visualization of Semantically Similar Words . 117 5.1.1 The System . 118 5.1.2 Evaluation and Results . 123 5.1.3 Summary . 125 5.2 Short Text Categorization . 125 5.2.1 Related Work . 126 5.2.2 Filename Classification . 127 5.2.3 Evaluation and Results . 130 5.2.4 Examples of the Vocabulary Projection . 132 5.2.5 Discussion . 135 vi CONTENTS 5.2.6 Summary . 136 5.3 Possible Applications to Text-Based Information Retrieval . 136 5.4 Conclusion . 139 Conclusion 141 Bibliography 145 Appendix A: Additional Examples of the Serelex System 163 Acknowledgments First of all, I thank my supervisor professor Cédrick Fairon from Université catholique de Louvain and co-supervisor professor Andrey Philippovich from Bauman Moscow State Technical University for their countless help and support during these years. Next, I would like to acknowledge financial support of “Wallonie-Bruxelles International (WBI)” founda- tion and “Institut Langage et Communication (IL&C)” of Université catholique de Louvain. I am also thankful to the members of my scientific committee: professor Marco Saerens from Université catholique de Louvain, Dr. Jean-Michel Renders from Xerox Research Center and professor Marie-Francine Moens from KU Leuven. Their advanced questions, precise suggestions and critical comments significantly improved quality of this dissertation. Moreover, I want to thank professor Yuri N. Philippovich from Bauman State Technical University for helping me make the first steps in Computational Linguistics and for all our scientific discussions. CENTAL, the NLP laboratory of Université catholique de Louvain, provided me an excel- lent research environment. I especially acknowledge help of Dr. Thomas François, who was always ready to answer a question and share his knowledge of Statistics and Natu- ral Language Processing. Several people from CENTAL provided helpful comments on the first versions of this text: Adrien Dessy, Olga Morozova, Jean-Léon Bouraoui, Thomas François, Sandrine Brognaux, Hubert Naets, Stéphanie Weiser, Patrick Watrin, Adrien Bibal and Louise-Amélie Cougnon. This help was essential to the success of the work. Last but not least, I thank all contributors to the “Serelex” project, especially Pavel Romanov, Hubert Naets, Olga Morozova and Alexey Romanov. It was a great pleasure and fun to collaborate with you. Finally, I thank Polina for love and support. Alexander Panchenko Louvain-la-Neuve, 14th February 2013 Publications Related to this Thesis [1] Panchenko A., Beaufort R., Naets H., Fairon C. Towards Detection of Child Sexual Abuse Media: Classification of the Associated Filenames. In Proceedings of the 35th European Conference on Information Retrieval (ECIR 2013). Lecture Notes in Com- puter Science (Springler), vol.7814, Moscow, Russia. [2] Panchenko A., Romanov P., Morozova O., Naets H., Philippovich A., Romanov A., Fairon C. Serelex: Search and Visualization of Semantically Related Words. In Pro- ceedings of the 35th European Conference on Information Retrieval (ECIR 2013). Lecture Notes in Computer Science (Springler), vol.7814, Moscow, Russia. [3] Panchenko A., Morozova O., Naets H. A Semantic Similarity Measure Based on Lexico-Syntactic Patterns. // In Proceedings of the 11th Conference on Natural Lan- guage Processing (KONVENS 2012), — Vienna (Austria), 2012 – pp.174–178. [4] Panchenko A., Beaufort R., Fairon C. Detection of Child Sexual Abuse Media on P2P Networks: Normalization and Classification of Associated Filenames. // In Pro- ceedings of Public Security Applications Workshop, International Conference on Lan- guage Resources and Evaluation (LREC 2012) — Istanbul (Turkey), 2012 – pp. 27-31. [5] Panchenko A., Adeykin S., Romanov P., Romanov A. Extraction of Semantic Relations between Concepts with KNN Algorithms on Wikipedia. // In Proceedings of Con- cept Discovery in Unstructured Data Workshop (CDUD), International Conference On Formal Concept Analysis (ICFCA 2012) — Leuven (Belgium), 2012. [6] Panchenko A., Morozova O. A Study of Hybrid Similarity Measures for Semantic Relation Extraction. // In Proceedings of Innovative Hybrid Approaches to the Pro- cessing of Textual Data Workshop, Conference of the European Chapter of the As- sociation for Computational Linguistics (EACL 2012) — Avignon (France), 2012 — pp. 10–18. [7] Panchenko A. A Study of Heterogeneous Similarity Measures for Semantic Relation Extraction. // In Proceedings of 14e Rencontres des Étudiants Chercheurs en Infor- x CHAPTER 0. PUBLICATIONS RELATED TO THIS THESIS matique pour le Traitement Automatique des Langues (JEP-TALN-RECITAL 2012) — Grenoble (France), 2012 — pp. 29–42. [8] Panchenko A. Towards an Efficient Combination of Similarity Measures for Seman- tic Relation Extraction. // Abstract in Computational Linguistics in the Netherlands (CLIN 22) – Tilburg (The Netherlands): Tilburg University, 2012 – pp.6. [9] Panchenko A. Comparison of the Knowledge-, Corpus-, and Web-based Similarity Measures for Semantic Relations Extraction. // In Proceedings of GEometrical Mod- els of Natural Language Semantics Workshop (GEMS), Conference on Empirical Methods in Natural Language Processing (EMNLP 2011) – Edinburgh (UK), 2011 – pp. 11–21. [10] Panchenko A. Comparison of the Knowledge-, Corpus-, and Web-based Similarity Measures for Semantic Relations Extraction. // Poster in Russian Young Scientists Conference in Information Retrieval (YSC 2011), Russian Summer School in Infor- mation Retrieval (RuSSIR 2011) — Saint-Petersburg (Russia), 2011. [11] Panchenko A. Can We Automatically Reproduce Semantic Relations of an Informa- tion Retrieval Thesaurus? // In Proceedings of Russian Young Scientists Conference in Information Retrieval (YSC 2010), Russian Summer School in Information Retrieval (RuSSIR 2010) — Voronezh (Russia), 2010. – pp. 36–51. http://elar.usu.ru/bitstream/1234.56789/3058/1/russir-2010-04.pdf [12] Panchenko A. Computing Semantic Relations from Heterogeneous Evidence. // Ab- stract in Computational Linguistics in the Netherlands (CLIN 21) – Ghent (Belgium): University College Ghent, 2011 – pp. 39. In Russian [13] Panqenko A., Adekin S., Romanov P., Romanov A. Izvleqenie seman- tiqeskih otnoxeni iz state Vikipedii s pomow~ algoritmov bliaxih sosede. // Trudy konferencii Analiz Social~nyh Sete, Izo- braeni i Tekstov (AIST) | Ekaterinburg, 2012 | S. 208{219. [14] Panqenko A. Metod avtomatiqeskogo postroeni semantiqeskih ot- noxeni medu konceptami informacionno-poiskovogo tezaurusa. // Vestnik Voroneskogo Gosudarstvennogo Universiteta. Seri \Sis- temny Analiz i Informacionnye Tehnologii", 2010 | Tom 2. | S. 160{168. xi http://www.vestnik.vsu.ru/program/view/view.asp?sec=analiz&year= 2010&num=02&f_name=2010-02-26 [15] Panqenko A. Towards an Efficient Combination of Similarity Measures for Se- mantic Relations Extraction. // Tezisy dokladov nauqno-tehniqesko me- dunarodno molodeno