Thesis M Badieh Habib Morgan
Total Page:16
File Type:pdf, Size:1020Kb
Named Entity Extraction and Disambiguation for Informal Text The Missing Link Mena B. Habib PhD dissertation committee: Chairman and Secretary: Prof. dr. P.M.G. Apers, University of Twente, NL Promotor: Prof. dr. P.M.G. Apers, University of Twente, NL Assistant promotor: Dr. ir. M. van Keulen, University of Twente, NL Members: Prof. dr. W. Jonker, University of Twente, NL Prof. dr. F.M.G. de Jong, University of Twente, NL Prof. dr. A. van den Bosch, Radboud University Nijmegen, NL CTIT Ph.D. thesis Series No. 14-301 Centre for Telematics and Information Technology P.O. Box 217, 7500 AE Enschede, The Netherlands. SIKS Dissertation Series No. 2014-20 The research reported in this thesis has been carried out under the auspices of SIKS, the Dutch Research School for Information and Knowledge Systems. ISBN: 978-90-365-3647-9 ISSN: 1381-3617 (CTIT Ph.D. thesis Series No. 14-301) DOI: 10.3990/1.9789036436479 http://dx.doi.org/10.3990/1.9789036536479 Cover design: Hany Maher Printed by: Ipskamp Drukkers Copyright c 2014 Mena Badieh Habib Morgan, Enschede, The Netherlands NAMED ENTITY EXTRACTION AND DISAMBIGUATION FOR INFORMAL TEXT THE MISSING LINK DISSERTATION to obtain the degree of doctor at the University of Twente, on the authority of the rector magnificus, prof. dr. H. Brinksma, on account of the decision of the graduation committee, to be publicly defended on Friday, May 9th, 2014 at 12:45 by Mena Badieh Habib Morgan born on June 29th, 1981 in Cairo, Egypt This dissertation is approved by: Prof. dr. P.M.G. Apers (promotor) Dr. ir. M. van Keulen (assistant promotor) Dedicated to the soul of my father Acknowledgments “I can do all things through Christ who strengthens me. (Philippians 4:13)” I always say that I am lucky. I am lucky because I always get wonderful and kind people surrounding me. I am lucky to have Peter Apers as my promoter. He supported my research direc- tions and gave me freedom and independence. His words always gave me confidence and insistence to complete my PhD. I am lucky to have Maurice van Keulen as my daily supervisor. Although we passed some foggy times, he never lost his positive attitude. He was always there to give ad- vice, optimism, support and ideas. Besides learning how to be a good researcher, I have learned from Maurice how to be a supervisor, which is something I would definitely need through my academic career. Words could never express my sincere gratitude to Maurice. I am lucky to have Willem Jonker, Franciska de Jong, and Antal van den Bosch as my committee members. I would like to thank them for their careful reading of my thesis. I am lucky to be a member of the databases group at the university of Twente. I would like to thank them all for providing me the pleasant working climate. Thanks for Maarten Fokkinga, Djoerd Hiemstra, Andreas Wombacher, Robin Aly, Ida den Hamer- Mulder, Suse Engbers, Jan Flokstra, Iwe Muiser, Juan Amiguet, Sergio Duarte, Victor de Graaff, Rezwan Huq, Mohammad Khelghati, Kien Tjin-Kam-Jet, Brend Wanders, Zhemin Zhu, Lei Wang, Ghita Berrada, Almer Tigelaar, Riham Abdel Kader and Dolf Trieschnigg. I would like to dedicate a special thanks to couple of them Ida den Hamer-Mulder and Juan Amiguet. Ida, the dynamo of the group. Ida helped me with my settlement in the Netherlands. She offered help even for things beyond her duty. The DB group is really lucky to have Ida as their secretary. Juan, my office mate, the person who knows at least one thing about everything. The man who is willing to help at any time. Juan, I am grateful for your help and for our nice conversations we had together discussing almost everything from food recipes to astronomy. I am lucky to spend my PhD life period at this peaceful quiet spot of the world called Enschede. In Enschede life is easy! I would like also to express my gratitude towards the Egyptian Coptic community in the Netherlands who helped me to overcome my home viii sickness. Thanks for bishop Arsany, father Maximos, father Pavlos, Samuel Poulos, Adel Saweiros, Sameh Ibrahim, Moneer Basalyous and Maher Rasla. I am lucky because I did an internship at the highly reputable databases and infor- mation systems group of the Max Planck Institute of Informatics in Saarbrucken, Ger- many. I learned a lot during my stay there. Thanks for Gerhard Weikum, Marc Spaniol, Mohamed Amir and Johannes Hoffart. I am lucky to study and work at the Faculty of Computers and Information Sciences in Ain Shams University in Cairo where I received my Bachelor and Master degrees. I would like to thank all my professors and colleagues there specially Abdel-Badieh Salem, Mohammed Roushdy, Mostafa Aref, Tarek Gharib, Emad Monier, Ayad Barsom, Marco Alfonse and many others. I am lucky to be the son of Badieh Habib and Aida Makien. My parents who did their best to raise me up as researcher. I genetically inherited my interest towards re- search, math and science from them. I hope I was able to achieve their wishes. I also could never forget to thank my sisters Hanan and Eman in addition to the rest of my family and my family in law who always provide love and support. I am lucky to have Shery, my lovely wife who did her best to offer the best atmo- sphere for me. The lady who provide unconditional care and love. Indeed, ‘Who can find a virtuous woman? For her price is far above rubies.’ (Proverbs 31:10). I am lucky to have Maria and Marina, my sweet twin angels. Whenever I am stressed, only one hour playing with them was enough to release all stress and added smile to my face. I am lucky to get my Christian doctrine at the Sunday school of Saint George church in El-Matariya, Cairo. The church where I lived my best days ever between its walls. It strongly participated in building my personality. I would like to thank all the church fathers Georgios Botros, Beshoy Boules, Tomas Naguib, Pola Fouad and Shenouda Da- wood. I also could never forget all my teachers there, specially Onsy Naguib, for their care, love and support. Finally, I am lucky to have my friends with whom I shared my best life moments. Thanks for Ehab Gamil, Gerges Saber, Maged Makram, Maged Matta, Mena George, Mena Samir, Mena William, Ramy Anwar, Romany Edwar, Sameh Samir and many others. Thanks for everyone I shared my dreams with one day. I am lucky to have all these people surrounding me. This thesis would have been much different (or would not exist) without these people. No it is not luck.. It is God’s hand who leads me through life. He said “I have raised him up in righteousness, and I will direct all his ways. (Isaiah 45:13)” Mena B. Habib Enschede, March 2014. Contents I Introduction 1 1 Introduction 3 1.1 Introduction . .3 1.2 Examples of Application Domains . .5 1.3 Challenges . .7 1.4 General Approach . 10 1.5 Research Questions . 12 1.6 Contributions . 13 1.7 Thesis Structure . 14 II Toponyms in Semi-formal Text 17 2 Related Work 19 2.1 Summary . 19 2.2 Information Extraction . 19 2.3 Named Entity Recognition . 22 2.3.1 Rule-based Approaches . 22 2.3.2 Machine Learning-based Approaches . 24 2.3.3 Toponyms Extraction . 28 2.3.4 Language Independence . 28 2.3.5 Robustness . 29 2.4 Named Entity Disambiguation . 30 2.4.1 Toponyms Disambiguation . 30 3 The Reinforcement Effect 33 3.1 Summary . 33 3.2 Introduction . 34 3.3 Toponyms Extraction . 36 3.3.1 GATE Toolkit . 36 x CONTENTS 3.3.2 JAPE Rules . 37 3.3.3 Extraction Rules . 38 3.3.4 Entity matching . 43 3.4 Toponyms Disambiguation . 43 3.4.1 Bayes Approach . 43 3.4.2 Popularity Approach . 45 3.4.3 Clustering Approach . 46 3.5 The Reinforcement Effect . 49 3.6 Experimental Results . 49 3.6.1 Dataset . 49 3.6.2 Initial Effectiveness of Extraction . 51 3.6.3 Initial Effectiveness of Disambiguation . 51 3.6.4 The Reinforcement Effect . 52 3.6.5 Further Analysis and Discussion . 54 3.7 Conclusions and Future Directions . 55 4 Improving Disambiguation by Iteratively Enhancing Certainty of Ex- traction 57 4.1 Summary . 57 4.2 Introduction . 57 4.3 Problem Analysis and General Approach . 59 4.4 Extraction and Disambiguation Approaches . 60 4.4.1 Toponyms Extraction . 61 4.4.2 Toponyms Disambiguation . 63 4.4.3 Improving Certainty of Extraction . 64 4.5 Experimental Results . 64 4.5.1 Dataset . 65 4.5.2 Effect of Extraction with Confidence Probabilities . 65 4.5.3 Effect of Extraction Certainty Enhancement . 66 4.5.4 Optimal cutting threshold . 67 4.5.5 Further Analysis and Discussion . 71 4.6 Conclusions and Future Directions . 72 5 Multilinguality and Robustness 75 5.1 Summary . 75 5.2 Introduction . 75 5.3 Hybrid Approach . 78 5.3.1 System Phases . 78 5.3.2 Toponyms Disambiguation . 79 CONTENTS xi 5.3.3 Selected Features . 80 5.4 Experimental Results . 82 5.4.1 Dataset . 83 5.4.2 Dataset Analysis . 85 5.4.3 SVM Features Analysis . 85 5.4.4 Multilinguality, Different Thresolding Robustness and Competitors . 89 5.4.5 Low Training Data Robustness . 90 5.5 Conclusions and Future Directions . 92 III Named Entities in Informal Text of Tweets 93 6 Related Work 95 6.1 Summary . 95 6.2 Named Entity Disambiguation .