Improving Search Via Named Entity Recognition in Morphologically Rich Languages – a Case Study in Urdu

Improving Search via Named Entity Recognition in Morphologically Rich Languages – A Case Study in Urdu A DISSERTATION SUBMITTED TO THE FACULTY OF THE UNIVERSITY OF MINNESOTA BY Kashif H. Riaz IN PARTIAL FULLFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Adviser: Dr. Vipin Kumar Co-Adviser: Dr. Blake Howald Co-Adviser: Dr. Jeanette Gundel February 2018 This dissertation is copyrighted to Kashif H. Riaz Copyright © 2018 Dedication To my teachers, parents, and Komayl i Acknowledgments To my late grandfather Intizar Hussain, I will always cherish the time I spent scribing for him. The access he provided to his library was no less than the Book of Kells for me. I am thankful to my committee members and my advisers for their support on this journey. This work would not have been possible without your encouragement to pursue multidisciplinary research. I thank Dr. Vipin Kumar for his immense support, guidance, encouragement, and most of all that he was gracious despite my procrastination and supported me through the ups and downs of life on this journey. I immensely thank Dr. Blake Howald for his guidance and showing me the path through the fog of this multifaceted research. Most importantly, I am grateful for the generosity of his time, his encouraging comments, his feedback, and taking me through the finish line. My gratitude to Dr. Michael Steinbach for his gentle guidance, his feedback on this dissertation, and components of my research on this journey. He has always listened patiently and talked me through the challenges of my research. Thanks to Claudia Neuhauser, for teaching me how to look into complex societal problems in a multidisciplinary way and giving me hope that there is light at the end of the tunnel for the problems close to my heart. I thank Dr. Jeanette Gundel for her time, advice, and guidance through this journey. For her patience, as she explained linguistic concepts as we sat in her office going over my research. To the Information Retrieval, and Computational Linguistics research community who guided me in conferences and workshops by providing feedback during my presentations. Specifically, to the presenters at 6th ESSIR (European Summer School in Information Retrieval) in Glasgow in setting a direction to my research. Your guidance to research in Information Retrieval for resource scarce languages was invaluable. My special thanks for Stephen Robertson, and Leif Azzopardi who on multiple occasions encouraged and provided direction. To Karim Darwish, who explained in detail the complexity of writing a stemmer for Arabic. I am truly standing on the shoulders of giants. To my supervisors and managers who supported me through this journey. Without their support this journey could not have been possible. Most importantly, I am thankful to my family, who had to suffer through “wasted” snow days, missing fall colors, shortened spring breaks, and at times a messy house, as I was busy in research. We will make up the time! ii Contents Contents .............................................................................................................................. iii Table of Figures ................................................................................................................... x List of Tables ....................................................................................................................... xi 1 Synopsis....................................................................................................................... 1 1.1 Morphological Rich Languages............................................................................. 2 1.2 Proposition ........................................................................................................... 5 1.3 Named Entity Recognition (NER) ......................................................................... 6 1.4 Search ................................................................................................................... 8 1.4.1 Keyword Search ............................................................................................ 8 1.4.2 Concept-based Search .................................................................................. 9 1.5 Urdu .................................................................................................................... 10 1.6 Methodology ...................................................................................................... 11 1.6.1 Enabling Technologies ................................................................................ 12 1.6.2 Evaluation Measures ................................................................................... 12 1.6.3 Base-line for Urdu Search ........................................................................... 14 1.6.4 NER for Urdu ............................................................................................... 14 1.7 Other Language Evaluation ................................................................................ 16 1.8 Experiment Sketch and Results .......................................................................... 16 1.9 Analysis ............................................................................................................... 17 1.10 Roadmap ......................................................................................................... 18 2 Named Entity Recognition ........................................................................................ 19 2.1 Introduction........................................................................................................ 19 2.1.1 Named Entity .............................................................................................. 19 iii 2.1.2 The Name Entity Task (Message Understanding Conference) ................... 20 2.1.3 Applications of Name Recognition ............................................................. 20 2.2 General Challenges in NER ................................................................................. 23 2.2.1 Ambiguity of Proper Names ........................................................................ 24 2.2.2 NER as an Information Extraction task ....................................................... 25 2.2.3 Evaluation Issues in IE and in NER .............................................................. 26 2.2.4 Architecture of the NER System ................................................................. 27 2.3 Name Recognition Systems and their Approaches ............................................ 28 2.3.1 Nymble: a High Performance Learning Name Finder ................................ 29 2.3.2 NetOwl™ Extractor from IsoQuest ............................................................. 30 2.3.3 Nominator: IBM T.J. Watson Research Center ........................................... 32 2.3.4 Nationality Specific Methods ...................................................................... 32 2.4 Conditional Random Fields ................................................................................ 33 2.4.1 Background ................................................................................................. 34 2.4.2 Hidden Markov Models .............................................................................. 34 2.4.3 Conditional Models ..................................................................................... 36 2.4.4 Maximum Entropy Markov Model (MEMM) .............................................. 36 2.4.5 Conditional Random Fields (CRF) ................................................................ 38 2.4.6 Use of Condition Random Fields for NER ................................................... 41 2.5 Recent Trends in Statistical and Machine Learning Approaches ....................... 42 2.5.1 Word2vec .................................................................................................... 42 2.5.2 Deep Learning ............................................................................................. 43 2.6 Reflections on Deep Learning and Word Embeddings for Morphologically Rich Languages ..................................................................................................................... 44 iv 2.7 Summary ............................................................................................................ 45 3 Search – Information Retrieval ................................................................................. 46 3.1 Named Entities in Search ................................................................................... 47 3.1.1 Challenges of Names in Search ................................................................... 48 3.2 The Search Challenge ......................................................................................... 49 3.3 Keyword Based Search ....................................................................................... 51 3.3.1 Boolean Keyword Searching ....................................................................... 51 3.3.2 Ranked Retrieval ......................................................................................... 53 3.3.3 Probabilistic Retrieval ................................................................................. 53 3.3.4 Vector Space Model – a variant of Ranked Retrieval ................................. 54 3.3.5 Challenges of Keyword search .................................................................... 56 3.4 Traditional Concept Searching

Improving Search Via Named Entity Recognition in Morphologically Rich Languages – a Case Study in Urdu

[2010] Manuscript Learnability Ver 2

IJCNLP 2011 Proceedings of the Workshop on Advances in Text Input Methods (WTIM 2011)

Proposal to Encode Bosnian Arabic Characters

Urdu Zabta Takhti (UZT) 1.01 L2/02-004

The Impact of Arabic Orthography on Literacy and Economic Development in Afghanistan

Urdu Keypad Free Download

The World's 500 Most Influential Muslims, 2021

Creating Standards

Arabic Samaritan Yezidi

Proposal to Encode the Khwarezmian Script in Unicode

Urdu Keyboard Label Instructions and Specifications

Urdu Word Processor – Standards and Guidelines