Statistical Parsing by Machine Learning from a Classical Arabic Treebank
Total Page:16
File Type:pdf, Size:1020Kb
Statistical Parsing by Machine Learning from a Classical Arabic Treebank Kais Dukes Submitted in accordance with the requirements for the degree of Doctor of Philosophy The University of Leeds School of Computing September, 2013 The candidate confirms that the work submitted is his own, except where work which has formed part of jointly-authored publications has been included. The contribution of the candidate and the other authors to this work has been explicitly indicated overleaf. The appropriate credit has been given where reference has been made to the work of others. This copy has been supplied on the understanding that it is copyright material and that no quotation from the thesis may be published without proper acknowledgement. Publications Chapters 4 to 10 in parts II, III and IV of this thesis are based on jointly-authored publications. I was the lead author and the co-authors acted in an advisory capacity, providing supervision and review. All original contributions presented here are my own. Part II - Modelling Classical Arabic The formal representations of Classical Arabic orthography, morphology and syntax presented in Chapters 4 to 6 are based on the following papers: Kais Dukes and Nizar Habash (2010a). Morphological Annotation of Quranic Arabic. In Proceedings of the Language Resources and Evaluation Conference (LREC) (2530-2536). Valletta, Malta. Kais Dukes, Eric Atwell and Abdul-Baquee Sharaf (2010b). Syntactic Annotation Guidelines for the Quranic Arabic Dependency Treebank. In Proceedings of the Language Resources and Evaluation Conference (LREC) (1822-1827). Valletta, Malta. Kais Dukes and Timothy Buckwalter (2010c). A Dependency Treebank of the Quran using Traditional Arabic Grammar. In Proceedings of the International Conference on Informatics and Systems (INFOS). Cairo, Egypt. Part III – Developing the Quranic Arabic Corpus The descriptions of the collaborative annotation methodology and online software platform in Chapters 7 and 8 are based on the following publications: Kais Dukes, Eric Atwell and Nizar Habash (2013). Supervised Collaboration for Syntactic Annotation of Quranic Arabic. Language Resources and Evaluation Journal (LREJ): Special Issue on Collaboratively Constructed Language Resources, 47:1 (33-62). Kais Dukes and Eric Atwell (2012). LAMP: A Multimodal Web Platform for Collaborative Linguistic Analysis. In Proceedings of the Language Resources and Evaluation Conference (LREC) (3268-3275). Istanbul. Kais Dukes, Eric Atwell and Abdul-Baquee Sharaf (2010d). Online Visualization of Traditional Quranic Grammar using Dependency Graphs. In Proceedings of the Foundations of Arabic Linguistics Conference. Cambridge. Part IV – Statistical Parsing Chapters 9 and 10 discuss statistical parsing and machine learning experiments. These chapters form an expanded description of the work summarized in the following paper: Kais Dukes and Nizar Habash (2011). One-step Statistical Parsing of Hybrid Dependency-Constituency Syntactic Representations. In Proceedings of the International Conference on Parsing Technologies (IWPT) (92-103). Dublin, Ireland. Acknowledgments First and foremost, my sincere gratitude is owed to my two PhD supervisors, Eric Atwell at the University of Leeds and Nizar Habash at Columbia University in the City of New York. During my work on this thesis over the last four years, I benefited immensely from Eric‟s expert advice on corpus annotation and computational linguistics. I am also deeply indebted to Eric for his encouragement to complete the parts of my work that led to peer-reviewed papers. He allowed me the level of intellectual freedom that I needed to make original contributions to new areas of research. I owe my deepest appreciation to Nizar, who acted as an external supervisor. He provided expert guidance on Arabic morphological and syntactic theory and how best to approach the problem of Arabic statistical parsing using machine learning. His belief in the direction and quality of my work helped provide the motivation I needed to see my research through to completion. My sincere gratitude and thanks are also directed to members of the research community who gave me invaluable advice, encouragement and support. Abdul- Rahman Adnan, Imran Alawiye, Mohammed Alyousef, Tim Buckwalter, Michael Carter, Teuku Edward, Lydia Lau, Katja Markert, Mazhar Nurani, Jonathan Owens, Fatma Said, Hind Salhi, Majdi Sawalha, Abdul-Baquee Sharaf, Wajdi Zaghouani and Mai Zaki deserve special mention. I owe my gratitude to Ahmed El-Helw and Nour Sharabash for kindly donating and administrating the web servers used to host the Quranic Arabic Corpus. I would also like to acknowledge the hard work of the numerous volunteers who contributed their time and effort to continuously improve the annotations online. Finally, I will be forever grateful for the love, kindness and support shown by my wife, Imen. Without her tireless patience and calming presence, our attempt to combine my part-time PhD with full-time work while raising two young happy children would never have been possible. From the bottom of my heart Imen, thank you for supporting me every step of the way. Abstract Research into statistical parsing for English has enjoyed over a decade of successful results. However, adapting these models to other languages has met with difficulties. Previous comparative work has shown that Modern Arabic is one of the most difficult languages to parse due to rich morphology and free word order. Classical Arabic is the ancient form of Arabic, and is understudied in computational linguistics, relative to its worldwide reach as the language of the Quran. The thesis is based on seven publications that make significant contributions to knowledge relating to annotating and parsing Classical Arabic. Classical Arabic has been studied in depth by grammarians for over a thousand Using this grammar to .(إعغاة) years using a traditional grammar known as i’rāb develop a representation for parsing is challenging, as it describes syntax using a hybrid of phrase-structure and dependency relations. This work aims to advance the state-of-the-art for hybrid parsing by introducing a formal representation for annotation and a resource for machine learning. The main contributions are the first treebank for Classical Arabic and the first statistical dependency-based parser in any language for ellipsis, dropped pronouns and hybrid representations. A central argument of this thesis is that using a hybrid representation closely aligned to traditional grammar leads to improved parsing for Arabic. To test this hypothesis, two approaches are compared. As a reference, a pure dependency parser is adapted using graph transformations, resulting in an 87.47% F1-score. This is compared to an integrated parsing model with an F1-score of 89.03%, demonstrating that joint dependency-constituency parsing is better suited to Classical Arabic. The Quran was chosen for annotation as a large body of work exists providing detailed syntactic analysis. Volunteer crowdsourcing is used for annotation in combination with expert supervision. A practical result of the annotation effort is the corpus website: http://corpus.quran.com, an educational resource with over two million users per year. ِب ْس ِم ٱ ه َِّلل ٱم هرْ َْحػٰ ِن ٱم هر ِح ِي ُس ب َحاهَ َم ََل ِع ْ َْل مَنَا إ هَل َما عَله ْم َتنَا إهه َم َٱه َت إمْ َع ِلي إمْ َح ِكي ْ ِ ِ ُ ُ „Glory be to thee! We have no knowledge except what you have taught us. Indeed it is you who is the all-knowing, the all-wise.‟ A prayer of the angels –The Quran, verse (2:32) Contents Part I: Introduction and Background 1 1 Introduction 2 1.1 Motivation ............................................................................................... 2 1.2 Research Questions ................................................................................. 4 1.2.1 Is Statistical Parsing Viable for Classical Arabic? ...................... 4 1.2.2 Is a Hybrid Representation Suitable for Parsing?........................ 5 1.2.3 Can Crowdsourcing be used for Annotating Arabic?.................. 9 1.3 Original Contributions of the Thesis ..................................................... 10 1.3.1 Theoretical Contributions .......................................................... 10 1.3.2 Practical Contributions .............................................................. 10 1.4 Thesis Outline ....................................................................................... 11 2 Literature Review 13 2.1 Introduction ........................................................................................... 13 2.2 Arabic Morphological Analysis ............................................................ 14 2.2.1 The Buckwalter Arabic Morphological Analyzer ..................... 14 2.2.2 Lexeme and Feature Representations ........................................ 15 2.2.3 Fine-Grained Morphological Analysis ...................................... 18 2.2.4 Finite State Morphological Analysis of the Quran .................... 19 2.3 Arabic Syntactic Treebanks .................................................................. 21 2.3.1 The Penn Arabic Treebank ........................................................ 21 2.3.2 The Prague Arabic Treebank ..................................................... 24 2.3.3 The Columbia Arabic Treebank ................................................ 28 2.4 Statistical Parsing Models ..................................................................... 32 2.4.1 Classical Arabic Parsing