Annotation of Conceptual Co-Reference and Text Mining the Qur'an

Annotation of Conceptual Co-reference and Text Mining the Qur’an Abdul Baquee Muhammad Submitted in accordance with the requirements for the degree of Doctor of Philosophy The University of Leeds School of Computing September, 2012 The candidate confirms that the work submitted is his own and that appropriate credit has been given where reference has been made to the work of others. This copy has been supplied on the understanding that it is copyright material and that no quotation from the thesis may be published without proper acknowledgement. - ii - Dedication أهدي هذه اﻷطروحة للدكتور سفر بن عبد الرمحن احلوايل ـ رئيس قسم العقيدة جبامعة أم القرى سابقا. لقد ظلت فكرة اﻻلتحاق بربنامج الدكتوراه حلماً يدور يف خاطري إىل أن التقيت به. فكان لدعمه وتشجيعه اﻷثر الكبري ﻹمتام هذا البحث. I dedicate this thesis to Dr. Safar ibn Abdur-Rahman Al-Hawali – former dean of Islamic theology Department, Umm Al-Qura University, Makkah, Saudi Arabia. Pursuing a PhD journey was a dream until I met Dr. Safar. His support and encouragement made this happen. - iii - Declaration I declare that the work presented in this thesis, is the best of my knowledge of the domain, original, and my own work. Most of the work presented in this thesis have been published. When co-authored with my supervisor, I prepared the first draft and then Eric reviewed and suggested amendments. Publications are listed below: (Abdul Baquee Muhammad1) Chapter 2 Sections 2.1 and 2.2 Muhammad, A. and Atwell, E. (2009) A Corpus-based computational model for knowledge representation of the Qur'an. 5th Corpus Linguistics Conference, Liverpool. Muhammad, A. and Atwell, Eric, (2012) "QurAna: corpus of the Qur’an annotated with pronominal anaphora", LREC 2012 Atwell, ES; Brierley, C; Dukes, K; Sawalha, M; Muhammad, A (2011) An Artificial Intelligence Approach to Arabic and Islamic Content on the Internet in: Proceedings of NITS 3rd National Information Technology Symposium. 2011. Chapter 4 Section 4.3 Muhammad, A. and Atwell, E. (2009) A Corpus-based computational model for knowledge representation of the Qur'an. 5th Corpus Linguistics Conference, Liverpool. Chapter 5 Muhammad, A. and Atwell, Eric, (2012) "QurAna: corpus of the Qur’an annotated with pronominal anaphora", LREC 2012 Chapter 6 Muhammad, A. and Atwell, Eric. (2012) "QurSim: A corpus for evaluation of relatedness in short texts", LREC 2012 Chapter 8 1 My surname was officially changed at the University records from ‘Sharaf’ to ‘Muhammad’ to resolve earlier confusion between surname and father’s name in my passport. Although, in this section I have used the corrected surname, readers may note that my earlier surname i.e.,‘Sharaf’ was used in most of the already published papers. - iv - Automatic" التصنيف اﻵلي للسور القرآنية (Muhammad, A and Atwell, Eric (2011 categorization of the Qur’anic chapters". 7th International Computing Conference in Arabic (ICCA11).31th May - 2nd June 2011, Imam Mohammed Ibn Saud University, Riyadh, Saudi Arabia Chapter 9 Muhammad, A. and Atwell, E. (2009) A Corpus-based computational model for knowledge representation of the Qur'an. 5th Corpus Linguistics Conference, Liverpool. In addition all text mining applications cited in chapter 7 were scripted by me and made available online along with their documentation under the website which I own: http://textminingtheQuran.com/wiki - v - Acknowledgements First and foremost I praise Allah for His bounty and blessings who provided me the strength to research His Words (i.e., the Qur’an) leveraging on the advancement of computational sciences. I am indebted to my advisor, Dr. Eric Atwell, for the guidance he provided me during my studies. Eric has given me plenty of freedom to explore the directions I was most interested in. At the same time he brought me back to the right focused course, when temptations used to drag me to distant directions. Eric’s comments on draft papers were very instrumental and resulted in quality publications. Eric also was very understanding during the ups and downs of my PhD journey, and I always found in his counselling and support a moral boost towards destination. I would like to acknowledge Kais Dukes for his Qur’anic Arabic Corpus (QAC). I have used it extensively while annotating Qur’anic anaphora, and it has substantially reduced the manual effort I would had to exercise otherwise . I have enjoyed being part of the Natural Language Processing group at Leeds University. The weekly seminars were very informative and we used to exchange ideas and suggestions on each other’s research. Special thanks to Majdi Sawalha and Justin Washtell for their constructive suggestions and support on various topics. I would like to thank also the Islamic Development Bank – my employer- for granting me leave to study. My colleagues in the IT Department were very cooperative and supportive during the course of my studies. This achievement was not possible without the support and understanding of all members of my family. My parents – Dr. Sharful Islam and Fatima Islam- were always behind me with their supplications and encouragement. Kids – Anas, Muhammad, Ayesha and Safiyyah- were very understanding and cooperative when dad spent yet another weekend behind the screen annotating or experimenting with datasets. And of course, my wife – Momtaj Begum- for her patience and support taking care of the kids and guaranteeing a conducive atmosphere for research at home. - vi - Abstract This research contributes to the area of corpus annotation and text mining by developing novel domain specific language resources. Most practical text mining applications restrict their domain. This research restricts the domain to the Qur’anic Text. In this thesis, a number of pre-processing steps were undertaken and annotation information were added to the Qur’an. The raw Arabic Qur’an was pre-processed into morphological units using the Qur’anic Arabic Corpus (QAC). Qur’anic terms were indexed and converted into a vector space model using techniques in Information Retrieval (IR). In parallel, nearly 24,000 Qur’anic personal pronouns were annotated with information on their referents. These referents are consolidated and organized into a total of over 1,000 ontological concepts. Moreover, a dataset of nearly 8,000 pairs of related Qur’anic verses are compiled from books of scholarly commentary on the Qur’an. This vector space model, the pronoun tagging, the verse relatedness dataset, and the part-of-speech tags available in QAC all together served for a number of Qur’anic text mining applications which were rendered online for public use. Among these applications: lemma concordance, collocation, POS search of the Qur’an, verse similarity measures, concept clouds of a given verse, pronominal anaphora and Qur’anic chapter similarity. Furthermore, machine learning experiments were conducted on automatic detection of verse similarity/relatedness as well as categorization of Qur’anic chapters based on their chronology of revelation. Domain specific linguistic features were investigated to induct learning algorithms. Results show that deep linguistic and world knowledge is needed to reach the human upper bound in certain computational tasks such as detecting text relatedness, question answering and textual entailment. However, many useful queries can be addressed using text mining techniques and layers of annotations made available through this research. The works presented here can be extended to include other similar texts like Hadith (i.e., saying of Prophet Muhammad), or other scriptures like the Gospels. - vii - Table of Contents Dedication .................................................................................................... ii Declaration .................................................................................................. iii Acknowledgements ..................................................................................... v Abstract ....................................................................................................... vi Table of Contents ...................................................................................... vii List of Tables ............................................................................................ xiii List of Qur’anic Verses ............................................................................ xiv List of Figures ......................................................................................... xvii Chapter 1 Introduction ................................................................................ 1 1.1 Proposed Text Mining Solution ...................................................... 1 1.2 Novel Contributions of this thesis .................................................... 2 1.3 Potential Usage of the resources produced by this thesis ............... 3 1.3.1 Arabic NLP ........................................................................... 3 1.3.2 Computational Text similarity and relatedness ..................... 4 1.3.3 Computational Anaphora Resolution .................................... 4 1.3.4 Computational Stylistics ....................................................... 4 1.3.5 Translation studies ............................................................... 5 1.4 Thesis Outline ................................................................................. 6 1.5 Summary ......................................................................................... 7 Chapter 2 Background on the language of the Qur’an ...........................

Annotation of Conceptual Co-Reference and Text Mining the Qur'an

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support