An Automatic Similarity Detection Engine Between Sacred Texts Using Text Mining and Similarity Measures

Rochester Institute of Technology RIT Scholar Works Theses 11-2014 An Automatic Similarity Detection Engine Between Sacred Texts Using Text Mining and Similarity Measures Salha Hassan Muhammed Qahl Follow this and additional works at: https://scholarworks.rit.edu/theses Recommended Citation Qahl, Salha Hassan Muhammed, "An Automatic Similarity Detection Engine Between Sacred Texts Using Text Mining and Similarity Measures" (2014). Thesis. Rochester Institute of Technology. Accessed from This Thesis is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact [email protected]. ROCHESTER INSTITUTE OF TECHNOLOGY An Automatic Similarity Detection Engine Between Sacred Texts Using Text Mining and Similarity Measures by Salha Hassan Muhammed Qahl Supervisor: Professor Ernest Fokoue A 6 credits thesis submitted in partial fulfillment for the degree of Master of Science in Applied Statistics in the Kate Gleason College of Engineering Center for Quality and Applied Statistics November 2014 c 2014 -Salha Qahl All rights reserved. ii Committee Approval Date Thesis Advisor:Professor Ernest Fokoué, Associate Professor, Center for Quality and Applied Statistics Date Committee Member:Professor Linlin Chen, Assistant Professor, Department of Mathe- matics Date Committee Member: Professor Robert Parody, Associate Professor, Center for Quality and Applied Statistics “Motivation isn’t enough. If you’ve an idiot and you motivate him, now you’ve a moti- vated idiot.” Stiff Jokes (2014 – present) ROCHESTER INSTITUTE OF TECHNOLOGY Abstract Kate Gleason College of Engineering Center for Quality and Applied Statistics Master’s of Science by Salha Hassan Muhammed Qahl Is there any similarity between the contexts of the Holy Bible and the Holy Quran, and can this be proven mathematically? The purpose of this research is using the Bible and the Quran as our corpus, we explore the performance of various feature extraction and machine learning techniques. The unstructured nature of text data adds an extra layer of complexity in the feature extraction task, and the inherently sparse nature of the corresponding data matrices makes text mining a distinctly difficult task. Among other things, We assess the difference between domain-based syntactic feature extraction and domain-free feature extraction, and then use a variety of similarity measures like Euclidean, Hillinger, Manhattan, cosine, Bhattacharyya, symmetries kullback-leibler, Jensen Shannon, probabilistic chi-square and clark. For a similarity to identify similarities and differences between sacred texts. Initially I started by comparing chapters of two raw text using the proximity measures to visualize their behaviors on high dimensional and spars space. It was apparent there was similarity between some of the chapters, but it was not conclusive. Therefore, there was a need to clean the noise using the so called Natural Language processing (NLP). For example, to minimize the size of two vectors, We initiated lists of similar vocabulary that worded differently in both texts but indicates the same exact meaning. Therefore, the program would recognize Lord as God in the Holy Bible and Allah as God in the Quran and Jacob as prophet in bible and Yaqub as a prophet in Quran. This process was completed many times to give relative comparisons on a variety of different words. After completion of the comparison of the raw texts, the comparison was completed for the processed text. The next comparison was completed using probabilistic topic modeling on feature extracted matrix to project the topical matrix into low dimensional space for more dense comparison. Among the distance measures intrdued to the sacred corpora, the analysis of similarities based on the probability based measures like Kullback leibler and Jenson shown the best result. Another similarity result based on Hellinger distance on the CTM also shows good discrimination result between documents. This work started with a believe that if there is intersection between Bible and Quran, it will be shown clearly between the book of Deuteronomy and some Quranic chapters. It is now not only historically, but also mathematically is correct to say that there is much similarity between the Biblical and Quranic contexts more than the similarity within the holy books themselves. Furthermore, it is the conclusion that distances based on probabilistic measures such as Jeffersyn divergence and Hellinger distance are the recommended methods for the unstructured sacred texts. Acknowledgements It would not have been possible to write this thesis without the help and support of the kind people around me, to only some of whom it is possible to give particular mention here. Above all, I would like to acknowledge the financial, academic and technical support of ministry of higher education in Saudi Arabia, particularly in the award of the King Abdullah Foreign Postgraduate scholarship that provided the necessary financial support for the entire degree. I would like to thank my friend Christopher Robert Jones for his personal support, love, guidance and great patience at all times. My parents, brothers and sisters have given me their unequivocal support throughout, as always, for which my mere expression of thanks is not sufficient. It cannot be argued that the most influential person in my graduate career has been my supervisor, Prof.Ernest Fokoué. FokouéĂŹs passion, guidance, and discipline have been indispensable to my growth as a scientist and as a person over these past two years. Prof.Ernest Fokoué., this thesis would not have been possible without you. I would like to use this opportunity to express my gratitude for his unconditional insightful support and for the immense knowledge that guided me to conduct this thesis. Besides my adviser, I would like to thank the rest of my thesis committee: Prof. Robert Parody, Prof.Linlin Chen, for their encouragement, and insightful comments. I also thank the Center for Quality and Applied Statistics for their support and assistance since the start of my postgraduate work in 2012, especially professor Daniel Lawrence, professor Peter Bajorski, professor Steve Lalonde and professor Joseph Voelkel. I greatly value the friendship of Jo Bill and I deeply appreciate her belief in me. Thanks to Jo and Chris for helping me keep focused in the lab so many nights, your help, guidance, and support will not be forgotten. Last but not least, I would like to thank Rebecca Ziebarth the graduate coordinator for the center for quality and applied statistics. You never made me feel my questions were being asked at a wrong time and always made me feel that my questions was the most important question at that moment, for that I can not thank you enough. vi Contents Abstract iv Acknowledgements vi List of Figuresx List of Tables xii Abbreviations xiii 1 INTRODUCTION1 1.1 Thesis Scope..................................3 1.2 Thesis Organization..............................5 1.3 Major Components of the Engine......................6 1.4 Algorithm...................................7 2 DATA COLLECTION AND PROCESSING8 2.1 Quran......................................9 2.2 Bible.......................................9 2.3 Document’s Name Code............................ 11 2.4 DTM for the Raw Data............................ 12 2.5 Processing the Row Corpus.......................... 12 2.5.1 Information Retrieval......................... 12 2.5.2 Filter the Text............................. 12 2.5.3 Categorized Terms........................... 13 2.5.4 Minimize Distance Between Vectors................. 14 2.5.5 Synonymy and Polysemy........................ 14 2.5.6 Stemming the Texts.......................... 15 2.6 Document Term Matrix Representation................... 16 2.7 Distance Performance and the Ψ matrix................... 19 3 SIMILARITY MEASURES 20 3.1 Measures of Similarity............................ 21 3.2 Minkowski Family............................... 23 3.2.1 Euclidean and Manhattan Distance.................. 24 3.3 Inner Product Family............................. 24 3.3.1 Cosine Similarity............................ 24 3.4 Squared-Chord Family............................. 25 vii Contents viii 3.4.1 Bhattacharyya Distance........................ 25 3.4.2 Hellinger Distance........................... 25 3.5 Chi-Square Family............................... 25 3.5.1 Probabilistic Symmetric chi-Square and Clark Distance...... 25 3.6 Shannon’s Entropy Family........................... 26 3.6.1 Kullback-Leibler Divergence..................... 26 3.6.2 Jenson Shanon-divergence...................... 27 3.7 Jaccard Similarity on the Expert Matrix................... 27 4 PROBABILISTIC TOPIC MODELING 28 4.1 Probabilistic latent semantic analysis.................... 29 4.2 Latent Dirichlet Allocation.......................... 30 4.3 Correlated Topic Modeling........................... 31 4.3.1 Posterior Distribution of CTM.................... 32 4.4 learning Algorithm Using Variational Expectation Maximization..... 33 4.5 Number of Topics K.............................. 34 5 VALIDATION AND RESULTS 35 5.1 General Topic Annotation........................... 36 5.1.1 The structure and dimension of DTM for Bible and Quran..... 36 5.2 The structure and dimension of all the data sets used in the analysis... 37 5.2.1 K Topics................................. 38 5.3 Topical Assignment..............................

An Automatic Similarity Detection Engine Between Sacred Texts Using Text Mining and Similarity Measures

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support