<<

Rochester Institute of Technology RIT Scholar Works

Theses

11-2014

An Automatic Similarity Detection Engine Between Sacred Texts Using Text Mining and Similarity Measures

Salha Hassan Muhammed Qahl

Follow this and additional works at: https://scholarworks.rit.edu/theses

Recommended Citation Qahl, Salha Hassan Muhammed, "An Automatic Similarity Detection Engine Between Sacred Texts Using Text Mining and Similarity Measures" (2014). Thesis. Rochester Institute of Technology. Accessed from

This Thesis is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact [email protected]. ROCHESTER INSTITUTE OF TECHNOLOGY

An Automatic Similarity Detection Engine Between Sacred Texts Using Text Mining and Similarity Measures

by

Salha Hassan Muhammed Qahl

Supervisor: Professor Ernest Fokoue

A 6 credits thesis submitted in partial fulfillment for the degree of Master of Science in Applied Statistics

in the Kate Gleason College of Engineering Center for Quality and Applied Statistics

November 2014 c 2014 -Salha Qahl All rights reserved. ii

Committee Approval

Date Thesis Advisor:Professor Ernest Fokoué, Associate Professor, Center for Quality and Applied Statistics

Date Committee Member:Professor Linlin Chen, Assistant Professor, Department of Mathe- matics

Date Committee Member: Professor Robert Parody, Associate Professor, Center for Quality and Applied Statistics “Motivation isn’t enough. If you’ve an idiot and you motivate him, now you’ve a moti- vated idiot.”

Stiff Jokes (2014 – present) ROCHESTER INSTITUTE OF TECHNOLOGY

Abstract

Kate Gleason College of Engineering Center for Quality and Applied Statistics

Master’s of Science

by Salha Hassan Muhammed Qahl

Is there any similarity between the contexts of the Holy Bible and the Holy Quran, and can this be proven mathematically? The purpose of this research is using and the Quran as our corpus, we explore the performance of various feature extraction and machine learning techniques. The unstructured nature of text data adds an extra layer of complexity in the feature extraction task, and the inherently sparse nature of the corresponding data matrices makes text mining a distinctly difficult task. Among other things, We assess the difference between domain-based syntactic feature extraction and domain-free feature extraction, and then use a variety of similarity measures like Euclidean, Hillinger, Manhattan, cosine, Bhattacharyya, symmetries kullback-leibler, Jensen Shannon, probabilistic chi-square and clark. For a similarity to identify similari- ties and differences between sacred texts.

Initially I started by comparing chapters of two raw text using the proximity measures to visualize their behaviors on high dimensional and spars space. It was apparent there was similarity between some of the chapters, but it was not conclusive. Therefore, there was a need to clean the noise using the so called Natural Language processing (NLP). For example, to minimize the size of two vectors, We initiated lists of similar vocabulary that worded differently in both texts but indicates the same exact meaning. Therefore, the program would recognize Lord as God in the Holy Bible and Allah as God in the Quran and as prophet in bible and Yaqub as a prophet in Quran.

This process was completed many times to give relative comparisons on a variety of different words. After completion of the comparison of the raw texts, the comparison was completed for the processed text. The next comparison was completed using proba- bilistic topic modeling on feature extracted matrix to project the topical matrix into low dimensional space for more dense comparison. Among the distance measures intrdued to the sacred corpora, the analysis of similarities based on the probability based measures like Kullback leibler and Jenson shown the best result. Another similarity result based on Hellinger distance on the CTM also shows good discrimination result between documents.

This work started with a believe that if there is intersection between Bible and Quran, it will be shown clearly between the book of Deuteronomy and some Quranic chapters. It is now not only historically, but also mathematically is correct to say that there is much similarity between the Biblical and Quranic contexts more than the similarity within the holy books themselves. Furthermore, it is the conclusion that distances based on probabilistic measures such as Jeffersyn divergence and Hellinger distance are the recommended methods for the unstructured sacred texts. Acknowledgements

It would not have been possible to write this thesis without the help and support of the kind people around me, to only some of whom it is possible to give particular mention here.

Above all, I would like to acknowledge the financial, academic and technical support of ministry of higher education in Saudi Arabia, particularly in the award of the King Abdullah Foreign Postgraduate scholarship that provided the necessary financial support for the entire degree. I would like to thank my friend Christopher Robert Jones for his personal support, love, guidance and great patience at all times. My parents, brothers and sisters have given me their unequivocal support throughout, as always, for which my mere expression of thanks is not sufficient. It cannot be argued that the most influential person in my graduate career has been my supervisor, Prof.Ernest Fokoué. FokouéĂŹs passion, guidance, and discipline have been indispensable to my growth as a scientist and as a person over these past two years. Prof.Ernest Fokoué., this thesis would not have been possible without you. I would like to use this opportunity to express my gratitude for his unconditional insightful support and for the immense knowledge that guided me to conduct this thesis. Besides my adviser, I would like to thank the rest of my thesis committee: Prof. Robert Parody, Prof.Linlin Chen, for their encouragement, and insightful comments. I also thank the Center for Quality and Applied Statistics for their support and assistance since the start of my postgraduate work in 2012, especially professor Daniel Lawrence, professor Peter Bajorski, professor Steve Lalonde and profes- sor Voelkel. I greatly value the friendship of Jo Bill and I deeply appreciate her belief in me. Thanks to Jo and Chris for helping me keep focused in the lab so many nights, your help, guidance, and support will not be forgotten. Last but not least, I would like to thank Ziebarth the graduate coordinator for the center for quality and applied statistics. You never made me feel my questions were being asked at a wrong time and always made me feel that my questions was the most important question at that moment, for that I can not thank you enough.

vi Contents

Abstract iv

Acknowledgements vi

List of Figuresx

List of Tables xii

Abbreviations xiii

1 INTRODUCTION1 1.1 Thesis Scope...... 3 1.2 Thesis Organization...... 5 1.3 Major Components of the Engine...... 6 1.4 Algorithm...... 7

2 DATA COLLECTION AND PROCESSING8 2.1 Quran...... 9 2.2 Bible...... 9 2.3 Document’s Name Code...... 11 2.4 DTM for the Raw Data...... 12 2.5 Processing the Row Corpus...... 12 2.5.1 Information Retrieval...... 12 2.5.2 Filter the Text...... 12 2.5.3 Categorized Terms...... 13 2.5.4 Minimize Distance Between Vectors...... 14 2.5.5 Synonymy and Polysemy...... 14 2.5.6 Stemming the Texts...... 15 2.6 Document Term Matrix Representation...... 16 2.7 Distance Performance and the Ψ matrix...... 19

3 SIMILARITY MEASURES 20 3.1 Measures of Similarity...... 21 3.2 Minkowski Family...... 23 3.2.1 Euclidean and Manhattan Distance...... 24 3.3 Inner Product Family...... 24 3.3.1 Cosine Similarity...... 24 3.4 Squared-Chord Family...... 25

vii Contents viii

3.4.1 Bhattacharyya Distance...... 25 3.4.2 Hellinger Distance...... 25 3.5 Chi-Square Family...... 25 3.5.1 Probabilistic Symmetric chi-Square and Clark Distance...... 25 3.6 Shannon’s Entropy Family...... 26 3.6.1 Kullback-Leibler Divergence...... 26 3.6.2 Jenson Shanon-divergence...... 27 3.7 Jaccard Similarity on the Expert Matrix...... 27

4 PROBABILISTIC TOPIC MODELING 28 4.1 Probabilistic latent semantic analysis...... 29 4.2 Latent Dirichlet Allocation...... 30 4.3 Correlated Topic Modeling...... 31 4.3.1 Posterior Distribution of CTM...... 32 4.4 learning Algorithm Using Variational Expectation Maximization..... 33 4.5 Number of Topics K...... 34

5 VALIDATION AND RESULTS 35 5.1 General Topic Annotation...... 36 5.1.1 The structure and dimension of DTM for Bible and Quran..... 36 5.2 The structure and dimension of all the data sets used in the analysis... 37 5.2.1 K Topics...... 38 5.3 Topical Assignment...... 40 5.3.1 Topical Proportion...... 41 5.4 Topical Content...... 42

6 PROXIMITY, SIMILARITY AND DISTANCE 43 6.1 Distances Between Probabilities Distribution...... 44 6.2 Cosine Degree of similarity...... 45 6.3 Hellinger degree of Similarity...... 46 6.4 Bhattacharyya Distance converted to Similarity...... 47 6.5 Symmetric Kullback-Libler Divergence...... 48 6.6 Jensen-Shannon Similarity...... 49 6.7 Euclidean Similarity...... 50 6.8 Manhattan Similarity...... 51 6.9 Symmetric Chi-Square...... 52 6.10 Clark Similarity...... 53 6.10.1 Distances Between Probabilities Distribution of the Raw Corpus. 54 6.10.2 Expert Topical Assignment...... 62 6.11 Evaluation levels...... 65 6.11.1 First Similarity Evaluation...... 65 6.11.2 Second Step of Evaluation...... 67

7 CONCLUSIONS AND FUTURE WORK 72 7.1 Research Summary...... 73 7.2 Feature Extension to this Research...... 75 Contents ix

A Appendix A 76

B Appendix B 83

Bibliography 84 List of Figures

1.1 Major Components of the System...... 6 1.2 The Research Algorithm...... 7

2.1 Knowledge Discovery Process in Databases (KDD)...... 13

3.1 Cosine similarities between five synthetic Vectors...... 25

4.1 Plate Diagram of PLSA. For more information ,see Blei et al.[2003a].. 30 4.2 Plate Diagram of LDA. For more information, Blei et al.[2003a]..... 31 4.3 graphical model representation of the Correlated Topic Mode, see Blei and Lafferty[2006a]...... 32

5.1 Number of Topics using LDA Algorithm...... 38 5.2 Number of Topics using CTM Algorithm...... 39

6.1 Cosine Similarity of the Raw Corpus...... 54 6.2 Cosine Similarity Clusters Result...... 54 6.3 Manhattan Similarity of the Raw Corpus...... 55 6.4 Manhattan Similarity Clusters Result...... 55 6.5 Hellinger Similarity of the Raw Corpus...... 56 6.6 Hellinger Similarity Clusters Result...... 56 6.7 Bhattacharyya Similarity of the Raw Corpus...... 57 6.8 Bhattacharyya Similarity Clusters Result...... 57 6.9 Chi-Square Similarity of the Raw Corpus...... 58 6.10 Chi-Square Similarity Clusters Result...... 58 6.11 Clark Similarity of the Raw Corpus...... 59 6.12 Clark Similarity Clusters Result...... 59 6.13 The Kullback-Leibler Jeffrey’s Divergence Matrix of the Raw Corpus... 60 6.14 The Kullback-Leibler Jeffrey’s Divergence Cluster...... 60 6.15 Jensen-Shannon Divergence Matrix of the Raw Corpus...... 61 6.16 Jensen-Shannon Divergence Clusters Result...... 61 6.17 Jaccard Similarity on the experts-topics matrix...... 62 6.18 Expert-wise similarity matrix 1...... 63 6.19 Expert-wise similarity matrix 2...... 64 6.20 Binary Similarity applied on small density topics...... 64 6.21 Distance behaviour detection through the two level of noise...... 66 6.22 The difference between the expert matrix and the distances applied on the CTM matrix...... 68 6.23 The difference between the expert matrix and the distances applied on the Raw matrix...... 70

x List of Figures xi

B.1 The density of the chosen-topics sorted by biblical experts...... 83 List of Tables

5.3 Topical Proportion of learning set...... 41 5.4 The set of neighboring words for the highest probability topic per chapter 42

6.1 Cosine Similarity...... 45 6.2 Hellinger similarity in 0–100 scale...... 46 6.3 Bhattacharyya Similarity in 0–100 scale...... 47 6.4 Symmetric KL of Two Probability...... 48 6.5 Jensen-Shannon Similarity in 1–100 scale...... 49 6.6 Euclidean Similarity in a scale of 1–100...... 50 6.7 Manhattan Similarity in 1–100 scale...... 51 6.8 Symmetric Chi-Square...... 52 6.9 Clark Similarity Measure...... 53 6.10 Distance Score table...... 71

A.1 Topics Projection of learning Set...... 77 A.2 Similarity and Distance table...... 78 A.3 Summary of similarity and Distance table...... 79

xii Abbreviations

Corpus Corpus (plural corpora) is representing a collection of documents NLP Natural Language Processing IR Information Retrieval FA Factor Analysis BOW Bag of Words PCA Principle Component Analysis DTM Document Term Matrix LSA Latent Semantic Indexing LDA Latent Dirichlet Allocation CTM Correlated Topic Modeling VEM Variational Expectation Maximization KDD Knowlege Discovery and Data mining Expert Matrix Eexpert Representation of the religious books sorted on Matrix format ψ Jaccard Distance applied on Expert matrix

xiii I dedicate this thesis to my loving people ......

xiv 1 Introduction

1 CQAS 2

Say, O believers, "We have believed in Allah (God, and what has been revealed to us and what has been revealed to and and and Jacob and the Descendants and what was given to Moses and and what was given to the prophets from their Lord. We make no distinction between any of them, and we are believers in submission to Him." (Ch2. Baqara 2:136) Delano et al.[2007].

I would like to start this thesis by asking these questions; why are so many wars fought in the name of religion? What differences are so important that thousands if not hundreds of thousands of lives have been given for this cause? How do we as a collective or any individual make a change to stop the relentless killing in the name of religion? It is my intention to prove mathematically that an analysis of the Holy Bible and the Quran using different measurement techniques from different mathematical families can help people see the similarities and not the differences in religions. This is one persons attempt to use mathematical calculations to bring peace to the world. Regardless of what religion, the idea of worshiping a power greater than yourself is generally accepted, so why do we focus on the differences and use these differences to differentiate ourselves from others. If a person believes in God, then it is generally accepted that God is perfect and our creator. One should ask themselves the question of why would the creator of everything give different commandments to each of His prophets? Why would there be differences in what He requires of humanity? Why would there be differences in the messages? Does this describe the omnipotent perfect God that each faith, perspective, belief, religion has of their creator. If anyone reads the books of the Christians or the Muslims they will read that indeed there are some differences. So where did these differences come from? are they the difference of what it is called mathematical noise or difference of contexts? Many whys need to be answered and it is the purpose of this research to use statistical and programming tools to answer some of them mathematically and we are hoping to put the first step to prove that the words of the creator are more similar than different. CQAS 3

1.1 Thesis Scope

The main motivation for our research is to find answers for questions like, how can one define and use a valid and useful measured of similarity between two sacred source? are sacred texts from Abrahamie religion as different as naysayers make them to be? How similar are biblical and Quranic sacred texts?

The thesis research starts by considering two fragments of sacred texts, lets say:

{d1=Quranic texts, d2=Biblical texts} and let 0 ≤ =(di, dj) ≤ 1 be the similarity functions. Given the data in form of corpora, how one can estimate and learn =n(., .)?

The first attempt is using the Similarities on our raw corpora =raw(., .)= use term docu- ment matrix along with the proposed similarity measures on real numbers. The Challenge we faced here is the corpora in texts format and it is unstructured. Hence, it needs to be processed to obtain the dimension using natural language processing. So by doing that, ˇ P ×1 we are mapping the raw data into new space, di → di ∈ R such that

P ≡ number of terms or n-grams in the bag of words

P ≡ typically very large in the mapping process dˇi, each document is projected onto the term space.

  dˇ11 dˇ12 dˇ13 ... dˇ1p   dˇ21 dˇ22 dˇ23 ... dˇ2p  D =    ˇ ˇ ˇ ˇ  d31 d32 d33 ... d3p    dˇn1 dˇn2 dˇn3 ... dˇnp

once D is created, =raw(dˇ1, dˇ2) can be obtained using one of the distances presented in chapter 4. To remedy the limitation of the raw document term matrix, we proposed extracting topics from the join corpus, then using topical allocation - projection of a given document as an input to the similarity engine as follows:

• obtain the corpus

• processing the corpus using NLP

• Performing latent Dirichlet allocation (LDA) or Correlated Topics Modeling (CTM) on the processed corpus.

• So, given (di, dj), we obtained {Zi = T opics(di) ,Zi = T opics(dj)}

• we finally perform similarity measures again on the projected space =CTM (Zi,Zj)

such that 0 ≤ =CTM (., .) ≤ 1 CQAS 4

The topics projection Zi,Zj contains aspects of the semantic meaning, and therefore better to represent the document than the raw data matrix. Typically, Zi and Zj are the topical proportion Zi,Zj ∈ θ(0, 1). This domain will provide extra advantage for probabilistic- based distances like kullback-leibler and Jensen Shannon. Throughout, similarities are computed but since it is all unsupervised learning, it is hard to determine how good the measure is? Therefore, we proposed an indirect supervision validation scheme based on what we call the Expert Similarity Measure.

→ Assume that an expert theologian well versed in the Quran or the Bible or both can provide us with topical description of the texts of interests. In practise we found such topical description quite readable from the Internet. Then we can create a matrix Ψ of the similarities computed based on the experts.

In other words, for di and dj two sacred texts to be compared, the expert will have assigned them to some expert topical space

  Docs W11 W12 W13 W14 W15 W16 W17 W1q    d1 0 0 1 0 0 0 0 0       d2 0 0 1 0 1 0 0 0       d3 0 0 0 0 0 0 0 0    W =  d4 0 0 1 0 0 0 0 0       ......       ......     ......    dn 0 0 1 0 0 0 1 0

So, Expert has q topics and n = 60 documents, → W ∈ {0, 1}n×q.

( 1 if topics ` in doc i Wi` = 0 otherwise

ψij = =(wi, wj) = Jaccard(wi, wj)

  ψ11 ψ12 . . . ψ1n   ψ ψ . . . ψ   21 13 2n  ψ =    ......    ψn1 ψn2 . . . ψnn

So, ψ is our standard baseline matrix. CQAS 5

For assessing the goodness of a given similarity and projection method: → Define

1. | =raw(di, dj) − Ψ(di, dj) |= δraw(di, dj)

2. | =CTM (di, dj) − Ψ(di, dj) |= δCTM (di, dj)

if : δCTM < δraw , conclude that CTM mimics ( approximates) the expert better. We assume that ψ is the ground truth of a sacred book.

The distances applied to the corpora of this research are a collection of different distances such Euclidean, Hillinger, Manhattan, cosine, Bhattacharyya, symmetries kullback-leibler, Jensen Shannon, probabilistic chi-square and clark. For a similarity, we used the Similarity = 1 1 − distance for distances that satisfies d(x, y) = d(y, x), and Similarity = 1+distance when d(x, y) 6= d(y, x)

1.2 Thesis Organization

The thesis is organized as follows: Chapter 2 provides the data collection and processing. Chapter 3 illustrates the overview of the proposed techniques related to the similarity measures along with their mathematical proprieties. Chapter 4 is a continuation of the explanation of the techniques that are related to feature extraction and different topic modeling techniques. Chapter 5 and 6 illustrates the research setup and it also reports on the results of using the proposed approach on the research studies. Chapter 7 draws conclusions from the presented research, and outlines potential future research that could be extended from this thesis. CQAS 6

1.3 Major Components of the Engine

The diagram below illustrates the major steps used for the analysis followed by the evaluation algorithm.

Figure 1.1: Major Components of the System CQAS 7

1.4 Algorithm

Figure 1.2: The Research Algorithm 2 Data Collection and Processing

8 CQAS 9

2.1 Quran

The first data source is the English translation of the book of Quran available in Delano et al.[2007]. It is a religious book for Muslims people of past, present, and future. Muslims believe that the Quran is in its original form until today and will remain so under the protection of God until the end of time. As in the Bible there are so many topics covered. Topic distribution in Quran talks about so many themes such as, spiritual guidance, political guidance, community and family guidance, love, punishment, mercy, the creation and its conception, as well as God just to name a few. Quran is considered a spiritual guidance for those willing to learn what God has ordained for them and used as a rule book for the game of life so to speak. Similarly, to the Bible, there are many prophets and their stories mentioned. Jesus, Moses, Noah, and Abraham to name a few. Recorded are the stories of the prophets as a means of guidance and used as a medium to teach present and future generations how to conduct oneself Ali[1934]. One of the greatest difficulties was finding a trusted translation of the Quran, mostly due to the fact that it was revealed and past down through generation in the Arabic language. The chosen version for this research is from Maududi[2011] website which has translations available in many different languages and done so by many respected and well educated scholars. Considering that the author of this research has half of the Quran memorized and now after years of learning English, it is believed that this is the best choice to use as a comparison as a legitimate translation. Mathematically for comparison of the Bible and Quran there should be a unit of comparison. For the purpose of this research, one chapter is considered one unit of comparison and 30 chapters in total will be compared. In order to create equal units of comparison, text measuring 30 - 32 kb is considered one chapter.

2.2 Bible

The second data source is the book of Bible. The chosen copy of bible is obtained from Biblica[1973, 1978, 1984, 2011], the 21st Century King James Version (KJ21). It is the revelation of God given to humanity through the prophet Jesus. This book is respected as Gods guidance and law across all Christian faiths. The Bible is the Christian source for spiritual guidance, family and community guidance, love, punishment, mercy, the creation and its conception, as well as God just to name a few Hyers[1984]. The Bible tells the stories of the prophets of God as a means of teaching His believers how to handle difficult situation, how to treat others, what to do when being mistreated, and how to spread love and mercy to name a few. It is much more than just the collection of the prophets Jesus, Moses, Noah, and Abraham. CQAS 10

It is the guidance and the rulebook for Christians who desire to follow what God has ordained for them Eissfeldt[1965]. The mathematical units for this study is the holy chapters in both books. In order to use similar means of measurements, the following books were used from the Bible; Deuteronomy, Genesis, Exodus, Isiah and Jeremiah. These books, as well as the Quran, were equally divided into to files of size 30-32 kb to obtain the same comparison unit. The following two fragments of sacred corpus represent the nature structure of the our data.

Deuteronomy Part of CHAPTER 1

8 Behold, I have set the land before you: go in and possess the land which the LORD sware unto your fathers, Abraham, Isaac, and Jacob, to give unto them and to their seed after them. 9 Âű And I spake unto you at that time, saying, I am not able to bear you myself alone: 10 The LORD your God hath multiplied you, and, behold, ye [are] this day as the stars of heaven for multitude. 11 (The LORD God of your fathers make you a thousand times so many more as ye [are], and bless you, as he hath promised you!) 12 How can I myself alone bear your cumbrance, and your burden, and your strife? 13 Take you wise men, and understanding, and known among your tribes, and I will make them rulers over you. 14 And ye answered me, and said, The thing which thou hast spoken [is] good [for us] to do. 15 So I took the chief of your tribes, wise men, and known, and made them heads over you, captains over thousands, and captains over hundreds, and captains over fifties, and captains over tens, and officers among your tribes

Quran Part of CHAPTER 1

2|2|This is the Book of Allah: there is no doubt about it. It is guidance to God fearing people, 2|3|who believe in the unseen, establish the Salats and expend (in Our way) out of what We have bestowed on them; 2|4|who believe in the Book We have sent down to you (i.e. the Qur’an) and in the Books sent down before you, and firmly believe in the Hereafter. 2|5|Such people are on the right way from their Lord and such are truly successful. 2|6|As for those who have rejected (these things), it is all the same to them whether you warn them or do not warn them: they are not going to believe. 2|7|Allah has sealed up their hearts and ears and a covering has fallen over their eyes, and they have incurred the severest punishment. 2|8|Then there are some who say, "We believe in Allah and the Last Day", whereas they do not believe at all. 2|9|They thus try to deceive Allah and the Believers, but they succeed in deceiving none except themselves and they realize it not. CQAS 11

2.3 Document’s Name Code

Data Size

• Mathematical Unit of comparison=Chapter=30 kilobyte of terms (The kilobyte is a multiple of the unit byte for digital information)

• Bible Textual Data :

1. Deuteronomy

2. Genesis

3. Exodus

4. Isaac

5. Jeremiah

Document’s Code

The chapters are coded by the first Ch.number for the thirty chapters we have for each document.

1. Chapters obtained from holy Bible coded as follows:

• Name Ch.x.D referred to chapters obtained from Deuteronomy. • Ch.x.G referred to chapters obtained from Genesis. • Ch.x.E referred to chapters obtained from Exodus. • Ch.x.Is referred to chapters obtained from Isaac. • Ch.x.JE referred to chapters obtained from Jeremiah.

2. Chapters obtained from holy Quran coded as follows:

• Name ch.x such that x=1,2,. . . ,30 for full Quran. CQAS 12

2.4 DTM for the Raw Data

The first input of similarity measures done using the raw corpus.The raw corpus in this case refers to corpora without applying any of the knowledge discovery algorithms. We are interested to handle the big corpus without any possible modification to test distance measures improvement. Hence, the term document matrix of the raw texts was normalized and unitized as follows: Each document is represented in a vector space with P tf-idtf normalization. The vectors then unitized to such that d xij = 1 Non of the information retrieval technique for feature extraction were applied. he second input of the data is to use the processed

2.5 Processing the Row Corpus

The second input to classify the distances is again using the raw corpora but with reduced noise. The steps of preparing the raw texts as follows:

2.5.1 Information Retrieval

First we need to understand some concepts related to the process of the typical process of knowledge discovery. Information retrieval is the activity of finding material (usually documents) of an unstructured nature (usually text) that is relevant to an information need from a collection of information resources (usually stored on a computer) Manning and Schütze[1999]. A common example of information retrieval with respect to text analysis is using Google to search for any topic. Measuring what documents or website are retrieved (recall) and what fraction of those documents have relevant information to the user’s need (precision) Singhal[2001], you end up with the modern day Google, however, this application has many more everyday uses such as email search, searching a file on your laptop to name a couple. By continually recalculating the quality matrix of recall and precision you end up with more accurate or relevant information. To search all the documentation in a corporation or sift through all the information on the Internet would be impossible. However, information retrieval takes most of the hard work away for us and therefore and essential part of today’s world.

2.5.2 Filter the Text

In order to reduce the noise of the data, stop words and redundant words are removed. Any words that have less than three letters were omitted. To improve the performance of CQAS 13

Figure 2.1: Knowledge Discovery Process in Databases (KDD) the research a specific list of stop words was created. The Bible and the Quran use num- bers and letters to distinguish between one verse or chapter from another, those marks were removed. According to prior knowledge and scanning of the text, an additional stop words list is created, see Appendix A for full list. This list includes the original English stop words, any noise words, any words believed not to give any value to the analysis or little semantic meaning like rare words. This is implemented in R using the tm package Feinerer.

2.5.3 Categorized Terms

The interest of this research is to study the Bible and Quran for similarities, not for specific content details of each book. Therefore, we implicitly defined equivalence classes of tokenization by grouping words. The importance of this step is to give results that are more accurate with respect to distance as well as reducing the dimensionality of the text. Relative words for comparisons sake are programmed to be counted as a single word or category. Some examples are as follows,

1. Whenever the name of a prophet is mentioned, even different spellings of a prophets name, or the noun prophet itself is mentioned it is counted and labelled prophet.

2. Any female mentioned by name or the pronouns she or the word women-woman are grouped under the label of woman.

3. words related to fire and hell were collected under the term punishment. CQAS 14

4. Any specific types or names of food are generalized as food.

5. Whenever the words Lord, God, and Father are mentioned from the Bible and the word allah from the Quran, they are all grouped as God.

6. the words related to and devil were collected under the term evil.

7. Any food including fruits, vegetables, meet,. . . etc mentioned by name were grouped under the label of food.

8. Any drink mentioned by name were grouped under the label of drink.

9. All Arabic and Islamic expressions that has high frequency were translated to their English meaning:

• giblah converted to direction. • aiiah|surah converted to verses. • mushrik|mushriken converted to disbeliever. • asham converted to location. • zakah|zakat converted to charity. • salat converted to prayer • ahikam translated to wisdome.

See Appendix B for full list

2.5.4 Minimize Distance Between Vectors

The sacred texts used for this research was obtained from an online source. Therefore, extra meta characters were expected to be found within the data. Dots and hyphens that used to break single words were removed. Additionally, further steps were applied on the texts such as removing punctuations from strings, chapters marks and HTML tags as well as white spaces. Furthermore, a lower case conversion was applied Ng et al. [1997].

2.5.5 Synonymy and Polysemy

Synonymy is the case of two terms that are written differently but hold the same meaning. It is common in sacred text that two different words have the same or similar meaning and there are many of them in the Bible and Quran. However, we are not concerned about this issue when it comes to this study, because performing probabilistic topical CQAS 15 analysis will allocate words in according to there co-occurrence. In other words, when performing probabilistic topic modeling, two exact words can have different probability in different topics according to theFor example, different synonymous of the word "old" are could be antiquated, ancient, obsolete, extinct ect, and when applying topics modeling; it will put the word antique under a topic that is different than the the topic of word ancient. With that is being said, words with same semantic meanings will not miss-lead the result.

Another problem arisen when analysing texts data is polyseme. A polyseme is the case when there is several meanings for a word. To tackle this problem, a "WordNet dictionary can be used through R environment and many other language. WordNet is an annotated semantic lexicon database introduced to R using the package WordNet, Stevens et al. [2011], that has a collection of nouns, verbs, adjectives and definitions of vocabulary and places them in to similar sets. It is available and commonly used for text analysis and natural language processing(NLP). This program is used to reduce the differences in the translations of the Bible and Quran from their original language by combining most of the synonymous and polyseme words to a root word. However, we are not concerned about polyseme for the same reason mentioned for Synonymy.

2.5.6 Stemming the Texts

’stemming refers to the reduction of words to their roots so that, for example, different grammatical forms or declinations of verbs are identified and indexed (counted) as the same word’ Nisbet et al.[2009]. So basically we are grouping morphologically related words. This step is an important approach for information retrieval and text analysis applications in general, an example is clustering, measures of textual similarity, automatic text processing etc. The benefits for stemming the context are to reduce the size of the data and to enhance information retrieval performance.

There are several approaches for stemming, affix or suffixes removal, Porter stemming algorithm , n-grams,table look up, stochastic algorithms, and matching algorithms Hull [1996]. R by default implements, a popular method in information retrieval known as Porter’s stemming. In this technique, the stemmed word reduced; in most cases, incomprehension meaning topics. For this research, out of many stemming algorithms, we used affixes and suffix removal algorithm to support the comparison of the two groups Harman[1991]. In this technique, the algorithm replaces the related group of words to the root word (primarily lexical unit). This stemming technique is particularly useful by removing the affix and suffix so that the root word is easily recognizable and readable. CQAS 16

2.6 Document Term Matrix Representation

To extract useful information from unstructured texts, we transform the textual data into vector Spaces. One approach to achieve this is by using bag-of-words(BOW) repre- sentation Deerwester et al.[1990]. BOW simplifies the statistical analysis by converting the data in a matrix format. This matrix called the document term matrix (DTM) where columns represent terms and rows represent document. Let D = {d1, d2, ..., dj} repre-

sents all the collection of documents in the data set, and T = {t1, t2, .., tn} represents all

the sets of terms in the document dj. The DTM is defined as the matrix of tk frequencies

within dj document. Because the occurrence of each term varies among the documents, the DTM can be very sparse. Sparsity is the percentage of cells in the matrix that are equal to zero. High percentage of sparsity in a DTM indicates that the distribution of the terms occur in a few documents. For illustration of the concept of DTM, here is a fragment of the actual term-document matrix from our corpora.

Quran.DT M =

  Docs oaths obeys obligation obvious partner payment pearls perceiving perform    ch1 0 0 0 2 0 0 0 0 4      ch10 0 0 3 0 0 0 0 0 0    ch11 0 0 0 0 0 0 0 0 2      ch12 0 0 0 0 0 0 0 0 0      ch13 0 0 0 0 0 0 0 1 0    ch14 3 0 0 0 0 0 0 0 0      ch15 0 0 0 0 1 0 0 0 0      ch16 0 0 0 0 0 1 0 0 0    ch17 0 0 0 0 0 0 1 0 0    ch18 1 0 0 0 1 0 0 0 1 CQAS 17

Bible.DT M =

 Docs affliction aflame afraid afreedomat afterward agate agather aggression    Ch1.D 0 0 4 0 0 0 0 0      Ch10.G 0 0 1 0 1 0 0 0    Ch11.G 0 0 3 0 0 0 0 0      Ch12.G 0 0 1 0 0 0 0 0      Ch13.E 5 0 0 0 0 0 0 0    Ch14.E 0 0 0 0 0 0 0 0      Ch15.E 0 0 0 0 0 0 0 0      Ch16.E 0 0 0 0 0 1 0 0    Ch17.E 0 0 0 0 0 0 0 0      Ch18.E 0 0 1 0 0 1 0 0      Ch19.IS 0 0 1 0 1 0 0 0     Ch2.D 0 0 3 0 0 0 0 0    Ch20.IS 0 1 3 0 0 0 0 1

Since the document may contain a small number of words from the entire vocabulary, this process yields a sparse matrix. Indeed, although this new matrix transformation is efficient, information will be lost because the text lost its syntactic structure.

An important aspect for DTM is weighting the terms occurrence within each document and within the entire set of documents. Since we have two approach for measuring the similarity, row text and the filtered corpora, we applied different weighting for each of the DTM. For the raw text, we applied term frequency weighting, and global and local weighting.

Local.W eight = L(t, d) = tfij (2.1)

As first step of local weight, every different word within each corpus is given a measure- ment relative to the whole text. Higher frequencies would increase this value.

The second step is weighting the matrix globally to penalize the each document for its common features across the data sets. The relative frequency of each word within the entire set of documents is calculated. If any particular word has a high frequency throughout the entirety of the text, then this word is deemed less valuable. Lower frequency shows a differentiation between chapters. CQAS 18

 N  Global.W eight = idf(t, D) = log (2.2) |{d ∈ D : t ∈ d}|

where N is the total number of documents and , and d ∈ D : t ∈ d is the number of documents the contain the feature Sebastiani and Ricerche[2002].Finally, a new matrix

contains the normalized term frequency tij of the product between the global and local weighting Azzopardi et al.[2009]. We express

tij = L(t, d) ∗ idf(t, D) (2.3)

Weight the terms lead to much better results, as it takes into account the relative im- portance of potential search terms. Although weighting is assumed to be an unnecessary step in latent Dirichlet allocation(LDA),Andrew and Beter[2010] shows applying differ- ent weighting schemes can significantly improves the result.For this research, we applied tf weight forDTM ∼ topic modding for N ∼ P ossion(ζ). CQAS 19

2.7 Distance Performance and the Ψ matrix

To asses the performance of a given similarity measure, a new matrix need to be defined as a standard baseline matrix of comparison. To obtain this matrix, we mapped the "true" topics to their chapters for both books using knowledge obtained by theological

person on the field. Once the topics were obtained, the presence of a topici ∈ chapterj scored as 1 and 0 otherwise. Then we applied binary distance on the resulting matrix. Accordingly, the strong similarity measure is the measure that is close to Ψ.

let [d1, d2, . . . , dn] ∈ expert matrix and let Simi = [Sim1, Sim2, . . . , Sim9] are the nine distances measures applied on CTM and the raw DTM matrix . Therefore, The Ψ matrix can be defined as follows:

1. Ψ = Expert.Sim = Jaccard.Sim(di, dj)

2. | =raw(di, dj) − Ψ(di, dj) |= δraw(di, dj)

3. | =CTM (di, dj) − Ψ(di, dj) |= δCTM (di, dj)

if : δCTM < δraw , conclude that CTM mimics the expert better.

One difficulty with this step is to utilize the pure knowledge from the expert point of view. Extracting and sorting the relevant data from multiple experts can be always associated with many online-text-mining steps. It is challenging because using a search engine may make life a little easier but even then, the probability of extracting irrelevant possibilities whiting a corpus is high. We had to get rid of the extremely repetitive words and possibly differentiate relevant synonyms and translations mistakes. All of these factors and many more add to the challenges and complexity of extracting the expert matrix from a web sources. 3 Similarity Measures

20 CQAS 21

3.1 Measures of Similarity

One of the major text-based retrieval applications is determining semantic similarity of texts. In supervised and unsupervised learning, detection of semantic similarity are widely used in many areas of research such as automatic plagiarism detection Karatzoglou et al.[2004], biomedical informatics Pedersen et al.[2007] semantic analysis Maguitman et al.[2005], recommendation system, web clustering and browsing Schultz and Joachims [2004] and so on. Schultz and Joachims[2004] and so on. The importance of evaluating different dissimilarity measures is due to the fact that different distance algorithms can convey slower, exact, or faster approximate results according to the context of the data. For example, the study White and Jose[2004] investigated three classes of proximity measures: association, correlation and distance on ten different pairs of chosen topics. The study showed that the results vary from one measure to another and behaved dif- ferently according to the features selection methods used on the data. This study for Penney et al.[1998] is a comparison study to evaluate the performance of six similarity measures in two-dimensional and three-dimensional medical image registration. Accord- ing to the study, only two out of six intensity-based similarity measures were able to register the clinical images accurately and robustly.

In this section we introduce different mathematical distances grouped mathematically and we empirically evaluate their performance . Each distance family has specific math- ematical properties that differentiates one another from each other. The effectiveness of applying the similarity measure is believed to be related to the mathematical properties of each family.

There is two types of Similarity similarity for measuring strings distance:

• String Similarity:

One way to calculate similarity between two strings is by using string kernel-based methods. In String kernel functions, the kernel function calculates the inner prod- uct between two string vectors. There are various types of string kernel methods, such as; spectrum, boundrange, constant, exponential , fullstring and string. Each type has its own matching character functionality. The String Kernel Functions are implemented in R with package kernlab Karatzoglou et al.[2004]. String met- ric family has several algorithms for measuring the similarity of two strings. It calculates similarities between two strings by counting the number of operations required for matching common substrings of a given sequence to another. Examples for this algorithm are Levenshtein distance (edit distance), Damerau Levenshtein, Hamming distance and longest common subsequence metric. The main difference CQAS 22

between those distances is the number of optimal string alignment chosen for each algorithms.

Using either String Kernel or String metric approach has limitations with respect to this analysis. These two approaches are more applicable for evaluating semantic similarity measures of gene analysis in computational biology, quantify the sim- ilarity of DNA structures Leslie et al.[2002], Spell checkers Cucerzan and Brill [2004], SVM text classification Lodhi et al.[2002], discriminative protein classifi- cationLeslie et al.[2004].

This can be a useful choice in sacred texts, to compute the differences between two short sacred strings such as the difference of two random verses.

For R implementation, seevan der Loo[2013], Keuleers and Keuleers[2013], Goslee and Urban[2007] and Chessel et al.[2004].

• Metric Similarity :

The first measure of similarity referes to the so called metric similarity. Metric simi- larity refers to the distance between two points in metric spaces Cha. Some important topological properties of metric distance are:

1. d(x, y) ≥ 0

2. d(x, y) = 0 , if and only if , x = y

3. d(x, y) = d(y, x)

4. d(x, z) ≤ d(x, y) + d(y, z)

To apply this concept to this research, we need to be able to have a measure to evaluate the closeness of two vectors of distributions in latent space. Taking into consideration that each corpus has a number of different latent topics. A specific technique is used in order to estimate the number of topics over all documents and focused on the distribution of words per topic. This is done in order to give an equivalent number of topics which result in uniform matrix of comparison. Two topics are considered similar if they share similar distribution of tokens. Two specific topics are identical when they have exact distribution of words. CQAS 23

An important question to be answered is, to what extent two documents share similar  topics. in other words, for a set of documents θdj = θt1i θt2i , . . . , θtki , what is the topical interaction between any given dj such that θtij represents the tokens distribution for a dj document.

Applying topic modeling techniques will help to project the features from the DTM space into small dimension space. Obtaining the topical assignment matrix will identify the topical proportions of each document. Therefore, the similarity can be applied for two given documents upon their topics projection. Two topics of different documents are considered to be similar if they share similar word distribution and two documents, accordingly, considered to be similar if the topics distributions are similar to some extent.

3.2 Minkowski Family

When it comes to text and data mining, we commonly face the challenge of a high dimensional space data which proven to have an impact on the choice of the distance. In our analysis,iIt is even worse because the raw data is very sparse. When p=2, the Minkowski distance is Euclidean distance, and when p=1 is sometimes known as the Manhattan distance. In this Euclidean family, studies shown the members of this family tend to be consistently sensitive to the p-th power of the absolute value. Aggarwal et al. [2001]. In our case, it is more likely for the unitized sparse document term matrix to suffer from applying euclidean distance. the euclidean distance of any two objects will √ be around the 2.

For two probability distribution, D = (d1, d2, d3, .., di) Q = (q1, q2, q3, .., qi), the Minkowski distance is a metric distance class on Euclidean space given by the corresponding for- mula Deza and Deza[2009]:

1 j p

X p D(P,Q) = (Di − Qi) (3.1) i=1

When p = 1, Minkowski is Manhattan distance. When p = 2, the formula becomes the popular distance measure euclidean distance. Many studies investigated the association between the week performance of this family on a high dimension space and examine the sensitivity of applying proximity and distance on such a space. Actually, many of the study concluded that the metric distance lose their qualitative meaning with high p-th power. CQAS 24

3.2.1 Euclidean and Manhattan Distance

The Euclidean distance is probably the most popular and commonly used type of dis- tance. It is defined as follow: v u j uX E(p, q) = t (Di − P i)2 (3.2) i=1

The Manhattan Distance is defined as follows:

n X DM (p, q) = |pi − qi| (3.3) i=1

This distance function tends to be more robust with high dimensional and is more prefer- able than the Euclidean distance metric Aggarwal et al.[2001].

3.3 Inner Product Family

3.3.1 Cosine Similarity

Cosine Similarity measure is the normalized inner product between two documents on the Vector Space that measures the cosine of the angle between them . The formula can be written as

n P p × q p · q i i Similarity = cos(θ) = = i=1 (3.4) kpkkqk s n s n P 2 P 2 (pi) × (qi) i=1 i=1 where 1 indicates the two vectors are the same and 0 means they are different.

Mathematically, cosine measure considered to be very efficient measure of similarity and popularly used for many applications, but from text analysis point of view this measure tends to be biased Li and Han[2013]. The following figure shows an example of cosine measure of similarity values between two shared features vectors. The result indicates that Cosine tend to behave less accurately for the DTM.

Qian et al.[2004] study shows in high dimensional space , the Cosine distance tends to have similar result as euclidean distance. CQAS 25

Figure 3.1: Cosine similarities between five synthetic Vectors

3.4 Squared-Chord Family

3.4.1 Bhattacharyya Distance

Bhattacharyya distance is another measure of variation that has the certain properties; it is symmetric, positive-semidefinite and unbounded. It is defined as:

d X √ DH (P,Q) = − log piqi (3.5) j=1

Clearly 0 ≤ DB(P,Q) ≤ ∞

The Bhattacharyya distance does not follow the triangle inequality but q Pk √ DB(P,Q) = 1 − j=1 piqi (Hellinger distances) does Kailath[1967].

3.4.2 Hellinger Distance

The Hellinger distance between two probability measures is

v v u d u d 1 uX √ √ 2 u X √ H(P,Q) = √ t ( pi − qi) = 2t1 − ( piqi) (3.6) 2 i=1 j=1

√ The values of Hellinger distance fall between 0 < DP (P,Q) < 2

3.5 Chi-Square Family

3.5.1 Probabilistic Symmetric chi-Square and Clark Distance

d 2 X (pi − qi)  D(p, q) = 2 (3.7) p + q j=1 i i CQAS 26

The Clark distance (Clark 1952) is defined as Deza and Deza[2006]:

v u d 2 uX |pi − qi|  D(p, q) = t (3.8) q + 1 j=1 i

3.6 Shannon’s Entropy Family

3.6.1 Kullback-Leibler Divergence

Kullback-Leibler divergence (KL divergence) or relative entropy is one way to measure the dissimilarity between two probability distribution. Using this divergence, we measure the amount of knowledge we obtain by moving form a prior distribution to a posterior distribution Lin[1991a]. Given P(i) and Q(i) two vectors of the probability distribution, the KL divergence is defined as follows:

X P (i) D (P kQ) = Q(i) log (3.9) KL Q(i) i

This measure is only defined if P and Q both sum to 1 and both Q(i) and P(i) should be non-negative. Although this, non-negative, measure is widely used in many appli- cations in pattern recognition, image processing and statistical measure of dissimilar- ity in general, it is worth knowing that The KL divergence is not a distance, since

Dk(p, q) 6= Dk(q, p). An extension of Kullback-Libler is the symmetric Jeffreys diver- gence measure that is numerically stable, symmetric so the comparison direction does not matter to us. Jeffrey divergence can be defined as:

DKL(pkq) = DKL(pkq) + DKL(qkp), which leads to:

d X pi D (pkq) = (p − q ) log( ) (3.10) KL i i q i=1 i

Clearly, 0 ≤ DKL(P,Q) ≤ 1. As it can be seen, both divergence measures are undefined when Q(i) = 0 and P (i) 6= 0. This can be a problem when dealing with the raw texts because of the high sparsity of DTM, the absolute continuity condition cannot be satisfied. Therefore, a small value of epsilon was added to ovoid log(0/0) for the raw matrix. Lin[1991b] presents a new class of information-theoretic divergence measure between two probability distribution that does not require the condition of absolute continuity. CQAS 27

3.6.2 Jenson Shanon-divergence

The Jensen-Shannon divergence derived to overcome the weaknesses of the Kullback- Leibler divergence, see Budka et al.[2011].

It is defined in terms of a symmetrizing relative entropy as follows:

( ) X 2P (i) X 2Q(i) D (P |Q) = 0.5 P (i) log  + Q(i) log  (3.11) JS P (i) + Q(i) P (i) + Q(i) i i

3.7 Jaccard Similarity on the Expert Matrix

For Given two binary documents, Jaccard Similarity measures the intersection of two documents attributes. It is given by

T11 J(di, dj) = (3.12) T01 + T10 + T11 The Jaccard distance is complementary to the Jaccard coefficient and is obtained by :

DJ (di, dj) = 1 − J(di, dj) (3.13)

Unlike Cosine measure, Jaccard satisfy the Lorenz concentration theory which implies that adding a constant value to both documents under comparison, will consequently increase the similarity. 4 Probabilistic Topic Modeling

28 CQAS 29

For the high dimensional space, it is expected that many features (words) can be redun- dant and irrelevant. The more noise we have, the less accurate performance of similarity we will get. We need, therefore, to be able to achieve the trade off between the features and dimensionality. In other words, we want to reduce the noise of the data without loosing the meaning of the features. We have a set of sacred corpora, and each corpus is a mixture of observed "terms". The goal is to find a method that is able to first, find these latent topics among the mixture of words and second, label these words prob- abilistically under categories called themes.In other words, we need to find a tool that extracts features and reduces the dimensionality of the data. This is indeed similar to the idea behind principle component analysis PCA and Factor Analysis. In fact, the idea of applying probability to draw a conclusion based on topical projection is very similar to the idea behind loading matrix in factor analysis.

However, from a text analysis point of view, we need to extend the idea of factor analysis to be able to use the joint distribution to compute the conditional distribution of the hidden structure given the observed words. The document is a distribution of topics, and each topic is a distribution of underlying words. The aim is to compute word prob- abilities under topics and topics probabilities under documents. The problem is totally unsupervised, The input of our research is a set of sacred chapters with no predefined topics. The only prior knowledge specified in my work, is the optimal topic numbers extracted. For R applications, see Hornik and Grün[2011], Chang and Chang[2010] and Jurka et al.[2014] for topics classification.

4.1 Probabilistic latent semantic analysis

The generative probabilistic model also called Aspect model is a bag of words model.

The process of this generative model can be described with graphical model. ww are observed random variables and Zw are latent variables. The arrow between the circles indicates conditional probability. θd is z-prior mixture of topics that are assigned to certain documents and is a parameter. CQAS 30

θd zw ww D N

Figure 4.1: Plate Diagram of PLSA. For more information ,see Blei et al.[2003a]

So for each topic, we generate θd and for each terms we draw the topical distribution Z randomly from the distribution θd, and finally, we draw ww from the distribution that is specified by a certain topic Z. This process is repeated D times. W is shaded here because it is the only variable that has known values. An issue with this model is overfitting that leads to poor predictive performance. Hofmann[2004]

4.2 Latent Dirichlet Allocation to overcome the over-fitting in PLSA, we need a model that structured to fit well in training data set, and be able at the same time to perform well for testing data. Latent Dirichlet Allocation (LDA) is more robust than PLSA in the sense that LDA overcome the over-fitting problem and produced a more consistent probabilistic model.

LDA basically is a bag-of-word generative model where each topic is modeled as probabil- ity distributions over terms. In the previous graphical model, we did not have parameters alpha and beta. The main difference between the PLSA model and LDA is that the pa- rameters θd no longer a parameter but it is a random variable that comes from a certain distribution that counts all the thetas. The distribution of is specified to be Dirichlet distribution such that 1 Y Dir(p|α , . . . , α ) = pαk−1 (4.1) 1 m Z k k where Z, a normalization factor, is multinomial Beta function expressed in terms of the gamma function Qm k=1 Γ(αk) Z = Pm (4.2) Γ( k=1 αk)

Dirichlet distribution from a mathematical point of view is the conjugate prior of multi-

nomial. if the prior distribution of the multinomial parameters Z ∼ Mult(θd) and W ∼ Mult(β) is Dirichlet then the posterior distribution is also a Dirichlet distribution Blei et al.[2003b].It is assumed that for any chosen words within a document D is inde- pendently selected from a mixture of k topic. So sampling a word from document D will CQAS 31 not have an effect of the choice of any of the subsequent words. This feature of LDA can limit extracting hidden information.

β

α θd zw ww D N

Figure 4.2: Plate Diagram of LDA. For more information, Blei et al.[2003a] subsectionGenerative Process Generative Process

1: for document dd in corpus D do

2: Choose θd ∼ Dirichlet(α)

3: for position w in dd do

4: generate a topic zw ∼ Multinomial(θd)

5: generate a word ww p(ww|zw, β), a multinomial distribution over words con- ditioned on the topic and the prior β. So in this generative process we have:

• D, k, N are fixed known parameters.

• α β are fixed unknown parameters.

• θd,zw, w are random variables.

4.3 Correlated Topic Modeling

One of the limitation of LDA is its inferior to capture the correlation of un- derlying topics. When one considers the correlation among latent topics, the resulting topics are less associated than CTM Blei and Lafferty[2007], Li and McCallum[2006], Wang et al.[2007]. LDA uses Dirichlet prior and each topic is independently selected without taking into account correlations among topics themselves. CTM does not apply Dirichlet instead it uses Gaussian distribu- tion. Gaussian distribution captures the correlation of the latent topics more efficiently. Our corpora are sacred text and it is assumed we have strong re- lationship between the topics. Based on this, applying CTM should result in more realistic model and will introduce many more latent topics than LDA does Murphy[2012]. CQAS 32

Figure 4.3: graphical model representation of the Correlated Topic Mode, see Blei and Lafferty[2006a]

We modified Step 2 in LDA generative process to the following: with η ∈ R(k−1) and P ∈ R(K−1)(K−1) we have:

P • ηk ∼ MVN(µk, k)

• set f(η ) = exp(ηk,i) fori = 1,...,N and forK = 1,...,K k,i PNk k i=1 exp(ηk,i)

Such that µk is the mean of the topics distribution, and ΣK−1 is variance- covariance matrix of topics on the logit scale. As it can be seen, the generative process of CTM is identical to that of LDA except that the topic proportions are drawn from a logistic normal rather than Dirichlet. So in CTM we have

logistic normal distribution as a prior over θd. Anther generative method to use is Dynamic Topic Modeling (DTM) Blei and Lafferty[2006b]. In this algorithm, the topics are modeling according to the time evolution in document corpora. underling topics. In regards to our research, it could be interesting to investigate the evolution of the sacred text language throughout the history of revelations, and accordingly one could be interested to study further details account of specific event that arise by DTM. However, it is not our intent to apply this technique because we need to have larger collection of texts.

4.3.1 Posterior Distribution of CTM

The posterior probability of CTM model is defined as:

exp(ηk, i) f(ηk,i) = (4.3) PNk i=1 exp(ηk, i)

fori = 1,...,Nk and forK = 1,...,K such that k is the number of the chosen topics. η is drawn from multivariate Gaussian. In other words, we have a prior that is dependent on µ and σ, variance covariance matrix of logistic normal . this step guarantees the dependence terms to be grouped under one them more efficiently Cohen et al.[2008]. CQAS 33

4.4 learning Algorithm Using Variational Expecta- tion Maximization

To test the model on a hold-out set, the data sets needs to be partitioned into representative groups for both independent sets. When dividing the data randomly, the likelihood of having a test set that would infer about a given chapter is less likely than if we had a balanced design. Therefore, in this case, it is important to use the stratified sampling for both training and test set as follows:

Training set: first training portion =75 % of Bible second training portion =75% of Quran Test set: first test potion =15% of Bible first test potion =15% of Quran Full sets: full training set=first training data +second trained data full test set=the sum of both test Set

P As a reminder, the prior probability of CTM model , ηk ∼ (µk, k), has logistic- normal form. Needless to say that although there are significant studies that have been done to determine the posterior probability using the conditional distribution of Îÿ given w, computing the exact inference remains a challenge due to the non-conjugacy of the logistic normal and multinomial distributions. SeeRoberts et al.[2014], a recent study presented scalable inference using Gibbs sampling algorithm for logistic-normal topic model. In this research, the al- gorithm used is Variational Expectation Maximization (VEM) Wainwright and Jordan[2008]. The basic idea of VEM is using the conditional distribution to approximate the posterior usinge tractable family distribution when it is too complex to draw from joint distributions Chen et al.[2013]. This is an opti- mization problem rather than inference, and it trades accuracy for speed. This, in fact, using VEM will not hurt our analysis because of the size of the cor- pora. Although this "optimization" algorithm is less accurate to compute the inference, it is less computational intractability and often faster algorithm than exact inference Murphy[2012]. CQAS 34

4.5 Number of Topics K

K denotes the fixed number of topics specified prior to run any of the probabilis- tic topic models. Each document has the ability to exhibit K > 1 topics because we sample repeatedly within a document. According to the chosen number of K, the resulting topical allocation matrix will have K columns of topics. Fixing a-priori K is considered to be one of the limitation for LDA, CTM. Specifying the number of K requires prior knowledge of the corpus and also affects the interpretability of the resulting latent topics. For example, if we want to investi- gate the topics related to a specific term; with high K, the term will be assigned to many unrelated topics, and with low K, the modeled topics will cover high number of unrelated words Landauer et al.[2013]. So, what is the right number of K within a document?There is no exact answer to this question, the recent way to answer this question is by applying hierarchical Dirichlet process LDA model( HDP-LDA)Teh et al.[2006] to estimate the optimal K. However, this has infeasible result on our data because of small corpus size. Another way to estimate the number of K is by applying the maximum log likelihood on a sequence of topics, and chose the optimal K topics associated with highest log likelihood value Griffiths and Steyvers[2004]. This leads to another measure of goodness of fit measure,perplexity. This measure tests the ability of a model to predict the unseen new documents Srivastava and Sahami[2010].

P n m log p(wmd )o perp(Dt) = exp − P (4.4) mNd Empirically speaking, The choice of hyperparameter can control the number of the chosen topics Blei et al.[2003b]. The number of topics chosen by log- likelihood or (equivalently, perplexity) tends to be proportional to the choice of hyperparameter alpha and beta. Modifying hyperparameters can affect the perplexity result for each iteration. Therefore, we computed the perplexity over the held-out test data with three folds with and fixed hyperparamaters. 5 Validation and Results

35 CQAS 36

5.1 General Topic Annotation

5.1.1 The structure and dimension of DTM for Bible and Quran

The following is the representation of both Sacred text term document matrix.

Bible.DTM1 DocumentTermMatrix (documents30, terms 2334) Non-/sparse entries 16588/53432 Sparsity 76% Maximal term length 14 Weighting term frequency (tf)

Quran.DTM1 DocumentTermMatrix (documents30, terms 6442) Non-/sparse entries 25709/167551 Sparsity 87% Maximal term length 16 Weighting term frequency (tf)

As illustrated, the number of terms that appears in DTM is not large in compari- son to the size of document. When applying LDA and especially CTM, we need saturated corpus for converging, otherwise, the optimization converge scheme for CTM will be very slow or not converge. However, if we had more corpora of sacred text then we could remove the frequent terms and include terms that have a term frequency (tf) value higher than the median or mean in order to give more precision. CQAS 37

5.2 The structure and dimension of all the data sets used in the analysis

• »»The structure and dimension of the CTM-training»»

$nrow : int 44 $ ncol : int 9259 $ dimnames:List of 2 $ Docs : chr [1:44] "ch27" "ch3" "ch29" "ch23" ... $ Terms: chr [1:9259] "abide" "abode" "abrogate" .. class= "DocumentTermMatrix" weighting= "term frequency" "tf"

• »» The structure and dimension of the CTM-test set»»:

$ ncol : int 5386 $ dimnames:List of 2 $ Docs : chr [1:16] "ch11" "ch15" "ch16" "ch17" $ Terms: chr [1:5386] "abide" "abiding" "ablaze"... class= "DocumentTermMatrix" weighting= chr [1:2] "term frequency" "tf"

• »»The structure of the normalized RAW DTM for the raw analysis»»:

$ nrow : int 60 $ ncol : int 8119 $ dimnames :List of 2 $ Docs : chr [1:60] "Ch1.D" "Ch10.G" "Ch11.G" .. $ Terms: chr [1:8119] "abid" "abl" "abod" "abov" ... class= chr [1:2] "DocumentTermMatrix" weighting = chr [1:2] "term frequency - inverse document frequency (normalized)" "tf-idf"$

• »» The structure of the expert topical assignment matrix »»:

chr [1:2120, 1:60] "Gen.1" "0" "0" "0" "0" "0" "0" .. dimnames=List of 2 $ : chr [1:2120] "Chapters" "aaron" "abd" "abel" ... $ : chr [1:60] "Gen.1" "Jer.12" "Deut.22" "Isa.1" CQAS 38

5.2.1 K Topics

The following plots are a projection of applying CTM and LDA on a sequence of K-dimensional topics from k=2 to k=50, folds=3.

Figure 5.1: Number of Topics using LDA Algorithm

The graphs of perplexity and loglikelihood algorithm show that k choice of topics for CTM and LDA is flatting after K=20-k=25, that means adding more than 20 topics will not add extra information to the model and it might add just noise. Therefore, the chosen number of optimal topics chosen to be 20 topics for the rest of this analysis. Although the choice of optimal K helped to identify the proper number of topics for better document topic categorization, this step might be not ideal for similarity measure especially for small corpora. Lets look back to the total probability for a model chosen by LDA. CQAS 39

Figure 5.2: Number of Topics using CTM Algorithm

p(W , Z, θ, ϕ; α, β) = p(ϕi; β)p(θj; α)(Zj,t|θj)p(Wj,t|ϕZj,t ) (5.1)

The total probability of LDA,θd is K-dimension vector of topic probabilities per document, which must add up to one. That means, the more value of assigned k, the more terms we include; and consequently, the thinner probability per topic we have. The question is, since the main goal of this research is finding the optimal number of K that supplies us with enough information, but not too much information that might affect the similarity between two documents. Let us look at a few different examples of k and see how these different values influence the analysis. CQAS 40

Apparently, chapters tend to be more similar with less initiated K. Hence, we need to be very aware of the influence of k value on the analysis. The more we include topics, the more we add extra "confusion" of "cleaned" corpus to the similarity matrix. Accordingly, the choice of k should interpret my data and at the same time help me to get the lowest distance. Hence, a correlated model with the exact k=20 is chosen for the rest of the analysis.

5.3 Topical Assignment

A document di ∈ [d1, d2, dN ] can be represented as a mixed proportions of K top- ics. For another representation of θ matrix, topical assignment matrix projects

Dn vectors of K topics ranked by topics distribution. For example, document ch27 can be viewed as a mixed proportions of topics:

Vd1 = [13, 20, 14, 3, 6, 15, 7, 1, 18, 17, 16, 12, 10, 4, 19, 8, 11, 9, 2, 5] Needless to say that some topics are absent for C27 such as topic 15,2,5. It will be more interesting if we have a closed look to the topical proportions in θ matrix to highlight the size of each topics per document. AppendixA contains the table of topical proportions per each chapter. CQAS 41 Topic 20 Topical Proportion of learning set Table 5.3: 5.3.1 Topical Proportion We fit CTM modelThe according resulting to posterior the probability training is data given and by We the trainedChapters following the matrix. hidden variables Topic 1 using VEM. ch11 Topic 2ch15 Topic 3ch16 Topic 4 9e-04ch17 Topic 0.0024 5ch21 0.0023 Topic 4e-04 6 0.0061ch25 0.0058 0.002 Topic 7 0.0062ch28 9e-04 9e-04 Topic 0.0138 8ch8 0.0027 0.0061 0.0015 Topic 0.0084 9 1e-04Ch11.G 8e-04 0.0057 0.0013 0.0146 0.003 Topic 10Ch15.E 0.0055 0.0124 0.2125 0.0133 5e-04 0.0653 0.007 Topic 0.0063Ch16.E 11 0.0059 0.0029 0.2573 0.0217 0.0011 2e-04 TopicCh17.E 0.0072 12 0.0017 0.0052 0.0204 0.1853 0.0898 0.0138 0.0439 0.0107 TopicCh21.IS 0.0026 13 0.0164 0.003 0.2437 0.1179 0.0622 2e-04 0.5953 0.0168 0.001 0.0111Ch25.JE Topic 0.0095 14 0.007 0.0018 6e-04 0.385 0.0053 0.002 0.0023Ch28.JE 0.008 0.0124 0.0116 Topic 0.0053 15 0.0897 0.4731 0.029 0.0137 0.0033Ch8.G 0.0063 0.0117 0.0195 Topic 0.0032 0.0038 3e-04 0.0036 16 0.0166 0.1977 0.0018 0.0027 9e-04 0.0196 0.0067 0.0183 Topic 5e-04 0.0018 0.0175 17 0.0409 0.0026 0.008 0.013 0.0017 0.0282 0.0023 0.0198 0.0176 0.0984 Topic 0.0081 0.005 18 0.0295 0.0032 0.005 0.0069 0.0013 0.3196 0.0016 0.0043 0.0027 0.0106 0.0175 Topic 0.0031 0.0036 0.0397 19 0.0048 0.0053 4e-04 0.0053 0.0024 0.0335 5e-04 0.0131 0.0022 0.0085 0.034 0.012 0.0146 0.0072 0.0181 0.0837 0.0081 0.0275 0.0053 0.0264 0.0678 0.0013 0.0124 0.0012 0.004 0.0202 0.0136 0.0021 0.003 0.0762 0.0032 0.0837 0.0158 0.0079 0.0862 0.0492 0.0128 0.0164 0.0072 0.0197 0.0459 0.0582 0.2633 0.0206 4e-04 0.0223 0.1412 0.0017 0.0022 0.179 0.2165 0.1691 0.0409 0.0028 0.0751 0.0243 0.0026 0.0315 0.0182 0.0976 0.0088 0.0277 3e-04 0.0037 0.0527 0.1111 0.1719 0.0208 0.1786 0.2757 0.5408 0.216 0.0081 0.0084 0.0908 0.0128 0.2188 0.2773 0.08 0.4004 0.0557 0.0067 0.085 0.1256 0.1163 0.1082 0.0372 0.2705 0.0068 0.0117 0.0707 0.1847 0.1665 0.004 0.0095 0.0431 0.0617 7e-04 0.0036 0.1976 0.0031 0.0877 0.0172 0.0252 0.0611 0.0594 0.1417 0.1485 0.014 0.0621 0.0256 0.0063 0.0933 0.1121 0.2757 0.0019 0.0129 0.0404 0.0284 0.0078 0.004 0.0048 0.197 0.3698 0.0011 0.0064 0.004 0.0316 0.164 0.0745 0.1347 0.0237 0.1251 0.0528 0.0063 0.0046 0.0038 0.0819 0.1118 0.1241 0.1593 0.0096 0.0092 0.0242 0.2136 9e-04 4e-04 0.0759 0.0257 0.0021 0.1453 0.0014 0.0713 0.0017 0.1513 0.0042 0.0115 0.0028 0.0014 0.0091 0.0105 0.0449 0.0012 0.0024 0.0859 0.0021 0.0027 0.0129 0.0544 0.0051 0.0464 0.0133 3e-04 0.0539 0.0062 0.0553 0.053 0.0203 0.0966 0.0182 0.0097 0.0065 0.05 0.0031 0.0051 0.0012 0.0052 0.0684 0.0274 0.0882 0.0101 0.0152 0.0288 0.0063 0.0045 0.006 0.048 0.0671 0.0126 0.3707 0.0713 0.0501 0.0035 0.0507 CQAS 42

5.4 Topical Content

Topic.1 Topic.2 Topic.3 Topic.4 Topic.5 Topic.6 Topic.7 Topic.8 Topic.9 Topic.10 god god god god probhet god god god god god shalt tabernacle probhet woman woman probhet earth israel land people land gold son people land people woman people probhet land woman probhet king days father surely womanry righteousness people almighty thine commanded whatwomanr sons woman probhet nations israel day man sockets people fast pharaoh beliwoman hundred holy declares jerusalem day shalt egypt book god beliwomanrs drinks woman king israel people altar babylon fear brothers book man joy children judah eat pillars land remember rwomanaled sons light judah king israel bars words foodht son twoman days salvation day drink hand violet drink towards israel day ground children days declares drink linen israel day hand fear ark bring covenant nations fathers brass scroll probhet egypt evil day peace city woman egypt israel gedaliah hajj judah unbeliwomanrs father justice time hand possess set jerusalem twoman womanchem charity flesh drink bring assyria heart thereof woman womanrything canaan covenant lived womanrlasting woman food commandments fine baruch life house hell food look word hezekiah children children sword harm children earth animals nwomanr saying zion command testimony live prayer died chastisement living earth hear listen days cubits officials mankind clans whatwomanr cain day son egypt Ch15.E Ch16.E Ch28.JE ch28 Ch11.G ch11 Ch8.G Ch25.JE Ch28.JE Ch21.IS Topic11 Topic12 Topic13 Topic14 Topic15 Topic16 Topic17 Topic18 Topic19 Topic20 god god god god god god god woman god god day probhet day probhet probhet probhet surely probhet probhet people hell egypt probhet people people people probhet god people woman surely land jinn twoman woman surely woman father hell day man pharao surely beliwoman surely signs people son beliwoman earth earth israel people woman day earth evil day probhet created people deny signs beliwoman chastisement beliwomanrs esau earth man probhet children twain earth evil woman reward brother book land people woman hell day bring hell beliwoman blessed life children drink abram chastisement heavens chastisement evil book bore servants night night forth lie beliwomanrs earth twoman twoman flock twoman fear lie hand earth lie twoman beliwoman chastisement flocks created surely brought egyptians drink fear fear day world set remember saying deeds shalt twoman true father unbeliwomanrs fear mother chastisement evil servant king woman surely deeds beliwomanrs earth named surely womanry hand day power divinity reward call hell bless time set life heart reward life wrong heavens wrong called worship hand called servants created drink brother probhets day land woman sin saysa sacrifice favours houses hell pharaoh women house mind food twoman house man knowledge mercy true signs country probhets time Ch25.JE Ch15.E ch25 ch8 ch16 ch25 ch28 Ch8.G ch16 Ch15.E

Table 5.4: The set of neighboring words for the highest probability topic per chapter 6 Proximity, Similarity and Distance

43 CQAS 44

6.1 Distances Between Probabilities Distribution

For increasing the computation and mathematical precision, the similarity mea- sures should be applicable to compare two continues probability distribution. In the case of LDA, we are comparing the density of two continuous distribution. In fact, when k=2, Dirichlet is a special case of Beta distribution, which is a family of continuous distributions. In the case of CTM, the topic proportions are sampled from a logit-normal dis- tribution, which is also a family of continuous distribution. Therefore, in both cases of LDA and CTM, the resulting matrix of θ exhibit a continues distribution.

In the following sections, we are tyring to find an answer to how does the distance measures can be affected by more or less noisy in the data? we are comparing three main categories of distances: metric, divergence, and entropy using the posterior probability as well as the raw unprocessed texts .The robustness of the measures may vary from one category to another and within the category. Weakness, sensitivity, and strength to the noisy data is critical and should be considered when it comes to sacred contexts. This is due the fact that filtering the sacred texts can be computationally expensive, a challenging task to do because of the required prior knowledge and less effective at reflecting the actual texts meanings. Thus, we chose to compare the row unfiltered corpora with the cleaned ones so we have two extreme cases of comparison of the same texts. CQAS 45 Cosine Similarity Table 6.1: 6.2 Cosine DegreeCosine of similarity ch11 ch11 ch15 ch16 100.00% ch15 ch17 88.00% 88.00% ch21 88.00% 88.00% ch16100.00% ch25 93.00% 91.00% 91.00% 93.00% ch28 91.00% 96.00% ch17 96.00% 100.00% ch8 91.00% 76.00% 95.00% 95.00% 95.00% Ch11.G 95.00% 91.00% 76.00% 89.00% 84.00% ch21Ch15.E 84.00% 100.00% 10.00% 89.00% 83.00% 91.00% 93.00% 72.00% Ch16.E 92.00% 92.00% 12.00% 72.00% 16.00% 83.00% 72.00% 93.00% 95.00% Ch17.E ch2583.00% 4.00% 83.00% 100.00% 16.00% 72.00% 17.00% 95.00% Ch21.IS 79.00% 10.00% 90.00% 7.00% 94.00% 94.00% 79.00% 20.00% 90.00% 6.00% Ch25.JE 13.00% 12.00% 16.00% ch2885.00% 12.00% 95.00% 85.00% 100.00% 95.00% 10.00% 10.00% Ch28.JE 15.00% 17.00% 10.00% 17.00% 16.00% 71.00% 6.00% 71.00% 4.00% 93.00% 93.00% 8.00% Ch8.G 13.00% 16.00% 11.00% ch812.00% 20.00% 8.00% 16.00% 6.00% 80.00% 100.00% 80.00% 10.00% 5.00% 7.00% 15.00% 14.00% 9.00% 11.00% 6.00% 10.00% 6.00% 86.00% 19.00% 8.00% 8.00% 86.00% 10.00% 12.00% 5.00% 17.00% 11.00% Ch11.G 11.00% 3.00% 12.00% 8.00% 6.00% 11.00% 11.00% 11.00% 100.00% 17.00% 3.00% 10.00% 11.00% 11.00% Ch15.E12.00% 16.00% 10.00% 9.00% 10.00% 11.00% 13.00% 16.00% 3.00% 11.00% 6.00% 16.00% 3.00% Ch16.E100.00% 10.00% 6.00% 7.00% 25.00% 8.00% 14.00% 16.00% 8.00% 14.00% 19.00% 3.00% 25.00% 11.00% 3.00% Ch17.E12.00% 6.00% 6.00% 100.00% 11.00% 12.00% 5.00% 5.00% 11.00% 17.00% 11.00% 6.00% 6.00% Ch21.IS39.00% 11.00% 6.00% 11.00% 5.00% 9.00% 11.00% 16.00% 10.00% Ch25.JE14.00% 13.00% 12.00% 8.00% 9.00% 68.00% 18.00% 5.00% Ch28.JE14.00% 22.00% 7.00% 10.00% 11.00% 14.00% 16.00% 8.00% 22.00% Ch8.G 28.00% 7.00% 12.00% 12.00% 6.00% 6.00% 39.00% 5.00% 68.00% 18.00% 6.00% 28.00% 7.00% 11.00% 16.00% 8.00% 100.00% 93.00% 8.00% 8.00% 16.00% 5.00% 24.00% 7.00% 93.00% 11.00% 24.00% 100.00% 6.00% 80.00% 80.00% 100.00% 37.00% 6.00% 96.00% 8.00% 4.00% 8.00% 96.00% 37.00% 100.00% 7.00% 11.00% 6.00% 62.00% 11.00% 70.00% 6.00% 4.00% 16.00% 62.00% 17.00% 21.00% 70.00% 16.00% 17.00% 11.00% 21.00% 100.00% 16.00% 16.00% 100.00% CQAS 46 Hellinger similarity in 0–100 scale Table 6.2: 6.3 Hellinger degree ofHellinger Similarity Similarity ch11 ch11 ch15 ch15 ch16 ch16 ch17 ch17 ch21 ch21 ch25 ch28 ch25 ch8 ch28 100 Ch11.G ch8 77 77 Ch15.E Ch11.G 75 Ch16.E 100 Ch15.E 75 83 Ch17.E 82 Ch16.E 82 85 Ch21.IS 83 85 Ch17.E Ch25.JE 72 100 85 81 Ch28.JE Ch21.IS 82 85 82 82 82 Ch8.G Ch25.JE 23 81 74 82 74 Ch28.JE 72 100 26 74 71 31 Ch8.G 82 82 17 82 82 68 82 31 71 78 21 33 74 100 21 78 31 78 74 82 34 68 25 83 26 83 26 82 20 84 37 23 74 21 80 29 78 32 100 27 31 80 23 19 82 37 84 25 69 19 33 69 26 32 24 82 24 26 26 15 74 37 26 100 26 74 31 26 23 31 20 23 75 34 17 75 31 24 25 29 17 25 26 29 22 23 26 21 15 100 31 21 26 21 31 100 27 26 20 20 26 21 19 32 19 44 19 19 25 27 26 25 32 15 44 31 21 27 20 31 33 17 28 31 24 100 33 15 21 37 19 35 20 50 32 37 19 32 22 62 50 22 25 37 39 20 32 76 45 31 100 33 25 32 37 31 82 62 31 24 21 32 82 25 49 26 33 25 31 19 100 27 26 39 28 27 25 24 19 29 30 35 21 28 26 28 21 45 19 29 100 25 21 79 21 33 29 32 55 19 21 79 37 20 31 100 19 19 62 21 76 22 55 49 62 37 30 100 33 31 37 32 32 100 CQAS 47 Bhattacharyya Similarity in 0–100 scale Table 6.3: 6.4 Bhattacharyya Distance convertedBhattacharyya Similarity to Similarity ch11 ch11 ch15 ch15 ch16 ch16 ch17 ch17 ch21 ch25ch21 ch28 ch25 ch8 ch28 Ch11.G ch8 Ch15.ECh11.G 100 Ch16.ECh15.E 95 Ch17.E95 Ch16.E Ch21.IS94 100 Ch25.JECh17.E 94 97 97 Ch28.JECh21.IS 97 98 97 Ch8.G 98 Ch25.JE 92 100 98 97 Ch28.JE 97 98 97 97 97 Ch8.G 53 97 93 97 94 92 100 56 93 92 61 97 97 46 97 97 90 97 61 92 95 51 63 94 100 50 95 61 95 94 97 64 90 55 97 56 97 56 97 50 97 66 53 94 51 96 59 95 62 100 57 61 96 52 48 97 66 97 55 91 48 63 91 56 61 54 97 54 56 56 44 93 67 55 100 56 93 61 56 52 61 49 53 94 64 46 94 61 54 55 59 46 55 56 59 52 53 55 50 44 100 61 51 56 51 61 100 56 55 50 49 56 51 48 62 49 72 48 48 55 57 56 54 61 44 72 61 51 57 49 61 63 46 57 61 54 100 62 44 51 66 49 65 49 78 61 66 48 62 52 87 78 52 55 67 68 49 62 95 73 61 100 63 54 61 66 61 97 87 61 53 51 62 97 54 77 56 62 55 61 48 100 56 55 68 58 57 55 53 48 59 60 65 51 58 56 57 50 73 49 59 100 54 51 96 51 63 59 62 82 48 51 96 66 49 61 100 48 49 87 50 95 52 82 77 87 67 60 100 63 61 67 62 62 100 CQAS 48 Symmetric KL of Two Probability Table 6.4: 6.5 Symmetric Kullback-LiblerThe K-L Divergence divergence ch11ch11 ch15 ch15 ch16 ch16 ch17 ch17 ch21 ch21 ch25 ch25 ch28 ch28 ch80.00 ch8 Ch11.G0.44 0.44 Ch15.ECh11.G 0.53 0.00 0.53 Ch16.ECh15.E 0.23 0.26 0.26 0.23 Ch17.ECh16.E 0.18 0.19 0.00 0.19 0.18 Ch21.ISCh17.E 0.67 0.30 0.27 0.27 0.30 0.67 Ch25.JECh21.IS 0.26 0.27 0.60 0.00 0.60 Ch28.JE0.27 0.26 Ch25.JE 7.67 0.56 Ch8.G 0.68 0.25 0.25 0.68 0.56 0.25 0.25 Ch28.JE 4.63 5.88 0.87 0.41 7.67 0.00 0.41 0.87 0.28 0.28 Ch8.G 4.48 8.16 4.87 0.54 4.63 0.22 0.22 0.54 0.38 0.38 6.91 6.93 6.84 4.59 4.48 0.33 0.00 0.33 0.22 5.88 0.22 4.53 5.60 7.60 7.74 5.30 6.93 0.78 0.78 0.27 4.87 0.27 5.67 3.71 6.15 6.61 7.46 7.60 6.18 0.00 0.56 4.59 8.16 0.56 6.67 4.46 3.74 6.24 6.61 6.56 9.16 5.74 0.52 5.30 6.84 0.52 6.23 4.73 3.57 6.56 7.64 6.24 8.29 6.30 6.18 8.00 7.74 6.91 0.00 5.94 4.58 4.65 0.00 6.69 9.27 4.57 5.74 7.46 5.56 5.60 6.24 5.94 5.34 4.71 3.00 7.74 7.63 6.30 9.16 6.15 4.84 4.53 6.83 5.35 4.39 4.50 3.00 5.99 8.29 6.24 3.71 7.30 4.57 7.84 5.46 4.67 4.41 0.00 9.27 7.64 3.74 7.95 5.67 6.92 5.20 4.60 4.50 2.23 6.69 3.57 7.17 4.46 7.63 4.16 6.83 1.29 2.23 7.74 4.65 7.28 4.73 4.52 3.45 6.67 4.41 0.00 4.71 6.56 4.58 5.99 2.76 6.23 0.27 1.29 4.39 0.46 5.34 3.86 5.94 6.18 4.60 0.27 8.00 5.35 4.67 5.95 5.94 0.00 3.45 5.56 2.37 5.46 8.22 6.83 5.31 6.18 4.84 4.16 5.06 5.20 7.84 5.31 7.30 4.79 2.76 7.25 6.92 0.00 7.95 5.95 0.37 4.52 6.83 7.17 5.06 4.40 1.76 3.86 0.37 7.28 8.22 0.00 0.46 4.76 6.56 7.25 1.23 2.37 1.76 4.79 1.23 3.60 4.40 0.00 4.76 3.60 4.48 4.48 0.00 CQAS 49 Jensen-Shannon Similarity in 1–100 scale Table 6.5: 6.6 Jensen-Shannon Similarity Shannon Similarity ch11 ch11 ch15 ch15 ch16 ch16 ch17 ch17 ch21 ch21 ch25 ch28 ch25 ch8 100 ch28 Ch11.G 88 ch8 88 Ch15.E87 Ch11.G Ch16.E100 87 91 Ch15.E Ch17.E91 91 93 Ch16.E Ch21.IS 91 92 85 Ch17.E Ch25.JE100 92 90 Ch28.JE91 Ch21.IS 93 91 91 91 Ch8.G 58 Ch25.JE 90 86 91 87 85 100 Ch28.JE 59 86 85 62 91 91 Ch8.G 54 90 91 83 91 62 85 89 56 63 87 100 56 89 62 89 87 91 64 83 59 92 59 92 59 90 56 92 65 58 87 56 90 61 89 63 100 60 62 90 58 55 91 65 92 59 84 55 63 84 59 62 58 91 58 59 59 53 87 66 59 100 59 87 62 59 58 62 56 58 87 64 54 87 62 58 59 61 54 59 59 61 57 58 59 56 53 100 62 56 59 57 62 100 59 59 56 56 59 56 55 63 56 69 55 55 59 60 59 58 62 53 69 62 56 60 56 62 63 54 60 62 58 100 63 53 56 65 55 64 56 73 62 65 55 62 57 80 73 57 59 66 67 56 63 88 70 62 100 63 58 62 65 62 91 80 62 58 56 63 91 58 72 59 63 59 62 55 100 59 59 67 60 60 59 58 55 61 61 64 56 60 59 60 56 70 56 61 100 58 56 89 57 63 61 62 76 55 56 89 65 56 62 100 55 55 80 56 88 57 76 72 80 66 61 100 63 62 66 63 63 100 CQAS 50 Euclidean Similarity in a scale of 1–100 Table 6.6: 6.7 Euclidean Similarity Euclidean Similarity ch11 ch11 ch15 ch15 ch16 ch16 ch17 ch17 ch21 ch21 ch25 ch28 ch25 ch8 ch28 100 Ch11.G ch8 82 Ch15.E82 Ch11.G Ch16.E81 Ch15.E 100 Ch17.E81 86 Ch16.E 85 Ch21.IS 85 84 86 Ch17.E Ch25.JE90 71 100 Ch28.JE90 Ch21.IS 87 84 84 89 Ch8.G 89 Ch25.JE 80 39 87 78 85 Ch28.JE 79 71 100 47 78 69 43 Ch8.G 80 84 27 89 84 72 84 51 69 76 39 42 79 100 30 84 35 76 77 85 51 72 43 85 43 85 41 89 29 89 39 39 77 41 79 50 84 47 100 42 43 79 38 29 85 38 89 45 69 40 42 69 46 45 42 85 35 47 41 27 74 39 43 100 44 74 51 43 38 47 38 39 80 51 24 80 35 35 44 44 27 48 42 50 36 39 41 30 28 100 32 40 46 43 50 100 39 41 29 40 43 39 29 37 37 43 29 40 48 44 43 41 23 27 43 36 42 42 37 50 34 24 44 35 42 100 30 28 42 39 41 37 38 40 23 38 29 36 36 63 40 42 43 39 38 40 47 69 48 35 100 34 41 45 45 32 76 63 47 18 41 37 76 24 50 42 30 45 36 23 100 39 43 38 28 44 44 18 40 36 25 37 40 28 44 44 33 48 37 44 100 24 42 83 43 37 36 36 54 40 42 83 45 37 32 100 23 41 64 69 33 42 54 50 64 41 25 100 37 32 41 38 38 100 CQAS 51 Manhattan Similarity in 1–100 scale Table 6.7: 6.8 Manhattan Similarity Manhattan Similarity ch11 ch11 ch15 ch15 ch16 ch16 ch17 ch17 ch21 ch21 ch25 ch28 ch25 ch8 ch28 Ch11.G100 ch8 Ch15.E70 Ch11.G 70 Ch16.E64 Ch15.E Ch17.E100 64 75 Ch16.E Ch21.IS71 71 77 Ch17.E Ch25.JE 75 78 61 Ch28.JECh21.IS 100 78 73 Ch8.G 74 Ch25.JE 77 75 75 71 38 Ch28.JE 73 64 73 64 61 100 38 Ch8.G 64 59 39 71 71 36 75 71 57 74 40 59 65 37 40 64 100 37 68 39 65 63 73 41 57 37 72 38 72 39 75 37 77 41 38 63 37 67 39 68 40 100 39 39 67 38 36 73 41 77 38 58 36 40 58 38 41 38 73 39 38 39 36 62 41 38 100 38 62 40 38 38 39 37 38 64 41 36 64 39 39 38 38 36 38 37 39 37 38 39 37 36 100 39 37 38 37 39 100 38 39 37 37 38 37 36 39 37 45 36 37 38 38 38 38 40 36 45 40 37 39 37 39 40 36 38 39 38 100 40 36 37 41 37 41 37 48 40 41 36 39 37 58 48 37 37 41 42 37 40 63 44 39 100 40 38 41 41 39 69 58 39 37 37 39 69 37 46 37 40 38 40 37 100 38 38 42 39 38 38 37 36 38 40 41 37 39 38 38 37 44 37 38 100 37 37 70 37 42 38 39 50 37 37 70 41 37 39 100 37 37 54 37 63 37 50 46 54 40 40 100 42 39 40 39 39 100 CQAS 52 Symmetric Chi-Square Table 6.8: 6.9 Symmetric Chi-Square Symmetric Chi-Square ch11 ch11 ch15 ch15 ch16 ch16 ch17 ch17 ch21 ch21 ch25 ch28 ch25 ch8 ch28 Ch11.G ch8 100 Ch15.ECh11.G 27 Ch16.E27 Ch15.E 30 Ch17.E100 Ch16.E 30 Ch21.IS18 19 Ch17.E Ch25.JE19 14 18 13 Ch28.JECh21.IS 36 100 13 Ch8.G 19 Ch25.JE 19 14 18 18 20 Ch28.JE 74 19 32 19 32 Ch8.G 36 100 74 32 38 72 20 19 77 19 19 41 19 72 38 27 76 71 32 100 76 25 73 27 33 19 71 41 75 17 74 17 73 19 76 16 70 74 33 76 24 73 25 73 100 73 72 24 74 77 20 70 16 75 40 76 71 40 74 72 75 20 74 74 73 77 33 70 75 100 75 33 72 74 74 73 76 75 31 71 77 31 73 74 75 74 77 74 75 73 75 75 74 76 77 100 73 76 74 76 73 100 75 74 76 76 74 76 77 72 76 65 77 76 74 75 74 75 73 77 65 72 76 73 76 73 71 77 74 73 75 100 71 77 76 70 76 70 76 61 73 70 77 72 75 48 61 76 75 70 69 76 73 30 66 73 100 71 75 72 69 73 20 48 73 75 76 72 20 75 62 75 71 75 72 76 100 75 75 69 74 75 75 75 76 74 73 70 76 74 75 74 76 66 76 74 100 75 76 24 76 71 72 74 56 76 76 24 69 76 72 100 76 76 49 30 76 76 56 62 49 70 73 100 71 72 70 73 73 100 CQAS 53 Clark Similarity Measure Table 6.9: 6.10 Clark Similarity Clark Similarity ch11 ch11 ch15 ch15 ch16 ch16 ch17 ch17 ch21 ch21 ch25 ch28100 ch25 ch827 ch28 27 Ch11.G30 ch8 Ch15.E100 30 18 Ch11.G Ch16.E19 19 14 Ch15.E Ch17.E18 13 36 Ch16.E Ch21.IS100 13 19 19 Ch17.E Ch25.JE14 18 18 20 74 Ch28.JECh21.IS 19 32 19 32 Ch8.G 36 100 74 Ch25.JE 32 38 72 20 19 77 19 Ch28.JE 19 41 19 72 38 27 76 Ch8.G 71 32 100 76 25 73 27 33 19 71 41 75 17 74 17 73 19 76 16 70 74 33 76 24 73 25 73 100 73 72 24 74 77 20 70 16 75 40 76 71 40 74 72 75 20 74 74 73 77 33 70 75 100 75 33 72 74 74 73 76 75 31 71 77 31 73 74 75 74 77 74 75 73 75 75 74 76 77 100 73 76 74 76 73 100 75 74 76 76 74 76 77 72 76 65 77 76 74 75 74 75 73 77 65 72 76 73 76 73 71 77 74 73 75 100 71 77 76 70 76 70 76 61 73 70 77 72 75 48 61 76 75 70 69 76 73 30 66 73 100 71 75 72 69 73 20 48 73 75 76 72 20 75 62 75 71 75 72 76 100 75 75 69 74 75 75 75 76 74 73 70 76 74 75 74 76 66 76 74 100 75 76 24 76 71 74 72 56 76 76 24 69 76 72 100 76 76 49 76 30 76 56 62 49 70 73 100 71 72 70 73 73 100 CQAS 54

6.10.1 Distances Between Probabilities Distribution of the Raw Corpus

The following is the result of applying the distance algorithms on the raw data. It clearly indicates, the distances losing their stability in the high dimensional space. While the majority of them were able to detect at some level the syntactic strength within each book’s entirety, but not across books. Similarity matrix plots along with their clustering summary are presented at the following section.

Cosine Simialrity Matrix ch9 ch8 ch7 ch6 ch5 ch4 ch30 ch3 ch29 ch28 ch27 ch26 ch25 ch24 ch23 ch22 ch21 ch20 ch2 ch19 ch18 ch17 ch16 ch15 ch14 ch13 ch12 value ch11 100 ch10 75 ch1 Ch9.G 50 Ch8.G Ch7.G 25 Ch6.G Sacred Chapters Ch5.D Ch4.D Ch30.JE Ch3.D Ch29.JE Ch28.JE Ch27.JE Ch26.JE Ch25.JE Ch24.IS Ch23.IS Ch22.IS Ch21.IS Ch20.IS Ch2.D Ch19.IS Ch18.E Ch17.E Ch16.E Ch15.E Ch14.E Ch13.E Ch12.G Ch11.G Ch10.G Ch1.D

ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10 ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18 ch19 ch20 ch21 ch22 ch23 ch24 ch25 ch26 ch27 ch28 ch29 ch30 Ch1.D Ch2.D Ch3.D Ch4.DCh5.D Ch13.ECh14.ECh15.ECh16.ECh17.ECh18.E Ch6.GCh7.GCh8.GCh9.G Ch10.GCh11.GCh12.G Ch19.IS Ch20.ISCh21.ISCh22.IS Ch23.ISCh24.ISCh25.JECh26.JECh27.JECh28.JECh29.JE Ch30.JE Sacred Chapters

Figure 6.1: Cosine Similarity of the Raw Corpus

Cosine Similarity Cluster 30 20 Height 10 ch14 ch16 ch9 ch7 ch20 ch18 ch15 ch8 ch19 ch6 ch23 0 ch17 ch3 Ch24.IS ch30 ch1 ch12 Ch25.JE Ch27.JE ch28 Ch23.IS Ch21.IS ch2 Ch19.IS Ch26.JE ch29 ch26 ch4 ch25 ch5 ch22 ch11 ch24 ch21 ch13 Ch5.D Ch6.G Ch9.G Ch8.G Ch7.G Ch15.E Ch13.E Ch12.G ch10 ch27 Ch22.IS Ch3.D Ch2.D Ch1.D Ch4.D Ch16.E Ch14.E Ch17.E Ch18.E Ch11.G Ch10.G Ch20.IS Ch29.JE Ch28.JE Ch30.JE

as.dist(sim.cosi) hclust (*, "ward.D2")

Figure 6.2: Cosine Similarity Clusters Result CQAS 55

Manhattan Similarity Matrix ch9 ch8 ch7 ch6 ch5 ch4 ch30 ch3 ch29 ch28 ch27 ch26 ch25 ch24 ch23 ch22 ch21 ch20 ch2 ch19 ch18 ch17 ch16 ch15 ch14 ch13 ch12 value ch11 100 ch10 ch1 80 Ch9.G Ch8.G 60 Ch7.G Ch6.G 40 Sacred Chapters Ch5.D Ch4.D Ch30.JE Ch3.D Ch29.JE Ch28.JE Ch27.JE Ch26.JE Ch25.JE Ch24.IS Ch23.IS Ch22.IS Ch21.IS Ch20.IS Ch2.D Ch19.IS Ch18.E Ch17.E Ch16.E Ch15.E Ch14.E Ch13.E Ch12.G Ch11.G Ch10.G Ch1.D

ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10 ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18 ch19 ch20 ch21 ch22 ch23 ch24 ch25 ch26 ch27 ch28 ch29 ch30 Ch1.D Ch2.D Ch3.D Ch4.DCh5.D Ch13.ECh14.ECh15.ECh16.ECh17.ECh18.E Ch6.GCh7.GCh8.GCh9.G Ch10.GCh11.GCh12.G Ch19.IS Ch20.ISCh21.ISCh22.IS Ch23.ISCh24.ISCh25.JECh26.JECh27.JECh28.JECh29.JE Ch30.JE Sacred Chapters

Figure 6.3: Manhattan Similarity of the Raw Corpus

Manhattan Similarity Cluster 39 38 37 Height 36 ch16 ch15 ch14 ch17 ch7 ch19 35 ch20 ch18 ch3 Ch24.IS ch23 ch1 Ch27.JE Ch23.IS Ch25.JE ch4 ch12 Ch19.IS Ch26.JE ch9 ch8 ch26 Ch22.IS ch22 Ch9.G Ch8.G 34 Ch5.D Ch15.E ch2 Ch28.JE ch6 Ch14.E ch25 ch29 Ch21.IS ch10 ch21 Ch20.IS ch30 Ch7.G Ch11.G ch27 Ch2.D Ch29.JE Ch1.D Ch13.E ch28 ch11 Ch3.D Ch10.G ch5 ch24 Ch6.G Ch4.D Ch17.E Ch12.G Ch16.E Ch30.JE ch13 Ch18.E

as.dist(sim) hclust (*, "ward.D2")

Figure 6.4: Manhattan Similarity Clusters Result CQAS 56

Hellinger Simialrity Matrix ch9 ch8 ch7 ch6 ch5 ch4 ch30 ch3 ch29 ch28 ch27 ch26 ch25 ch24 ch23 ch22 ch21 ch20 ch2 ch19 ch18 ch17 ch16 ch15 ch14 ch13 ch12 value ch11 100 ch10 75 ch1 Ch9.G 50 Ch8.G Ch7.G 25 Ch6.G Sacred Chapters Ch5.D Ch4.D Ch30.JE Ch3.D Ch29.JE Ch28.JE Ch27.JE Ch26.JE Ch25.JE Ch24.IS Ch23.IS Ch22.IS Ch21.IS Ch20.IS Ch2.D Ch19.IS Ch18.E Ch17.E Ch16.E Ch15.E Ch14.E Ch13.E Ch12.G Ch11.G Ch10.G Ch1.D

ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10 ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18 ch19 ch20 ch21 ch22 ch23 ch24 ch25 ch26 ch27 ch28 ch29 ch30 Ch1.D Ch2.D Ch3.D Ch4.DCh5.D Ch13.ECh14.ECh15.ECh16.ECh17.ECh18.E Ch6.GCh7.GCh8.GCh9.G Ch10.GCh11.GCh12.G Ch19.IS Ch20.ISCh21.ISCh22.IS Ch23.ISCh24.ISCh25.JECh26.JECh27.JECh28.JECh29.JE Ch30.JE Sacred Chapters

Figure 6.5: Hellinger Similarity of the Raw Corpus

Hellinger Similarity Cluster 15 10 Height ch16 ch15 ch19 ch26 5 ch20 ch18 ch9 ch7 Ch27.JE ch4 ch8 ch12 ch23 Ch5.D ch3 Ch23.IS Ch24.IS Ch26.JE Ch25.JE ch2 ch1 ch6 ch29 ch30 ch17 ch22 Ch19.IS Ch8.G Ch9.G Ch15.E Ch11.G Ch20.IS Ch22.IS Ch29.JE Ch21.IS ch10 ch28 ch14 ch11 ch13 ch25 ch27 Ch2.D Ch3.D Ch1.D Ch7.G ch21 Ch14.E Ch13.E Ch10.G ch5 ch24 Ch4.D Ch6.G Ch16.E Ch18.E Ch17.E Ch12.G Ch28.JE Ch30.JE 0

as.dist(m2) hclust (*, "ward.D2")

Figure 6.6: Hellinger Similarity Clusters Result CQAS 57

Bhattacharyya Simialrity Matrix ch9 ch8 ch7 ch6 ch5 ch4 ch30 ch3 ch29 ch28 ch27 ch26 ch25 ch24 ch23 ch22 ch21 ch20 ch2 ch19 ch18 ch17 ch16 ch15 ch14 ch13 ch12 value ch11 100 ch10 ch1 80 Ch9.G 60 Ch8.G Ch7.G 40 Ch6.G Sacred Chapters Ch5.D Ch4.D Ch30.JE Ch3.D Ch29.JE Ch28.JE Ch27.JE Ch26.JE Ch25.JE Ch24.IS Ch23.IS Ch22.IS Ch21.IS Ch20.IS Ch2.D Ch19.IS Ch18.E Ch17.E Ch16.E Ch15.E Ch14.E Ch13.E Ch12.G Ch11.G Ch10.G Ch1.D

ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10 ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18 ch19 ch20 ch21 ch22 ch23 ch24 ch25 ch26 ch27 ch28 ch29 ch30 Ch1.D Ch2.D Ch3.D Ch4.DCh5.D Ch13.ECh14.ECh15.ECh16.ECh17.ECh18.E Ch6.GCh7.GCh8.GCh9.G Ch10.GCh11.GCh12.G Ch19.IS Ch20.ISCh21.ISCh22.IS Ch23.ISCh24.ISCh25.JECh26.JECh27.JECh28.JECh29.JE Ch30.JE Sacred Chapters

Figure 6.7: Bhattacharyya Similarity of the Raw Corpus

Bhattacharyya Similarity Cluster 45 40 35 Height ch16 ch15 ch19 ch9 ch7 ch18 ch20 ch12 ch23 30 ch8 ch3 ch1 ch17 Ch5.D Ch27.JE Ch23.IS ch6 Ch25.JE ch26 ch4 Ch9.G Ch24.IS Ch15.E Ch19.IS ch14 Ch22.IS Ch26.JE ch29 Ch2.D Ch8.G ch30 ch2 Ch21.IS ch25 ch22 ch21 Ch7.G Ch13.E Ch11.G Ch20.IS 25 ch11 ch10 ch28 ch27 ch13 Ch3.D Ch1.D Ch6.G Ch17.E Ch14.E Ch10.G ch5 Ch29.JE ch24 Ch4.D Ch16.E Ch12.G Ch30.JE Ch18.E Ch28.JE

as.dist(s1) hclust (*, "ward.D2")

Figure 6.8: Bhattacharyya Similarity Clusters Result CQAS 58

Probabilistic Symmetric Chi−Square Similarity Matrix ch9 ch8 ch7 ch6 ch5 ch4 ch30 ch3 ch29 ch28 ch27 ch26 ch25 ch24 ch23 ch22 ch21 ch20 ch2 ch19 ch18 ch17 ch16 ch15 ch14 ch13 ch12 value ch11 100 ch10 90 ch1 Ch9.G 80 Ch8.G Ch7.G 70 Ch6.G Sacred Chapters Ch5.D Ch4.D Ch30.JE Ch3.D Ch29.JE Ch28.JE Ch27.JE Ch26.JE Ch25.JE Ch24.IS Ch23.IS Ch22.IS Ch21.IS Ch20.IS Ch2.D Ch19.IS Ch18.E Ch17.E Ch16.E Ch15.E Ch14.E Ch13.E Ch12.G Ch11.G Ch10.G Ch1.D

ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10 ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18 ch19 ch20 ch21 ch22 ch23 ch24 ch25 ch26 ch27 ch28 ch29 ch30 Ch1.D Ch2.D Ch3.D Ch4.DCh5.D Ch13.ECh14.ECh15.ECh16.ECh17.ECh18.E Ch6.GCh7.GCh8.GCh9.G Ch10.GCh11.GCh12.G Ch19.IS Ch20.ISCh21.ISCh22.IS Ch23.ISCh24.ISCh25.JECh26.JECh27.JECh28.JECh29.JE Ch30.JE Sacred Chapters

Figure 6.9: Chi-Square Similarity of the Raw Corpus

Probabilistic Symmetric Chi−Square Similarity Cluster 110 100 90 Height 80 70 Ch6.G ch13 ch2 ch28 ch8 ch9 ch27 ch10 ch26 ch18 ch15 ch6 ch7 ch4 ch5 ch14 ch23 ch1 ch3 ch11 ch17 ch29 ch30 ch12 ch20 ch21 ch22 Ch7.G ch16 ch19 ch24 ch25 Ch22.IS Ch10.G Ch19.IS Ch20.IS Ch8.G Ch9.G Ch21.IS Ch30.JE Ch23.IS Ch24.IS Ch25.JE Ch15.E Ch11.G Ch12.G Ch17.E Ch26.JE Ch27.JE Ch28.JE Ch29.JE Ch13.E Ch14.E Ch1.D Ch5.D 60 Ch4.D Ch2.D Ch3.D Ch16.E Ch18.E

as.dist(chi.D.sim) hclust (*, "ward.D2")

Figure 6.10: Chi-Square Similarity Clusters Result CQAS 59

Clark Similarity Matrix ch9 ch8 ch7 ch6 ch5 ch4 ch30 ch3 ch29 ch28 ch27 ch26 ch25 ch24 ch23 ch22 ch21 ch20 ch2 ch19 ch18 ch17 ch16 ch15 ch14 ch13 ch12 value ch11 100 ch10 ch1 99 Ch9.G 98 Ch8.G Ch7.G 97 Ch6.G Sacred Chapters Ch5.D Ch4.D Ch30.JE Ch3.D Ch29.JE Ch28.JE Ch27.JE Ch26.JE Ch25.JE Ch24.IS Ch23.IS Ch22.IS Ch21.IS Ch20.IS Ch2.D Ch19.IS Ch18.E Ch17.E Ch16.E Ch15.E Ch14.E Ch13.E Ch12.G Ch11.G Ch10.G Ch1.D

ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10 ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18 ch19 ch20 ch21 ch22 ch23 ch24 ch25 ch26 ch27 ch28 ch29 ch30 Ch1.D Ch2.D Ch3.D Ch4.DCh5.D Ch13.ECh14.ECh15.ECh16.ECh17.ECh18.E Ch6.GCh7.GCh8.GCh9.G Ch10.GCh11.GCh12.G Ch19.IS Ch20.ISCh21.ISCh22.IS Ch23.ISCh24.ISCh25.JECh26.JECh27.JECh28.JECh29.JE Ch30.JE Sacred Chapters

Figure 6.11: Clark Similarity of the Raw Corpus

clark Similarity Cluster 103 102 101 100 Height 99 98 97 ch29 ch30 ch2 ch23 ch27 Ch19.IS ch18 Ch20.IS Ch22.IS ch7 ch26 ch1 ch3 Ch21.IS ch22 ch28 ch9 ch15 ch16 ch14 ch17 ch8 ch19 Ch23.IS Ch24.IS ch5 ch21 ch4 ch6 ch10 Ch25.JE ch11 ch20 Ch26.JE Ch27.JE 96 ch24 ch25 Ch29.JE ch12 ch13 Ch10.G Ch28.JE Ch30.JE Ch6.G Ch7.G Ch15.E Ch5.D Ch17.E Ch13.E Ch14.E Ch11.G Ch8.G Ch3.D Ch4.D Ch9.G Ch12.G Ch1.D Ch2.D Ch16.E Ch18.E

as.dist(clark.S) hclust (*, "ward.D2")

Figure 6.12: Clark Similarity Clusters Result CQAS 60

Jeffrey’s Divergence Matrix ch9 ch8 ch7 ch6 ch5 ch4 ch30 ch3 ch29 ch28 ch27 ch26 ch25 ch24 ch23 ch22 ch21 ch20 ch2 ch19 ch18 ch17 ch16 ch15 ch14 ch13 ch12 value ch11 50 ch10 40 ch1 Ch9.G 30 Ch8.G 20 Ch7.G 10 Ch6.G 0 Sacred Chapters Ch5.D Ch4.D Ch30.JE Ch3.D Ch29.JE Ch28.JE Ch27.JE Ch26.JE Ch25.JE Ch24.IS Ch23.IS Ch22.IS Ch21.IS Ch20.IS Ch2.D Ch19.IS Ch18.E Ch17.E Ch16.E Ch15.E Ch14.E Ch13.E Ch12.G Ch11.G Ch10.G Ch1.D

ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10 ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18 ch19 ch20 ch21 ch22 ch23 ch24 ch25 ch26 ch27 ch28 ch29 ch30 Ch1.D Ch2.D Ch3.D Ch4.DCh5.D Ch13.ECh14.ECh15.ECh16.ECh17.ECh18.E Ch6.GCh7.GCh8.GCh9.G Ch10.GCh11.GCh12.G Ch19.IS Ch20.ISCh21.ISCh22.IS Ch23.ISCh24.ISCh25.JECh26.JECh27.JECh28.JECh29.JE Ch30.JE Sacred Chapters

Figure 6.13: The Kullback-Leibler Jeffrey’s Divergence Matrix of the Raw Corpus

Jeffrey’s Divergence Cluster 120 100 80 Height 60 40 ch2 ch10 ch18 ch26 ch27 Ch6.G Ch7.G ch11 ch15 ch22 ch28 ch17 ch23 ch14 ch4 ch5 ch6 ch7 ch29 ch30 ch8 ch9 ch12 ch13 ch21 ch16 ch1 ch3 ch19 ch20 ch24 ch25 Ch22.IS 20 Ch10.G Ch19.IS Ch30.JE Ch20.IS Ch25.JE Ch11.G Ch12.G Ch21.IS Ch23.IS Ch24.IS Ch15.E Ch8.G Ch9.G Ch26.JE Ch27.JE Ch28.JE Ch29.JE Ch13.E Ch14.E Ch1.D Ch5.D Ch4.D Ch17.E Ch2.D Ch3.D 0 Ch16.E Ch18.E

as.dist(mJ) hclust (*, "ward.D2")

Figure 6.14: The Kullback-Leibler Jeffrey’s Divergence Cluster CQAS 61

Jensen−Shannon Divergence Matrix ch9 ch8 ch7 ch6 ch5 ch4 ch30 ch3 ch29 ch28 ch27 ch26 ch25 ch24 ch23 ch22 ch21 ch20 ch2 ch19 ch18 ch17 ch16 ch15 ch14 ch13 ch12 value ch11 0.8 ch10 0.6 ch1 Ch9.G 0.4 Ch8.G Ch7.G 0.2 Ch6.G 0.0 Sacred Chapters Ch5.D Ch4.D Ch30.JE Ch3.D Ch29.JE Ch28.JE Ch27.JE Ch26.JE Ch25.JE Ch24.IS Ch23.IS Ch22.IS Ch21.IS Ch20.IS Ch2.D Ch19.IS Ch18.E Ch17.E Ch16.E Ch15.E Ch14.E Ch13.E Ch12.G Ch11.G Ch10.G Ch1.D

ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10 ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18 ch19 ch20 ch21 ch22 ch23 ch24 ch25 ch26 ch27 ch28 ch29 ch30 Ch1.D Ch2.D Ch3.D Ch4.DCh5.D Ch13.ECh14.ECh15.ECh16.ECh17.ECh18.E Ch6.GCh7.GCh8.GCh9.G Ch10.GCh11.GCh12.G Ch19.IS Ch20.ISCh21.ISCh22.IS Ch23.ISCh24.ISCh25.JECh26.JECh27.JECh28.JECh29.JE Ch30.JE Sacred Chapters

Figure 6.15: Jensen-Shannon Divergence Matrix of the Raw Corpus

Jensen−Shannon Divergence Cluster 1.4 1.2 1.0 Height 0.8 Ch6.G 0.6 ch2 ch26 ch23 ch27 ch18 Ch7.G ch17 ch6 ch7 ch10 ch11 ch22 ch28 ch14 ch8 ch9 ch4 ch5 ch1 ch3 ch29 ch30 ch12 ch13 ch21 ch15 ch16 ch19 ch20 Ch22.IS Ch10.G ch24 ch25 Ch19.IS Ch30.JE Ch20.IS Ch21.IS Ch8.G Ch9.G Ch25.JE Ch23.IS Ch24.IS Ch11.G Ch12.G Ch15.E Ch28.JE Ch29.JE Ch26.JE Ch27.JE Ch17.E Ch13.E Ch14.E Ch1.D Ch5.D Ch4.D 0.4 Ch2.D Ch3.D Ch16.E Ch18.E

as.dist(JSD) hclust (*, "ward.D2")

Figure 6.16: Jensen-Shannon Divergence Clusters Result CQAS 62

6.10.2 Expert Topical Assignment

This plot shows the Jaccard distance between the books according to the true expert topical assignments matrix. We build this matrix by removing the mean- ingless words from expert and identifying the presence of topics per book using the term document matrix. The problem with this method is we dealt with each token as a topic. The reality is, each token is considered to be a topic if we penalize it and normalize it across documents. However, these methods are valid to capture the highest frequent tokens across a given chapter, but they are not going to do much for us when our intention is to merge different expert topics. This is because each x-expert for a y-book has a different writing style to express the main topic. A recommended way to build a low- dimension expert matrix is to manually build a binary weighted matrix based on a set of chosen topics. The drawback of this method is that it is slow and does not meet the requirements of this research to build an automatic detection engine.

Jer.5 Jer.36 Jer.30 Jer.19 Jer.12 Jer.1 Isa.51 Isa.42 Isa.33 Isa.23 Isa.18 Isa.1 Gen.47 Gen.41 Gen.33 Gen.27 Gen.20 Gen.11 Gen.1 Exod.9 Exod.34 Exod.29 Exod.22 Exod.13 Exod.1 Deut.7 Deut.29 value Deut.22 100 Deut.14 75 Deut.1 ch9 50 ch8 ch7 25 ch6 0 Sacred Chapters ch5 ch4 ch30 ch3 ch29 ch28 ch27 ch26 ch25 ch24 ch23 ch22 ch21 ch20 ch2 ch19 ch18 ch17 ch16 ch15 ch14 ch13 ch12 ch11 ch10 ch1

ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10 ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18 ch19 ch20 ch21 ch22 ch23 ch24 ch25 ch26 ch27 ch28 ch29 ch30 Isa.1 Jer.1 Jer.5 Deut.1 Deut.7Exod.1 Exod.9Gen.1 Isa.18Isa.23Isa.33Isa.42Isa.51 Jer.12Jer.19Jer.30Jer.36 Deut.14Deut.22Deut.29 Exod.13Exod.22Exod.29Exod.34 Gen.11Gen.20Gen.27Gen.33Gen.41Gen.47 Sacred Chapters

Figure 6.17: Jaccard Similarity on the experts-topics matrix

It is interesting to look to the Quranic Ψ matrix separate from the biblical expert matrix. We did this to see if merging three Ψ matrices could have some influence on the distance result. According to this matrix PLOT, we notice that the Quranic semantic is strongly related to each other. It indicates that Quran structure used to emphasize similar issues multiple times throughout the Quranic chapters. According to the biblical Ψ matrix, it is noticed that some of the biblical chapters shown empty cells and some others were 100% similar. CQAS 63

Quranic Syntactic Chapter ch9

ch8

ch7

ch6

ch5

ch4

ch30

ch3

ch29

ch28

ch27

ch26

ch25 value ch24 100 ch23 75 50 ch22 25 ch21 0 Sacred Chapters ch20

ch2

ch19

ch18

ch17

ch16

ch15

ch14

ch13

ch12

ch11

ch10

ch1

ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10 ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18 ch19 ch20 ch21 ch22 ch23 ch24 ch25 ch26 ch27 ch28 ch29 ch30 Sacred Chapters

Figure 6.18: Expert-wise similarity matrix 1

The reason for this because the topical space provided by the expert for some of the chapters. We either need to merge or average more than one expert DTM to increase the topical dimensions of those or apply a dimensionality reduction technique like PCA on expert DTM. However, we do not want to lose any information throughout the matrix or reduce the dimension at this time. Moreover, PCA is not recommended for the sparse DTM matrix. Consequently, we calculated the highest frequent topics across biblical chapters to determine if those empty chapters are empty because of their small size relative to other high dimensions chapters or because they have no shared topics with other biblical chapters. The following graph indicates that they were shared topics among those empty cells. CQAS 64

Biblical Syntactic Chapter Jer.5

Jer.36

Jer.30

Jer.19

Jer.12

Jer.1

Isa.51

Isa.42

Isa.33

Isa.23

Isa.18

Isa.1

Gen.47 value Gen.41 100 Gen.33 75 50 Gen.27 25 Gen.20 0 Sacred Chapters Gen.11

Gen.1

Exod.9

Exod.34

Exod.29

Exod.22

Exod.13

Exod.1

Deut.7

Deut.29

Deut.22

Deut.14

Deut.1

Isa.1 Jer.1 Jer.5 Deut.1 Deut.7 Exod.1 Exod.9 Gen.1 Isa.18 Isa.23 Isa.33 Isa.42 Isa.51 Jer.12 Jer.19 Jer.30 Jer.36 Deut.14 Deut.22 Deut.29 Exod.13 Exod.22 Exod.29 Exod.34 Gen.11 Gen.20 Gen.27 Gen.33 Gen.41 Gen.47 Sacred Chapters

Figure 6.19: Expert-wise similarity matrix 2

Similarity Matrix

Jer.30

Jer.12

Jer.1

Isa.51

Isa.42

Isa.1

Gen.33

value 100 Gen.20 75

50

Gen.1 25

0 Sacred Chapters

Exod.34

Exod.22

Exod.13

Deut.7

Deut.29

Deut.22

Deut.14

Isa.1 Jer.1 Deut.7 Gen.1 Isa.42 Isa.51 Jer.12 Jer.30 Deut.14 Deut.22 Deut.29 Exod.13 Exod.22 Exod.34 Gen.20 Gen.33 Sacred Chapters

Figure 6.20: Binary Similarity applied on small density topics CQAS 65

6.11 Evaluation levels

6.11.1 First Similarity Evaluation

The first comparison was done to evaluate the similarities measures according to their behavior on different level of noise and dimension. The input vectors of comparison were chosen based on the result of the expert matrix, which gives the following result:

• Sim(Deut.1, ch26 )=33.3%

• Sim(Deut.1, ch14)=50%

• Sim(Deut.1, ch15)=16%

• Sim(Deut.29, ch1)=33%

• Sim(Deut.29, ch3)=25%

As the figure shows, for the five different sacred chapters, few measures were able to distinguish between the two different input matrices. ψ (similarity.Expert) result plotted to visualize how each distance related to the true distance. We will explain that in details in the section labelled Second Step of Evaluation. CQAS 66

Deut.1, ch14 Deut.1, ch26 Deut.29, ch3 Deut.1, ch15 Deut.29, ch1

Jenson Shanon Manhattan Symmetric Chi−Square Symmetric K−L

Similarity.CTM

Similarity.Raw

Similarity.Expert

Bhattacharyya clark Cosine euclidean Hellinger

Similarity.CTM

Similarity.Raw

Similarity.Expert

0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 Similarity.Value

Figure 6.21: Distance behaviour detection through the two level of noise CQAS 67

The Similarity.CTM level of comparison shows the result of similarity calcu- lations across the five sacred chapters applied on the θ. The Similarity.Raw displays the result using the texts with noise. Results from distances measures applied on the raw corpus and the θ matrix show that most of the distances are suffering at some level to distinguish between the documents and they are clustering around the same similarity percentage. Bhattacharyya, Clark, Manhattan, and Symmetric chi square, for example, ex- hibit low discrimination power of both input matrices. While, on the other hand, Symmetric Kullback, Euclidean and Hellinger show more realistic results. They are able to distinguish the different types of the input data. Cosine and Jenson gave contrasting results.

6.11.2 Second Step of Evaluation

1. Similarity Evaluation on CTM

Figure 6.22 illustrates the difference between the similarity percentages obtained by the ψ and θ. The similarity measure that results in shorter distance between these two matrices is considered.

The distances under evaluation here are the distances that were able to pass the first level of comparison. Those distances are; Symmetric Kull- back, Euclidean and Hellinger Cosine and Jenson. Among these distances, Figure 2 indicates that the Kullback, Euclidean, Hellinger and Jenson are the closer to ψ matrix. CQAS 68

Deut.1, ch15 Deut.29, ch1 Deut.1, ch14 Deut.1, ch26 Deut.29, ch3

Hellinger Jenson.S K−L Manh. Manh. (2.1) Manh. (18.91) Manh. (15.16) Manh. (10.01) Manh. (1.94) K−L (8.4) K−L (42.39) K−L (24.18) K−L (23.26) K−L (15.27) Jenson.S (34) Jenson.S (26) Jenson.S (18) Jenson.S (17) Jenson.S (0) Hellinger (6) Hellinger (41) Hellinger (23) Hellinger (21) Hellinger (13) euclidean (7.56) euclidean (7.23) euclidean (5.03) euclidean (22.03) euclidean (14.68) Cosine (83.7) Cosine (74) Cosine (66.1) Cosine (65.9) Cosine (49.1) clark (75.37) clark (65.83) clark (57.5) clark (56.01) clark (40.95) Bhatt. (8) Bhatt. (5) Bhatt. (21) Bhatt. (15) Bhatt. (13) Chi−Square (62.28) Chi−Square (53.09) Chi−Square (45.39) Chi−Square (45.16) Chi−Square (28.28) Chi−Square Bhatt. clark Cosine euclidean Manh. (2.1) Manh. (18.91) Manh. (15.16) Manh. (10.01) Manh. (1.94) K−L (8.4) K−L (42.39) K−L (24.18) K−L (23.26) K−L (15.27) Jenson.S (34) Jenson.S (26) Jenson.S (18) Jenson.S (17) Jenson.S (0) Hellinger (6) Hellinger (41) Hellinger (23) Hellinger (21) Hellinger (13) euclidean (7.56) euclidean (7.23) euclidean (5.03) euclidean (22.03) euclidean (14.68) Cosine (83.7) Cosine (74) Cosine (66.1) Cosine (65.9) Cosine (49.1) clark (75.37) clark (65.83) clark (57.5) clark (56.01) clark (40.95) Bhatt. (8) Bhatt. (5) Bhatt. (21) Bhatt. (15) Bhatt. (13) Chi−Square (62.28) Chi−Square (53.09) Chi−Square (45.39) Chi−Square (45.16) Chi−Square (28.28) 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 Difference.CTM.full

Figure 6.22: The difference between the expert matrix and the distances applied on the CTM matrix CQAS 69

2. Similarity Evaluation on Document Term Matrix of the Raw Data:

Figure 6.23 shows the difference between the raw document term matrix and the Ψ matrix. Looking at the result of Euclidean, Hillnger, Kullback Leibler , Jenson achieved success at this level by giving relatively close result to true distance measures of the ψ matrix.

As it is mentioned in the previous analysis Jenson and Cosine show con- trasting result. This step is important to evaluate such a confusion. In other words, Jenson was able to distinguish between the structure of the different inputs as well as Cosine. However, cosine gives the farthest differ- ence result when compared matrix. conversely, Jenson’s result was closer to the rue distance given by the true expert distance - Ψ matrix which is how we verified the strength of this algorithm this step. CQAS 70

Deut.1, ch15 Deut.29, ch1 Deut.1, ch14 Deut.1, ch26 Deut.29, ch3

Hellinger Jenson.S K−L Manh. Manh. (2.78) Manh. (2.33) Manh. (19.32) Manh. (14.74) Manh. (10.75) K−L (24.17) K−L (2.8) K−L (18.21) K−L (17.86) K−L (10.8) Jenson.S (5.65) Jenson.S (28.31) Jenson.S (2.12) Jenson.S (11) Jenson.S (10.08) Hellinger (65) Hellinger (57) Hellinger (36) Hellinger (2) Hellinger (19) euclidean (73) euclidean (65) euclidean (57) euclidean (40) Cosine (47) Cosine (29) Cosine (28) Cosine (20) Cosine (13) clark (83) clark (74) clark (66.92) clark (66) clark (49.9) Bhatt. (2) Bhatt. (18) Bhatt. (17) Bhatt. (10) Bhatt. (0) Chi−Square (62.03) Chi−Square (52.6) Chi−Square (45) Chi−Square (44.6) Chi−Square (28) Chi−Square Bhatt. clark Cosine euclidean Manh. (2.78) Manh. (2.33) Manh. (19.32) Manh. (14.74) Manh. (10.75) K−L (24.17) K−L (2.8) K−L (18.21) K−L (17.86) K−L (10.8) Jenson.S (5.65) Jenson.S (28.31) Jenson.S (2.12) Jenson.S (11) Jenson.S (10.08) Hellinger (65) Hellinger (57) Hellinger (36) Hellinger (2) Hellinger (19) euclidean (73) euclidean (65) euclidean (57) euclidean (40) Cosine (47) Cosine (29) Cosine (28) Cosine (20) Cosine (13) clark (83) clark (74) clark (66.92) clark (66) clark (49.9) Bhatt. (2) Bhatt. (18) Bhatt. (17) Bhatt. (10) Bhatt. (0) Chi−Square (62.03) Chi−Square (52.6) Chi−Square (45) Chi−Square (44.6) Chi−Square (28) 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 Difference.row

Figure 6.23: The difference between the expert matrix and the distances applied on the Raw matrix CQAS 71

Below is the summary table of the above analysis. o summarise our analysis, we sort the distance based on the following scoring interval:

• close distances if δx(di, dj) ≤ 20%

• medium distance if 40% ≤ δx(di, dj) ≤ 20%

• var distances if δx(di, dj) ≥ 40%

Distance Distinguation level mean.Difference.raw mean.Difference.raw score mean.Difference.CTM mean.Difference.CTM score Bhattacharyya bad 9.400 close 12.400 close clark bad 67.964 var 59.132 var Cosine confused 27.400 medium 67.760 var euclidean Bad 58.400 var 11.306 close Hellinger medium 35.800 medium 20.800 close Jenson Shanon confused 11.432 close 19.000 close Manhattan bad 9.984 close * 9.624 close * SymmetricChi-Square bad 46.446 var 46.840 var SymmetricK-L good 14.768 close 22.700 medium to close

Table 6.10: Distance Score table

table 6.10 summarizes the three levels of comparison demonstrated above; dis- tinguishing level, the difference of the distances applied on the document term matrix of the raw data and ψ and the distances applied on θ matrix and ψ. As it can be seen from the summery of the first level, the entropy family and Hellinger distance scored as good which means they are the only distances were able to differentiate between the different data inputs. The mean.difference.raw and the mean.difference.CTM are the average of the distances for the five cho- sen chapters, see appendix for full calculation relatd to this table. For these two levels, the entropy family and Hellinger distance also demonstrate relative stability in performance. Remarkably, Hellinger distance provides a lower bound for the KL divergence, so convergence in KL divergence implies convergence of the Hellinger distance. 7 Conclusion and Future Considerations

72 CQAS 73

The goal of this study is to find which distance works best for comparing sacred texts to automate the similarity detection for the sacred corpora. Using the Bible and the Quran as our corpora, we explore the performance of various statistical methodology and feature extraction techniques to prepare the sacred chapters for the final comparison. Initial comparison of the similarity between chapters of two books with their noise showed there was a strong similarity within the sacred book and rarely across the sacred books. Once similarity was determined, a program was needed to mine the text before projecting into latent space for the second step of the evaluation. The distances presented in this research are Euclidean, Hillinger, Manhattan, cosine, Bhattacharyya,symmetric kullback- leibler, Jensen Shannon, probabilistic chi-square and clark. We are evaluating each measure according to; distinguishing ability of the different levels of noise as well as closeness to the baseline matrix. A varity of analysis techniques were completed to determine which technique best fits this type of text comparison.

7.1 Research Summary

I hope that this thesis could be a first step towards developing proximity en- gine for sacred texts. The aim of this thesis has been to build a framework for automatically mining the text and performing similarity. This work differs from existing approaches in that it automatically clean the text, project it into latent space and evaluate different similarity measures accordinglly. Many NLP algorithms were used to compliment the analysis such as, stemming, translation algorithm, and filtering to estimate the true model as accurate as possible and to obtain efficient and sufficient knowledge for comparison. After obtaining the three comparison matrices, we introduced nien distance categories to contribute in studying the similarity between the sacred texts. Throughout this work, we were able to contribute in the field of Statistics, computer science and religious study by answering the following questions:

1. Is there similarity between Quran and Bible? There are many factors related to the answer. A similarity measure is just an indication of how close two textual objects are to each other. That means it is a tool to convey some of the shared information between the tested objects but not all the information. In fact, the different approaches for measuring the similarity for corpora lead to different answers. In other words, many factors should be considered when answering this question; such as, the algorithm used for NLP processing, the dimension of the data, and most importantly the distance measure algorithm used. As we proved CQAS 74

earlier, that two distance measures can give contrasting answers for the same data sets.

Bearing this in mind, this research provides an estimated answer for an intersection between Bible and Quran which was clearly shown between Deuteronomy and four chapters of the Quran.

2. What is the main feature for sacred text ?

The semantics structure found to be significant for sacred books. That means, the sacred chapters tend to be strongly related among each other but not across books. One reason for that is the hidden semantics rela- tionship between the texts such as the writing style, the translation, the language and slang used at a specific period of time. The feature research approach to handle that is to use Markov chain monte carlo to extract learn the pattren of the sacred structre. CQAS 75

3. What is the recommended proximity measure for sacred text ? We found that the entropy family is quite effective for measuring the simi- larity among the texts and the results obtained by this family is the closest to the baseline result.

7.2 Feature Extension to this Research

There is clearly many future directions to enrich this area of research. One of the direct extension of this research is by averaging many experts topics matrix and to narrow this matrix to be as accurate as possible. Another area of research is to classify the sacred book translation according to the accuracy level of translation. One approach to do this could be by also comparing the distance between the topical assignment of an author and the averaged ψ matrix. As it states before, the semantic structure is strong between sacred texts for many reasons, therefore; Markov chain monte carlo can be applied to extract the main semantic features. A Appendix A

76 CQAS 77

ch27 2 20 14 16 6 15 13 4 19 5 10 17 9 3 8 18 12 1 11 7 ch3 10 20 14 16 13 2 15 4 5 6 19 17 9 3 8 18 12 11 1 7 ch29 2 20 14 19 16 6 13 4 15 5 10 9 17 3 8 18 12 1 11 7 ch23 1 16 6 20 9 2 14 19 7 10 13 15 5 4 8 17 3 18 12 11 ch7 10 16 20 14 2 13 6 9 15 4 5 19 17 3 8 18 12 7 11 1 ch22 9 20 16 14 13 10 2 19 6 15 4 5 17 3 8 18 12 1 11 7 ch12 7 16 20 14 6 10 13 9 2 15 4 5 17 19 8 3 18 12 1 11 ch19 6 20 16 14 2 4 15 5 13 17 9 19 8 3 10 18 12 11 1 7 ch24 9 19 16 20 6 14 10 2 13 4 15 5 17 3 8 18 7 12 1 11 ch14 16 10 9 20 14 19 6 2 7 13 15 4 5 17 3 8 18 12 1 11 ch6 16 14 20 10 13 6 2 4 15 9 5 17 3 19 8 18 12 7 1 11 ch18 9 20 13 14 16 2 15 4 6 19 10 5 17 3 8 18 12 1 11 7 ch5 10 14 20 13 16 4 2 5 15 19 17 3 6 9 8 18 12 1 11 7 ch2 13 20 14 16 4 2 15 5 10 17 6 3 9 8 18 19 12 11 1 7 ch9 16 9 6 20 14 13 10 2 7 4 5 15 19 17 3 8 18 12 1 11 ch4 14 20 16 13 10 2 15 4 5 6 19 17 9 3 8 18 12 1 11 7 ch30 19 20 14 2 4 15 16 5 9 10 6 17 13 3 8 18 12 11 1 7 ch10 14 20 16 13 4 15 5 10 2 17 6 19 3 8 18 9 12 1 11 7 ch13 7 20 16 14 4 15 5 6 17 13 2 10 9 8 18 19 3 11 12 1 ch1 2 16 13 20 10 14 6 9 4 15 5 19 17 3 7 8 18 12 1 11 ch20 16 6 9 20 14 2 7 19 10 13 4 15 5 17 3 8 18 12 1 11 ch26 9 16 20 14 10 2 6 19 13 4 15 5 17 3 8 18 12 7 1 11 Ch27.JE 12 8 17 20 5 15 4 3 14 13 2 18 16 6 10 19 9 11 1 7 Ch3.D 3 20 5 15 4 14 17 13 2 8 16 11 18 6 19 10 12 9 1 7 Ch29.JE 17 8 20 15 4 5 12 14 3 13 2 6 18 16 10 19 9 11 1 7 Ch23.IS 8 20 4 15 5 14 17 16 2 13 3 12 6 18 10 19 9 1 11 7 Ch7.G 15 20 4 5 14 17 8 3 18 2 13 16 6 12 19 10 9 11 1 7 Ch22.IS 8 20 17 4 15 5 14 12 3 13 2 16 18 6 19 10 9 1 11 7 Ch12.G 11 20 4 15 3 5 17 14 18 8 6 16 13 2 19 10 12 9 7 1 Ch19.IS 12 20 15 8 4 5 17 14 3 2 16 13 18 6 19 10 9 11 1 7 Ch24.IS 1 8 20 15 5 17 3 12 14 4 2 13 6 16 18 19 10 11 9 7 Ch14.E 5 4 20 15 3 14 17 8 2 18 13 6 16 12 19 10 9 11 7 1 Ch6.G 15 20 4 5 14 17 3 8 2 18 13 16 6 12 19 10 9 11 1 7 Ch18.E 18 20 5 4 15 14 3 17 2 8 13 16 6 10 12 19 9 11 1 7 Ch5.D 5 20 3 4 15 14 17 8 18 2 16 13 6 12 10 19 11 9 7 1 Ch2.D 5 3 20 4 15 14 17 8 2 13 16 18 6 10 12 11 19 9 1 7 Ch9.G 15 20 4 5 17 14 18 3 8 2 13 16 6 12 19 10 9 11 1 7 Ch4.D 11 3 20 5 4 15 14 17 8 16 18 13 2 6 19 10 12 9 1 7 Ch30.JE 17 20 4 15 14 5 8 3 13 2 6 12 18 16 10 19 9 11 7 1 Ch10.G 18 15 20 4 5 17 14 11 3 8 13 2 16 6 12 10 19 9 7 1 Ch13.E 4 20 5 15 14 17 3 13 8 2 16 18 6 19 12 11 9 10 7 1 Ch1.D 5 20 3 15 4 14 17 8 2 13 16 18 6 10 12 19 9 11 7 1 Ch20.IS 12 20 15 4 8 5 17 14 3 2 16 13 6 18 19 10 9 11 1 7 Ch26.JE 12 20 5 17 4 15 8 14 3 13 16 2 6 18 10 19 9 11 1 7

Table A.1: Topics Projection of learning Set CQAS 78

Chapter Distance Similarity.R Similarity.CTM Difference.row Difference.CTM.full Deut.1, ch26 Cosine 5 98.9 28 65.9 Deut.1, ch14 Cosine 3 99.1 47 49.1 Deut.1, ch15 Cosine 3 99.7 13 83.7 Deut.29, ch1 Cosine 4 99.1 29 66.1 Deut.29, ch3 Cosine 5 99 20 74 Deut.1, ch26 euclidean 90 27.97 57 5.03 Deut.1, ch14 euclidean 90 27.97 40 22.03 Deut.1, ch15 euclidean 89 23.56 73 7.56 Deut.29, ch1 euclidean 90 18.32 57 14.68 Deut.29, ch3 euclidean 90 17.77 65 7.23 Deut.1, ch26 Hellinger 52 10 19 23 Deut.1, ch14 Hellinger 52 9 2 41 Deut.1, ch15 Hellinger 52 10 36 6 Deut.29, ch1 Hellinger 90 12 57 21 Deut.29, ch3 Hellinger 90 12 65 13 Deut.1, ch26 Bhatt. 33 38 0 5 Deut.1, ch14 Bhatt. 32 37 18 13 Deut.1, ch15 Bhatt. 33 37 17 21 Deut.29, ch1 Bhatt. 35 41 2 8 Deut.29, ch3 Bhatt. 35 40 10 15 Deut.1, ch26 Manh. 35.33 34.94 2.33 1.94 Deut.1, ch14 Manh. 35.26 34.84 14.74 15.16 Deut.1, ch15 Manh. 35.32 34.91 19.32 18.91 Deut.29, ch1 Manh. 35.78 35.1 2.78 2.1 Deut.29, ch3 Manh. 35.75 35.01 10.75 10.01 Deut.1, ch26 K-L 51.21 8.82 18.21 24.18 Deut.1, ch14 K-L 52.8 7.61 2.8 42.39 Deut.1, ch15 K-L 5.2 7.6 10.8 8.4 Deut.29, ch1 K-L 50.86 9.74 17.86 23.26 Deut.29, ch3 K-L 49.17 9.73 24.17 15.27 Deut.1, ch26 clark 99.92 89.01 66.92 56.01 Deut.1, ch14 clark 99.9 90.95 49.9 40.95 Deut.1, ch15 clark 99 91.37 83 75.37 Deut.29, ch1 clark 99 90.5 66 57.5 Deut.29, ch3 clark 99 90.83 74 65.83 Deut.1, ch26 Chi-Square 78 78.39 45 45.39 Deut.1, ch14 Chi-Square 78 78.28 28 28.28 Deut.1, ch15 Chi-Square 78.03 78.28 62.03 62.28 Deut.29, ch1 Chi-Square 77.6 78.16 44.6 45.16 Deut.29, ch3 Chi-Square 77.6 78.09 52.6 53.09 Deut.1, ch26 Jenson.S 22 50 11 17 Deut.1, ch14 Jenson.S 21.69 50 28.31 0 Deut.1, ch15 Jenson.S 21.65 50 5.65 34 Deut.29, ch1 Jenson.S 22.92 51 10.08 18 Deut.29, ch3 Jenson.S 22.88 51 2.12 26

Table A.2: Similarity and Distance table CQAS 79

Distance mean.Simiarity.CTM mean.Difference.CTM mean.Simiarity.row mean.Difference.row Bhattacharyya 38.6 12.4 33.6 9.4 clark 90.532 59.132 99.364 67.964 Cosine 99.16 67.76 4 27.4 euclidean 23.118 11.306 89.8 58.4 Hellinger 10.6 20.8 67.2 35.8 Jenson Shanon 50.4 19 22.228 11.432 Manhattan 34.96 9.624 35.488 9.984 Symmetric Chi-Square 78.24 46.84 77.846 46.446 Symmetric K-L 8.7 22.7 41.848 14.768

Table A.3: Summary of similarity and Distance table CQAS 80

#quran prophet list docs[[j]] <- gsub("yaqub|yaghuth|yahya|Ishaq|O probhet|o prophet|messenger|Messenger of God|ismail|Ahmed|ishmael|Zakariyya|Ta-Ha|Noah|Idris| Lut|Ya’qub|Ishaq|Ibrahim|Saleh|Al-Yasa|Zakariya|Shu’ ayb|Sulayman|Uzayr|Musa|lyas|Ismail|Uzayr|Luqman|Ismail| Harun|Hud|Ayub|Dawud|Dhul-Kifl|yahya|yaghuth|Zedekia h|Jeremiah|Muhammad |Isaac|Sulaiman|Harun|Aaron|Jesus|Noah|Adam" , ignore.case=TRUE)

#Bible prophet list

docs[[j]]<- gsub("Adam|Abraham|Methusaleh|Lot|Noah|Heber|Joseph| EnochIshmael

Isaac|Jacob|Jethro|Job|Ezekiel|Moses|Aaron|David|Solomon|Elias|Jonah|

Zachariah||Jesus" , ignore.case=TRUE) #Female Quran list

docs[[j]] <- gsub("she|Maryam|woman|womenn|Asiyah|Wife|Wives|female |womanr|females|Sara|Eve|Hawwa|Mary|wife|wives|Asia" ignore.case = TRUE) #Bible Female list

docs[[j]]<-gsub("Abigail|Abihail|Abijah |Abishag|Abital|Achsah|Ada|Adah|Ahinoam|Ahlai|Aholah| Aholiamah|Aholibah|Aholibamah|Anammelech|Anna|Apphia |Asenath|Asherah|Ashtoreth|Atarah|Athaliah|Azubah|Gomer |Hadassah||Haggith|Hammoleketh|Hamutal|Handmai CQAS 81

d|Hannah|Harlot|Hazelelponi|

|Heifer|Helah|Hephzibah|Herodias|Hodesh|Hoglah|Hulda

h|Iscah|Ishtar||Jecoliah|Jedidah|Jehoaddan|Jeh osheba|Jemima|Jerioth|Jerusha|Jewess

|Jezebel|Joanna|Jochebed|Judith|Julia|Junias|Keren-h appuch|Keturah|Kezia|Lapidoth||Lois|Lo-ruhamah|L ydia|Maachah|Magdalene|Mahalath

|Mahlah|Mara|Martha|Mary|Mehetabel|Merab|Meshullemet h|Michaiah|Michal|Midwife||Miriam|Mrs.Noah|Bas hemath|Bath-sheba

|Bathshua|Bernice||Bithiah|Candace|Chloe|Claud ia|Concubine|Cozbi|Cushite|Damaris|Daughter|Deaconess|Deborah|Delilah

|Diana|Diblaim||Dorcas|Drusilla|Eglah|Elect lady|Elisabeth|Elisheba|Elizabeth|Ephah|Ephratah|Eph rath|Esther|Esther|Ethiopian woman|Eunice|Euodia

|Euodias|Eve|Naamah|Naarah|Nagge|Naomi|Nehushta|Noad iah|Nymphas|Oholibamah|Orpah|Peninnah|Persis|daughters|Phebe|Priscilla

|Prophetess|Prostitute|Puah|Queen|Queen of Heaven|Rachab|||Rebecca|Rebekah|Reumah|Rh oda|Rizpah|Ruhamah||Salome|Sapphira|Sara||Sarai|Serah

|Sherah|lomith|Shelomoth|womann|Sherah|Shimeath|Shim rith|Shiphrah|Shomer|Shua|Shulamite woman|Spouse|Succoth-benoth|Susanna|Syntyche|Syrophe nician|Tabitha|Tahpenes

||Taphath|Timna|Tirza|Tirzah|Tryphena|Tryphosa| Vashti|Witch|Zebudah|Zeresh|Zeruah|Zeruiah|Zibiah|Zillah|| womanary|Zipporah" , ignore.case=TRUE) CQAS 82

# Drinks in Quran and Bible

docs[[j]]<-gsub("Water|Milk |Honey|Wine|Olive|oil|Vinegar" , ignore.case=TRUE) # Food list docs[[j]]<-gsub("Dates|Fruit|Grapes|Gr ains|Olives|Buckthorn|Pomegranate|Mustard|Onion|Herb sbeans|vegetables|Cucumbers|Garlic|food|Lentil|Gourd

|Banana|Herbage|Abb|Fig" , ignore.case=TRUE) #Food in Bible

docs[[j]]<-gsub("Dates|Almonds|Pista chio|Raisins|Sycamore|Leeks|Barley|Millet|Spelt|Pige on|Partridge|Curds", ignore.case=TRUE) B Appendix B

6

Topics Moses Idol Believe Evil Messenger 4 Human Abraham Money Children Birth Death Satan

density Communication Prayer Marriage Faith Creation Jesus Abortion 2 Health Women Love God Israel

0

0.00 0.25 0.50 0.75 1.00 Expert.Sort

Figure B.1: The density of the chosen-topics sorted by biblical experts

83 Bibliography

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, March 2003a. ISSN 1532-4435. URL http: //dl.acm.org/citation.cfm?id=944919.944937.

David Blei and John Lafferty. Correlated topic models. Advances in neural information processing systems, 18:147, 2006a.

Matthew J Delano, Philip O Scumpia, Jason S Weinstein, Dominique Coco, Srinivas Nagaraj, Kindra M Kelly-Scumpia, Kerri A O’Malley, James L Wynn, Svetlana Antonenko, Samer Z Al-Quran, et al. Myd88-dependent expansion of an immature gr-1+ cd11b+ population induces t cell suppression and th2 polarization in sepsis. The Journal of experimental medicine, 204(6):1463– 1474, 2007.

Abdullah Yusuf Ali. The meaning of the Glorious Quran. Islamic Books, 1934.

Abul Ala Maududi. Holy quran translation, 2011. URL Tanzil.net.

Inc Biblica. Holy bible, 1973, 1978, 1984, 2011. URL https://www. biblegateway.com/versions/English-Standard-Version-ESV-Bible.

M Conrad Hyers. The meaning of creation: Genesis and modern science. West- minster John Knox Press, 1984.

Otto Eissfeldt. The . Blackwell, 1965.

Christopher D Manning and Hinrich Schütze. Foundations of statistical natural language processing. MIT press, 1999.

Amit Singhal. Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4):35–43, 2001. URL http://act.buaa.edu.cn/hsun/IR2013/ref/ mir.pdf.

Ingo Feinerer. tm: Text mining package, 2008. UR L http://CRAN. R-project. org/package= tm. R package version 0.3-3.

84 Bibliography 85

Hwee Tou Ng, Wei Boon Goh, and Kok Leong Low. Feature selection, perceptron learning, and a usability case study for text categorization. SIGIR Forum, 31 (SI):67–73, July 1997. ISSN 0163-5840. doi: 10.1145/278459.258537. URL http://doi.acm.org/10.1145/278459.258537.

K Stevens, T Huang, and D Buttler. The c-cat wordnet package: An open source package for modifying andapplying wordnet. Technical report, Lawrence Livermore National Laboratory (LLNL), Livermore, CA, 2011. URL ftp: //140.247.115.226/CRAN/web/packages/wordnet/wordnet.pdf.

Robert Nisbet, John Elder IV, and Gary Miner. Handbook of statistical analysis and data mining applications. Academic Press, 2009.

David A Hull. Stemming algorithms: A case study for detailed evaluation. JASIS, 47(1):70–84, 1996.

Donna Harman. How effective is suffixing? JASIS, 42(1):7–15, 1991.

Scott C. Deerwester, Susan T Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. Indexing by latent semantic analysis. JASIS, 41 (6):391–407, 1990.

Fabrizio Sebastiani and Consiglio Nazionale Delle Ricerche. Machine learning in automated text categorization. ACM Computing Surveys, 34:1–47, 2002. URL http://nmis.isti.cnr.it/sebastiani/Publications/ACMCS02.pdf.

L. Azzopardi, G. Kazai, S. Robertson, S. Rüger, M. Shokouhi, D. Song, and E. Yilmaz. Advances in Information Retrieval Theory: Second International Conference on the Theory of Information Retrieval, ICTIR 2009 Cambridge, UK, September 10-12, 2009 Proceedings. LNCS sublibrary: Information sys- tems and applications, incl. Internet/Web, and HCI. Springer, 2009. ISBN 9783642044168. URL http://books.google.ca/books?id=4eXUYHM3lBkC.

T. Wilson. Andrew and A. Chew Beter. Term weighting scheme for latent dirichlet allocation. 2010 Association for computational lingustic, 2010. URL http://www.aclweb.org/anthology/N10-1070.

Alexandros Karatzoglou, Alex Smola, Kurt Hornik, and Achim Zeileis. kernlab- an s4 package for kernel methods in r. 2004. URL http://epub.wu.ac.at/ 1048/1/document.pdf.

Ted Pedersen, Serguei VS Pakhomov, Siddharth Patwardhan, and Christopher G Chute. Measures of semantic similarity and relatedness in the biomedical Bibliography 86

domain. Journal of biomedical informatics, 40(3):288–299, 2007. URL http: //www.sciencedirect.com/science/article/pii/S1532046406000645.

Ana G Maguitman, Filippo Menczer, Heather Roinestad, and Alessandro Vespignani. Algorithmic detection of semantic similarity. In Proceedings of the 14th international conference on World Wide Web, pages 107–116. ACM, 2005. URL http://dl.acm.org/citation.cfm?id=1060765.

Matthew Schultz and Thorsten Joachims. Learning a distance metric from rela- tive comparisons. Advances in neural information processing systems (NIPS), page 41, 2004.

Ryen W White and Joemon M Jose. A study of topic similarity measures. In Proceedings of the 27th annual international ACM SIGIR conference on Re- search and development in information retrieval, pages 520–521. ACM, 2004.

Graeme P Penney, Jürgen Weese, John A Little, Paul Desmedt, Derek LG Hill, and David J Hawkes. A comparison of similarity measures for use in 2-d-3-d medical image registration. Medical Imaging, IEEE Transactions on, 17(4): 586–595, 1998.

Christina S Leslie, Eleazar Eskin, and William Stafford Noble. The spectrum kernel: A string kernel for svm protein classification. In Pacific symposium on biocomputing, volume 7, pages 566–575, 2002. URL http://pdf.aminer.org/000/554/352/the_spectrum_kernel_ a_string_kernel_for_svm_protein_classification.pdf.

Silviu Cucerzan and Eric Brill. Spelling correction as an iterative process that exploits the collective knowledge of web users. In EMNLP, volume 4, pages 293–300, 2004.

Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. Text classification using string kernels. The Journal of Machine Learning Research, 2:419–444, 2002. URL http://machinelearning.wustl. edu/mlpapers/paper_files/LodhiSSCW02.pdf.

Christina S Leslie, Eleazar Eskin, Adiel Cohen, Jason Weston, and William Stafford Noble. Mismatch string kernels for discriminative pro- tein classification. Bioinformatics, 20(4):467–476, 2004. URL http:// bioinformatics.oxfordjournals.org/content/20/4/467.short.

Mark van der Loo. Package âĂŸstringdistâĂŹ. 2013. URL http://cran.um. ac.ir/web/packages/stringdist/stringdist.pdf. Bibliography 87

Emmanuel Keuleers and Maintainer Emmanuel Keuleers. Package âĂŸvwrâĂŹ. 2013. URL ftp://ftp.ie.debian.org/mirrors/cran.r-project.org/web/ packages/vwr/vwr.pdf.

Sarah C Goslee and Dean L Urban. The ecodist package for dissimilarity-based analysis of ecological data. Journal of Statistical Software, 22(7):1–19, 2007. URL http://core.kmi.open.ac.uk/download/pdf/6303215.pdf.

Daniel Chessel, Anne B Dufour, Jean Thioulouse, et al. The ade4 package-i- one-table methods. R news, 4(1):5–10, 2004. URL http://thioulouse.fr/ ref/ade4-Rnews.pdf.

Sung-Hyuk Cha. Comprehensive survey on distance/similarity measures be- tween probability density functions. City, 1(2):8. URL http://csis.pace. edu/ctappert/dps/d861-12/session4-p2.pdf.

Charu C. Aggarwal, Alexander Hinneburg, and Daniel A. Keim. On the surpris- ing behavior of distance metrics in high dimensional spaces. In Proceedings of the 8th International Conference on Database Theory, ICDT ’01, pages 420–434, London, UK, UK, 2001. Springer-Verlag. ISBN 3-540-41456-8. URL http://dl.acm.org/citation.cfm?id=645504.656414.

Michel Marie Deza and Elena Deza. Encyclopedia of distances. Springer, 2009.

Baoli Li and Liping Han. Distance weighted cosine similarity measure for text classification. In Intelligent Data Engineering and Automated Learning– IDEAL 2013, pages 611–618. Springer, 2013.

Gang Qian, Shamik Sural, Yuelong Gu, and Sakti Pramanik. Similarity between euclidean and cosine angle distance for nearest neighbor queries. In Proceed- ings of the 2004 ACM symposium on Applied computing, pages 1232–1237. ACM, 2004.

Thomas Kailath. The divergence and bhattacharyya distance measures in signal selection. Communication Technology, IEEE Transactions on, 15 (1):52–60, 1967. URL http://ieeexplore.ieee.org/stamp/stamp.jsp?tp= &arnumber=1089532.

Michel-Marie Deza and Elena Deza. Dictionary of distances. Elsevier, 2006.

Jianhua Lin. Divergence measures based on the shannon entropy. Infor- mation Theory, IEEE Transactions on, 37(1):145–151, 1991a. URL http: //ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=61115. Bibliography 88

Jianhua Lin. Divergence measures based on the shannon entropy. IEEE Trans- actions on Information theory, 37:145–151, 1991b.

Marcin Budka, Bogdan Gabrys, and Katarzyna Musial. On accuracy of pdf divergence estimators and their applicability to representative data sampling. Entropy, 13(7):1229–1266, 2011.

Kurt Hornik and Bettina Grün. topicmodels: An r package for fitting topic models. Journal of Statistical Software, 40(13):1–30, 2011.

Jonathan Chang and Maintainer Jonathan Chang. Package âĂŸldaâĂŹ, 2010. URL http://cran.r-project.org/web/packages/lda/lda.pdf.

Timothy P Jurka, Loren Collingwood, Amber E Boydstun, and Main- tainer Timothy P Jurka. Package âĂŸrtexttoolsâĂŹ. 2014. URL ftp://mirror3.mirror.garr.it/pub/mirrors/CRAN/web/packages/ RTextTools/RTextTools.pdf.

Thomas Hofmann. Latent semantic models for collaborative filtering. ACM Transactions on Information Systems (TOIS), 22(1):89–115, 2004.

David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993–1022, 2003b.

David M Blei and John D Lafferty. A correlated topic model of science. The Annals of Applied Statistics, pages 17–35, 2007. URL http://www.jstor. org/stable/4537420.

Wei Li and Andrew McCallum. Pachinko allocation: Dag-structured mixture models of topic correlations. In Proceedings of the 23rd international confer- ence on Machine learning, pages 577–584. ACM, 2006.

Xuanhui Wang, ChengXiang Zhai, Xiao Hu, and Richard Sproat. Mining cor- related bursty topic patterns from coordinated text streams. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 784–793. ACM, 2007.

Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.

David M Blei and John D Lafferty. Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning, pages 113–120. ACM, 2006b. Bibliography 89

Shay B Cohen, Kevin Gimpel, and Noah A Smith. Logistic normal priors for unsupervised probabilistic grammar induction. In Advances in Neu- ral Information Processing Systems, pages 321–328, 2008. URL http:// machinelearning.wustl.edu/mlpapers/paper_files/NIPS2008_0822.pdf.

Margaret E Roberts, Brandon M Stewart, and Edoardo M Airoldi. Structural topic models.âĂİ. Technical report, Working Paper. Export BibTex Tagged XML, 2014.

Martin J Wainwright and Michael I Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends R in Machine Learning, 1(1-2):1–305, 2008.

Jianfei Chen, June Zhu, Zi Wang, Xun Zheng, and Bo Zhang. Scalable inference for logistic-normal topic models. In Advances in Neural Information Process- ing Systems, pages 2445–2453, 2013. URL http://papers.nips.cc/paper/ 4981-scalable-inference-for-logistic-normal-topic-models.pdf.

Thomas K Landauer, Danielle S McNamara, Simon Dennis, and Walter Kintsch. Handbook of latent semantic analysis. Psychology Press, 2013.

Yee Whye Teh, Michael I Jordan, Matthew J Beal, and David M Blei. Hier- archical dirichlet processes. Journal of the american statistical association, 101(476), 2006. URL http://amstat.tandfonline.com/doi/abs/10.1198/ 016214506000000302#.VAdhCvldV8E.

Thomas L Griffiths and Mark Steyvers. Finding scientific topics. Proceed- ings of the National academy of Sciences of the United States of America, 101(Suppl 1):5228–5235, 2004. URL http://www.pnas.org/content/101/ suppl_1/5228.abstract.

Ashok N Srivastava and Mehran Sahami. Text mining: classification, clustering, and applications. CRC Press, 2010.