<<

ALGORITHMS FOR ENHANCING INFORMATION RETRIEVAL USING

SEMANTIC WEB

A dissertation submitted to Kent State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy

by Moy’awiah Al-Shannaq August, 2015

i

Dissertation written by

Moy’awiah A. Al-Shannaq

B.S., Yarmouk University, 2003

M.S., Yarmouk University, 2005

Ph.D., Kent State University, 2015

Approved by

______Austin Melton, Professor, Ph.D., , Doctoral Advisor

______Johnnie Baker, Professor Emeritus, Ph.D., Computer Science

______Angela Guercio, Associate Professor, Ph.D., Computer Science

______Farid Fouad, Associate Professor, Ph.D., Chemistry

______Donald White, Professor, Ph.D., Mathematical Sciences

Accepted by

______Javed I. Khan, Professor, Ph.D., Chair, Department of Computer Science

______James L. Blank, Professor, Ph.D., Dean, College of Arts and Sciences

ii

Table of Contents

LIST OF FIGURES...... viii

LIST OF TABLES...... xi

DEDICATION...... xiii

ACKNOWLEDGMENTS...... xiv

ABSTRACT...... xv

CHAPTER 1: INTRODUCTION…...... 1

1.1 Goals of the Research...... 3

1.2 Contributions...... 5

1.3 Dissertation out line ...... 6

CHAPTER 2: BACKGROUND…...... 8

2.1 Introduction……………………………………………………………………………8

2.2 Information Retrieval………………………………………………………………….9

2.3 Web Evolution……………………………………………………………………….13

2.3.1 Web 1.0…………………………………………………………………………..14

2.3.2 Web 2.0…………………………………………………………...... 14

2.3.3 Web 3.0…………………………………………………………………………..15

2.3.4 Web 4.0…………………………………………………………………………..16

2.4 ………………………………………………………………………..18

2.5 Standards, Recommendations, and Models………………………………………….20

iii

2.5.1 URI and Unicode………………………………………………………………...21

2.5.2 Extensible Markup Language……………………………………………………22

2.5.3 RDF: Describing Web Resources………………………………………………..23

2.5.4 RDF Schema: Adding Semantics………………………………………………...24

2.5.5 OWL: Web Ontology Language………………………………………………....25

2.5.6 Logic and Proof…………………………………………………………………..26

2.5.7 Trust……………………………………………………………………………...27

2.6 What is SPARQL? …………………………………………………………………..27

2.7 Integrating Information Retrieval and Semantic Web……………………………….30

CHAPTER 3: RELATED WORK…………………………………………………….34

3.1 Topic Modeling………………………………………………………………………34

3.1.1 From Vector Space Modeling to Latent Semantic Indexing…………………….36

3.1.2 The Probabilistic Latent Semantic Indexing (PLSI) model……………………...43

3.1.3 The Latent Dirichlet Allocation Model (LDA)…………………………………..45

3.1.4 Topics in LDA…………………………………………………………………...47

3.1.4.1 Model………………………………………………………………………...48

3.2 Automatic Query Expansion (AQE)…………………………………………………50

3.3 Document Ranking with AQE……………………………………………………….53

3.4 Why and when AQE works………………………………………………………….54

3.5 How AQE works……………………………………………………………………..57

3.5.1 Preprocessing of Data Source……………………………………………………57

iv

3.5.2 Generation and Ranking of Candidate Expansion Features……………………..58

3.5.2.1 One-to-One Associations…………………………………………………….59

3.5.2.2 One-to-Many Associations…………………………………………………..59

3.5.2.3 Analysis of Feature Distribution in Top-Ranked Documents………………..60

3.5.2.4 Query Language Modeling…………………………………………………..60

3.5.2.5 A Web Search Example……………………………………………………...61

3.6 Selection of Expansion Features……………………………………………………..64

3.7 Query Reformulation………………………………………………………………...65

3.8 Related Work………………………………………………………………………...66

CHAPTER 4: SEMANTIC WEB AND ARABIC LANGUAGE……………………69

4.1 Importance of Arabic Language……………………………………………………..69

4.2 Right-to-Left Languages and the Semantic Web…………………………………….72

4.3 Arabic Ontology……………………………………………………………………...73

4.4 The Arabic Language and the Semantic Web: Challenges and Opportunities………75

CHAPTER 5: GENERATING TOPICS……………………………………………...82

5.1 Introduction…………………………………………………………………………..82

5.2 Data Set………………………………………………………………………………84

5.3 Experimental Results………………………………………………………………...85

5.3.1 Experiments on an English Corpus……………………………………………....88

5.3.1.1 Preprocessing Steps………………………………………………………….88

5.3.1.2 English Corpus Creation……………………………………………………..89

v

5.3.1.3 Experiment 1…………………………………………………………………90

5.1.3.4 Experiment 2…………………………………………………………………91

5.1.3.5 Experiment 3…………………………………………………………………91

5.1.3.6 Experiment 4…………………………………………………………………91

5.1.3.7 Experiment 5…………………………………………………………………91

5.1.3.8 Experiment 6…………………………………………………………………91

5.3.2 Experiments on an Arabic Corpus……………………………………………….92

5.3.2.1 Arabic Corpus Creation……………………………………………………...92

5.3.2.2 Experiment 1…………………………………………………………………94

5.3.2.3 Experiment 2…………………………………………………………………94

5.3.2.4 Experiment 3…………………………………………………………………94

5.3.2.5 Experiment 4………………………………………………………………....94

5.3.2.6 Experiment 5…………………………………………………………………95

5.3.2.7 Experiment 6…………………………………………………………………95

5.4 Discussion……………………………………………………………………………96

CHAPTER 6: TOPIC MODELING AND QUERY EXPANSION………………...102

6.1 Introduction…………………………………………………………………………102

6.2 Why to use the combination of topic modeling and query expansion? ……………105

6.3 Semantic Search…………………………………………………………………….106

6.3.1 Handling Generalizations……………………………………………………….106

6.3.2 Handling Morphological Variations……………………………………………107

vi

6.3.3 Handling Concept Matches…………………………………………………….108

6.3.4 Handling Synonyms with Correct Sense……………………………………….109

6.4 Methodology………………………………………………………………………..110

6.4.1 Query Expansion………………………………………………………………..114

6.4.2 Our Work……………………………………………………………………….116

6.4.2.1 Subsystem……………………………………………………….121

6.4.2.2 Query Expansion Subsystem……………………………………………….121

6.5 Experimental Results and Discussion………………………………………………122

6.5.1 Experiment 1…………………………………………………………………....123

6.5.2 Experiment 2……………………………………………………………………124

6.5.3 Experiment 3……………………………………………………………………125

6.5.4 Experiment 4……………………………………………………………………126

6.5.5 Experiment 5……………………………………………………………………127

6.5.6 Experiment 6……………………………………………………………………128

6.5.7 Experiment 7……………………………………………………………………129

6.5.8 Experiment 8……………………………………………………………………130

CHAPTER 7: CONCLUSION AND FUTURE WORK……………………………132

APPENDIX A………………………………………………………………………….141

REFERENCES………………………………………………………………………...158

vii

List of Figures

Figure 1: The classic search model……………………….……………………………...10

Figure 2: Architecture of a textual IR system……………………………………………13

Figure 3: Evolution of the Web………….………………………………………………17

Figure 4: Web of documents………………….…………………………………………19

Figure 5: The structure of Web data……………….…………………………………….20

Figure 6: Semantic Web layered architecture……………………………………………21

Figure 7: An RDF graph storing information about Arab Bank.………………………...24

Figure 8: An RDF graph storing information about transporation services between cities…...... 25

Figure 9: Two-dimmensioal plot of terms and documents………………………………40

Figure 10: Two-dimmensioal plot of terms and documents along with the query application theory ………………………………………………………………………..41

Figure 11: Two-dimmensioal plot of terms and documents using the SVD of a reconstructed Term-Document Matrix………………………………………………….42

viii

Figure 12: Plate notation representing the LDA model………………………………….48

Figure 13: Main steps of Automatic Query Expansion……………….…………………57

Figure 14: The first ten results returned by Google in response to the query "foreign minorities Germany"……………………………………………………………………..62

Figure 15: The first ten results returned by Google in response to the expanded query

"foreign minorities Germany sorbs Frisians"……………………………………………63

Figure 16: The real strength of the top ten languages…………..…………...... 71

Figure 17: The three main components of the semantic annotation process…………….76

Figure 18: A pie chart showing the distribution of languages used in creating ontologies stored in the OntoSelect library………………………………………………………….78

Figure 19: A visualization of the probabilistic generative process for three documents..87

Figure 20: The intuitions behind Latent Dirichlet Allocation…………………………...88

100.…………………… سيعلمونه Figure 21: Segmentation of the Arabic agglutinated form

Figure 22: System components for Arabic corpus using LSI topic modeling.…………117

Figure 23: System components for Arabic corpus using LDA topic modeling.………..118

Figure 24: System components for English corpus using LDA topic modeling.………119

ix

Figure 25: System components for English corpus using LSI topic modeling...... 120

Figure 26: Query one for Arabic corpus with LDA topic modeling……………………123

Figure 27: Query two for Arabic corpus with LDA topic modeling and query expansion……………………………………………………………………………….124

Figure 28: Query three for Arabic corpus with LSI topic modeling……………………125

Figure 29: Query four for Arabic corpus with LSI topic modeling and query expansion……………………………………………………………………………….126

Figure 30: Query five for English corpus with LDA topic modeling………...... 127

Figure 31: Query six for English corpus with LDA topic modeling and query expansion……………………………………………………………………………….128

Figure 32: Query seven for English corpus with LSI topic modeling………………....129

Figure 33: Query eight for English corpus with LSI topic modeling and query expansion……………………………………………………………………………….130

Figure 34: LSI steps………………………………………………………………...... 141

Figure 35: The LDA model……………………………………………………..………149

Figure 36: A taxonomy of approaches to AQE………………………………..……….156

Figure 37: Gensim……………………………………………………...……………….157

x

List of Tables

Table 1: A comparsion between Web 1.0 and Web 2.0…...……………………………..15

Table 2: A comparsion between Web 2.0 and Web 3.0…...……………………………..16

Table 3: The result of SPARQL query…………………………………………………..28

Table 4: Computing Capabilities Assessment…………………………………………...33

Table 5: Book titles………………………………………………………………………38

Table 6: The 16*17 Term –Document Matrix corresponding to book titles in Table 7 ....39

Table 7: The three topics…………………………………………………………………43

Table 8: Clustering of the world’s languages based on their script family……………...72

Table 9: Listing of tools used in building Semantic Web applications………………….74

Table 10: Arabic support summary………………………………………………………75

Table 11: English and Arabic Wikipedia………………………………………………...84

Table 12: Summary of stemmer features………………………………………………...90

Table 13: Example of two sharing the same stem but have different senses…….93

xi

Table 14: Accuracy for topics generating for English corpus…………………………..95

Table 15: Accuracy for topics generating for Arabic corpus……………………………96

root…………………………………………..98علم Table 16: Some derivations from the

99..…………………………..……… .بسم Table 17: Four possible solutions for the

Table 18: Handling generalizations…………………………………………………….107

Table 19: Handling Morphological Variations………………………………………....108

Table 20: Handling Concept Matches………………………………………………….109

109...... ………………………" شعب" Table 21: Different senses for word

Table 22: Document 1 ………………………………………………………………….142

Table 23: Document 2 …………………………………………………...... 143

Table 24: Two texts………………………………...... 145

Table 25: Definition of variables in the model ………………………………………...150

xii

DEDICATION

To my parents, Abdulla and Mariam

To my wife, Hebah

To my daughters, Rafeef and Salma

To my family and friends

xiii

ACKNOWLEDGMENTS

First and foremost, I would like to thank Allah, my God, who gives me the confidence to pursue my doctoral studies at Kent State University and surrounds me by wonderful advisor, family and friends. I would like to take this opportunity to thank them.

I would like to show my greatest appreciation and gratitude to my research advisor,

Prof. Austin Melton, for his guidance and support over the four enjoyable, memorable, and wonderful years. Prof. Melton’s support is further than what words can say. I would also like to thank the committee members: Prof. Johnnie Baker, Dr. Angela Guercio, Dr. Farid

Fouad, and prof. Donald White for their time, comments, valuable suggestions, and feedback.

I should not forget to thank the faculty members and colleagues of Computer

Science Department for providing encouragement, and great company.

Last but not least, I am grateful to my friends. Special thanks to Mazen AlZyoud,

Bilal Sayheen, and Dr. Nouh Alhindawi for their help, support and friendship.

Thank you one and all.

Moy’awiah Al-Shannaq

August, 2015, Kent, Ohio

xiv

Abstract

The research explores the use of advanced Semantic Web tools to problems in

Information Retrieval (IR). Specifically, the IR methods that are used to retrieve information from the documents depending on traditional keyword based suffer from some drawbacks. Among these drawbacks are word sense ambiguity, the query intent ambiguity, and the ability to exploit semantic knowledge within documents. These drawbacks can negatively affect the precision. The work advances the field by investigating approaches to discover the abstract "topics" that occur in a collection of documents based on topic modeling approach, then expand the query to explore and discover hidden relations between the documents. The work will be applied to Arabic and English documents.

The hypothesis is that classifying the corpus with meaningful descriptive information, and then expanding the query by using Automatic Query Expansion, will improve the results of the IR methods for indexing and querying information. In particular, the work uses techniques which are considered as advanced Information

Retrieval methods that have been widely used for indexing and analyzing the corpus and then apply the advanced Semantic Web techniques in order to increase the accuracy. These techniques will be tested on both English and Arabic documents.

Recent topic modeling techniques with query expansion are discussed and compared for problems such as efficiency and scaling.

xv

CHAPTER 1

Introduction

1.1 Introduction

Technology is developing dramatically. Its main objective is to enhance the level of service provided and to facilitate human activity. The development of technology gives better potential for users to access information using various search tools.

The World Wide Web does not necessarily mean the Internet as most people believe. However, WWW is the most crucial part of the Internet that can give us access to millions of resources regardless of their physical location and language and, with the rapid development of the Internet, access to media storage (containing wide textual corpus) and digital encyclopedias. One of the main problems we face today in the information society is information overload. This problem became more serious because of the huge amount of data that have been created, and made available online, in in the last decade. With the expected continuous growth of the WWW (in size, languages, and formats), researchers believe that search engines will have a hard time maintaining the quality of retrieval results and it is very difficult or impossible to analyze the huge amount of information. To deal with this problem and since the current technology of the search engine has its limits, we need a new vision for the web to be able to make intelligent choices and gain a better

1

meaning of the information over the Internet; the new version is called the Semantic Web

[44, 113].

A breakthrough in the field of technology is the mass use of the Internet. People from all around the world and from different walks of life now have the chance to use the

Internet and to interact with one another. The world has become a global village that guarantees a better interaction among people and better access to resources. In addition to launching the global village, the Semantic Web has made it easy to access resources online.

The Semantic Web can enhance the Web through a layer of metadata (data about data) that is machine-interpretable. Therefore, by using this method, a computer program can process the content and make conclusions about a Web page [65]. An important innovation was introduced by Tim Berner-Lee, who highlighted the role of web content medium in finding, sharing, and integrating information. The Semantic Web provides procedures for sharing and exchanging data, which makes the access to the knowledge easier and faster.

The Semantic Web is not only a user-friendly tool, but can also help in the knowledge evolution. For that reason, it is essential for the Semantic Web tools and applications to utilize various languages such as Arabic. In other words, Semantic web aims at creating a global village though creating an information network and sharing knowledge.

This dissertation is focused on investigating approaches for the discovery of the abstract "topics" that occur in a collection of documents. The approach is based on a topic model and then expands the query to explore and discover hidden relations between the

2

documents in an efficient manner according to accuracy and time complexity. More precisely, we examine how advanced Information Retrieval techniques and Semantic Web can be utilized to enhance the retrieval accuracy. The work will be applied to Arabic and

English Wikipedia documents using generating topics and then combining generating topics with query expansion. Further details are presented in Chapter 5 and 6.

1.2 Goals of the Research

Representing the content of text documents is considered a critical part in any

Information Retrieval approach. The traditional techniques treat the words in these documents without any semantic or relationship between them. In other words, documents are normally represented as a “bag of words” which means that the words are expected to occur independently without any relationships between the words [110]. For that reason many researchers have developed and suggested many approaches to capture the hidden relationships between the words in order to group words into topics.

In general, the interface in Information Retrieval systems is designed in a way that gives the user the opportunity to search for key words in a single input box. These key words are used to retrieve the relevant documents by matching the collection index to find the documents that contain those keywords. The system can likely retrieve suitable matches only if a user query contains multiple topic-specific keywords that accurately describe his information needs. Yet, users may submit short queries and the natural language is

3

inherently ambiguous which may result in retrieving erroneous and incomplete results in this traditional retrieval model [41].

Term mismatch is considered to be one of the critical language issues for the effectiveness of the retrieval process. In other words, users usually do not use the same words that indexers use, which is called the vocabulary problem. The vocabulary problem can be understood through synonymy and polysemy. These problems may result in erroneous and irrelevant retrievals which decrease the accuracy of the results [53].

Expanding the original query by using other words that best fit the user’s intention is considered to be one of the most natural and successful techniques. In other words, generate a usable query which should likely retrieve the relevant documents.

The idea is that we can improve the accuracy by applying topic modeling for the corpus rather than dealing with the documents as bag of words to capture the hidden relationships between the words in order to group words into specific topics, and then expand the query to explore and discovery hidden relations between documents. Since the synonyms can play an important role to increase the number of relevant retrieved documents from the query, we will use the WordNet tool in order to capture the semantic information of document words.

In summary, this dissertation attempts to investigate different techniques for discovering the topics that occur in a collection of documents, and then the next approach is to expand the query by using Automatic Query Expansion. By applying the advanced

Semantic Web techniques, this dissertation will attempt to provide more accurate retrieval

4

results. These techniques have been tested and applied to both English and Arabic documents and the results are discussed.

1.3 Contributions

The main contributions of this work involve improving the results of previous work in topic modeling and query expansion for both Arabic and English documents. In addition, the research also involves applying the work to Arabic documents since only a few works have tackled this topic so far.

The main contributions of this dissertation are to:

● Explore the use of advanced Semantic Web tools to problems in Information

Retrieval (IR).

● Investigate approaches to discover the abstract “topics” that occur in a collection of

documents based on topic model approaches and then expand the query to explore

and discover hidden relations between the documents.

● Demonstrate that classifying the corpus with meaningful descriptive information

using Latent Semantic Indexing can improve the results of IR methods applied to

Arabic and English documents.

● Demonstrate that classifying the corpus with meaningful descriptive information

using Latent Dirichlet Allocation can improve the results of IR methods applied to

Arabic and English documents.

5

● Test if topic model can describe the semantics of a corpus without explicit

semantics.

● Test if Latent Dirichlet Allocation has higher accuracy than Latent Semantic

Indexing for Arabic documents as in English documents which has already been

proved.

● Test if Latent Semantic Indexing is faster to execute than Latent Dirichlet

Allocation for Arabic documents as in English documents which has already been

proved.

● Combine topic model and query expansion and apply to English documents.

● Combine topic model and query expansion and apply to Arabic documents.

● Apply to Arabic and English documents and verify the results.

1.4 Dissertation outline

Chapter 1 describes the significance of the present study and the research hypotheses. The remainder of the dissertation is organized as follows.

Chapter 2 provides an overview of Information Retrieval, Semantic Web, and bridging the gap between Information Retrieval and Semantics. Chapter 3 provides a review of the related literature. It starts with a general overview of topic modeling and query expansion; it also shows the combination between topic modeling and query expansion for English documents and reveals the combination between topic modeling and query expansion for Arabic documents.

6

Chapter 4 discusses the Semantic Web and the Arabic language. The chapter starts by introducing the importance of the Arabic language and includes a section on Right-to-

Left Languages and the Semantic Web. After that, the opportunities for Arabic ontologies is discussed in a separate section. Finally the chapter is concluded by examining the Arabic language and the Semantic Web in terms of challenges and opportunities.

Chapter 5 discusses topics generation by using topic modeling. The chapter starts by defining the data set, then it explains the steps of the proposed approach. A section presents the experimental results. The chapter ends with experimental evaluations and discussions. Chapter 6 discusses query expansion. The chapter starts by defining what a query expansion is. Then it explains the steps of the proposed approach to Arabic and

English documents. The experimental evaluation and discussions are presented in a separate section.

The last chapter of this dissertation, chapter 7, summarizes the dissertation’s work and discusses open issues and future research directions.

7

Chapter 2

Background

2.1 Introduction

"There have always been things which people are good at, and things

computers have been good at, and little overlap between the two."

- Tim Berners-Lee, May 1998

The World Wide Web (abbreviated as WWW or W3, commonly known as the

Web) plays a very important role in communicating and accessing information in all the

fields of knowledge. Besides, WWW has become an essential part of peoples’ lives in most

societies today. As a matter of fact, the World Wide Web has been one of the fastest rising

and most universal phenomena in history [65].

In October 1990, Tim Berners-Lee, who is the inventor of the World Wide Web,

specified three fundamental technologies which remain the foundation of today’s Web.

The three fundamental technologies can be described in brief as the following:

1. Hyper Text Markup Language (HTML): can be defined as the publishing format

for the Web that provides the ability to format documents and link them to other

documents and resources.

2. Uniform Resource Identifier (URI): can be defined as the unique address for each

resource on the Web.

8

3. Hypertext Transfer Protocol (HTTP): can be defined as a protocol that allows one

to retrieve linked resources from the Web.

This chapter will describe traditional Information Retrieval approaches, how the

World Wide Web has improved over time, the basic Semantic Web concepts and technologies including Resource Description Framework (RDF), ontology languages such as RDFS and OWL, and the SPARQL query language for the Semantic Web. It will also provide a summary of the World Wide WEB 1.0, World Wide Web 2.0, World Wide Web

3.0, and the World Wide Web 4.0.

2.2 Information Retrieval

First, it is important to define what is meant by Information Retrieval (IR), which can be very broad and entail different senses. In general, Information Retrieval refers to the process of finding information resources (usually documents) of an unstructured nature

(usually text) that are considered relevant to the exact types of information needed from a large collection of information resources. The purpose of automated Information Retrieval systems is to reduce what is called "information overload" by using intelligent search methods in order to help users to find the required information with high performance [77,

99].

9

Figure 1: The classic search model [15].

C. Manning, P. Raghavan, and H. Schutze show that in the 1990s most people preferred to obtain information from other people, rather than opting for Information

Retrieval systems. For example, during that period of time, most people used human travel agents to reserve their travel tickets or even packages. However, since the beginning of the last decade, many developments and enhancements of Information

Retrieval successes has driven web search engines to new levels of quality. These improvements added more value to the pervious process, which in turn, led to achieve customers’ satisfaction. Consequently, web search has become a standard and favored source of information for most people [77]. For example, the Pew Internet Survey in

10

2004 (Fallows 2004) showed that 92% of internet users affirm that the Internet is a

convenient source for finding everyday information [51].

An Information Retrieval process starts when a user enters a query into the system.

For example, searching the web for the query “Kent State University” may show irrelevant results. In this case the query will not uniquely identify and retrieve a single document.

Alternatively, many documents could match the query with different degrees of accuracy.

Most Information Retrievals compute a numeric score for each document in the database to decide which documents match and which documents do not match the query; after that the information systems rank the documents according to these values.

Documents with the top ranking are then shown to the user and considered as relevant documents. The process could then be iterated if the user wishes to improve the query [15].

Most Information Retrieval investigations have focused on retrieval effectiveness, which is usually based on document relevance judgments. However, there are some problems associated with relevance judgments such as being subjective and changeable.

For example, different judges will give different relevance values to a document retrieved in answer to a given query. More details about these issues can be found in Salton and

McGill (1983) and Sparck-Jones (1981) [98]

To evaluate the performance of Information Retrieval systems many measures have been proposed. These measures require a collection of documents and a query. Every document is simply classified to be either relevant or non-relevant to a particular query.

The most commonly used measures are recall and precision. Recall, in Information

Retrieval, is the ratio of retrieved and relevant documents for a given query over the total

11

number of relevant documents for that query in the database. Except for small test collections, the result is generally unknown and should be estimated by sampling or some other methods. Precision, in Information Retrieval, is the ratio of retrieved and relevant documents over the total number of documents retrieved. Fall-out, in Information

Retrieval, is the ratio of retrieved and non-relevant documents over the total number of all non-relevant documents. F-measure, in Information Retrieval, is used to measure the accuracy by considering both the precision and recall. The results for recall, precision, fall- out and F-measure values are between 0 and 1 [98].

Recall = │ {relevant documents} ∩ {retrieved documents} │

│ {relevant documents} │

Precision = │ {relevant documents} ∩ {retrieved documents} │

│ {retrieved documents} │

F-measure = 2* Precision* Recall

(Precision + Recall)

Fall-out = │ {non- relevant documents} ∩ {retrieved documents} │

│ {non- relevant documents} │

12

Figure 2: Architecture of a textual IR system [98].

2.3 Web Evolution

The idea for the web was first presented by Tim Burners-Lee in 1989. Then it became the largest method to transfer and construct information [52, 67]. The web has witnessed many developments in the last two decades [52]. The web has developed over the time from web 1.0 to web 4.0. Web 1.0 is considered as a Web of Information

Connections and web of cognition. Web 2.0 is considered as a web of communication. Web

3.0 is considered as a web of co-operation, and web 4.0 is considered as a web of integration

[3].

13

2.3.1 Web 1.0

Web 1.0 is also known as Static Web since it was used as Web of Information

Connections. Web 1.0 is a – read-only web. The websites include static HTML pages that update infrequently, and it is considered as mono-directional. Web 1.0 was used by businesses to present their products using catalogs or brochures to introduce and advertise their productions using the web, so that people could read them and contact them for business. The main idea in Web 1.0 is that the websites are used to publish the information for anyone at any time in order to establish an online presence. As a result, the websites are considered as not interactive and indeed as brochure-ware [3, 62].

2.3.2 Web 2.0

Web 2.0 developed from Web of Information into Social Web 2.0 (Social), i.e.,

Web of People Connections. Web 2.0 differs from Web 1.0 because it can read and write as well so it can be viewed as read-write web, wisdom web, and people-centric web. Since

Web 2.0 has reading as well as writing properties, the Web 2.0 is considered as bi- directional [3,84]. Table 1 compares web 1.0 and web 2.0 [3].

14

Table 1: A comparison between web 1.0 and web 2.0 [3].

Web 1.0 (Static Web) Web 2.0 (Social Web) Inventor: Tim Berners Lee Inventor: Tim O’Reilly

Reading Reading/Writing

Companies Communities

Client-Server Peer to Peer

HTML XML

Taxonomy Tags

Owning Sharing

Netscape Google

Web forms Web applications

Screen scraping APIs

Services sold over the web Web services

Lectures Conversation

2.3.3 Web 3.0

In 2006 John Markoff suggested a new third version or generation of the web which is now known as web 3.0 [102]. The main idea of web 3.0 is to define structured data and link them in order to allow data to be more efficiently discovered, automated, integrated, and reused across various applications [3]. Web 3.0 is also known as Semantic Web.

Semantic web was first invented by Tim Berners-Lee who is also the inventor of the World

15

Wide Web [85] and it will be discussed in the next chapter in more details. Table 2 compares Web 2.0 and Web 3.0 [3].

Table 2: A comparison between Web 2.0 and Web 3.0

Web 2.0 (Social Web) Web 3.0 (Semantic Web) Inventor: Tim O’Reilly Inventor: Tim Berners Lee

Read/Write Web Portable Personal Web

Communities Individuals

Sharing Content Consolidating Dynamic Content

Blogs Lifestream

AJAX RDF

Wikipedia, google Dbpedia, igoogle

Tagging User engagement

2.3.4 Web 4.0

Web 4.0 is also known as symbiotic web. Web 4.0 is an idea that is being developed and there is no exact definition of how it will be. The goal of Web 4.0 is to enable interaction between humans and machines in symbiosis so the web is moving towards being an intelligent web. Web 4.0 will be a multi-taskable web that will enable users to read, write, execute, and concurrent the web with intelligent interactions. Web 4.0 technologies will be able to connect information or connect knowledge through semantic techniques. In addition Web 4.0 technologies will be able to apply the knowledge shared between data items and define context in order to do basic reasoning. In simple words,

16

machines will be smart enough to read the contents of the web and respond in the form of executing and deciding what to execute first; be able to load websites fast with superior quality and performance and build more commanding interfaces. All of these features will lead to increase the quality and performance of the entire process [39]. Figure 3 shows the evolution of the web.

Figure 3: Evolution of the web [39].

17

2.4 The semantic Web vision

"The Semantic Web is not a separate Web but an extension of the current

one, in which information is given well-defined meaning, better enabling

computers and people to work in cooperation."

- Tim Berners-Lee et al; Scientific American, May 2001

The web has witnessed several developments over time. The latest available version is called the Semantic Web, which is the third generation of the web. The Semantic Web

(Web 3.0) is built on top of Web 2.0 to add more features to the previous versions of the web. These new features go beyond reading and writing to include discovering and retrieval of new information.

According to the World Wide Web Consortium (W3C), "The Semantic Web provides a common framework that allows data to be shared and reused across applications, enterprises, and community boundaries" [70]. The Semantic Web is a web of meaning that can show things in the manner that computers can understand. The main important purpose of the Semantic Web is to make the web readable by machines and not only by humans.

In fact, the current web, Web 2.0, can be described as a web of documents, and it is similar to a global file system which makes use of different types of data such as natural languages, multimedia data, files, graphics, and much more. The important problems that posed from it could include the web of documents built for human consumption in which the main objects are documents and links between them (or parts of them). In [28, 29, 60] the authors indicate that the semantics of the content as well as the links between

18

documents are hidden and the degree of structure between objects is low. Figure 4 represents the structure of web of documents in simple way [29].

Figure 4: Web of documents (Web 2.0) [29].

Semantics has been proposed as the means for overcoming the problems of the current web. The Semantic Web can be broadly defined as the web of data. It is just like a global database in which most its features are included. The main target of the design of the web of data is first the machines and then the humans. In the web of data links are always between things simply because things are the primary objects. The degree of structure between objects is mostly based on the RDF model (to be defined later) and the semantics of content as well as the links are always explicit [28, 29, 60]. In Figure 5, the structure of web of data is shown [29]. In this Figure the data web has its main evolution in connected information and interlined data. In this model, information is given a well- defined meaning that is understandable, analyzable, and machine readable [111].

19

Figure 5: The structure of web of data (Web 3.0) [111].

Semantic Web, or web of meaning, has been introduced with its new version that in one way or another will transform the web of linked structured documents to the web of linked data. By using Linked Data Principles developers can query Linked Data from multiple sources at once and combine them without the need of a single common schema that all data shares. This would allow the machine to understand, analyze, and process large scale data automatically instead manually.

2.5 Standards, Recommendations, and Models

As we discussed in previous sections, the Semantic Web can be viewed as a system that enables machines to understand and respond to complex human requests based on their meanings. The Semantic Web can be briefly described in one sentence: "How to make the web more readable to a computer" [22].

20

Tim Berners-Lee proposed a layered architecture for the Semantic Web that is often represented using a diagram, with many variations. Figure 6 gives a typical representation of this diagram [56].

Figure 6: The Semantic Web layered architecture [56].

The functionality of each layer of the Semantic Web layered architecture is described briefly in the following subsections.

2.5.1 URI and Unicode:

URI stands for Universal Resource Identifier and it is an expansion on the concept of Universal Resource Locator or URL, which is the commonly known identifier for

21

websites and webpages. URI is the W3C recommendation for describing the name and location of current and future objects on Internet. Unicode is the 16-bit character set representation and is the almost universally adopted successor to ASCII [18]. The purpose for Unicode is to represent any character of the world uniquely. The Uniform Resource

Identifier (URI) is a string of characters that is used to uniquely identify the name of a resource over a network [4, 18, 22]. Both Unicode and URI could be described as the provision of a unique identification mechanism within the language stack of the Semantic

Web [54].

2.5.2 The Extensible Markup Language

XML is used as a base syntax for other technologies developed for the upper layers of the Semantic Web. XML and its related standards, such as Name Spaces (NS), and schemas are used to form a common means to structure data on the web without any communication between the meanings of the data. Name Space is used to identify and distinguish different XML elements of different vocabularies. It supports mixing different elements from various vocabularies to represent a specific function. An XML schema guarantees that the sent information matches the received information whenever interchanging information between two applications at XML level occurs [4].

22

2.5.3 RDF: Describing Web Resources

RDF stands for Resource Description Framework. RDF is the World Wide Web

Consortium (W3C) recommended data model for the representation of information about resources on the Web. RDF is a knowledge representation language dedicated to the annotation of resources within the Semantic Web for the purpose of utilizing metadata

(data describing data) to provide more information about certain elements on the web.

Since metadata itself is understandable to machines, RDF is used as a data model for managing, structuring and reasoning with the data on the web and their relations existing in the real life. Now many documents are annotated via RDF due to its simple data model and its formal semantics. In its abstract syntax, an RDF document is a set of triples or statements of the form Subject-Predicate-Object . The linking structures form a directed, labeled graph, where the edges represent the named link between two resources, which are represented by the graph nodes. This graph view is the easiest possible mental model for RDF and is often used in easy-to-understand visual explanations [8, 79, 89, 111].

The following example (Example 1) means that there exists a resource named

"Arab Bank" whose homepage, phone, and fax are ,

"+962-6-5694901" and "962-6-56949141," respectively. Note that foaf1 and vcard2 (which are used in this example) are two RDF vocabularies [8].

Example 1: The assertion of the following RDF triples:

{ _:b1 foaf:name "Arab Bank" .

_:b1 foaf:homepage .

_:b1 foaf:phone" "+962-6-5694901" .

23

_:b1 vcard:fax" "+962-6-56949141" .

}

An RDF document can be represented by a directed labeled graph, as shown in

Figure 7, where nodes are terms appearing as the subject or object in a triple and arcs define the set of triples (i.e., if is a triple, then S P O).

Figure 7: An RDF graph storing information about Arab Bank.

2.5.4 RDF Schema: Adding Semantics

In spite of the fact that the RDF is a flexible graph data model for representing metadata of resources, it does not provide any semantics for these types of data.

Consequently, the ontology languages applies RDF Schema to enable a machine to infer a predefined basic type system for RDF models. RDF Schema was officially a W3C recommendation in February 2004 as a semantic extension to RDF. It describes classes and properties of the resources in the basic RDF model. RDF Schema or RDFS provides a simple reasoning framework to infer types of resources [3].

Figure 8 shows a graph representation of RDF and RDFS layers that store information about transportation services between cities [89].

24

Figure 8: An RDF graph storing information about transportation services between cities

[89].

2.5.5 OWL: Web Ontology Language

Ontologies are considered one of the most important pillars of the Sematic Web.

They are engineering artifacts that can formally describe some concepts and their relationships within a given knowledge domain [91]. They can provide a computationally conceptual representation of our current understanding of reality, as described within the information we have [13]. The ontology layer describes properties and the relations between properties.

An ontology can be defined as a collection of terms used to describe a specific domain with inference ability [3]. OWL stands for Web Ontology Language, OWL (and its new version OWL 2) is the World Wide Web Consortium (W3C) recommendation data

25

model for creating and representing ontologies. OWL can be viewed as a data model language that can increase the semantic richness of the RDF. This comes as a result of

OWL characteristics that include effectiveness, flexibility, and speed.

There are differences between OWL and RDFS in terms of the ability to add semantics to the RDF. In general, the features of OWL enable one to add more semantics in this regard. Antoniou, et al. [12] said when we compare OWL to RDF "in many cases we need to express more advanced, more ‘expressive’ knowledge. For example, that every person has exactly one birth date, or that no person can be both male and female at the same time."

So far the top layers of the stack architecture of the Web Semantics, which are

Logic, Proof, and Trust, are still unstandardized layers from W3C and have to be realized in order to recognize the web of data. In this regard in [70] W3C says that "Logic, Proof, and Trust, are currently being researched and simple application demonstrations are being constructed."

2.5.6 Logic and Proof

The Logic and Proof layer is above the ontology layer. This layer will enable making new inferences by using an automatic reasoning system. Logic is used to improve the ontology language further in addition to allowing the writing of application-specific declarative knowledge. Reasoning systems will allow agents to perform deductions independent of using particular resources that satisfy their requirements [56].

26

2.5.7 Trust

Trust is the top layer of the stack. This layer provides an assurance of quality of the information on the web and a degree of confidence in the resources providing this information [3]. In other words, trust, which is used to derive statements, will be supported by verifying that the premises come from trusted sources and by relying on formal logic when deriving new information.

2.6 What is SPARQL?

SPARQL stands for Simple Protocol and RDF Query Language. SPARQL is the

World Wide Web Consortium (W3C) recommendation since January 2008 as a specification of a query language for RDF. SPARQL is designed much in the spirit of classical relational languages such as SQL [92].

SPARQL is an RDF query language, that is, a query language for RDF databases, which is able to retrieve and manipulate data stored in Resource Description Framework format such as retrieving nodes from RDF graphs[92].

Since the entire database is a set of "subject-predicate-object" triples, SPARQL allows users to write queries against data that follows the RDF specifications of the W3C.

RDF data can also be considered in SQL relational database terms as a table with three columns. The first column is the subject column, the second one is the predicate column, and the third one is the object column.

27

A simple SPARQL query is expressed by using a form resembling a standard query language (SQL) SELECT statement:

SELECT B FROM u WHERE P

Where ‘u’ is the URL of an RDF graph G to be queried; whereas ‘P’ is a SPARQL graph pattern (i.e., a pattern constructed over RDF graphs with variables); and ‘B’ is a tuple of variables that appear in ‘P’ [8]. The following example is an example of an SPARQL query.

Example 2: Consider the following SPARQL query:

SELECT ?name ?phone

FROM

WHERE

?p1 foaf: homepage .

?p1 foaf: name ?name.

?p1 foaf: name ?phone .

This SPARQL can be used to retrieve the name and the phone of a bank whose homepage is http://www.arabbank.com.jo/. When evaluated against the RDF graph of

Figure 7, the following answers will be returned [8]:

Table 3: The result of SPARQL query

# ?name ?phone

1 ArabBank " + 962 - 6 - 5694901"

28

In addition to Select, the SPARQL language specifies three forms of SPARQL queries for different purposes. These forms are ASK, CONSTRUCT, and DESCRIBE [8,

12].

ASK query: an ASK SPARQL query does the same as the SELECT query, but instead of retrieving the matching for variables, it returns ‘true’ if there is at least one answer for P in the RDF graph. Otherwise, ‘false’ is returned.

The format for ASK: ASK FROM u WHERE P.

Example 3: ASK { ?x foaf:name "Arab Bank" }

When evaluated against the RDF graph of Figure 7, the answer is:

True.

CONSTRUCT query: Construct SPARQL Query is used to retrieve RDF graph from a large set of RDF [12].

Example 4: PREFIX ex:

PREFIX :

PREFIX geo:

CONSTRUCT {?apartment swp:hasNumberOfBedrooms ?bedrooms.}

Example 5: the following example depends on figure 7.

CONSTRUCT {?p1 foaf: name ?name.}

DESCRIBE SPARQL query: DESCRIBE query is used to extract an RDF graph from the SPARQL endpoint, the contents of which is left to the endpoint to decide based

29

on what the maintainer deems as useful information. According to SPROT (SPARQL

Protocol for RDF) specification SPARQL endpoint can be defined as a conformant

SPARQL protocol service that enables users to query a knowledge base via the SPARQL language and the returned results typically appearing in one or more machine-processable formats [12].

Example 6: PREFIX :

DESCRIBE ?x

WHERE {? x :writesBook :book3 }

Example 7: the following example depends on figure 7.

DESCRIBE ?p1

WHERE {?p :name :Arab Bank }

2.7 Integrating Information Retrieval and Semantic Web

The common goal of both Information Retrieval (IR) and Semantic Web (SW) with respect to the Internet is enabling users of the Web to find and make use of relevant and appropriate online information in a timely fashion. The common problem addressed by both IR and the Semantic Web is the need for Web users to select through billions of web pages to find the information that is relevant to them regarding any given topic of interest

[20].

Both Information Retrieval and Semantic Web are dealing with similar problems, but they are coming from different directions. "IR was developed before anyone had

30

conceived of the Internet, indeed even before computer networks were common. With the invention of the Internet, the theories and tools of IR were pressed into service with great success, emerging as the core algorithms of Web search engines" [20]. Information

Retrieval (IR) is considered as a bottom-up approach. "IR researchers began with very low- level structures and continue to work up, IR researchers began with the most basic structures the bit patterns representing words in computer memory and did bit by bit comparisons to determine if certain key words chosen by the searcher existed within a given document or document set. Their algorithms grew in power and complexity from there" [20].

"The Semantic Web, on the other hand, is a vision of how the future might be, a reaction to the problems that exist with the Web in its current state. In this sense IR researchers are steadily moving incrementally forward from the past whereas Semantic

Web researchers are working back from a revolutionary vision of what will be in the future"

[20]. Semantic Web is considered as a top-down approach. "Semantic Web researchers are beginning with a very high-level blueprint and are trying to work down to fill in the details, they started with complex, high-level goals as expressed by humans. For example, such as

I need to make my travel arrangements for the sales conference next month, and outlining out the systems and structures that would need to be built to support the satisfaction of such goals automatically." [20]

In general, the Semantic Web aims to address the problems of Information Retrieval on the Web in several ways, such as: understanding the semantics of terms in context, incorporating deductive and inferential rules of reasoning into relevancy determinations,

31

and by gathering enough information about the users and their information needs to determine what is truly relevant in each case. To reach these goals in the vision of the

Semantic Web, three technological pillars are used:

 Web ontology in order to specify semantics

 markup languages for achieving smooth interoperability

 agents for reasoning and action [22].

As we discussed earlier, IR is considered a bottom-up approach, while Semantic

Web is considered a top-down approach. To integrate these two fields so each field can complement the other, we need to pay more attention to ontology development and query formulation.

Fortunately, just as Internet and World Wide Web protocols have helped connect huge amounts of information for human consumption, new approaches are being developed in order to help connect equal or greater amounts of information for machine manipulation and processing. These advances will simplify information interoperability and provide better information relevance and confidence within the enterprise and on the World Wide

Web. Over time, they will cover the way for new intelligent brokering and knowledge reasoning capabilities across the field of collected information. Table 4 shows key capabilities of semantic computing and the resulting impact for stakeholders.

32

Table 4: Computing Capabilities Assessment (Adapted by Richard Murphy) Capability Purpose Stakeholders Impact Take-away Near-term Semantic Web Services Provides flexible System Reduced friction More automated and look-up and Developers and in web services flexible data connections discovery and System adoption and schema Integrators deployment transformation Information Integration Reduces Data and Reduced cost to Increased and/or Interoperability integration Metadata integrate interoperability at complexity from Architects heterogeneous improved speed and n2 to n data sources reduced cost Intelligent Search Provides context Business and Reduced human Increased search sensitive search, Technology filtering of search accuracy translates into queries on Managers, results, more greater productivity concepts, and Analysts, and relevant searches personalized Individuals filtering Capability Purpose Stakeholders Impact Take-away Longer-term Model-Driven Enables software Software Less coding Less code maintenance Applications applications to Developers required, faster and faster change process domain changes to responsiveness logic from domain logic actionable models Adaptive and Autonomic Provides the System Increased Reduced cost to Computing ability for Administrators reliability and maintain systems and applications to reduced cost lessened human diagnose and through self- intervention forecast system diagnostics and administration planning of complex systems Intelligent Reasoning Supports machine Applications and Reduced Reduced application inference based on Cognitive Agents requirements for development cost rich data and embedding logic evolvable schemas and constraints apart from domain models

33

CHAPTER 3

Related Work

3.1 Topic Modeling

Representing the content of text documents is considered a critical part in any

Information Retrieval approach. Documents are normally represented as a "bag of words" which means that the words are expected to occur independently without any relationships between the words [110]. For that reason many researchers have developed and suggested many approaches to capture the hidden relationships between the words in order to group words into topics.

Topic modeling is a kind of statistical modeling that is used for discovering the topics that occur in a collection of documents. Topic modeling is used in and in natural language processing as well. It is clear that if you are given a document that is about a particular topic, you would expect certain words to appear in the document more frequently than other words. For example: we expect that the words "dog" and "bone" will appear more often in documents about dogs, and we also expect that the words "cat" and

34

"meow" will appear more often in documents about cats. A document normally contains multiple topics in different proportions. Therefore, in a document that talks about cats in

10% and 90% about dogs, probably the word dog will appear approximately nine times more than the word cat. A topic model uses a mathematical framework which is based on the statistics of the words in each document. This mathematical framework is used to capture the previously mentioned statistics in order to allow the examining of a set of documents and discovering what the topics might be and what each document's balance of topics is. While topic models were first described and implemented in the context of natural language processing and machine learning, now other fields such as have applications using topic models [33].

Topic modeling has evolved over time. In 1988, early topic modeling was developed by Dumais and co-workers, which is known as Latent Semantic Indexing (LSI) or (LSA) [87]. After that time, in 1999, Thomas Hofmann presented a new model, which is known as the Probabilistic Latent Semantic Indexing

(PLSI) model or Probabilistic Latent Semantic Analysis (PLSA) [64], which is developed as an alternative to LSI. In 2002, David Blei, , and Michael I. Jordan developed a new model which is considered a generalization of PLSI called Latent Dirichlet

Allocation (LDA) [33]. Latent Dirichlet Allocation (LDA) is considered the most common topic modeling that is currently in use. Other topic models try to make extensions on LDA such as the model.

35

3.1.1 From Vector Space Modeling to Latent Semantic Indexing

The Vector Space Model (VSM) was the prevalent IR model until 1988. In this model each document in the collection is a list of the main terms and their frequency in the document is counted. In VSM a document is regarded as a vector of terms. Each unique term in the document collection corresponds to a dimension in the space and a value indicates the frequency of this term. Documents and queries are treated as vectors in a multidimensional space, and queries are treated as documents in the search space. In this term space, it is not possible to assign a position to terms simply because these are the dimensions of the space and the VSM assumes terms as independence. Term weights, which are calculated by adapting a particular weighting schema, specify the coordinate values that are assigned to documents and query vectors [14, 33].

The VSM model is built on matching query terms to terms found in the documents.

For that reason, VSM assumes that terms are independent. However, the terms can be dependent through synonymity, such as car and automobile, and polysemy such as the verb

"to get" can mean "to procure" (e.g., I'll get the drinks), "to become" (e.g., She got scared),

"to understand" (e.g., I get it). Another example of polysemy is the adjective "plain" can mean "simple" (e.g. English is a plain subject) or "with nothing added" (e.g., The chocolate is too plain) [33].

As we mentioned earlier, Latent Semantic Indexing (LSI) is a model that was developed in 1988 by Dumais and co-workers at Bellcore to overcome the shortcomings of the Vector Space Model (VSM).

36

LSI is a machine-learning model that that uses a mathematical technique called singular value decomposition (SVD) to make representations of the meaning of words by analyzing the relations between vocabularies and passages in huge bodies of text.

LSI is built on the principle that words that are used in the same contexts tend to have similar meanings. An important feature of LSI is its ability to extract the conceptual content of a body of a text by establishing associations between those terms that occur in similar contexts. The method used by LSI to capture the essential semantic information is built on techniques including dimension reduction, selecting the most important dimensions from a decomposed co-occurrence matrix, and finally using Singular Value

Decomposition [25, 46, 78].

LSI was an improvement over the Vector Space Model (VSM) that was based on term matching to determine term dependencies. The adaption of LSI came as a result of its ability to discover term dependencies through synonyms, polysemy, and related terms used in these documents.

Many researchers, such as Binkley and Lawrie 2010, proved the ability of LSI in dealing with synonyms [25]. For example, LSI proved its efficiency in answering correctly

64% of the synonyms in the Test of English as a Foreign Language, where the students’ average was far below the LSI [25]. For more details about LSI see the Appendix.

Example: This example illustrates how LSI works if you want to classify the corpus into three topics. Suppose we have the book titles as shown in the following Table [45].

37

Table 5: Book titles [45].

38

Table 6: The 16*17 Term Document Matrix corresponding to book titles in Table 5 [45].

39

Figure 9: Two-dimensional plot of terms and documents [45].

40

Figure 10: Two-dimensional plot of terms and documents along with the query application theory [45].

41

Figure 11: Two-dimensional plot of terms and documents using the SVD of a reconstructed Term-Document Matrix [45].

42

Table 7: The three topics [45].

3.1.2 The Probabilistic Latent Semantic Indexing (PLSI) model

Even as LSI is considered an improvement over the Vector Space Model (VSM), it has some problems or challenges. The first challenge to LSI has to do with its scalability and performance. More specifically, when we compare LSI to other Information Retrieval techniques we can find that LSI requires relatively high computational performance and memory [68]. However, this problem or challenge has been solved in the modern computer systems. These modern systems have high-speed processors and the availability of inexpensive memory. For example, it is common for real world applications such as LSI to completely process more than 30 million documents [94]. LSI performs such processing through the matrix and SVD computations. In order to perform these processes, LSI uses a tool called Gensim. Gensim is a software package that contains a completely scalable

(unlimited number of documents, online training) implementation for the LSI.

The second challenge to LSI is due to the difficulty in determining the optimal number of dimensions to use for performing the SVD. The general idea is fewer dimensions will result in broader comparisons of the concepts contained in a collection of text, while a higher number of dimensions will result in more specific or more relevant comparisons of concepts contained in a collection of text. Researchers found that the number of

43

documents in the collection can be applied to determine a good number of dimensions that can be used. For example, researchers found that around 300 dimensions will usually provide the best results with moderate-sized document collections (hundreds of thousands of documents) and possibly 400 dimensions for larger-size document collections (millions of documents)[34]. However, recent studies suggest that number of dimensions should be between 50 to 1000 depending on two major factors: the size and nature of the document collection [45].

In 1999, Thomas Hofmann developed a new model as an alternative model to overcome the previous challenges in the LSI, which is known as Probabilistic Latent

Semantic Indexing (PLSI). This model is also known as the Probabilistic Latent Semantic

Analysis (PLSA) model or as the Aspect Model.

The Probabilistic Latent Semantic Indexing model was applied by Hoffman to retrieve tasks in Vector Space Model (VSM) framework. However he applied probabilistic

Latent Semantic Indexing model over small collections. He examined probabilistic Latent

Semantic Indexing model in two different ways. The first one was as a unigram model in order to equate the empirical word distributions. The second one was as a latent space model in order to provide a low dimensional document/query representation. PLSI proved to provide advanced retrieval performance over the standard term frequencies and Latent

Semantic Indexing (LSI). This advanced retrieval performance was in four collections that involved 1033, 1400, 3204, and 1460 document abstracts. Despite all these improvements,

PLSI still suffers from some drawbacks. Among these drawbacks is the length of the collection size, which is considered to be relatively small since the number of parameters

44

in the model grows linearly with the size of the corpus, which leads to serious problems with over fitting. Besides, document length cannot be considered to be representative since it is small [110].

3.1.3 The Latent Dirichlet Allocation model (LDA)

Starting from 2003, many scholars such as David Blei, Andrew Ng, and Michael

Jordan started to question the PLSI model which was developed by Hoffman. These scholars tried to overcome the drawbacks of the PLSI by developing a new model called the Latent Dirichlet Allocation Model (LDA). According to Blei, et al., PLSI "is incomplete in that it provides no probabilistic model at the level of documents. In PLSI, each document is represented as a list of numbers (the mixing proportions for topics), and there is no generative probabilistic model for these numbers." [33].

Blei and his colleagues suggested that mixture models that record documents and words exchangeability should be considered in order to consider interchangeable representations for words and documents. For example, if two words tend to occur in similar documents, the words are considered to be similar, if two documents tend to include similar words, the documents are considered to be similar. This reason leads them to develop the Latent Dirichlet Allocation Model (LDA).

LDA is a well-known model in machine learning that uses a generative model and it is considered as an example of a model for building topic models. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the

45

document's topics. Since LDA is based on a generative model it allows sets of observations to be explained by unobserved groups in order to explain why some parts of the data should be considered as similar [33].

The difference between generative models and discriminative models is discussed in the following points:

 A generative model can be considered as a full model. In other words, the

probabilistic model provides a model for all variables, while a

discriminative model cannot be considered as a full model. The

discriminative model provides a model only for the target variable(s)

restricted to the observed variables.The fundamental difference between the

two is the following: if you have a number of input data (x) and you want

to label it into classes (y) then the generative model will learn the joint

probability P(x, y), while a discriminative model will learn the conditional

probability P (y| x).

 A generative model can be used to generate values for any variable in the

model, whereas a discriminative model can be used to generate values only

for target variables restricted to the observed quantities.

 In general, the discriminative model cannot express more complex

relationships between the observed and target variables [27].

In the previous topic models (VSM, LSI, PLSI), they do not care about the word order and they assume it will not cause any problem. These models treat a documents as a

46

"bag of words". However, this assumption is not true in all the cases since word semantics is sensitive to word ordering. For example, searching Google for college junior or junior college will yield extremely different results.

3.1.4 Topics in LDA

LDA and PLSI process documents interplay of different topics. Nevertheless, the dispersal of topics in LDA should have Dirichlet in advance, a scenario that enable the mixture of topics in a document to function realistically [55].

Each document, in LDA, can be monitored as interplay of different topics. This case is comparable to probabilistic latent semantic analysis (PLSA), except for that in LDA the topic dispersal is supposed to include a Dirichlet prior. However, these results show more authentic interaction of topic in a document [55].

For example, suppose the topics in an LDA model can be categorized as

CHICKEN_ related and COW_related. In the CHICKEN_ related topic, various words can be generated such as hen, egg, and roasted, and these can be categorized by the user as

CHICKEN_ related. Obviously, the word chicken itself will have high probability given this topic. Also, the topic of COW_ related can probably generate words such as barn, milk, and beef. Irrelevant words such as "a," "an," "and," "the," will have approximately equal probabilities between categories.

47

3.4.1.1 Model

The LDA model is represented as a probabilistic graphical model in Figure 12. As can be seen from the Figure, it is clear that there are three levels to represent the LDA [33]:

 Documents are represented in the outer plate.

 The repeated choice of topics and words within a document are represented in

the inner plate.

 α and β are corpus-level parameters.

Figure 12: Plate notation representing the LDA model.

For more details about LDA model see Appendix A.

Examples:

Example 1: Suppose you have the following set of sentences and we need to classify them into two topics [43]:

 I like to eat broccoli and bananas.

 I ate a banana and spinach smoothie for breakfast.

48

 Chinchillas and kittens are cute.

 My sister adopted a kitten yesterday.

 Look at this cute hamster munching on a piece of broccoli.

Answer: The topics that we discover using the LDA are likely to be:

 Sentences 1 and 2: 100% Topic A (Topic A represents the food topic).

 Sentences 3 and 4: 100% Topic B (Topic B represents the animal topic).

 Sentence 5: 60% Topic A, 40% Topic B.

Example 2: Suppose you have the following set of documents and we need to classify them into two topics [43]:

Doc1: After I eat my breakfast of apples, oranges, bananas, and grapes, I'm going to go snowboarding in the Alps if it's not too cold outside.

Doc2: Apples, oranges, bananas, and grapes make good smoothies.

Doc3: Apples, oranges, bananas, and grapes are tasty fruits.

Doc4: Snowboarding in the Alps is a lot of fun, but cold.

Doc5: My friend lives in the Alps, where he teaches snowboarding.

Answer: The topics that we discover using the LDA are likely to be:

 Topic 1 (the "fruit" topic): represented most strongly by apples, oranges,

bananas, grapes.

49

 Topic 2 (the "Alps" topic): represented most strongly by Alps, snowboarding,

cold.

 Document 1: 66.66% Topic 1, 33.33% Topic 2.

 Documents 2 and 3: 100% Topic 1.

 Documents 4 and 5: 100% Topic 2.

3.2 Automatic Query Expansion (AQE)

In general, the interface in Information Retrieval systems is designed in a way that gives the user the opportunity to search for key words in a single input box. These key words are used to retrieve the relevant documents by matching the collection index to find the documents that contain those keywords. The system is more likely to retrieve suitable matches if a user query contains multiple topic-specific keywords that accurately describe his information need. Yet, users may submit short queries and the natural language is inherently ambiguous, which may result in retrieving erroneous and incomplete results in this traditional retrieval model [41].

Term mismatch is considered to be one of the critical language issues for the effectiveness of the retrieval process. In other words, users usually do not use the same words that indexers use, which is also called the vocabulary problem. The vocabulary problem can be understood through synonymy and polysemy. In this context, synonymy refers to different words that have the same meaning; also, synonymy may include word inflection (singular and plural form of a word) whereas polysemy refers to word that has

50

different meanings. These problems may result in erroneous and irrelevant retrievals, which should decrease the accuracy. More exactly, the synonymy problem may result in decreasing the recall, whereas polysemy may result in decreasing the precision [53].

Scholars suggest different approaches to solve the vocabulary problem. These approaches include relevance feedback, word sense disambiguation, interactive query refinement, and search results clustering. Expanding the original query by using other words that best fit the user’s intension is considered to be one of the most natural and successful techniques. In other words, generate a usable query which should likely retrieve the relevant documents.

Research on Automatic Query Expansion (AQE) affirms that although the volume of data is increasing drastically, users still use a very low number of terms when searching, and web searching can be considered a good example of this. Hitwise reported in 2009 that the length of an average query is 2.3 words which is the same length that was reported in

1999 by Lau and Horvitz [74].

Studies show that there has been a slight increase in the query length which has reached up to five or more words. Yet, the most dominant query length still consists of one, two, and three words. Using relatively short queries affects negatively the vocabulary problem, since a lack of query terms may reduce the possibility of dealing with synonymy.

At the same time, the large volume of data increases the problems of polysemy. All of these factors resulted in a greater need to use Automatic Query Expansion (AQE) [41].

In the past few years, many scholars presented a large number of AQE techniques employing various approaches that depend on leveraging several data sources and using

51

effective methods in order to find new features correlated with the query terms. Nowadays,

AQE is based on solid theoretical frameworks and valid considerations of the usability and limitations of AQE. For example, after defining the important parameters that affect the method performance, what are the best queries that AQE can perform? and so on.

Meanwhile, there is an increasing usage of the basic techniques in conjunction with other mechanisms in order to increase their effectiveness. This includes combining methods, using active selection of information sources, and applying discriminative policies of method application.

These improvements have been further verified by the results obtained from labs experiments. The evaluation results that were presented at the text retrieval conference series, gave more importance to AQE. In this conference, many participants depended on the AQE technique in order to report significant advancement in retrieval performance

[41].

Nowadays, AQE has proved to be one of the efficient techniques that are being used to enhance the retrieval effectiveness of documents. Many commercial applications started to employee AQE in their desktop and intranet searches. Some Google products such as Google Enterprise, MYSQL, and Lucene have an AQE feature that users may choose to use. Yet, operational IR web systems including search engines are not completely adopting AQE [41].

Reasons for this reluctance is due to the following. First, performing a query using

AQE is considered computationally expensive compared with the web search applications.

Second, the currently used AQE techniques can provide good results only on average, and

52

may provide poor results for some queries. Besides, AQE paid more attention to precision at the expense of recall. In other words, some AQE results might not appear in the first page which users usually look for. Third, AQE sometimes retrieves documents in the result page that might not have the original query terms, which in turn might cause confusion to the users [41].

In section three we provide a pragmatic definition of AQE. Section four discusses why and under which assumptions AQE produces more accurate results than using unexpanded queries. Section five briefly reviews other applications of AQE including document ranking. Section six examines variations to the vocabulary problem. Finally, section seven describes how AQE works.

3.3 DOCUMENT RANKING WITH AQE

In general, many IR systems such as search engines completely or partially depend on calculating the importance of the terms that are used in the query and in the document to decide on their responses [41].

According to the above formula, we have a weighting function where wt,q represents the weight of term t in query q, and wt,d represents the weight of term t in document d. In this formula, the weight for each term is considered to be proportional to

53

the term frequency and inversely proportional to the frequency and length of the documents containing the term.

We can modify the ranking scheme of formula (1) in order to accommodate query expansion, abstracting away from the specific underlying weighting model. The original query q and a source of data from which to compute and weigh the expansion terms is considered to be the basic input to AQE. The query q’ that is formed by an expanded set of terms with their associated weights w’ is considered to be the output to AQE.

The new weighted query terms are used to compute the similarity between query q’ and document d. This similarity is shown in the following formula [41].

"The most typical data source for generating new terms is the collection itself being searched and the simplest way of weighting the query expansion terms is to use just the weighting function used by the ranking system. If more complex features than single terms are used for query expansion (e.g., phrases), the underlying ranking system must be able to handle such features" [41].

3.4 WHY AND WHEN AQE WORKS

A document ranking system connects query terms by an implicit "OR". The advantage of this assumption is that it expands the boundaries of the query. In other words, there is a chance for increasing the recall and retrieving documents that do not include the original query terms.

54

For example, if the query "Thanksgiving" is expanded to also include "an official holidays in the USA, last Thursday of November, American tradition family gathering, feast, turkey, and black Friday," then the new query will retrieve the documents that contain the original term "Thanksgiving" in addition to other documents that include the other expanded terms.

This feature is one of the most distinguishing features of AQE. This feature can also be very helpful for search applications in specific domains such as the financial, medical, scientific, and technical.

These recall improvements can also be achieved regardless of the query’s use of implicit Boolean operator OR or AND. The query expansion can also negatively affect the search results, i.e., it might affect the precision of results since some expanded terms might not necessarily be related to the original query terms [81]. This problem might be due to various reasons. For example, these expanded terms might be related to irrelevant concepts as opposed to the entire query. Using proper nouns as expanded terms may make the results even worse [106].

Besides, these expanded terms might not be relevant to the original query. This can be seen when AQE bases its retrievals on a top document that in itself is irrelevant.

Decreasing the precision can also be noticed when retrieving relevant documents with a lower ranking regardless if the expanded terms are related to the query concept or not [40].

For example, if the query "Will Smith" is expanded with "American actor,"

"movie," and "comedy", then documents about different actors that include the expanded

55

terms might appear as higher score results than a document about Will Smith that does not include the expanded terms.

Many experimental studies have proved the possibility of losing precision (e.g.,

Voorhees and Harman [1998]). There are two crucial measurements to evaluate the effectiveness of IR systems: Recall and Precision. Recent experiments affirm that AQE is one of the best approaches that combines both Recall and Precision. These experiments also affirm that AQE improves Precision and Recall by a percentage of 10% over the other approaches (e.g., Mitra, et al. [1998], Carpineto, et al. [2002], Liu, et al. [2004], Lee, et al.

[2008])[41].

The findings discussed above prove further the effectiveness of AQE as a technique. However, this effectiveness does not necessarily guarantee precision if it is the main concern for users. Yet, many recent studies confirm that AQE does not always decrease precision.

AQE might not be the appropriate method for all types of query, especially when it comes to web searching. According to Broder, in general, web queries can be classified into three basic types: informational, navigational, or transactional [36]. The informational queries can be considered the most related to AQE, since users do not know specifically what they are looking for or they cannot describe the query clearly in words.

On the other hand, in the other two types of web queries, navigational, and transactional, the intended pages are almost always well defined by the users by specific terms. This can be for two reasons: first, in navigational queries there are specific URLs.

56

The second reason has to do with the nature of transactional queries where users are looking for a specific Web-mediated activity.

3.5 HOW AQE WORKS

Figure 13 shows the four major steps that can explain how AQE works [55].

Figure 13: Main steps of automatic query expansion.

As shown in Figure 13, the main steps of automatic query expansion are preprocessing of data source, generation and ranking of candidate expansion features, selection of expansion features, and query reformulation. The following sections discuss in details each of these steps.

3.5.1 Preprocessing of Data Source

This process includes transforming the expanded raw data into a more effective format that will be used in the following steps. This step is not dependent on the user’s query that is supposed to be expanded, but it is related to the expansion method and the

57

type of data source in question. This section discusses the most popular processing procedures.

The information included in the retrieved top ranked results is used by many query expansion techniques as a point of departure. This information also depends on the user’s original query. Computing the initial retrieval run depends on indexing the collection and running the query against the collection index. There are several steps for indexing [41]:

 Extracting text from documents such as HTML, PDF, MS Word, etc. (this

applies only if the collection is made of such types of documents);

 Performing tokenization, which refers to extracting individual words,

dismissing punctuation and case;

 Removing stop words, which means removing common words (e.g., articles

and prepositions);

 Applying stemming for the words, which refers to returning words to their

original roots and removing reflection from words;

 Applying weighting to words, which refers to deciding on the importance of

each word which should be applied to each document.

3.5.2 Generation and Ranking of Candidate Expansion Features

After finishing the first stage, AQE is responsible for generating and ranking the features of the candidate expansion. The importance of this step stems from the fact that most methods of query expansion usually choose a small number of the candidate expansion features that will be added to the query.

58

The original query and the transformed data source comprise the input to this stage.

The output can be a set of expansion features which are usually assigned a score.

Preprocessing the original query can be done in order to delete common words and/or extract important terms that are supposed to be expanded [41].

The types of relations between the query terms and the expansion features are used to classify the methods used to remove the generated candidate and ranking [41].

3.5.2.1 One-to-One Associations

There are many ways for generating and ranking candidates and one of the simplest ways depends on one-to-one associations between the query terms and the expansion features. In other words, each single query term is related to one expansion feature. By using different kinds of techniques, expansion features can be generated and scored for each single query term [41].

3.5.2.2 One-to-Many Associations

Bai, et al. [2007], explain that one-to-one association suffers from a problem which is that it may add a term when it is only partially relevant to a query term. This, in turn, might not reflect the accurate relationship between the query and the expansion term. This problem can be explained in the following example. The term "program" might be strongly associated with other words such as "computer," but expansion in this case can work only in some queries such as "Application program," and "Java program." On the other hand,

59

this expansion might not work well for queries such as "government program," "TV program," "space program" [16].

3.5.2.3 Analysis of Feature Distribution in Top-Ranked Documents

This technique cannot be considered among the pervious categories because it does not focus on the features that are directly related to the terms in the query regardless if they are single or multiple. The main idea behind this technique revolves around using the first document that came as a result of searching the original query as an expansion for the original query. This technique also uses this document to extract more terms in order to use them as expansion features. "In a sense, the expansion features are related to the full meaning of the query because the extracted terms are those that best characterize the pseudo-relevant documents as a whole, but their association with the query terms is not analyzed explicitly" [41].

3.5.2.4 Query Language Modeling

AQE uses other techniques such as query language modeling, which depends on building a statistical language model for each query that specifies a probability distribution over each term. The terms that come with high probabilities are considered to be the best representative terms. These techniques are also called model-based techniques. There are two known technique. The first one was suggested by Zhai, et al., and called mixture model.

The second one was suggested by Lavrenko, et al., and called relevance model. In general these two models highly depend on the top retrieved documents [75, 112].

60

3.5.2.5 A Web Search Example

To explain the features that different expansion methods use, consider the following example. Assume that you are looking for web pages that include information about

"foreign minorities Germany." Figure 14 shows the top results page that Google returned after responding to the query "foreign minorities Germany" (as of April 2009). It is clear that five out of the first ten results are irrelevant to the query because of inappropriate matching with the query terms (e.g., some of the results are related to German minorities that do not live in Germany). Figure 15 shows the result after applying automatic query expansion.

61

Figure 14: The first ten results returned by Google in response to the query "foreign minorities Germany" (as of May 2009) [41].

62

Figure 15: The first ten results returned by Google in response to the expanded query "foreign minorities Germany sorbs Frisians" (as of May 2009) [41].

For more details about Applications of AQE and a Classification of Approaches, see

Appendix A.

63

3.6 Selection of Expansion Features

Selecting the top elements for query expansions can be done by ranking the candidate features. In general, selecting features for expansion is limited to a small number of features so the retrieved query might be processed quickly. This happens since the retrieval effectiveness of limited representative terms is not by default less successful than using all the candidate terms in the expansion. This in turn can be referred to as noise reduction.

Researchers suggest a number of features that one should include as the ideal features. According to Amati et al., these features range from five to ten. Whereas

Carpineto, et al., Buckley, et al., and Wong, et al. suggest that these features can reach up to a few hundred [41].

According to Carpineto, et al. [41], the performance decrease accompanying non- ideal values is usually modest, and that, based on many experiments, the number of expansion features has a low relevance. In general, ten to thirty is the default number of expansion features. If we interpret the feature scores as probabilities, then we should choose the terms with greater probabilities than a certain threshold, e.g., p = 0.001 as suggested in Zhai and Lafferty [112]. It is suggested that adapting more informed selection polices, such as using multiple term-ranking functions and selecting for each query the most common terms, or choosing a variable amount of expansion, depends on the query difficulty. These selection polices can be more convenient than focusing on finding the ideal number of expansion terms. According to Billerbeck, et al., Buckley, et al., Cao et al., the optimal number of expansion features can vary depending on the type of query.

64

Besides, Cao, et al. suggest that when the fraction of expansion terms reaches to one third, it can negatively affect the retrieval performance [41]. Moreover, Carpineto, et al., and Cao, et al. claim performance improvement can be greatly enhanced when users choose the best features for each query [41].

3.7 Query Reformulation

Query reformulation represents the last step of AQE. This step includes describing the expanded query that will be sent to the IR system. This includes weighting each feature that describes the expanded query reweighting.

"The most popular query reweighting technique is modeled after Rocchio’s formula for relevance feedback [Rocchio 1971] and its subsequent improvements [Salton and

Buckley 1990], adapted to the AQE setting" [41].

The general formulation includes q’ which refers to the expanded query, q which refers to the original query, λ refers to the parameter that is responsible for weighting the related contribution of query and expansion terms, and scoret refers to the weight that is assigned to the expansion term t.

65

3.8 Related Work

Many algorithms have been implemented to solve the problem of text categorization. Most of the work in this area is restricted to English text. That is to say that the categorization has not been applied to many other languages, including Arabic.

Text classification is a fundamental task in document processing, especially because the information flood is getting enormous. Various classification approaches are tested on languages like English, German, French, and other European languages. For many European languages, there are many rule-based and statistical approaches that can be used for all fields of Information Retrieval and knowledge management. However, there are only a few approaches that can deal with the Arabic language and which are able to handle problems with the inflections that do not appear in other languages in a similar fashion.

In their paper entitled "Automatic Arabic Document Categorization Based on the

Naïve Bayes Algorithm," Elkourdi and Bensaid suggest a new approach to classify the

Arabic language automatically [48]. To do so, they used a statistical machine learning algorithm called Naive Bayes (NB). This approach can "classify non-vocalized Arabic web documents … to one of five pre-defined categories." The results of their work showed that the overall accuracy of this approach is 62% [48].

Similarly, Sawaf, Zaplo, and Ney provide a new approach for classifying and clustering Arabic documents. Their method mainly depends on statistical methods without providing any morphological analysis. Sawaf, Zaplo, and Ney argue that their approach seems to be very successful with Arabic-language documents [100].

66

Other scholars such as Zrigui, Ayad, Mars, and Maraoui propose a new approach for classifying Arabic language documents. They use two types of algorithms: Support

Vector Machine (SVM), and Latent Dirichlet Allocation (LDA). They concluded that this approach can be very effective for classifying Arabic language documents [113].

In their paper Brahmi, Ech-Cherif, and Benyettou entitled "An Arabic lemma-based stemmer for latent topic modeling," introduce a new approach for classifying Arabic language documents. "A new lemma-based stemmer is developed and compared to a root- based one for characterizing Arabic text. The Latent Dirichlet Allocation (LDA) model is adapted to extract Arabic latent topics from various real-world corpora." They maintain this approach improves the classification for Arabic language documents compared the already existing root based stemming [35].

In 2009, Al-Shalabi, et al. tried to expand queries using Word Sense

Disambiguation (WSD), and then they compared the performance of a search engine before and after expanding the query. Their work can be summarized in the following steps:

 Submitting a query in Arabic.

 Reviewing the retrieved documents to decide whether each document is

relevant or irrelevant to the query.

 Resubmitting the query by adding the expansion terms depending on the

synonyms provided by a dictionary by the user.

The problem with this study is that it did not provide the results that they found in their study. Instead, they just mentioned that the precision may decrease in many cases, especially when adding polysemy [9].

67

In 2014, Mahgoub, et al., tried to expand the query depending on three Arabic resources: Arabic Wikipedia Dump, "Al Raed" Dictionary and "Google WordNet"

Dictionary. Their work can be summarized in the following:

The first step includes locating the named entities or concepts used in the query in

Wikipedia. Then, they added the title of "redirect" pages which can direct to the same concept and its subcategories from Wikipedia categorization system, only in the case that they located a named entity or a concept. Otherwise, they used the other two dictionaries

Google WordNet and Al Raed. Moreover, they applied two query expansion methodologies. The first one is a single query expansion that involves all expanded terms.

The second one is a multiple query expansion for each term. Finally, they combined the results of the first two queries into one result list. The problem of their study was in the accuracy (recall and precision, which reached in some cases about 10%) [76].

A survey of literature shows that most of the studies in the area of topic modeling and query expansion are limited to European languages. Moreover, most of the studies have applied either topic modeling or query expansion to retrieve information. In other words, very few studies have combined both approaches, and none of them, to the best of my knowledge, has applied the combination of these two methods to Arabic language documents. This study investigates these issues thoroughly with regard to Arabic language in the following chapters. Furthermore, this study will compare the results in terms of accuracy and time complexity for both English and Arabic.

68

Chapter 4

Semantic Web and Arabic language

4.1 Importance of Arabic Language

Arabic is one of the Semitic languages that include Hebrew, Acadian, Phoenician,

Tigre, Aramaic, Syriac, Ugaritic, Amharic, Geez, and Tigrinya. All of these languages have died, or at least are only used in simple ways, except Arabic. For example, the Acadian and

Ugaritic languages died eras ago. Hebrew is one of the oldest Semitic languages that had disappeared, but has been growing again in Israel recently. The Tigre language is a language only used as a liturgical language of the Ethiopian and Eritrean Orthodox

Tewahedo Church. Geez, which was the official language of the Kingdom of Aksum and the Ethiopian imperial court, is only used in the literature of the Ethiopian Orthodox

Tewahedo Church, the Eritrean Orthodox Tewahedo Church, the Ethiopian Catholic

Church, and the Beta Israel Jewish community.

There are two main reasons why the Arabic language is still used and has a lot of speakers even now. The first reason is the prosperity of the Arabic language in the

Jahiliyyah (barbaric, primitive) age of Arab society because of the art of poetry.

There were many poems written by Arabic poets. Each tribe was encouraging its members to learn Arabic skills to become skillful poets. Therefore, the Jahiliyyah era is considered as the richest time of Arabic speakers and poets. In that era, some poets

69

composed the Mu’allaqat, which was a poem that had at least one thousand verses. Also, there were annual competitions to choose the best poem which had the most eloquence and meaning.

The second reason was the mission of the prophet Mohammed and the holy Qur’an.

The Qur’an is the holy book for all Muslims, which is in the Arabic language. The prophet

Mohammed was teaching the Islamic rules in Arabic, and most of the basic rituals related to worship in Islam, which must be done using Arabic language for both Arab and non-

Arab people. For example, there are five daily prayers in which part of the Quran must be read in Arabic. These are the two main reasons why the Arabic Language survived, and influenced non-Arabs to learn the basics of the Arabic language.

The Arabic language is one of the most common languages in terms of the number of people who speak it. It is spoken by approximately 330 million people in twenty three countries in the Middle East and North Africa where Arabic is the official language.

Besides, the Arabic language is the religious language for all Muslims from all over the world since Arabic is the language of the holy Qur’an.

Arabic language has 28 alphabet letters, and has a different writing orientation with respect to English, i.e., from right to left. Arabic language is considered as a one of the richest semantic languages that have specific words to describe a specific thing. Arabic language is one of the six official languages of the United Nations [31, 97].

According to "The World’s Ten Most Influential Languages" by George Weber,

Arabic is among the world’s ten most influential languages. The formula that they used to decide the most influential languages is based on the following factors:

70

 Number of primary speakers: maximum 4 points.

 Number of secondary speakers: maximum 6 points.

 The economic power of countries using the language: maximum 8 points.

 Number of major areas of human activity in which the language is important:

maximum 8 points.

 Number of the population of countries using the language: maximum 7 points.

 Socio-literary prestige of the language: maximum 4 points (plus an additional

point for being an official UN language).

As a result, based on this formula, the Arabic language is the fifth most influential language in the world as shown in Figure 16 [109].

Figure 16: The real strength of the top ten languages [109].

71

4.2 Right-to-Left Languages and the Semantic Web

Regardless of the percentage of the Arabic script, which is 8.9% of the world’s languages, Arabic language is not well recognized in the Semantic Web. As Table 8 shows, other languages with a lower percentage, such as Hanzi, are better recognized in the

Semantic Web. In terms of the script, it is obvious that other languages that has the same script, such as Farsi have also some applications in Semantic Web. Theoretically, the percentage of the users of the script and the script itself should not cause a problem for

Arabic to be recognized in the Semantic Web [6].

English has a left-to-right script and also has the highest percentage of use on the web (56%) [1]. This forced most of the Semantic Web developers to design their application and tools in a way that fit the English script, i.e., support left-to-right scripts.

However, many attempts started to consider languages that support right-to-left scripts such as in Hebrew and Farsi.

Table 8: Clustering of the world’s languages based on their script family [86].

72

4.3 ARABIC ONTOLOGY

In this section, Arabic ontology is introduced because in chapter 6 we use the

Arabic WordNet ontology for Arabic language, and WordNet ontology for English language.

Ontology learning aims at extracting related concepts and relations from the corpus in question in a semi-automatic way in order to form an ontology. The life cycle of an ontology involves six phases: ontology creation, ontology population, ontology validation, ontology deployment, ontology maintain, and ontology evolution [37].

These learning processes for ontology can be further sub-divided into: extract terms, discover synonyms, obtain concepts, extract concept hierarchies, define relations among concepts, and deduce rules or axioms. Applying these steps can make ontology matching possible and also make the related branches into a topic available for any user

[23].

Studying ontology with different languages can also be a challenge to web designers to satisfy the needs of millions of users of the World Wide Web. Obtaining information in a specific language that users are looking for has increased in today’s searches [23]. Thus, the need for Arabic has also increased because the ontology in English is not translatable into Arabic for the following reasons:

 There is a clear lack in Arabic editors for OWL language and RDF files.

 Tools that support Arabic character set do not fully support right-to-left script.

 There is almost no Arabic Language parser that supports RDF files in semantic

editors.

73

 There is no Arabic metadata definition similar to English [10].

 There is no open-source for Arabic Semantic Web software's tools and Web

services.

Table 9: Listing of tools used in building Semantic Web applications [6].

Function Tool Description

Ontology Protégé Visual ontology editor written in Java with many plug-in tools.

Editor

Ontology Sesame An open-source RDF database with support for RDF Schema

Repository inferencing and querying.

Information GATE Open-source infrastructure that provides language processing

Extraction capabilities. Plug-ins for processing Arabic documents are

available at:

http://gate.ac.uk/gate/doc/plugins.html#arabic

Reasoners Jena A Java framework to construct Semantic Web applications. It and provides a programmatic environment for RDF, RDFS, OWL,

Processors SPARQL and includes a rule-based inference engine.

74

Table 10: ARABIC SUPPORT SUMMARY [58].

Tool RDF OWL Query

Protégé Support Arabic Limited Support Limited Support

Jena Support Arabic Support Arabic Limited Support

Sesame Limited Support Limited Support NO Support for

Arabic

KAON2 NO Support for NO Support for NO Support for

Arabic Arabic Arabic

4.4 The Arabic language and the Semantic Web: Challenges and opportunities

A classic Semantic Web application consists of two main parts: the semantic annotation process and the semantic query and reasoning process.

The semantic annotation process, as depicted in Figure 17, requires three components: an extraction engine, an ontology, and an annotation generator. First, the extraction engine is responsible for finding the entities of interest in a document using techniques (e.g., Wrappers, ‘IE’, etc.).

Second, an ontology describes the domain of interest. Third, the annotation generator is responsible for explicitly adding semantic meaning to the extracted entities using the ontology.

75

-

Figure 17: The three main components of the semantic annotation process [6].

The semantic query and reasoning process, on the other hand, requires tools that enable "intelligent" searching through Semantic Web data rather than uninformed string matching, which is currently employed in traditional Information Retrieval systems. These tools use the layer of descriptive semantics as a medium for communication between agents and as a way to inference over data and derive conclusions.

Investigating the field of Semantic Web showed some issues that can explain the reasons for the lack of Arabic research. These reasons are related to the lack of technology support, tools, applications, and the availability of adequate resources. There are many languages that most companies around the world do not take into consideration and Arabic language is one of them. For example, for many years Android and Apple have not supported Arabic font, so Arabic consumers have no choice but to use technologically inferior alternatives that allow for Arabic support on Apple and Android products. In addition, there are no Arabic companies that are concerned about manufacturing devices which can deal directly with the Arabic language. Furthermore, there are only a few Arabic

76

companies that developed tools and applications. We can summarize the challenges and difficulties in the following points:

1) Lack of Arabic support in existing Semantic Web tools:

A specific problem with Semantic Web tools processing Arabic text concerns encoding. Different encodings of Arabic script exist on the Web; dominant encodings include UTF-8, Windows-1256, and ISO-8859-6 [105]. Moreover, most of the Semantic

Web tools we have encountered were built using Java, which supports internationalization.

Therefore, there is a strong need to consolidate the different Arabic encoding or simply adhere to one encoding schema when representing Arabic text (i.e., Unicode). Typical

Semantic Web developers’ tools use Unicode throughout (Carroll, 2005); hence this might solve part of the support problem [42].

2) Lack of Arabic Semantic Web applications:

According to OntoSelect ontology library, English represents 49% of the ontology in the library, which suggests that there is an obvious lack of Arabic in the Semantic Web world. Figure 18 shows the limited Arabic Semantic Web applications compared to other languages. This problem can be explained by a result of the lack of tools and software to process Arabic scripts throughout all the steps of the semantic annotation process [61].

77

Figure 18: A pie chart showing the distribution of languages used in creating ontologies stored in the OntoSelect library [6].

3) Limited support for Arabic research in the field of Semantic Web technologies:

The collaboration between academic research centers and grant bodies resulted in many Semantic Web researches. The investment in the Semantic Web led to producing many Semantic Web tools such as GATE, Jena, and Protégé. There are several reasons that lead to the limited research in Arabic language. Among these reasons are: lack of funding, adequate resources, and lack of interest in the field of Semantic Web [6].

78

In addition, in most of the Arabic countries there are no plans or goals to construct or support research centers for Arabic language and Semantic Web in order to help people to catch up with the evolution of the technology.

4) Intrinsic difficulties of the language:

There are difficulties that played a negative role in the process of creating these tools in Arabic. Specifically, Arabic language suffers from the following difficulties: complex morphology, absence of capital letters, and short vowels.

Arabic language is composed of verbs, particles, and nouns which are derived from approximately 10,000 roots [97]. A noun is a name or a word that describes a person, a thing, or an idea. Verbs are similar to the English verbs; verbs in Arabic are classified into perfect, imperfect, and imperative. Arabic particles include prepositions, adverbs, conjunctions, interrogative particles, exceptions, and interjections. "Arabic is highly inflectional and derivational, which makes morphological analysis a very complex task" and "Capitalization is not used in Arabic, which makes it hard to identify proper names, acronyms, and abbreviations" is stated in [96].

All the previously discussed issues indicate that introducing an Arabic Semantic

Web is not an easy task.

There are many potential challenges and opportunities that users of the Arabic language may encounter with the commencement of Semantic Web solutions. Some of these challenges include the following:

 The need for the development of controlled vocabularies and ontologies to explicitly

define machine-processable semantics. These controlled vocabularies and ontologies

79

would help people and machines in communicating concisely and support the

semantic and syntax exchange between them. Such promising efforts are being

developed; they include the construction of an Arabic WordNet [49].

 Computational processing of Arabic text differs from its English counterpart. The

Arabic language encompasses more complex morphological, grammatical, and

semantic aspects so that existing Natural Language Processing (NLP) algorithms

used for the English language cannot be directly re-purposed for the use of the Arabic

language. This problem also has its roots in the Semantic Web domain. Information

extraction, one of the Semantic Web’s main processes, relies extensively on

extracting concepts from Web documents; this process requires the analysis of the

document content - either morphologically, grammatically, or semantically - to be

able to relate the extracted instances to a pre-defined ontology concept. Thus, more

tailored NLP tools such as GATE are needed to process the Arabic language

accordingly [3].

As for opportunities, the Semantic Web provides rich opportunities for users of the

Arabic language to process data at many levels. In terms of data, large and currently growing repositories of Arabic content in business, science, government documents, and email messages exist in the current Web. The Semantic Web provides users with a common framework to integrate and derive new meaning, value, and knowledge from these existing

Arabic repositories. An important element of the Semantic Web is the language used in recording how the data relates to real-world concepts. Arabic content on the Web can be networked in meaningful ways in semantic-enhanced applications, as the concept of a

80

multilingual Semantic Web supporting Arabic content is attainable. In terms of language barriers, the Semantic Web offers users of the Arabic language, and machines alike, the chance to move from one part of the Web to another based on a related meaning, regardless of the language used to represent that meaning.

Opportunities can also be highlighted from local and regional perspectives. In the era of knowledge economies, modern organizations and governments are increasingly turning to knowledge management as a differentiating asset to generate greater productivity and create new value. As nations in the Arabic region begin the era of e-Government and e-Commerce, shared data understanding and representation for Arabic content is essential.

This can be achieved by using Semantic Web technologies. Thus computer mediated

Arabic content is attainable by employing Semantic Web tools and applications that aim to provide machine-processable data with a universal representation [6].

81

CHAPTER 5

Generating Topics

5.1 Introduction

Nowadays, there is a huge amount of information that we encounter in many different fields. For example, Wikipedia articles, news articles, astronomical survey data,

Flickr images, and social network. According to Hilbert & López, the world's technological capacity to store information has roughly doubled every 40 months since the 1980s [63], and according to IBM, since 2012 2.5 Exabytes of data are created every day. This suggests that the main problem we face today with regards to information is information overload.

For that reason, there is a need for algorithmic tools to search, organize, and understand this huge amount of information. For our purposes, when we use the Latent Dirichlet

Allocation (LDA), we can consider a topic as a probability distribution over a collection of words, and when we use the Latent Semantic Index (LSI), we can consider a topic as a similarity distribution over a collection of words. A topic model in LDA is considered as a generative model, which is a formal statistical relationship between a group of observed and latent (unknown) random variables that specifies a probabilistic procedure to generate the topics. According to Landauer and Dumais, LSI is a "theory and method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text" [72]. The main goal of topic modeling is to discover and

82

provide a "thematic summary" of a collection of documents. In other words, we need to answer the following question: what themes are being presented in the documents in question [103]?

In topic models, the semantic properties of a text are expressed in terms of topics, and as a result, topics can be used as knowledge background which provides semantically related words to expand the literal matching of words that are presented in a given text

[103]. The topic modeling can be used in various IR applications to overcome the problems of literal word-matching algorithms since it adds an extra semantic layer to address the vocabulary mismatch problem in terms of synonymy and polysemy, as we described in

Chapter 4. These problems occur because the users of IR systems usually do not use the same words that indexers use. For example, assume that a person writes a query with the term "pool" to search the web about thread pool in a programming language. The web search engine may return results related to a swimming pool, a group of people, a programming language, or even an online shop.

In this chapter latent semantic indexing (LSI) and the Latent Dirchilet Allocation

(LDA) have been used to classify the corpus into specific topics. The experiments have been performed by using both Arabic and English documents. The results show that LDA is more accurate than LSI for both Arabic and English documents; we will discuss the results in more detail later in this chapter. The results show also that LSI and LDA can identify with high accuracy topics for the corpus both in Arabic and English documents and that LSI is faster than LDA in time complexity.

83

5.2 Data set

Due to the unavailability of free Arabic resources, we have adopted the Arabic

Wikipedia and the English Wikipedia to select Arabic and English documents.

According to Wikipedia, the Arabic Wikipedia is currently the 22nd largest edition

of Wikipedia by article count, and is the first among the Semitic languages. As of August

2014, the Arabic Wikipedia has over 300,000 articles. On the other hand, the English

Wikipedia is currently the largest edition of Wikipedia by article count. The English

Wikipedia includes more than 4,878,735 articles and it increases every day by over 800

new articles. Table 11 shows more details about the English and Arabic Wikipedia.

Table 11: English and Arabic Wikipedia

Number Language Wiki Number of articles Number of users Number of images

1 English en 4879148 25255450 855816

22 Arabic ar 371803 1012177 23607

Studying several other corpora to measure the impact of different corpora on the

results can be a future direction. Moreover, as this study is limited only to Wikipedia

documents, other future studies can be approached by examining different types of corpora.

Furthermore, future studies can use specific types of corpora such as medical, politics, or

economic which might yield different results.

84

5.3 Experimental Results

In this section we measure the accuracy for topic modeling using LSI and LDA for both English and Arabic documents.

Why LSI?

As we mentioned earlier, one of the major shortcomings of a number of IR methods is that they fail to treat synonymy and polysemy correctly. LSI is considered an advanced

Information Retrieval method that solves the problems caused by synonymy and polysemy.

It goes beyond traditional Information Retrieval (IR) methods that rely on key word matching techniques and rather “deals with the concept, and carries out a search on this basis” [11]. Deerwester, et al., describe the three major advantages of using the LSI representation with the following labels: synonymy, polysemy, and term dependence [46].

As we discussed earlier, synonymy refers to different words that have the same meaning. Also synonymy may include word inflection (singular and plural forms of a word), whereas polysemy refers to a word that has different meanings. These problems may result in erroneous and irrelevant retrievals which should decrease the accuracy. More exactly, the synonymy problem may result in decreasing the recall; whereas, polysemy may result in decreasing the precision. The third advantage of using LSI that it can deal with term dependency. "The traditional vector space model assumes term independence and terms serve as the orthogonal basis vectors of the vector space. Since there are strong associations between terms in language, this assumption is never satisfied. While term independence represents the most reasonable first-order approximation, it should be

85

possible to obtain improved performance by using term associations in the retrieval process. Adding common phrases as search items is a simple application of this approach.

On the other hand, the LSI factors are orthogonal by definition, and terms are positioned in the reduced space in a way that reflects the correlations in their use across documents. It is very difficult to take advantage of term associations without dramatically increasing the computational requirements of the retrieval problem. While the LSI solution is difficult to compute for large collections, it needs only be constructed once for the entire collection and performance at retrieval time is not affected" [95].

WHY LDA?

As we discussed earlier, Latent Dirichlet Allocation (LDA), which was introduced by Blei, et al. [33], is a formal generative latent mixture model of documents. Latent

Dirichlet Allocation (LDA) has quickly become one of the most popular probabilistic text modeling techniques in machine learning and natural language processing (NLP).

One of the major problems in the LSI is the ambiguity. For example, when searching for the term "apple" how could a search engine determine if you are talking about apple company products, or the apple which is a kind of fruit. On the other hand, LDA tries to solve the problem of ambiguity by comparing a document to two topics and determining which topic is closer to the document, across all combinations of topics which seem mostly relevant. For that reason, LDA supports an Information Retrieval system (such as a search engine) to determine which documents are most relevant to which topics. For that reason the results show that LDA is more accurate than is LSI for generating topics.

86

Figure 19: A visualization of the probabilistic generative process for three documents [103].

As we can see from Figure 19, DOC1 draws from Topic 1 with probability 1, DOC2 draws from Topic 1 with probability 0.5 and from Topic 2 with probability 0.5, and DOC3 draws from Topic 2 with probability 1. The topics are represented by β1:K (with K = 2 in this case), where β is the corpus level parameter and K represents the number of topics to be generated.

87

Figure 20: The intuitions behind latent Dirichlet allocation [32].

5.3.1 Experiments on an English corpus

5.3.1.1 Processing Steps

A corpus consists of a set of documents. The present study makes use of the Gensim tool (an open source topic-modeling tool that is implemented in the Python programming language) to run the LSI and LDA experiments. That is, each corpus is saved in one text file containing a number of lines that equals the number of different documents in each corpus. In other words, each line contains one document. The first corpus is for English

Wikipedia documents and the second corpus is for Arabic Wikipedia documents.

88

5.3.1.2 English Corpus Creation

Constructing the corpus is an important step for topic modeling. The second step in the LSI and LDA analysis requires pre-processing the corpus by removing non-English words, removing digits and punctuation marks { : , ;?@%*! & $ # [ ] …..}, and removing stop words. The stop words are words such as prepositions and articles that have less lexical meaning compared to content words such as verbs or nouns. The importance of creating such a list stems from the fact that LSI and LDA should capture the relation between meaningful words in order to reach a good level of accuracy. Also, in all the experiments, we used the Paice/Husk stemmer, as it showed superior performance over other stemming approaches. The importance of using stems for searching is used to take advantage of increasing recall by retrieving terms that have the same roots but different endings.

Paice proposes some metrics to evaluate a stemmer regardless of the task carried out: the under-stemming index (UI), the over-stemming index (OI), the stemming weight

(SW), and an error rate relative to truncation (ERRT) [82]. According to Paice, the

Paice/Husk is the strongest stemmer, followed by Lovins, which is still considered a strong stemmer, and finally Porter, which is the weakest among the three. After that, in 2003,

Frakes and Fox support the Paice/Husk result and affirm that Paice/Husk stemmer is the strongest stemmer, followed by Lovins stemmer which is greatly stronger than Porter stemmer [82]. Table 12 shows a summary of stemmer features.

89

Table 12: Summary of stemmer features [82].

Stemmer Number of Number of Use of Use of Strength Use of constraint rules suffixes recording Partial rules phase matching 29 294 35 rules Yes Strong Yes Lovins unknown 1200 No Yes Strong Yes Dawson

62 51 No No Weak Yes Porter 115 Unknown No No Very Yes Paice/Husk Strong Krovetz unknown 5 Yes No Very No weak

It is worth mentioning that only ten keywords are used to represent each topic

regardless if the corpus is classified into 5, 10, or 15 topics. The reasons behind selecting

ten words as keywords are supported by the experiment which shows that: 1) selecting five

keywords or less does not represent necessarily a topic such as the politics or the economic

since these five keywords can describe different topics; 2) selecting 10 keywords is

considered sufficient to represent a topic; 3) selecting 15 keywords or more can be used to

represent very specific topics such as politics in the middle east. Moreover, manual

investigation, in this regard, refers to user’s judgment i.e., only the user can decide if the

keywords chosen for each topic are relevant or not.

5.3.1.3 Experiment 1

We apply LSI to an English corpus, and we set the number of topics to five, and

the number of keywords for each topic to ten. After manual investigation, we find that the

accuracy is about 82%.

90

5.3.1.4 Experiment 2

We apply LSI to an English corpus, and we set the number of topics to ten, and the number of keywords for each topic to ten. After manual investigation, we find that the accuracy is about 84%.

5.3.1.5 Experiment 3

We apply LSI to an English corpus, and we set the number of topics to fifteen topics, and the number of keywords for each topic to ten. After manual investigation, we find that the accuracy is about 79%.

5.3.1.6 Experiment 4

We apply LDA to an English corpus, and we set the number of topics to five topics, and the number of keywords for each topic to ten. After manual investigation, we find that the accuracy is about 85%.

5.3.1.7 Experiment 5

We apply LDA to an English corpus, and we set the number of topics to ten topics, and the number of keywords for each topic to ten. After manual investigation, we find that the accuracy is about 88%.

5.3.1.8 Experiment 6

We apply LDA to an English corpus, and we set the number of topics to fifteen topics, and the number of keywords for each topic to ten. After manual investigation, we find that the accuracy is about 81.3%.

91

5.3.2 Experiments on an Arabic corpus

5.3.2.1 Arabic Corpus Creation

Constructing the corpus is an important step for topic modeling in LSI and LDA.

The second step in the LSI and LDA analysis requires pre-processing the corpus by performing the following steps:

 Remove digits and punctuation marks {: @%*! & $ # [ ] …..}

{.…,ُ ِ,ُ َ ُ , ُ,~}  Remove all vowels

aleph. The reason for this "أ" with the letter "ء", ُ "إ"," ُ آ,"ؤ","ئ" ,""ا Replace

are represented in dictionaries"ء"ُ conversion is the fact that all forms of Hamza

."أ" as one form and people often misspell different forms of aleph

The reason behind this normalization is the fact that there ."ة" with "ه"  Replace

when they appear at the end of "ه" or "ة" is not a single convention for spelling

ُ ُ .a word

The reason behind this normalization is the fact that there ."ي" with"ى ُ"  Replace

when they appear at the end of "ي" or "ى" is not a single convention for spelling

a word.

 Remove diacritics.

."ُ ~" Remove

."ء" with the letter"ئ"ُ  Replace the letter

."ء" with the letter"ؤ"ُ  Replace the letter

 Remove all non-Arabic words in the documents.

92

.etcُ "حين"ُ ,"حتى" ,ُ "في",ُ "عن","من"  Remove all Arabic function words, such as

The Arabic function words (stop words) are the words that are not useful in

documentary research system, e.g., words such as pronouns and prepositions.

 Apply Light-10 stemmer. In all the experiments on Arabic corpus, we apply Light-10 stemmer, which was developed earlier by Larkey (2007). The adaption of this stemmer came as a result of it is superior performance over other stemming approaches [76].

Instead of stemming the whole corpus before indexing it, the Light-10 stemmer is based on grouping sets of words with the same stem that were found in the same document into a dictionary. Later on, this dictionary can be used in expansion. This can help reduce the probabilities of matching between words that have the same stem but do not have the same meaning, since they must be first found in the same document in a given corpus in order to be used in expansion. Consider the following example in Table 13:

Table 13: Example of two words sharing the same stem but have different senses [76].

Arabic Word Stem English Equivalent

Obedience طاع الطاعة

Plague طاع الطاعون

yet we do not expand the word ,"طاع" We see that both words share the same stem

as there is no document in the corpus that contains both "الطاعون" with the word "الطاعة" words.

93

5.3.2.2 Experiment 1

We apply LSI to an Arabic corpus, and we set the number of topics to five topics, and the number of keywords for each topic to ten. After manual investigation, we find that the accuracy is about 74.1%.

5.3.2.3 Experiment 2

We apply LSI to an Arabic corpus, and we set the number of topics to ten topics, and the number of keywords for each topic to ten. After manual investigation, we find that the accuracy is about 77%.

5.3.2.4 Experiment 3

We apply LSI to an Arabic corpus, and we set the number of topics to fifteen topics, and the number of keywords for each topic to ten. After manual investigation, we find that the accuracy is about 71.2%.

5.3.2.5 Experiment 4

We apply LDA to an Arabic corpus, and we set the number of topics to five topics, and the number of keywords for each topic to ten. After manual investigation, we find that the accuracy is about 76.3%.

94

5.3.2.6 Experiment 5

We apply LDA to an Arabic corpus, and we set the number of topics to five topics, and the number of keywords for each topic to ten. After manual investigation, we find that the accuracy is about 79.8%.

5.3.2.7 Experiment 6

We apply LDA to an Arabic corpus, and we set the number of topics to ten topics, and the number of keywords for each topic to ten. After manual investigation, we find that the accuracy is about 73.4%.

Table 14 and Table 15 summarize the experimental results.

Table 14: Accuracy for topics generation from an English corpus.

English corpus Accuracy

LSI with 5 topics, number of keywords = 10 82%

LSI with 10 topics, number of keywords =10 84%

LSI with 15 topics, number of keywords = 10 79%

LDA with 5 topics, number of keywords = 10 85%

LDA with 10 topics, number of keywords =10 88%

LDA with 15topics, number of keywords =10 81.3%

95

Table 15: Accuracy for topics generation from an Arabic corpus.

Arabic corpus Accuracy

LSI with 5 topics, number of keywords =10 74.1%

LSI with 10 topics, number of keywords = 10 77%

LSI with 15 topics, number of keywords =10 71.2%

LDA with 5 topics, number of keywords = 10 76.3%

LDA with 10 topics, number of keywords = 10 79.8%

LDA with 15 topics, number of keywords = 10 73.4%

5.4 Discussion

As we discussed earlier in section 5.3 (experimental results), the experiments show that LDA is more accurate in generating the topics than LSI for both Arabic and English corpora. The LDA is more accurate in generating the topics than LSI for the English documents by about 4 percentage points, whereas, LDA is more accurate in generating the topics than LSI for the Arabic documents by about 3 percentage points.

Also, the experiments show that LSI is faster in generating the topics than LSI for both Arabic and English corpora as well. Moreover, the experiments show that the topics generated for English documents in both techniques LSI and LDA is more accurate than the topics generated for Arabic documents in both techniques, i.e., LSI and LDA. To the best of my knowledge, there is no study that compares the accuracy for generating the topics between LSI and LDA for both English and Arabic documents.

96

In topic modeling for Arabic language, only a few works have been published so far. EL-Kourdi and Ben Said [48] suggest a new approach to classify Arabic language automatically based on Naïve Bayes Algorithm. They used 1500 web documents, and they classified the documents into five categories. They report that the accuracy is about

68.78%. When we compare our results to their results we find that our results show an improvement in terms of accuracy of about 6 percentage points when we use LSI to generate 5 topics. Also, our results show an improvement in terms of accuracy of about 8 percentage points when we use LDA to generate 5 topics.

In our experiments, we find that the accuracy in classifying Arabic topics is less than the accuracy in classifying English topics due to the following challenges:

 Arabic language morphology is considered a complex morphology. For that

reason, Arabic is considered one of the highly varied languages. In general, an

Arabic word can be classified into one of three morpho-syntatic categories:

nouns, verbs, or particles. Other scholars try to use other categories, such as

prepositions and adverbs that are used in English language, but this method did

not add any advantage [73, 83, 104].

 In Arabic language the root is an important element since different words may

be derived based on specific patterns or schemas. The root is considered a

linguistic unit that provides the semantics for both Arabic and English language.

For Arabic language, the root is a non-vocalized word that usually consists of

three letters, and sometimes four or five letters. Table 16 provides some

.[104 ,66]علمُ derivations from the root

97

.[root [35علم Table 16: Some derivations from the

ِعلم َعلَم َعلم ُعلِم Arabic Word

English Be known Teach/instruct Flag/banner Knowledge Meaning

 Vocalization: In Arabic language the words are vocalized with diacritics such

as {~, ُ,َُ , ُ ِ ُ,…}, but in real life only the holy Qur’an and some formal documents

include full vocalization. Readers with sufficient knowledge of Arabic may

understand Arabic texts without vocalization by depending on the context of

the word or the term in question. However, this is not always the case. This fact

emphasizes the ambiguity that Arabic words might cause, which is a problem

that needs to be taken care of, specifically as part of the morphology. Table 17

gives an example for a word composed of three letters [35].

98

.بسم Table 17: Four possible solutions for the word

Solution Morphology Vocalization English Meaning

Smiling بَ ُسم Noun 1

Smile بَ َسم Verb 2

By the name of ب ْسم Prep + Noun 3

With poison ب َسم Prep + Noun 4

 Agglutination: refers to the idea of using more than one lexical unit within the

same word. These lexical units can be embodied within the word and linked

together at the same time. In Arabic language, a word can be extended by

attaching four kinds of affixes. The four affixes are antefixes, prefixes, suffixes,

and postfixes. Figure 21 shows several types of affixes that are attached to the

This case can lead to high ambiguity . َسيعلمونه in the agglutinated form ofعلمُ core

in order to extract the correct form from the agglutinated form. In addition, as

we demonstrated above in Table 17, the morphology analysis becomes more

difficult for non-vocalized texts [35].

99

.[35]سيعلمونه Figure 21: Segmentation of the Arabic agglutinated form

According to the results in Table 14 and Table 15, we find that the most accurate results after manual investigation occur when the number of topics equals ten. We applied our experiments by setting the number of topics to five, ten, and fifteen topics, respectively.

When we classify our corpus to five topics, the returned results were general, i.e., not detailed. For example, topics include terms such as information and bacteria were in one topic (sciences), but when we classify our corpus to ten topics, terms such as information, and bacteria were classified into two topics (Computers and Medical). This supports that our system is working well since the first five topics are also included in the ten topics. But when we classify our corpus to fifteen topics there were some redundant results. For example, the occurrence of the term “information” appeared in the topics as computers, medical, and finance.

100

In order to find the LSI time complexity, LSI needs to be solved by Singular Value

Decomposition (SVD) as we discussed in chapter 3. For more details about SVD, see the

Appendix. Normally, the time complexity of the SVD application is O (min {MN2, NM2}), where M refers to the number of rows of the input matrix and N refers to the number of columns. Therefore, the problem for the time complexity is when both M and N are large

[108]. On the other hand, the time complexity for an inference in LDA strongly depends on the effective number of topics per document. There are two possible cases:

Case one: When a document is generated from a small number of topics, the time complexity for the MAX-LDA can be solved in polynomial time.

Case two: If a document can use an arbitrary number of topics, the time complexity for the MAX-LDA is NP-hard.

Moreover, the application of LDA can be computationally expensive both in terms of time and memory. The time complexity for the LDA is O (MNT+T3). In terms of space or memory the LDA requires O (MN +MT+NT), where M refers to the number of samples,

N refers to the number of features and T= min (M, N) [38, 101]. For more details about

LDA, see appendix A.

101

Chapter 6

Topic Modeling and Query Expansion

6.1 Introduction

As we discussed in the previous chapters, the current web is understandable by humans only. That is to say, machines cannot understand information on the Web as people can understand it. To explain this problem in more detail, consider the following examples.

Example 1: Assume that a person writes a query in English by using the term "bank" to search the web for a financial institution. Example 2: Assume that a person writes a query

.to search the web for a flag "علم" in Arabic by using the term

The web search engine may return results in example 1 related to a financial institution, or it might search for the geographical meaning for the word bank, which can refer to a side of a river. In the second example the web search engine may return results related to a flag, or it might search for the knowledge, or for a famous person or thing; this

without vocalization can refer to any of the three mentioned ’علم‘ can occur since the word meanings. This means that the retrieved results have no semantic relationships. In order to solve the ambiguity problem, we need to add an extra semantic layer to the current Web in order to enable machines to understand what a Web page is about.

102

Since the arrival of the Semantic Web, many Web applications were developed to take advantage of the capabilities provided by the Semantic Web technologies such as intelligent reasoning over data, semantic search, and data interoperability. However, most

Semantic Web technologies focus on processing Latin family scripts, and only a few researches have applied Semantic Web technologies to develop Arabic language applications.

Query expansion (QE) is the process of reformulating the original query by evaluating the user's query words and then expanding the query words in order to match additional relevant documents.

The process of query expansion is considered a complex task, since a query expansion engine needs to analyze the search results before displaying and ranking them as relevant search documents. In general, query expansion (QE) works as follows:

 User enters a query to the search engine.

 Based on its underlying QE algorithm, the search engine tries to enhance the

original query by matching additional documents that might be important to the

user.

 Search engine display the documents based on their relevancy; in other words,

more relevant documents will be displayed first followed by less relevant

documents.

Query expansion (QE) is used to improve retrieval performance in Information

Retrieval processes [107]. Query expansion is used to:

103

 Find the synonyms of the words then search for the synonyms,

 Find morphological forms of words and in order to do that we use a stemmer for

each word in the search query,

 Re-weighting the terms in the original query.

Most of the time, users do not use the best terms in order to formulate search queries. Search engines use query expansion to increase the quality of user search results.

Expanding a query can be done by stemming the user’s original query terms, which leads to matching more documents.

Matching more documents can be a result of two factors. The first is by using alternate word forms for the original terms entered by the user. The second is by searching for synonyms of the user’s original query terms. The recall in many cases will be increased at the cost of precision.

There are problems in query expansion such as how to decide which terms are considered important to include, the precision and the recall tradeoff. Precision and recall tradeoff is one of the problems of query expansion, regarding whether it is worthwhile to perform query expansion since high recall in many cases causes a decrease in precision.

The current research is focused on applying the topic modeling approach using LSI and LDA in order to discover and classify the topics that occur in a collection of documents

(semantic representation of a specific corpus). Afterward, we will use the topic words that are generated by applying LSI and LDA for each topic to formulate the queries in order to

104

measure the accuracy in terms of recall and precision. Then, we will expand the query in order to explore and discover the hidden relations between the documents, and to enrich the process of retrieving with more organized driven data which we got from topic modeling. Finally, we will compare the accuracy of the results between the two approaches.

The experiments are applied over Arabic and English corpora.

This chapter explains the steps of the proposed approach for Arabic and English documents, shows the experimental results. Finally a discussion is presented in a separate section.

6.2 Why use a combination of topic modeling and query expansion?

The usage of both methods can be attributed to several reasons. On the one hand, query expansion can enhance the recall but, in many cases, it may hurt the precision. On the other hand, topic modeling can help classify the corpus with a specific number of topics.

This classification should result in improving the accuracy of the system. To better explain this situation consider the following example. Let classify a general corpus into four specific topics: politics, economy, food, and sport, in order to formulate and narrow the query for each topic. In this situation searching the political topic will be done by using the keywords extracted from the classified political topic. Thus, adopting such an approach may result in decreasing the search time and enhancing the overall accuracy.

Topic modeling can also provide a glossary that was originally extracted from the corpus based on the classified topics. For example, the query "apple laptop" in a corpus that is not classified according to topic modeling might show inaccurate results. These

105

results might show documents that contain the words apple related to food or economy topic. However, applying topic modeling to the corpus should figure out that "apple laptop" can more likely be found in an economy or computer topic. The usage of query expansion can retrieve the related documents based on the expanded words such as computer, personal computer, apple air, apple mac, etc.

To test the possibility of using both methods, i.e., topic modeling and query expansion, this research applies topic modeling to two different types of corpora. The first type is an Arabic corpus and second one is an English corpus.

6.3 Semantic Search

The purpose of query expansion is to take advantage of the main features of a semantic search that make it a more appealing choice over the traditional keyword based techniques. The main features of a semantic search are handling generalizations, handling morphological variants, handling concept matches, and handling synonyms with the correct sense (Word Sense Disambiguation).

6.3.1 Handling Generalizations

The main goal of handling generalizations is to enable the system to analyze a user’s query in order to provide the user with pages that contain material which is considered related or relevant to sub-concepts of the user’s query. The following example illustrates the handling generalizations feature. Consider the following example of Table18

.[Violence) [76)" عنف " where a query contains the general term or concept

106

Table 18: Handling generalizations

User’s Query In Arabic Equivalent Query In English

”Violence in Africa“ -اعمال عنف في افريقيا

Handling generalizations aims at allowing the system to analyze queries submitted by users to deliver relevant results to the user’s sub-concepts.

Semantic-based search engines should be able to recognize documents with similar

torture) as relevant to)" تعذيب" ,(suppression) , " قمع" ,(extermination) "ابادة" :concepts like

it should match"اعمال عنف في افريقيا" user’s query. Using semantic search if the user query is the documents that describe the user query. Also it should be expanded to match the documents that describe topics which include similar concepts, which have similar

" قمع في " ,"تعذيب في افريقيا"semantic meaning, that are related to the original query such as

" عنف في افريقيا " ,"ابادة في افريقيا " ,افريقيا

In this case it will improve the original query by retrieving more relevant documents using the semantic meaning rather than exact matching. Handling generalization can be achieved by applying topic modeling techniques such as LSI and LDA.

6.3.2 Handling Morphological Variations

Handling morphological variations is based on allowing the system to use words derived from the same root as words of the original query. To illustrate further the

107

mechanism of handling morphological variations, consider the following example as described in table 19 [76].

Table 19: Handling Morphological Variations

User’s Query In Arabic Equivalent Query In English

” Development in the Middle East “ " التطور في الشرق األوسط "

"التطور" Documents that contain morphological variants of the word

.should also be considered relevant to the user’s query" تطوير " and , "تطور"(Development)

a semantic search"التطور في الشرق األوسط" As shown in Table 19, if the user original query is should handle morphological variations of the word by using stemmer. In this case, the

should also retrieve different variants of the word"التطور في الشرق األوسط" original query

in the same context, such as ,"التطور"

This ". تطوير في الشرق األوسط " and ", تطور في الشرق األوسط " , "التطورات في الشرق األوسط" can be achieved by using the stemmer for both Arabic and English.

6.3.3 Handling Concept Matches

Handling concept matches is based on allowing the system to use concepts or named entities that can be referred to words in the original queries. To illustrate further the mechanism of Handling Concept Matches, consider the example described in Table 20

[76].

108

Table 20: Handling Concept Matches

User’s Query In Arabic Equivalent Query In English

” Egypt “ " مصر "

"جمهورية مصر العربية", has other equivalent expressions such as "مصر" The term

So documents that contain any of these expressions should be considered as ."أرض الكنانة"

Also the term "UK" has other equivalent expressions such as ."مصر" relevant to the term

"England", "United Kingdom", and "Great Britain". So documents that contain any of these expressions should be considered relevant to the term "UK".

6.3.4 Handling Synonyms with Correct Sense

In general, the meaning of Arabic words depends on their diacritics. The word

can have different meanings according to its diacritization. Consequently, systems " شعب" should take this issue into consideration for expansion. For example we can see in Table

.[76] "شعب" an example that shows different meaning of the word 21

"شعب" Table 21: Different senses for word

Arabic vowelized word English equivalent Arabic synonyms

أمم ,مواطنين People ,Nation َشعب

فروع Branches ُشعب

109

In order to make query expansion able to solve the problems of handling generalizations, handling concept matches, and handling synonyms with the correct interpretation (Word Sense Disambiguation) our system uses topic modeling by using the

LSI and LDA which reply on the semantic meaning. As for handling morphological variants, in order to solve this problem first the preprocessing steps and then the stemmer can be applied.

6.4 Methodology

In this section, we summarize the steps of our approach and we give the details for each step.

1. We built the following two corpora for our system:

 The first corpus is for the system with Arabic Wikipedia documents, which is

called Arabic corpus.

 The second corpus is for the system with English Wikipedia documents, which

is called English corpus.

There are many preprocessing steps that are going to be applied to the above two corpora: stemming, normalization and handling composite terms (separating terms), as we mentioned in Chapter 5.

2. As a next step, we will classify and categorize the two corpora into a specific

number of topics (using topic modeling). We do topic modeling using the Gensim

tool.

110

Gensim is considered one of the most effective topic modeling toolkits. The main purpose for using Gensim is to handle large text collections by using effective online algorithms [93]. For more details about Gensim see Appendix A.

 The Arabic corpus is classified into five topics by using LSI, which in this study

is called LSI five topics Arabic corpus.

 The Arabic corpus is classified into ten topics by using LSI, which in this study

is called LSI ten topics Arabic corpus.

 The Arabic corpus is classified into fifteen topics by using LSI, which in this

study is called LSI fifteen topics Arabic corpus.

 The Arabic corpus is classified into five topics ny using LDA, which in this

study is called LDA five topics Arabic corpus.

 The Arabic corpus is classified into ten topics by using LDA, which in this

study is called LDA ten topics Arabic corpus.

 The Arabic corpus is classified into fifteen topics by using LDA, which in this

study is called LDA fifteen topics Arabic corpus.

 The English corpus is classified into five topics by using LSI, which in this

study is called LSI five topics English corpus.

 The English corpus is classified into ten topics by using LSI, which in this study

is called LSI ten topics English corpus.

 The English corpus is classified into fifteen topics by using LSI, which in this

study is called LSI fifteen topics English corpus.

111

 The English corpus is classified into five topics by using LDA, which in this

study is called LDA five topics English corpus.

 The English corpus is classified into ten topics by using LDA, which in this

study is called LDA ten topics English corpus.

 The English corpus is classified into fifteen topics by using LDA, which in this

study is called LDA fifteen topics English corpus.

3. We compare the efficiency and the accuracy for the Arabic corpus and the English

corpus as follow:

 For LSI five topics Arabic corpus, we generate a query for each topic in this

category using the same keywords yielded in each topic. These keywords

which are used to form the query were identified as representative keywords

for the topic in question. This observation applies to all the next bulleted

descriptions.

 For LSI ten topics Arabic corpus, we generate a query for each topic in this

category using the same keywords yielded in each topic.

 For LSI fifteen topics Arabic corpus, we generate a query for each topic in

this category using the same keywords yielded in each topic.

 For LDA five topics Arabic corpus, we generate a query for each topic in

this category using the same keywords yielded in each topic.

 For LDA ten topics Arabic corpus, we generate a query for each topic in

this category using the same keywords yielded in each topic.

112

 For LDA fifteen topics Arabic corpus, we generate a query for each topic

in this category using the same keywords yielded in each topic.

Typically, the result of each query will retrieve the documents that are related to the specific topic (query). Here, the results are measured according to the time needed for producing the retrieved list of each query and according to the accuracy.

At the end of this step we have obtained two main contributions. The first contribution is to do topic modeling for a system with semantic representation. The importance of this contribution stems from the fact that classifying the huge corpus can simplify the process of search for documents. For example, searching the corpus in question can be done more easily if the corpus has been classified into different sub-topics such as economic, political, and scientific, etc. The second contribution is to formulate the query based on the same keywords yielded in each topic as explained above.

4. We compare the efficiency and the accuracy for yielded topics of Arabic corpus

and English corpus as follow:

 For LSI five topics English corpus, we generate a query for each topic in

this category using query expansion.

 For LSI ten topics English corpus, we generate a query for each topic in this

category using query expansion.

 For LSI fifteen topics English corpus, we generate a query for each topic in

this category using query expansion.

 For LDA five topics English corpus, we generate a query for each topic in

this category using query expansion.

113

 For LDA ten topics English corpus, we generate a query for each topic in

this category using query expansion.

 For LDA fifteen topics English corpus, we generate a query for each topic

in this category using query expansion.

At the end of this step we have obtained three main contributions. The first contribution is to do topic modeling for a system with semantic representation, the second contribution is to formulate the query based on the same keywords yielded in each topic, and the third contribution is to expand the original query terms by using WordNet for both

Arabic and English corpora in order to handle synonyms. We will discuss these issues in more detail in next section.

6.4.1 Query Expansion

Many resources such as ontologies have been used for different knowledge based applications. Among the knowledge base applications are natural language processing, intelligent information integration, Information Retrieval, etc. [21]. Ontologies have been used widely in the Semantic Web. The primary purpose of the Semantic Web is "annotation of the data on the Web with the use of ontologies in order to have machine-readable and machine-understandable Web that will enable computers, autonomous software agents, and humans to work and cooperate better through sharing knowledge and resources" [7].

Ontologies have many features such as denoting knowledge, relating multiple concepts to a specific domain, and establishing relationships among these concepts [57].

These features give ontologies the advantage to be used as a query expansion method.

114

The WordNet ontology is considered to be one of the ontologies that can be used in query expansion. The main function of this ontology is that it can classify different

English terms into groups of synonyms, which are called synsets. The WordNet ontology can define the relationships among these synonyms, and it can give suggested definitions

[80]. In other words, WordNet can function as both a thesaurus and a dictionary.

The latest version of the WordNet, version 3.1, contains 155,287 words. These words are structured in 117,659 synsets for a total of 206,941 word-sense pairs. The size of this database is roughly 12 megabytes in compressed form [80].

In general, WordNet contains the major lexical categories, such as adjectives, nouns, adverbs, and verbs. However, WordNet neglects other secondary lexical categories such as prepositions.

Deciding on the similarity between words is considered to be one of the popular uses of WordNet. Consequently many algorithms haven been introduced [17, 90]. They can measure the distance among the synsets and the words "in WordNet's graph structure, such as by counting the number of edges among synsets" [88]. For Example, we can consider the two words as synonyms if the distance between the two words equals 2 edges.

Arabic WordNet is considered a lexical database for the Arabic language. The conception and methodology used for WordNet in English and European languages are also used for Arabic, i.e., Arabic WordNet is being constructed following methods developed for EuroWordNet.

Arabic WordNet structure is similar to a thesaurus; the organization of the Arabic

WordNet relies on the structure of synsets [30, 49, 50]. In other words, "sets of synonyms

115

and pointers describing relations to other synsets. Each word can belong to one or more synsets, and one or more categories of the discourse. These categories are organized in four classes: noun, verb, adjective, and adverb. Arabic WordNet is a lexical network whose nodes are synsets and relations between synsets are the arcs. It currently counts 11,269 synsets (7,960 names, 2,538 verbs, adjectives, 110 adverbs 661), and 23,481 words" [2].

6.4.2 Our Work

To achieve our purpose, which is combing topic modeling (using LSI and LDA) and query expansion for both English and Arabic documents, our system contains many components (subsystems) that are used to represent the main points of evaluation, and then to summarize the results in the precision and recall measurements.

The system contains two corpora, one for Arabic Wikipedia documents and the other for English Wikipedia documents.

The system also contains two types of queries (Arabic and English). Figures 22, 23,

24 and 25 show the main components of our system.

116

Figure 22: System components for Arabic corpus using LSI topic modeling.

For the Arabic corpus, we first applied preprocessing steps. As we discussed in

,"حتى" , "في", "عن","من" chapter 5, preprocessing steps include stop words removal such as

etc., digits and punctuation marks removal {: , ;?@%*! & $ # [ ] …..}, non-Arabic "حين"

with "ء ", "إ"," آ,"ؤ","ئ" ,"ا " words removal, all vowels removal {~, ُُ , َُ , ُِ ,….}, replacing

"ئ" replacing the letter ,"ي" with " ى " replacing."ة" with "ه" aleph, replacing "أ" the letter

"~ " ,diacritics removal ,"ء" with the letter"ؤ" replacing the letter ,"ء" with the letter removal, all non-Arabic words in the documents removal.

117

After that we apply light-10 stemmer, then we classify the Arabic corpus using LSI into a specific number of topics and a specific number of terms. In Figure 22 the arrow with number 1 refers to generating or formulating the query with the topic keyword terms to retrieve the relevant documents from the corpus. In other words, if we classify our corpus into ten topics (politics, economy, etc.) and each topic with ten keywords terms, then we can generate 10 queries and each query contains 10 terms. Arrows with number 2 and 3 refer to expanding the query terms using the Arabic WordNet ontology.

Figure 23: System components for Arabic corpus using LDA topic modeling.

118

The same steps discussed in the above paragraph are also adapted for our second topic modeling technique, which is LDA.

Figure 24: System components for English corpus using LDA topic modeling.

For the English corpus, we first applied preprocessing steps. As we discussed in chapter 5, preprocessing steps include stop words removal, digits and punctuation marks removal, and non-English words removal. After that we apply the Paice/Husk stemmer, and then we classify the English corpus using LDA into a specific number of topics with a specific number of terms. In Figure 24 the arrow with number 1 refers to generating or

119

formulating the query with the topic terms to retrieve the relevant documents from the corpus based on the topic terms. In other words, if we classify our corpus into ten topics

(politics, economy, etc.) and each topic with ten terms, then we can generate 10 queries and each query contains 10 terms. Arrows with number 2 and 3 refer to expanding the query terms using the WordNet ontology.

Figure 25: System components for English corpus using LSI topic modeling.

The same steps discussed in the above paragraph are also adapted for our second topic modeling technique, which is LSI.

120

6.4.2.1 Stemming Subsystem

We applied stemming for each corpus as we discussed in Chapter 5, and also we applied stemming for the two types of queries (Arabic and English). For the English queries we applied the Paice/Husk stemmer, and for the Arabic queries we applied Light-10 stemmer. The importance of applying the stemming is for the efficiency and the accuracy of the system, which can be enhanced. Enhancement can be achieved by adding a conceptual flavor by depending on stems instead of exact words, and so combining several words in one concept expressed by a stem. On the other hand, stemming can decrease the precision of the system while it increases its recall.

6.4.2.2 Query Expansion Subsystem

The query expansion subsystem expands the original query terms or keywords using three steps:

 Finding the set of synonyms (synset) for each keyword or term in the original

query,

 Selecting the best synonyms to use after finding the set of synonyms for each

keyword for query expansion,

 Expanding the query by using the pre-found best synonyms for the keywords

of the original query from step 2.

For example, searching the query "teaching foreign language" should return synonymous words such as learn, instruct, educate. The user may select instruct as the best

121

synonym for the term teach, then the query can be expanded to include the best synonyms for the original query.

6.5 Experimental Results and Discussion

To evaluate the results of our system, we use the standard Information Retrieval

(IR) measurements recall and precision. The values of recall and precision are between 0 and 1 and are usually given in percentage. When the value of recall equals 0% that means that none of the relevant documents are retrieved, and when the value of recall equals 100% that means that all the relevant documents have been retrieved, although there could be retrieved documents that are not relevant. When the value of precision equals 0% means that all of the retrieved documents are irrelevant, and when the value of precision equals

100% that means all the retrieved documents are relevant, although there could be relevant documents that have not been retrieved [26].

As we mentioned earlier, there is a tradeoff between precision and recall. If the value of the recall is high, then the value of the precision usually is low, and if the value of the precision is high, then the value of the recall usually is low. Recall, precision, and F- measure are defined as follows:

Recall = │ {relevant documents} ∩ {retrieved documents} │

│ {relevant documents} │

122

Precision = │ {relevant documents} ∩ {retrieved documents} │

│ {retrieved documents} │

F-measure = 2* Precision* Recall

(Precision + Recall)

To measure the accuracy (recall, precision) of our system, we include the first 100 ranked retrieved documents for each query. We run 20 queries for each experiment and then we calculate the average recall and precision.

6.5.1 Experiment 1

100%

90%

80%

70%

60% 5-Topics 50% 10-Topics 40% 15-Topics 30%

20%

10%

0% Average Recall Average Precision

Figure 26: Query one for Arabic corpus with LDA topic modeling.

123

After using the LDA topic modeling technique for the Arabic corpus, we use the

10 already generated words by the LDA to formulate the query. We select to use 10 words because after experimenting with 5, 10, 15, and 20 words, we find that the topic of 10 keywords is the most representative (less than 10 keywords produces more general topics, more than 10 keywords give more specific and restricted topics). This step is applied to the three categories of 5, 10, and 15-topics. The highest average recall is achieved with the 10- topics category and the lowest is obtained with 5-topics category. The highest average precision is achieved with the 10- category and the lowest is obtained with the 5- category.

6.5.2 Experiment 2

100% 90% 80% 70% 60% 5-Topics 50% 10-Topics 40% 15-topics 30% 20% 10% 0% Average Recall Average Precision

Figure 27: Query two for Arabic corpus with LDA topic modeling and query expansion.

After using the LDA topic modeling technique for the Arabic corpus, we use the

10 keywords already generated by the LDA to formulate the query and we use Arabic

124

WordNet ontology to expand the words that are used in the query. This step is applied to the three categories of 5, 10, and 15-topics. Applying the query expansion technique enhances the average recall for all the categories, i.e., 5, 10, and 15-topics. On the other hand, the average precision is slightly less in the two cases 5 and 15.

6.5.3 Experiment 3

100% 90% 80% 70% 60% 5-Topics 50% 10-Topics 40% 15-Topics 30% 20% 10% 0% Average Recall Average Precision

Figure 28: Query three for Arabic corpus with LSI topic modeling.

After using the LSI topic modeling technique for the Arabic corpus, we use the 10 keywords already generated by the LSI to formulate the query. This step is applied to the three categories of 5, 10, and 15-topics. The highest average recall is achieved with the 10- category and the lowest is obtained with the 5-category. The highest average precision is achieved with the 10-category and the lowest is obtained with 5-category.

125

6.5.4 Experiment 4

100% 90% 80% 70% 60% 5-Topics 50% 10-Topics 40% 15-Topics 30% 20% 10% 0% Average Recall Average Precision

Figure 29: Query four for Arabic corpus with LSI topic modeling and query expansion.

After applying the LSI topic modeling technique for the Arabic corpus, we use the ten keywords already generated by the LSI to formulate the query, and we use Arabic

WordNet ontology to expand the keywords that are used in the query. This step is applied to the three categories of 5, 10, and 15-topics. Applying the query expansion technique enhances the average recall for all the categories, i.e., 5, 10, and 15-topics. On the other hand, the average precision is slightly less in only one case, which is the case of 10 topics.

126

6.5.5 Experiment 5

100% 90% 80% 70% 60% 5-Topics 50% 10-Topics 40% 15-Topics 30% 20% 10% 0% Average Recall Average Precision

Figure 30: Query five for English corpus with LDA topic modeling.

After using the LDA topic modeling technique for the English corpus, we use the ten keywords already generated by the LDA to formulate the query. This step is applied to the three categories of 5, 10, and 15-topics. The highest average recall is achieved with 10- category and the lowest is obtained with 5- category. The highest average precision is achieved with 10- category and the lowest is obtained with 5-category.

127

6.5.6 Experiment 6

100% 90% 80% 70% 60% 5-Topics 50% 10-Topics 40% 15-Topics 30% 20% 10% 0% Average Recall Average Precision

Figure 31: Query six for English corpus with LDA topic modeling query expansion.

After using the LDA topic modeling technique for the English corpus, we use the ten keywords already generated by the LDA to formulate the query, and we use WordNet ontology to expand the words that are used in the query. This step is applied to the three categories of 5, 10, and 15-topics. Applying the query expansion technique enhances the average recall for all the categories, i.e., 5, 10, and 15-topics. On the other hand, the average precision is slightly less in the two cases of 5 and 15-topics.

128

6.5.7 Experiment 7

1 0.9 0.8 0.7 0.6 5-Topics 0.5 10-Topics 0.4 15-Topics 0.3 0.2 0.1 0 Average Recall Average Precision

Figure 32: Query seven for English corpus with LSI topic modeling.

After using the LSI topic modeling technique for the English corpus, we use the ten keywords already generated by the LSI to formulate the query. This step is applied to the three categories of 5, 10, and 15-topics. The highest average recall is achieved with 10- category and the lowest is obtained with 5-category. The highest average precision is achieved with 10-category and the lowest is obtained with 5-category.

129

6.5.8 Experiment 8

100% 90% 80% 70% 60% 5-Topics 50% 10-Topics 40% 15-Topics 30% 20% 10% 0% Average Recall Average Precision

Figure 33: Query eight for English corpus with LSI topic modeling query expansion.

After using the LSI topic modeling technique for the English corpus, we use the ten keywords already generated by the LSI to formulate the query, and we use WordNet ontology to expand the words that are used in the query. This step is applied to the three categories of 5, 10, and 15-topics. Applying the query expansion technique enhances the average recall for all the categories, i.e., 5, 10, and 15-topics. On the other hand, the average precision is slightly less in the two cases of 10 and 15-topics.

From the previous results, we find that LSI is more accurate for retrieving relevant documents in terms of recall and precision than LDA. In all experiments we find the highest average recall and the highest average precision with the category of 10-topics. In all experiments we find that the expanded Arabic query with Arabic WordNet enhances the average recall for all categories, and the average precision is slightly less in some cases.

But overall, in terms of f-measure, the system is enhanced with query expansion. In all

130

experiments, we find that the expanded query with WordNet enhances the average recall for all the categories, and the average precision is less in some cases. But overall, in terms of f-measure, the system is enhanced with query expansion. Arabic expansion using Arabic

WordNet ontology enhances the overall accuracy of the system, but English WordNet ontology enhances the accuracy more. The reason for that, as we mentioned in Chapter 5, is probably the Arabic language challenges that we face in the Semantic Web.

131

CHAPTER 7

Conclusion and Future Work

The dissertation addresses a very common problem we face today in the information society, which is information overload. This problem became more serious because of the huge size of the World Wide Web (WWW) and with the rapid developments of Internet that the world has witnessed in the last decade. With the expected continuous growth of the WWW (in size, languages, and formats), the search engines will have hard time maintaining the quality of retrieval results, and it is very difficult to analyze the huge amount of information which queries now return. To deal with this problem, we need a new vision for the web to be able to make intelligent choices and gain a better meaning of the information on the Internet. This new version of web is called the Semantic Web.

The contribution of this study lies in comparing the accuracy of generating the topics between LSI and LDA for both English and Arabic documents. Most of the studies that have been conducted dealt with English corpora and a few have coped with Arabic.

The studies that have been done on Arabic examine small size corpora and the accuracy was low. Therefore, my study is an attempt to compare Wikipedia documents in

Arabic and English, which is huge, in term of size. Also, topic modeling will be enhanced using query expansion for both Arabic and English.

132

Furthermore, previous research does not apply combining topic modeling with query expansion on Arabic. Since there is a lack of the research for the Arabic language in the Semantic Web field, the research involves applying the work to Arabic language corpus. Also, my study has compared the accuracy (in terms of precision and recall) of topic modeling and query expansion in both languages Arabic and English. Even though topic modeling and query expansion was better for English than for Arabic, it does have tangible improvement on Arabic.

The dissertation also addresses some advantages of the current Semantic Web technologies that can be applied for Arabic language to allow Arabic users to benefit from the Semantic Web.

This dissertation focuses on investigating approaches to discover the topics that occur in a collection of documents based on a topic modeling approach, and then to expand the query by exploring and discovering hidden relations between the documents in an efficient manner according to accuracy. More precisely, the research explores how the application of advanced Semantic Web tools to problems in Information Retrieval (IR) can be utilized to enhance the accuracy. The work is applied on Arabic and English documents.

The current study hypothesizes that classifying the corpus with meaningful descriptive information, and then expanding the query by using Automatic Query

Expansion, will improve the results of the IR methods for indexing and querying information. In particular, the work uses topic modeling techniques, which are considered advanced Information Retrieval methods that have been widely used for indexing and analyzing the corpus and then applies advanced Semantic Web techniques in order to

133

increase the accuracy. We can summarize the main contributions of this dissertation as the following:

 Investigate approaches to discover the abstract “topics” that occur in a collection of

documents based on topic modeling approaches and then expand the query to

explore and discover hidden relations between the documents.

 Demonstrate that classifying a corpus with meaningful descriptive information

using Latent Semantic Indexing can improve the results of IR methods applied to

Arabic and English documents.

 Demonstrate that classifying a corpus with meaningful descriptive information

using Latent Dirichlet Allocation can improve the results of IR methods applied to

Arabic and English documents.

 Test if Latent Dirichlet Allocation has higher accuracy than Latent Semantic

Indexing for Arabic documents as in English Documents.

 Test if Latent Semantic Indexing is faster to execute than Latent Dirichlet

Allocation for Arabic documents as in English Documents.

 Combine topic model and query expansion for English documents.

 Combine topic model and query expansion for Arabic documents.

Due to unavailability of free Arabic resources, we have adopted Arabic Wikipedia and English Wikipedia as data sets for our experiments. The results show that the Latent

Semantic Indexing and Latent Dirichlet Allocation can identify topics for the corpus with high accuracy for both Arabic and English documents. The experiments were conducted using 5, 10, and 15 topics using LSI and LDA. The results also show that Latent Semantic

134

Indexing is faster than Latent Dirichlet Allocation in time complexity. The experiments show that LDA is more accurate in generating the topics than LSI for both Arabic and

English corpora. Also, the experiments show that LSI is faster in generating the topics than

LDA for both Arabic and English corpora as well. Moreover, the experiments show that the topics generated for English documents in both techniques LSI and LDA are more accurate than the topics generated for Arabic documents in both techniques. To the best of my knowledge, there is no study that compares the accuracy for generating the topics between LSI and LDA for both English and Arabic documents. The following tables show the results.

Table 14: Accuracy for topics generating for English corpus

English corpus Accuracy

LSI with 5 topics, number of keywords = 10 82%

LSI with 10 topics, number of keywords =10 84%

LSI with 15 topics, number of keywords = 10 79%

LDA with 5 topics, number of keywords = 10 85%

LDA with 10 topics, number of keywords =10 88%

LDA with 15topics, number of keywords =10 81.3%

135

Table 15: Accuracy for topics generating for Arabic corpus.

Arabic corpus Accuracy

LSI with 5 topics, number of keywords =10 74.1%

LSI with 10 topics, number of keywords = 10 77%

LSI with 15 topics, number of keywords =10 71.2%

LDA with 5 topics, number of keywords = 10 76.3%

LDA with 10 topics, number of keywords = 10 79.8%

LDA with 15 topics, number of keywords = 10 73.4%

In our experiments, we find that the accuracy in classifying Arabic topics is less than the accuracy in classifying English topics due, at least in part, to the following challenges:

 Arabic language morphology is a complex morphology.

 In Arabic language, which is one of the Semitic languages, the root is an

important element from which different words may be derived based on

specific patterns or schemes.

 In Arabic language the words are vocalized with diacritics such as

but in real life only the holy Qur’an and some formal documents ُ ,{.…, ُِ,ُ َ,ُ ,~}

include full vocalization.

 In Arabic language, a word can be extended by attaching four kinds of

affixes. The four affixes are antefixes, prefixes, suffixes, and postfixes.

136

According to the experimental results for topic modeling, we find that the most accurate results after manual investigation are obtained when the number of topics is equal to ten. We apply our experiments to five, ten, and fifteen topics.

Query expansion (QE) is the process of reformulating the original query by evaluating the user's query words and then expanding the query in order to match additional relevant documents. The process of query expansion is considered a complex task, since a query expansion engine needs to analyze the search results before displaying and ranking them as relevant search documents.

The purpose for query expansion is to handle the main features of semantic search that make it a more appealing choice over the traditional keyword based techniques. The main features of semantic search are: handling generalizations, handling morphological variants, handling concept matches, and handling synonyms with the correct sense (Word

Sense Disambiguation).

To achieve our purpose, which is combining topic modeling (using LSI and LDA) and query expansion for both English and Arabic documents, our system contains many components (subsystems) that are needed to represent the main points of evaluation, and then summarizing them using the precision and recall measurements.

For the Arabic corpus, we first applied preprocessing steps. Preprocessing steps include stop words removal, digits and punctuation marks removal, etc. After that we apply light-10 stemmer, then we classify the Arabic corpus using LSI and LDA into a specific number of topics and a specific number of terms. The query is generated from the topic terms to retrieve the relevant documents from the corpus based on the topic terms. In other

137

words, if we classify our corpus into ten topics (politics, economy, etc.) and each topic with ten terms, then we can generate 10 queries and each query contains 10 terms. After that we expand the query terms using the Arabic WordNet ontology.

For the English corpus, we first applied preprocessing steps. Preprocessing steps include stop words removal, digits and punctuation marks removal, etc. After that we apply

Paice/Husk stemmer, then we classify the English corpus using LSI and LDA into a specific number of topics and a specific number of terms. The query is generated from the topic terms to retrieve the relevant documents from the corpus based on the topic terms. In other words, if we classify our corpus into ten topics (politics, economy, etc.) and each topic with ten terms, then we can generate 10 queries and each query contains 10 terms.

After that we expand the query terms using the WordNet ontology.

In all the experiments we find that expanding the Arabic query with Arabic

WordNet enhances the average recall for all the categories, and the average precision is slightly less in some cases. But overall, relative to f-measure, the system is enhanced with the query expansion.

In all experiments we find that expanding the query with WordNet enhances the average recall for all the categories, and the average precision is slightly less in some cases.

But overall, relative to f-measure, the system is enhanced with the query expansion. Arabic

Expansion using Arabic WordNet ontology enhanced the overall accuracy of the system, but English WordNet ontology enhanced the accuracy more.

The work presented in this dissertation forms the basis for a number of opportunities for research in topic modeling and query expansion, bridging the gap between

138

Information Retrieval and the Semantic Web. We plan to extend our work. The main directions for future work areas follows.

 Since the majority of existing tools and applications for the Semantic Web do

not support the Arabic language, there are many things to do in the future for

the Semantic Web model to support Arabic language. The Arabic Web is a good

target to build an Arabic ontology that has its own vocabulary of terms to

represent the semantic core of documents and study the effect of every semantic

relationship used in this process, such as synonymy and polysemy.

 Another future step is to study several other corpora to measure the impact of

this approach on different corpora and evaluate the results. In this research the

corpus was obtained by using Wikipedia documents, which is considered broad

set. A similar approach can be applied to other more specific corpora such as

newspapers or journals in the medical, political, or social fields, etc.

The results of this research will be published as two separate papers. The first paper is dealing with comparing the accuracy of generating the topics between LSI and LDA for both English and Arabic documents. The second one is coping with combining topic modeling with query expansion on Arabic, since this topic has not been researched on

Arabic language in the Semantic Web field. This second paper will compare the results of this process with an English corpus.

We have seen that the combined approach of topic modeling and query expansion is proved efficient in enhancing search accuracy for English corpus. Likewise, my study

139

has also proved its effectiveness when applied on Arabic corpus. Therefore, I recommend that research should examine the efficiency of this factor i.e., combined approach of topic modeling and query expansion, on other languages, in which my study can be used as a model.

140

Appendix A

1. LSI

LSI has been used widely because of its ability to infer topics in a well-organized manner and for its solid theoretical background. LSI is able to best find cooccurrence among a wide range of terms, which results in projecting the documents in a small dimensional space. Inference is done using linear algebra actions for condensed Singular

Value Decomposition (SVD) on the sparse term-document matrix. The performance of the

SVD enables us to project the documents in a small dimensional space, which is a folding- in process. Figure 34 shows the Latent Semantic Index steps [25].

Term-Document Singular Value Vectors Corpus (Semantic Space) matrix Decomposition

Figure 34: LSI Steps

As shown in Figure 34, before applying the LSI over the corpus in order to build vectors (semantic space), many preprocessing steps should be applied. First, documents in the corpus are considered as bags of words and each document is presented in a vector.

Second, we have to perform a transformation in order to have a tf-idf representation for each document (bag of words). tf-idf stands for term frequency–inverse document frequency.

141

The main goal for tf-idf which is a numerical statistic, is to show the importance of each word to a document in a corpus. Term frequency (tf) can be measured by calculating the times that term t occurs in document d. The inverse document frequency shows the occurrence of the term in all documents [77]. idf (t, D) = log N

│ {d ∈D: t ∈d} │

Where:

 N: refers to the total number of documents in the corpus

 │ {d ∈D: t ∈d} │ refers to the number of documents where the term t appears

(i.e. tf (t,d) ≠ 0). If the term is not in the corpus, this will lead to a division by

zero. It is therefore common to adjust the denominator to 1+ │ {d ∈D: t ∈d} │.

 Then tf–idf is calculated as: tf-idf (t,d,D) = tf (t,d) × idf (t, D)

Example: Suppose we have term frequency tables for a collection consisting of only two documents [77].

Table 22: Document 1:

Term Term count This 1 Is 1 A 2 Sample 1

142

Table 23: Document 2:

Term Term count this 1 is 1 Another 2 Example 3 a. Calculate the tf-idf for the term "this" in document 1. tf =1, idf (this, D) = log N

│ {d ∈D: t ∈d} │ idf (this, D) = log 2/2 =0. Since the term occurs in all documents, then tf-idf is zero for this term. b. Calculate the tf-idf for the term "example" in document 2. tf (example, d2) = 3 idf (example, D) = log 2/1=0.3010 tf-idf = tfF (example, d2) × idf (example, D) = 3 * 0.3010 = 0.903.

Third, LSI is used to build a corresponding vector that contains a description of the document (LSI-Space). Finally, different similarity measures are used in order to find and compare the similarities between all documents. The similarity between documents is typically measured by the cosine between the corresponding vectors, which increases as more terms are shared.

143

Cosine similarity measures the similarity between two vectors of an inner product space to calculate the cosine of the angle between them. If the angle is 0° then the cosine is 1; however, the cosine is less than 1 for any other angle. "It is thus a judgment of orientation and not magnitude: two vectors with the same orientation have a Cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude. Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0, 1]" [24].

A and B represent two vectors of attributes and their cosine similarity is cos (θ). A dot product and magnitude are used to represent the similarity:

The result range is from -1 to 1, where -1 means exactly the opposite, 1 means exactly the same, and in-between values indicates intermediate similarity or dissimilarity.

When it comes to Information Retrieval, the cosine similarity of two documents ranges from 0 to 1, for the reason that the term frequencies (tf-idf weights) cannot be negative. As a result, the angle between two term frequency vectors cannot be greater than 90°.

Example: Suppose we have two texts and Table 24 shows the words in the texts and the frequency for each word. How to calculate the cosine similarity between two texts, T1 and

T2 [24]?

144

Table 24: Two texts.

Text Advance Semantic Web Tools

T1 1 1 0 1

T2 2 0 1 1

In general, two documents are considered similar if their corresponding vectors in the VSM space point towards the same (general) direction. LSI relies on a Single Value

Decomposition (SVD) of the co-occurrence matrix [25, 78]. SVD is a form of factor analysis and acts as a method for reducing the dimensionality of a feature space without serious loss of specificity. One of the most successful applications of SVD in Information

Retrieval is the Google search engine (www.google.com).

Latent Semantic Indexing is called so since it can correlate terms that are semantically related; these terms are already latent in a collection of text. LSI, also known as LSA, can also reveal the hidden latent semantic structure the way words are used in a body of text. It can also perform the queries initiated by users to extract the meaning of a text. This process is known as concept searching. The queries or concept searches that adapt LSI as a method will show results similar to each other in terms of meaning and

145

according to the search criteria even if there is no shared word or words with the search criteria.

The smallest dimension of the original matrix can be used to decompose and then recompose any matrix. Yet using fewer dimensions to recompose the original matrix can show an interesting phenomenon. Automatically, in SVD a rectangular matrix X can be decomposed into the product of three different matrices [5].

One component matrix (U) defines the original row entries as vectors of derived orthogonal factor values, the second (V) defines the original column entries in the same way, and the third (Σ) is a diagonal matrix which contains scaling values such that when the three components of the matrix are multiplied, then the original matrix is reconstructed

(i.e., X = UΣVT). The columns of V and U represent the right and left singular vectors, respectively, corresponding to the monotonically decreasing (in value) diagonal entries of

Σ, which are called the singular values of the matrix X [5, 71].

2. LDA Model

In order to capture the dependencies among the various variables in a concise way, the LDA employs plate notation. In the plate notation the boxes, which are also called plates, demonstrate replication. Figure 12 in Chapter 3 shows that there are two plates.

Documents are represented in the outer plate, whereas the repeated choice of topics and words within a document is represented in the inner plate. M represents the number of

146

documents and N represents the number of words in a document [55]. Based on the previous Figure 12 in Chapter 3:

 α: is the parameter of the Dirichlet prior to the per-document topic

distributions, and can be considered as a corpus level parameter.

 β: is the parameter of the Dirichlet prior on the per-topic word distribution,

and can be considered as corpus level parameter,

 : is the topic distribution for document i,

 : is the word distribution for topic k,

 : is the topic for the j-th word in document i, and

 : is the specific word [55]

Note: The parameters α and β are considered to be the corpus level parameters, and the corpus level parameters are assumed to be sampled once in the process of generating a corpus. Finally, the variables z and w are considered to be word-level variables and the word-level variables are assumed to be sampled once for each word in each document [33].

The Wij are the only observable variables (manifest variables). Observable variables are used in statistics, and they are considered to be opposite to latent variables.

Observable variables are those variables that can be observed and directly measured. The other variables in this model are known as latent variables (hidden variables). Latent variables are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured). Latent variable models are mathematical models that aim to explain observed variables in terms

147

of latent variables. Latent variable models are used in many areas such as machine learning, artificial intelligence, and natural language processing [47]

Latent Dirichlet allocation (LDA) is considered as a generative probabilistic model for a corpus. The basic idea for the generative process is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. LDA assumes the following generative process for a corpus D consisting of M documents each of length Ni [33]:

1. Choose , where and is the Dirichlet

distribution for parameter .

2. Choose , where .

3. For each of the word positions , where , and

,

a) choose a topic

b) Choose a word

148

Figure 35: The LDA model.

149

Table 25: Definition of variables in the model [33]

Variable Type Meaning

number of topics (e.g. 50) Integer

number of words in the Integer vocabulary (e.g. 50,000 or 1,000,000)

Integer number of documents

number of words in document Integer d

total number of words in all documents; sum of all Integer

values, i.e., prior weight of topic k in a

document; usually the same

for all topics; normally a

number less than 1, e.g., 0.1, positive real to prefer sparse topic

distributions, i.e. few topics

per document

150

K-dimensional vector of collection of all values, positive reals viewed as a single vector

prior weight of word w in a

topic; usually the same for all

words; normally a number positive real much less than 1, e.g., 0.001,

to strongly prefer sparse word

distributions, i.e., few words

per topic

V-dimensional vector of collection of all values, positive reals viewed as a single vector

probability (real number probability of word w between 0 and 1) occurring in topic k

V-dimensional vector of distribution of words in topic probabilities, which must sum k to 1

151

N-dimensional vector of identity of all words in all integers between 1 and V documents

probability (real number probability of topic k between 0 and 1) occurring in document d for

any given word

K-dimension vector of distribution of topics in probabilities, which must sum document d to 1 integer between 1 and K identity of topic of word w in document d

integer between 1 and V identity of word w in

document d

integer between 1 and V identity of word w in

document d

N-dimensional vector of identity of topic of all words integers between 1 and K in all documents

152

It is important to distinguish between the LDA model and a simple Dirichlet- multinomial clustering model. The following points show the differences between the LDA model and a simple Dirichlet-multinomial clustering model.

 A classical clustering model is considered to be two-level model, whereas LDA

is considered as three-level model.

 A classical clustering model restricts a document to being associated with a

single topic, whereas LDA documents can be associated with multiple topics.

In a classical clustering model, the Dirichlet is sampled once for a corpus, a multinomial clustering variable is selected once for each document in the corpus, and a set of words is selected for the document conditional on the cluster variable. For these reasons, a document can be associated with a single topic. On the other hand, in the LDA model, the topic node is sampled repeatedly within the document. For that reason, documents can be associated with multiple topics [33].We can then mathematically describe the random variables as follows:

3. Applications of AQE

The main application for AQE is to expand the query in order to improve document ranking. Besides, there are other applications and tasks for the AQE. The following section will briefly discuss four other areas or applications.

153

3.1

Question Answering aims at providing precise or concise answers to specific questions such as "How many states are in the USA?" This application is similar to document ranking with AQE. In this case, Question Answering application has a major problem which is the mismatch between question and answer vocabularies [41].

3.2 Multimedia Information Retrieval

The advent of digital media increased the searching for multimedia documents such as speeches, images, and videos. In general, multimedia (e.g., annotations, captions, and surrounding html/xml descriptions) IR systems depend on metadata to perform their searches. The absence of metadata forces IR to rely on some multimedia content analysis that is combined with AQE methods.

3.3 Information Filtering

Information filtering (IF) refers to the idea of selecting the most relevant documents to the user among set of documents. These documents arrived regularly and the need for information also grows from the users’ side. Hanani, et al. [2004] provide a detailed explanation about filtering application domains (e.g., electronic news, blogs, e-commerce, and e-mail) [59]. Information Filtering involves two main approaches: collaborative IF, and content-based IF. Collaborative IF depends on the preferences of the users who have similar methods, i.e., similar ways of searching. According to Belkin and Croft 1992,

154

content-based IF resembles, to a great extent, IR systems since user profiles can be adapted as queries and data streams can also be modeled as collections of documents [19].

3.4 Cross-Language Information Retrieval

Cross-language information retrieval (CLIR) refers to the ability of retrieving documents in languages other than the language that the user used in the query. According to Koehn 2010, CLIR applies traditional approaches such as monolingual retrieval and query translation. The latter approach can be processed by several methods: machine readable bilingual dictionaries, parallel corpora, or [69].

4. A Classification of Approaches

According to the conceptual paradigm, AQE techniques can be classified into five main groups, which are Web data, search log analysis, query-specific statistical approaches, corpus-specific statistical approaches, and linguistic methods. After that, each group can then be further divided into a few subclasses. Figure 36 shows the general taxonomy that includes the five main groups in addition to the subclasses.

155

Figure 36: A taxonomy of approaches to AQE [55].

5. Gensim

Gensim implements a wide variety of methods including "tf–idf, random projections, deep learning with Google's algorithm (reimplemented and optimized in Cython), hierarchical Dirichlet processes (HDP), latent semantic analysis

(LSA) and latent Dirichlet allocation (LDA), including distributed parallel versions" [93].

There are many applications for Gensim that proved to be useful, such as commercial and academic applications.

156

Figure 37: Gens

NumPy is a Python programming language extension; this extension can add support for large, multi-dimensional arrays and matrices. Besides, it provides a large library of high-level mathematical functions that are supposed to operate on these arrays [93].

SciPy is an open source Python library. SciPy is used by different people, such as engineers analysts and scientists, in order to perform scientific and technical computing, since it has different modules for different tasks, such as linear algebra, optimization, and integration [93].

Cython programming language is considered as a superset of Python that can invoke

C/C++ routines by using a special foreign function interface. It also has the ability to declare the static type of subroutine parameters and results, class attributes, and local variables [93].

157

References

[1] Abdelali, Ahmed, James Cowie, David Farwell, Bill Ogden, and Stephen Helmreich.

"Cross-language information retrieval using ontology." Proceedings of TALN Batz-sur-

Mer (2003).

[2] Abderrahim, Mohammed Alaeddine, Mohammed El Amine Abderrahim, and

Mohammed Amine Chikh. "Using Arabic Wordnet for semantic indexation in information

retrieval system." arXiv preprint arXiv:1306.2499 (2013).

[3] Aghaei, Sareh, Mohammad Ali Nematbakhsh, and Hadi Khosravi Farsani. "Evolution

of the world wide web: from Web 1.0 to Web 4.0." International Journal of Web &

Semantic Technology 3, no. 1 (2012): 1-10.

[4] Al-Feel, Haytham, M. A. Koutb, and Hoda Suoror. "Toward An Agreement on

Semantic Web Architecture." Europe 49, no. 384,633,765 (2009): 806-810.

[5] Alhindawi, Nouh Talal. "Supporting Source Code Comprehension During Software

Evolution and Maintenance." PhD diss., Kent State University, 2013.

[6] Al-Khalifa, Hend, and Areej Al-Wabil. "The Arabic language and the semantic web:

Challenges and opportunities." In The 1st int. symposium on computer and Arabic

language. 2007.

[7] Alkhalifa, Musa, and Horacio Rodríguez. "Automatically Extending Named Entities

coverage of Arabic WordNet using Wikipedia." International Journal on Information and

Communication Technologies 3, no. 3 (2010).

158

[8] Alkhateeb, Faisal, A. Manasrah, and A. Bsoul. "Bank Web Sites Phishing Detection and Notification System Based on Semantic Web technologies." International Journal of

Security & Its Applications 6, no. 4 (2012): 53-66.

[9] Al-Shalabi, Riyad, Ghassan Kanaan, Mustafa Yaseen, Bashar Al-Sarayreh, and N. Al-

Naji. "Arabic query expansion using interactive word sense disambiguation." In

Proceedings of the Second International Conference on Arabic Language Resources and

Tools, Cairo, Egypt. 2009.

[10] Al-Zoghby, Aya M., Ahmed Sharaf Eldin Ahmed, and Taher T. Hamza. "Arabic

Semantic Web Applications–A Survey." Journal of Emerging Technologies in Web

Intelligence 5, no. 1 (2013): 52-69.

[11] Antai, Roseline, Chris Fox, and Udo Kruschwitz. "The Use of Latent Semantic

Indexing to Cluster Documents into Their Subject Areas." (2011).

[12] Antoniou, Grigoris, and Frank Van Harmelen. A semantic web primer. MIT press,

2012.

[13] Aranguren, Mikel Egaña, Erick Antezana, Martin Kuiper, and Robert Stevens.

"Ontology Design Patterns for bio-ontologies: a case study on the Cell Cycle Ontology."

BMC bioinformatics 9, no. Suppl 5 (2008): S1.

[14] Aswani Kumar, Ch, M. Radvansky, and J. Annapurna. "Analysis of a vector space model, latent semantic indexing and formal concept analysis for information retrieval."

Cybernetics and Information Technologies 12, no. 1 (2012): 34-48.

[15] Baeza-Yates, Ricardo, and William Bruce Frakes, eds. Information retrieval: data structures & algorithms. Prentice Hall, 1992.

159

[16] Bai, Jing, Jian-Yun Nie, Guihong Cao, and Hugues Bouchard. "Using query contexts in information retrieval." In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 15-22. ACM, 2007.

[17] Ballatore, Andrea, Michela Bertolotto, and David C. Wilson. "Linking geographic vocabularies through WordNet." Annals of GIS 20, no. 2 (2014): 73-84.

[18] Beckett, Dave, and Brian McBride. "RDF/XML syntax specification (revised)." W3C recommendation 10 (2004).

[19] Belkin, Nicholas J., and W. Bruce Croft. "Information filtering and information retrieval: Two sides of the same coin?." Communications of the ACM 35, no. 12 (1992):

29-38.

[20] Benton, Morgan, Eunhee Kim, and Benjamin Ngugi. "Bridging The Gap: From

Traditional Information Retrieval To The Semantic Web." AMCIS 2002 Proceedings

(2002): 198.

[21] Berners-Lee, Tim, and Mark Fischetti. "The original design and ultimate destiny of the world wide web by its inventor." Weaving the Web (1999).

[22] Berners-Lee, Tim, James Hendler, and Ora Lassila. "The semantic web." Scientific

American 284.5, 28-37, 2001.

[23] Beseiso, Majdi, Abdul Rahim Ahmad, and Roslan Ismail. "A Survey of Arabic language Support in Semantic web." International Journal of Computer Applications 9, no. 1 (2010): 35-40.

[24] Bhakkad, Ankit, S. C. Dharmadhikari, M. Emmanuel, and Parag Kulkarni. "E-VSM:

Novel text representation model to capture contex-based closeness between two text

160

documents." In Intelligent Systems and Control (ISCO), 2013 7th International Conference on, pp. 345-348. IEEE, 2013.

[25] Binkley, David, and Dawn Lawrie. "Information retrieval applications in software maintenance and evolution." Encyclopedia of Software Engineering (2010): 454-463.

[26] Binkley, David, and Dawn Lawrie. "Information retrieval applications in software development." Encyclopedia of Software Engineering (2010): 231-242.

[27] Bishop, Christopher M., and Julia Lasserre. "Generative or discriminative? getting the best of both worlds." Bayesian statistics 8 (2007): 3-24.

[28] Bizer, Christian, Tom Heath, and Tim Berners-Lee. "Linked data-the story so far."

Semantic Services, Interoperability and Web Applications: Emerging Concepts (2009):

205-227.

[29] Bizer, Christian, Tom Heath, Kingsley Idehen, and Tim Berners-Lee. "Linked data on the web (LDOW2008)." In Proceedings of the 17th international conference on World

Wide Web, pp. 1265-1266. ACM, 2008.

[30] Black, William J., and Sabri ElKateb. "A prototype English-Arabic dictionary based on WordNet." In Proceedings of 2nd Global WordNet Conference, GWC2004, Czech

Republic, pp. 67-74. 2004.

[31] Black, William, Sabri Elkateb, Horacio Rodriguez, Musa Alkhalifa, Piek Vossen,

Adam Pease, and Christiane Fellbaum. "Introducing the Arabic project." In

Proceedings of the Third International WordNet Conference, pp. 295-300. 2006.

[32] Blei, David M. "Probabilistic topic models." Communications of the ACM 55, no. 4

(2012): 77-84.

161

[33] Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." the Journal of machine Learning research 3 (2003): 993-1022.

[34] Bradford, Roger B. "An empirical study of required dimensionality for large-scale latent semantic indexing applications." In Proceedings of the 17th ACM conference on

Information and knowledge management, pp. 153-162. ACM, 2008.

[35] Brahmi, Abderrezak, Ahmed Ech-Cherif, and Abdelkader Benyettou. "An arabic lemma-based stemmer for latent topic modeling." Int. Arab J. Inf. Technol. 10, no. 2

(2013): 160-168.

[36] Broder, Andrei. "A taxonomy of web search." In ACM Sigir forum, vol. 36, no. 2, pp.

3-10. ACM, 2002.

[37] Buitelaar, Paul. "Human Language Technology for the Semantic Web." (2005).

[38] Cai, Deng, Xiaofei He, and Jiawei Han. "Training linear discriminant analysis in linear time." In Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on, pp. 209-217. IEEE, 2008.

[39] Cake, Marcus. "Web 1.0, Web 2.0, Web 3.0 and Web 4.0 explained." (2008).

[40] Carmel, David, Eitan Farchi, Yael Petruschka, and Aya Soffer. "Automatic query wefinement using lexical affinities with maximal information gain." In Proceedings of the

25th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 283-290. ACM, 2002.

[41] Carpineto, Claudio, and Giovanni Romano. "A survey of automatic query expansion in information retrieval." ACM Computing Surveys (CSUR) 44, no. 1 (2012): 1.

162

[42] Carroll, Jeremy J. "An Introduction to the Semantic Web: Considerations for building multilingual Semantic Web sites and applications." Multilingual Computing 68, no. 15

(2005): 7.

[43] Chen, Edwin. "Introduction to latent dirichlet allocation." Webseite(02. Mai 2012) http://blog. echen. me/2011/08/22/introduction-to-latent-dirichlet-allocation (2011).

[44] Contreras, Jesús, Oscar Corcho, and Asunción Gómez-Pérez. "Six Challenges for the

Semantic Web." (2009).

[45] Deerwester, Scott C., Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and

Richard A. Harshman. "Indexing by latent semantic analysis." JAsIs 41, no. 6 (1990): 391-

407.

[46] Deerwester, Scott. "Improving information retrieval with latent semantic indexing."

(1988).

[47] Dodge, Yadolah. The Oxford dictionary of statistical terms. Oxford University Press,

2006.

[48] El Kourdi, Mohamed, Amine Bensaid, and Tajje-eddine Rachidi. "Automatic Arabic document categorization based on the Naïve Bayes algorithm." In Proceedings of the

Workshop on Computational Approaches to Arabic Script-based Languages, pp. 51-58.

Association for Computational , 2004.

[49] Elkateb, Sabri, William Black, Horacio Rodríguez, Musa Alkhalifa, Piek Vossen,

Adam Pease, and Christiane Fellbaum. "Building a wordnet for arabic." In Proceedings of

The fifth international conference on Language Resources and Evaluation (LREC 2006).

2006.

163

[50] Elkateb, Sabry, William Black, Piek Vossen, David Farwell, H. Rodríguez, A. Pease, and M. Alkhalifa. "Arabic WordNet and the challenges of Arabic." In Proceedings of

Arabic NLP/MT Conference, London, UK. 2006.

[51] Fallows, Deborah. The Internet and daily life. Washington, DC: Pew Internet &

American Life Project, 2004.

[52] Fuchs, Christian, Wolfgang Hofkirchner, Matthias Schafranek, Celina Raffl, Marisol

Sandoval, and Robert Bichler. "Theoretical foundations of the web: cognition, communication, and co-operation. Towards an understanding of Web 1.0, 2.0, 3.0." Future

Internet 2, no. 1 (2010): 41-59.

[53] Furnas, George W., Thomas K. Landauer, Louis M. Gomez, and Susan T. Dumais.

"The vocabulary problem in human-system communication." Communications of the ACM

30, no. 11 (1987): 964-971.

[54] Gerber, Aurona J., Andries Barnard, and Alta J. Van der Merwe. "Towards a semantic web layered architecture." (2007).

[55] Girolami, Mark, and Ata Kabán. "On an equivalence between PLSI and LDA." In

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 433-434. ACM, 2003.

[56] Greenberg, Jane, Stuart Sutton, and D. Grant Campbell. "Metadata: A fundamental component of the Semantic Web." Bulletin of the American Society for Information Science and Technology 29, no. 4 (2003): 16-18.

[57] Guarino, Nicola, Daniel Oberle, and Steffen Staab. "What is an Ontology?." In

Handbook on ontologies, pp. 1-17. Springer Berlin Heidelberg, 2009.

164

[58] Hammo, Bassam H. "Towards enhancing retrieval effectiveness of search engines for diacritisized Arabic documents." Information Retrieval 12, no. 3 (2009): 300-323.

[59] Hanani, Uri, Bracha Shapira, and Peretz Shoval. "Information filtering: Overview of issues, research and systems." User Modeling and User-Adapted Interaction 11, no. 3

(2001): 203-259.

[60] Hassanzadeh, Oktie. "Introduction to Semantic Web Technologies & Linked Data."

University of Toronto (2011).

[61] Hatem, M., D. Neagu, and H. Ramadan. "Towards personalization and a Unique

Uniform Resource Identifier for Semantic Web Users with in an Academic Environment."

The Journal of Instructional Technology and Distance learning 3, no. 6 (2006).

[62] Heindl, Eduard, and Norasak Suphakorntanakit. "Web 3.0." (2008).

[63] Hilbert, Martin, and Priscila López. "The world’s technological capacity to store, communicate, and compute information." science 332, no. 6025 (2011): 60-65.

[64] Hofmann, Thomas. "Probabilistic latent semantic indexing." In Proceedings of the

22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 50-57. ACM, 1999.

[65] Huiping, Jiang. "Information Retrieval and the semantic web." In Educational and

Information Technology (ICEIT), 2010 International Conference on, vol. 3, pp. V3-461.

IEEE, 2010.

[66] Kadri, Youssef, and Jian-Yun Nie. "Effective stemming for Arabic information retrieval." In The Challenge of Arabic for NLP/MT, Intl Conf. at the BCS, pp. 68-74. 2006.

165

[67] Kamel Boulos, Maged N., and Steve Wheeler. "The emerging Web 2.0 social software: an enabling suite of sociable technologies in health and health care education1."

Health Information & Libraries Journal 24, no. 1 (2007): 2-23.

[68] Karypis, George, and Eui-Hong Sam Han. "Fast supervised dimensionality reduction algorithm with applications to document categorization & retrieval." In Proceedings of the ninth international conference on Information and knowledge management, pp. 12-19.

ACM, 2000.

[69] Koehn, Philipp. Statistical machine translation. Cambridge University Press, 2009.

[70] Koivunen, Marja-Riitta, and Eric Miller. "W3c semantic web activity." Semantic Web

Kick-Off in Finland (2001): 27-44.

[71] Laham, T. K. L. D., and Peter Foltz. "Learning human-like knowledge by singular value decomposition: A progress report." In Advances in Neural Information Processing

Systems 10: Proceedings of the 1997 Conference, vol. 10, p. 45. MIT Press, 1998.

[72] Landauer, Thomas K., Peter W. Foltz, and Darrell Laham. "An introduction to latent semantic analysis." Discourse processes 25, no. 2-3 (1998): 259-284.

[73] Larkey, Leah S., Lisa Ballesteros, and Margaret E. Connell. "Improving stemming for

Arabic information retrieval: light stemming and co-occurrence analysis." In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 275-282. ACM, 2002.

[74] Lau, Tessa, and Eric Horvitz. Patterns of search: analyzing and modeling web query refinement. Springer Vienna, 1999.

166

[75] Lavrenko, Victor, and W. Bruce Croft. "Relevance based language models." In

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 120-127. ACM, 2001.

[76] Mahgoub, Ashraf Y., Mohsen A. Rashwan, Hazem Raafat, Mohamed A. Zahran, and

Magda B. Fayek. "Semantic Query Expansion for Arabic Information Retrieval." In

EMNLP: The Arabic Natural Language Processing Workshop, Conference on Empirical

Methods in Natural Language Processing, Doha, Qatar, pp. 87-92. 2014.

[77] Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. Introduction to information retrieval. Vol. 1. Cambridge: Cambridge university press, 2008.

[78] Marcus, Andrian, Andrey Sergeyev, Vaclav Rajlich, and Jonathan Maletic. "An information retrieval approach to concept location in source code." In Reverse

Engineering, 2004. Proceedings. 11th Working Conference on, pp. 214-223. IEEE, 2004.

[79] McGuinness, Deborah L., and Frank Van Harmelen. "OWL web ontology language overview." W3C recommendation 10, no. 10 (2004): 2004.

[80] Miller, George A. "WordNet: a lexical database for English." Communications of the

ACM 38, no. 11 (1995): 39-41.

[81] Mitra, Mandar, Amit Singhal, and Chris Buckley. "Improving automatic query expansion." In Proceedings of the 21st annual international ACM SIGIR conference on

Research and development in information retrieval, pp. 206-214. ACM, 1998.

[82] Moral, Cristian, Angélica de Antonio, Ricardo Imbert, and Jaime Ramírez. "A Survey of Stemming Algorithms in Information Retrieval." Information Research: An

International Electronic Journal 19, no. 1 (2014): n1.

167

[83] Moukdad, Haidar. "Stemming and root-based approaches to the retrieval of Arabic documents on the Web." Webology 3, no. 1 (2006).

[84] Murugesan, San. "Understanding Web 2.0." IT professional 9, no. 4 (2007): 34-41.

[85] Palmer, Sean B. "The semantic web: An introduction, 2001." (2009).

[86] Paolillo, John, and Daniel Pimienta. "Measuring linguistic diversity on the Internet."

(2005): 92-104.

[87] Papadimitriou, Christos H., Hisao Tamaki, Prabhakar Raghavan, and Santosh

Vempala. "Latent semantic indexing: A probabilistic analysis." In Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, pp. 159-168. ACM, 1998.

[88] Pedersen, Ted, Siddharth Patwardhan, and Jason Michelizzi. "WordNet:: Similarity: measuring the relatedness of concepts." In Demonstration papers at hlt-naacl 2004, pp.

38-41. Association for Computational Linguistics, 2004.

[89] Pérez, Jorge, Marcelo Arenas, and Claudio Gutierrez. "nSPARQL: A navigational language for RDF." Web Semantics: Science, Services and Agents on the World Wide Web

8, no. 4 (2010): 255-270.

[90] Pilehvar, Mohammad Taher, David Jurgens, and Roberto Navigli. "Align,

Disambiguate and Walk: A Unified Approach for Measuring ." In ACL

(1), pp. 1341-1351. 2013.

[91] Presutti, Valentina, and Aldo Gangemi. "Content ontology design patterns as practical building blocks for web ontologies." In Conceptual Modeling-ER 2008, pp. 128-141.

Springer Berlin Heidelberg, 2008.

168

[92] Prud’hommeaux, Eric, and Andy Seaborne. "SPARQL Query Language for RDF.

W3C Recommendation, January 2008." (2008).

[93] Řehůřek, Radim, and Petr Sojka. "Software framework for topic modelling with large corpora." (2010).

[94] Řehůřek, Radim. "Subspace tracking for latent semantic analysis." In Advances in

Information Retrieval, pp. 289-300. Springer Berlin Heidelberg, 2011.

[95] Rosario, Barbara. "Latent semantic indexing: An overview." Techn. rep. INFOSYS

240 (2000).

[96] Saeed, Abdullah. The Qur'an: an introduction. Routledge, 2008.

[97] Saleh, Layan M. Bin, and Hend S. Al-Khalifa. "AraTation: an Arabic semantic annotation tool." In Proceedings of the 11th International Conference on Information

Integration and Web-based Applications & Services, pp. 447-451. ACM, 2009.

[98] Salton, Gerard, and Michael J. McGill. "Introduction to modern information retrieval."

(1983).

[99] Sanderson, Mark, and W. Bruce Croft. "The history of information retrieval research."

Proceedings of the IEEE 100, no. Special Centennial Issue (2012): 1444-1451.

[100] Sawaf, Hassan, Jörg Zaplo, and Hermann Ney. "Statistical classification methods for

Arabic news articles." Natural Language Processing in ACL2001, Toulouse, France

(2001).

[101] Sontag, David, and Daniel M. Roy. "Complexity of inference in topic models." In

Advances in Neural Information Processing: Workshop on Applications for Topic Models:

Text and Beyond. 2009.

169

[102] Spivack, N. "Web 3.0: The third generation web is coming." Lifeboat Foundation

(2013).

[103] Steyvers, Mark, and Tom Griffiths. "Probabilistic topic models." Handbook of latent semantic analysis 427, no. 7 (2007): 424-440.

[104] Tuerlinckx, Laurence. "La de l’arabe non classique." 7 e JADT (2004).

[105] Van Harmelen, Frank. "The semantic web: What, why, how, and when." Distributed

Systems Online, IEEE 5, no. 3 (2004).

[106] Vechtomova, Olga, and Murat Karamuftuoglu. "Elicitation and use of relevance feedback information." Information processing & management 42, no. 1 (2006): 191-206.

[107] Vechtomova, Olga, and Ying Wang. "A study of the effect of term proximity on query expansion." Journal of Information Science 32, no. 4 (2006): 324-333.

[108] Wang, Quan, Jun Xu, Hang Li, and Nick Craswell. "Regularized latent semantic indexing: A new approach to large-scale topic modeling." ACM Transactions on

Information Systems (TOIS) 31, no. 1 (2013): 5.

[109] Weber, George. "Top languages." Lang Mon 3 (1997): 12-18.

[110] Wei, Xing, and W. Bruce Croft. "LDA-based document models for ad-hoc retrieval."

In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 178-185. ACM, 2006.

[111] Yu, Liyang. A developer’s guide to the semantic Web. Springer Science & Business

Media, 2011.

170

[112] Zhai, Chengxiang, and John Lafferty. "Model-based feedback in the language modeling approach to information retrieval." In Proceedings of the tenth international conference on Information and knowledge management, pp. 403-410. ACM, 2001.

[113] Zrigui, Mounir, Rami Ayadi, Mourad Mars, and Mohsen Maraoui. "Arabic text classification framework based on latent dirichlet allocation." CIT. Journal of Computing and Information Technology 20, no. 2 (2012): 125-140.

171