Algorithms for Enhancing Information Retrieval Using

ALGORITHMS FOR ENHANCING INFORMATION RETRIEVAL USING SEMANTIC WEB A dissertation submitted to Kent State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy by Moy’awiah Al-Shannaq August, 2015 i Dissertation written by Moy’awiah A. Al-Shannaq B.S., Yarmouk University, 2003 M.S., Yarmouk University, 2005 Ph.D., Kent State University, 2015 Approved by _______________________________________ Austin Melton, Professor, Ph.D., Computer Science, Doctoral Advisor _______________________________________ Johnnie Baker, Professor Emeritus, Ph.D., Computer Science _______________________________________ Angela Guercio, Associate Professor, Ph.D., Computer Science _______________________________________ Farid Fouad, Associate Professor, Ph.D., Chemistry _______________________________________ Donald White, Professor, Ph.D., Mathematical Sciences Accepted by ______________________________________ Javed I. Khan, Professor, Ph.D., Chair, Department of Computer Science ______________________________________ James L. Blank, Professor, Ph.D., Dean, College of Arts and Sciences ii Table of Contents LIST OF FIGURES........................................................................................................viii LIST OF TABLES............................................................................................................xi DEDICATION................................................................................................................xiii ACKNOWLEDGMENTS..............................................................................................xiv ABSTRACT......................................................................................................................xv CHAPTER 1: INTRODUCTION….................................................................................1 1.1 Goals of the Research.....................................................................................................3 1.2 Contributions..................................................................................................................5 1.3 Dissertation out line ......................................................................................................6 CHAPTER 2: BACKGROUND…....................................................................................8 2.1 Introduction……………………………………………………………………………8 2.2 Information Retrieval………………………………………………………………….9 2.3 Web Evolution……………………………………………………………………….13 2.3.1 Web 1.0…………………………………………………………………………..14 2.3.2 Web 2.0…………………………………………………………..........................14 2.3.3 Web 3.0…………………………………………………………………………..15 2.3.4 Web 4.0…………………………………………………………………………..16 2.4 Semantic Web………………………………………………………………………..18 2.5 Standards, Recommendations, and Models………………………………………….20 iii 2.5.1 URI and Unicode………………………………………………………………...21 2.5.2 Extensible Markup Language……………………………………………………22 2.5.3 RDF: Describing Web Resources………………………………………………..23 2.5.4 RDF Schema: Adding Semantics………………………………………………...24 2.5.5 OWL: Web Ontology Language………………………………………………....25 2.5.6 Logic and Proof…………………………………………………………………..26 2.5.7 Trust……………………………………………………………………………...27 2.6 What is SPARQL? …………………………………………………………………..27 2.7 Integrating Information Retrieval and Semantic Web……………………………….30 CHAPTER 3: RELATED WORK…………………………………………………….34 3.1 Topic Modeling………………………………………………………………………34 3.1.1 From Vector Space Modeling to Latent Semantic Indexing…………………….36 3.1.2 The Probabilistic Latent Semantic Indexing (PLSI) model……………………...43 3.1.3 The Latent Dirichlet Allocation Model (LDA)…………………………………..45 3.1.4 Topics in LDA…………………………………………………………………...47 3.1.4.1 Model………………………………………………………………………...48 3.2 Automatic Query Expansion (AQE)…………………………………………………50 3.3 Document Ranking with AQE……………………………………………………….53 3.4 Why and when AQE works………………………………………………………….54 3.5 How AQE works……………………………………………………………………..57 3.5.1 Preprocessing of Data Source……………………………………………………57 iv 3.5.2 Generation and Ranking of Candidate Expansion Features……………………..58 3.5.2.1 One-to-One Associations…………………………………………………….59 3.5.2.2 One-to-Many Associations…………………………………………………..59 3.5.2.3 Analysis of Feature Distribution in Top-Ranked Documents………………..60 3.5.2.4 Query Language Modeling…………………………………………………..60 3.5.2.5 A Web Search Example……………………………………………………...61 3.6 Selection of Expansion Features……………………………………………………..64 3.7 Query Reformulation………………………………………………………………...65 3.8 Related Work………………………………………………………………………...66 CHAPTER 4: SEMANTIC WEB AND ARABIC LANGUAGE……………………69 4.1 Importance of Arabic Language……………………………………………………..69 4.2 Right-to-Left Languages and the Semantic Web…………………………………….72 4.3 Arabic Ontology……………………………………………………………………...73 4.4 The Arabic Language and the Semantic Web: Challenges and Opportunities………75 CHAPTER 5: GENERATING TOPICS……………………………………………...82 5.1 Introduction…………………………………………………………………………..82 5.2 Data Set………………………………………………………………………………84 5.3 Experimental Results………………………………………………………………...85 5.3.1 Experiments on an English Corpus……………………………………………....88 5.3.1.1 Preprocessing Steps………………………………………………………….88 5.3.1.2 English Corpus Creation……………………………………………………..89 v 5.3.1.3 Experiment 1…………………………………………………………………90 5.1.3.4 Experiment 2…………………………………………………………………91 5.1.3.5 Experiment 3…………………………………………………………………91 5.1.3.6 Experiment 4…………………………………………………………………91 5.1.3.7 Experiment 5…………………………………………………………………91 5.1.3.8 Experiment 6…………………………………………………………………91 5.3.2 Experiments on an Arabic Corpus……………………………………………….92 5.3.2.1 Arabic Corpus Creation……………………………………………………...92 5.3.2.2 Experiment 1…………………………………………………………………94 5.3.2.3 Experiment 2…………………………………………………………………94 5.3.2.4 Experiment 3…………………………………………………………………94 5.3.2.5 Experiment 4………………………………………………………………....94 5.3.2.6 Experiment 5…………………………………………………………………95 5.3.2.7 Experiment 6…………………………………………………………………95 5.4 Discussion……………………………………………………………………………96 CHAPTER 6: TOPIC MODELING AND QUERY EXPANSION………………...102 6.1 Introduction…………………………………………………………………………102 6.2 Why to use the combination of topic modeling and query expansion? ……………105 6.3 Semantic Search…………………………………………………………………….106 6.3.1 Handling Generalizations……………………………………………………….106 6.3.2 Handling Morphological Variations……………………………………………107 vi 6.3.3 Handling Concept Matches…………………………………………………….108 6.3.4 Handling Synonyms with Correct Sense……………………………………….109 6.4 Methodology………………………………………………………………………..110 6.4.1 Query Expansion………………………………………………………………..114 6.4.2 Our Work……………………………………………………………………….116 6.4.2.1 Stemming Subsystem……………………………………………………….121 6.4.2.2 Query Expansion Subsystem……………………………………………….121 6.5 Experimental Results and Discussion………………………………………………122 6.5.1 Experiment 1…………………………………………………………………....123 6.5.2 Experiment 2……………………………………………………………………124 6.5.3 Experiment 3……………………………………………………………………125 6.5.4 Experiment 4……………………………………………………………………126 6.5.5 Experiment 5……………………………………………………………………127 6.5.6 Experiment 6……………………………………………………………………128 6.5.7 Experiment 7……………………………………………………………………129 6.5.8 Experiment 8……………………………………………………………………130 CHAPTER 7: CONCLUSION AND FUTURE WORK……………………………132 APPENDIX A………………………………………………………………………….141 REFERENCES………………………………………………………………………...158 vii List of Figures Figure 1: The classic search model……………………….……………………………...10 Figure 2: Architecture of a textual IR system……………………………………………13 Figure 3: Evolution of the Web………….………………………………………………17 Figure 4: Web of documents………………….…………………………………………19 Figure 5: The structure of Web data……………….…………………………………….20 Figure 6: Semantic Web layered architecture……………………………………………21 Figure 7: An RDF graph storing information about Arab Bank.………………………...24 Figure 8: An RDF graph storing information about transporation services between cities…...............................................................................................................................25 Figure 9: Two-dimmensioal plot of terms and documents………………………………40 Figure 10: Two-dimmensioal plot of terms and documents along with the query application theory ………………………………………………………………………..41 Figure 11: Two-dimmensioal plot of terms and documents using the SVD of a reconstructed Term-Document Matrix………………………………………………….42 viii Figure 12: Plate notation representing the LDA model………………………………….48 Figure 13: Main steps of Automatic Query Expansion……………….…………………57 Figure 14: The first ten results returned by Google in response to the query "foreign minorities Germany"……………………………………………………………………..62 Figure 15: The first ten results returned by Google in response to the expanded query "foreign minorities Germany sorbs Frisians"……………………………………………63 Figure 16: The real strength of the top ten languages…………..………….....................71 Figure 17: The three main components of the semantic annotation process…………….76 Figure 18: A pie chart showing the distribution of languages used in creating ontologies stored in the OntoSelect library………………………………………………………….78 Figure 19: A visualization of the probabilistic generative process for three documents..87 Figure 20: The intuitions behind Latent Dirichlet Allocation…………………………...88 100.…………………… سيعلمونه Figure 21: Segmentation of the Arabic agglutinated form Figure 22: System components for Arabic corpus using LSI topic modeling.…………117 Figure 23: System components for Arabic corpus using LDA topic modeling.………..118 Figure 24: System components for English corpus using LDA topic modeling.………119 ix Figure 25: System components for English corpus using LSI topic modeling................120 Figure 26: Query one for

Algorithms for Enhancing Information Retrieval Using

Enhanced Thesaurus Terms Extraction for Document Indexing

A Data-Driven Framework for Assisting Geo-Ontology Engineering Using a Discrepancy Index

Multi-View Learning for Hierarchical Topic Detection on Corpus of Documents

Semi-Automated Ontology Based Question Answering System Open Access

Nonstationary Latent Dirichlet Allocation for Speech Recognition

Large-Scale Hierarchical Topic Models

Journal of Applied Sciences Research

An Evaluation of Machine Learning Approaches to Natural Language Processing for Legal Text Classification

Incorporating Domain Knowledge in Latent Topic Models

Extração De Informação Semântica De Conteúdo Da Web 2.0

Using Lexico-Syntactic Ontology Design Patterns for Ontology Creation and Population

LASLA and Collatinus