AMERICAN UNIVERSITY of BEIRUT Dbpediasearch
Total Page:16
File Type:pdf, Size:1020Kb
AMERICAN UNIVERSITY OF BEIRUT DBpediaSearch: An Effective Search Engine for DBpedia by HIBA KHALED ARNAOUT A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science to the Department of Computer Science of the Faculty of Arts and Sciences at the American University of Beirut Beirut, Lebanon February 2017 AMERICAN UNIVERSITY OF BEIRUT THESIS, DISSERTATION, PROJECT RELEASE FORM Student Name: Last First Middle Master's Thesis Master's Project Doctoral Dissertation § ¤ § ¤ § ¤ ¦ ¥ ¦ ¥ ¦ ¥ 2 I authorize the American University of Beirut to: (a) reproduce hard or electronic copies of my thesis, dissertation, or project; (b) include such copies in the archives and digital repos- itories of the University; and (c) make freely available such copies to third parties for research or educational purposes. 2 I authorize the American University of Beirut, to: (a) reproduce hard or electronic copies of it; (b) include such copies in the archives and digital repositories of the University; and (c) make freely available such copies to third parties for research or educational purposes after: One year from the date of submission of my thesis, dissertation or project. Two years from the date of submission of my thesis , dissertation or project. Three years from the date of submission of my thesis , dissertation or project. Signature Date This form is signed when submitting the thesis, dissertation, or project to the University Libraries Acknowledgements Firstly, I would like to thank my thesis advisor Prof. Shady Elbassuoni for supporting me throughout the course of this experience. It all started with an Information Retrieval course, during which I started my research journey with Dr.Elbassuoni. His patience, motivation, and positivity made everything looks easier. I learned a lot from his vast experience in two domains, namely, Informa- tion Retrieval and Machine Learning. I would like to thank him for teaching me how to write a proper research paper. I would like to thank my committee members Prof. Mohamad Jaber and Prof. Wassim ElHajj for their comments and suggestions. Additionally, I am very thankful for Dr.ElHajj's constant support since day one. I am grateful for the American University of Beirut's research board (URB) for funding my research. Finally, a special thanks to my family. Words cannot express how grateful I am to my mother who encouraged me. My appreciation to my father for his moral and financial support, and for his ineterest in having a look at the results of my thesis. v An Abstract of the Thesis of Hiba Arnaout for Master of Science Major: Computer Science Title: DBpediaSearch: An Effective Search Engine for DBpedia The active progress of knowledge-sharing communities like Wikipedia have made it achievable to build large knowledge-bases, such as DBpedia. These knowledge-bases use the Resource Description Framework (RDF) as a flexible data model for representing the information in the Web. The semantic query language for RDF is known as SPARQL. Although querying with SPARQL gives very specific results, the users knowledge of the underlying data is a must. In this thesis, we propose an effective search engine for DBpedia. The search sys- tem takes SPARQL queries augmented with keywords as input and gives the most relevant results as output. To be able to do this, we develop a novel rank- ing model based on statistical machine translation for both triple-pattern queries and keyword-augmented triple-pattern queries. Our system supports automatic query relaxation in case no results were found. Our ranking model also takes into consideration result diversity in order to ensure that the user is provided with a wide range of aspects of the query results. We develop a diversity-aware eval- uation metric based on the Discounted Cumulative Gain to evaluate diversified result sets. Finally, we build an evaluation benchmark on DBpedia, which we use to evaluate the effectiveness of our search engine. vi Contents Acknowledgements v Abstract vi 1 Introduction 1 1.1 Motivation . .1 1.2 Challenges . .3 1.3 Contributions . .3 2 Indexing and Query Processing 6 2.1 Knowledge Graph: DBpedia . .6 2.2 Triple Patterns . .6 2.3 Keyword-augmented Triple Patterns . .7 2.4 Query Relaxation . .8 3 Ranking 11 3.1 Ranking Model . 11 3.1.1 Result Ranking . 11 3.1.2 Triple Weight . 14 3.1.3 Keyword Augmentation . 15 3.2 Relaxation Model . 17 3.3 Related Work . 18 3.3.1 Unstructured Queries on Unstructured Data . 18 3.3.2 Unstructured Queries on Structured Data . 19 3.3.3 Structured Queries on Structured Data . 20 3.3.4 Query Relaxation . 20 3.4 Experimental Evaluation . 22 3.4.1 Setup . 22 3.4.2 Results . 22 4 Diversity 26 4.1 Diversity Problem . 26 4.2 Maximal Marginal Relevance . 28 vii 4.3 Diversity Notions in RDF Setting . 28 4.3.1 Resource-based Diversity . 29 4.3.2 Term-based Diversity . 29 4.3.3 Text-based Diversity . 30 4.4 Diversity-aware Re-ranking Algorithm . 31 4.5 Diversity-aware Evaluation Metric . 32 4.5.1 Resource-based Novelty . 32 4.5.2 Text-based Novelty . 32 4.6 Related Work . 33 4.7 Experimental Evaluation . 34 4.7.1 Evaluation Using Diversity-aware Metric . 34 4.7.2 Evaluation Using Crowdsourcing . 37 5 Efficiency 42 5.1 Database Statistics . 42 5.2 Running Time Statistics . 43 5.3 Running Time Challenges . 44 A Guidelines: Choosing the Better Result Set 45 B Guidelines: Assessing a Subgraph 48 B.1 Purely Structured Queries Guidelines . 48 B.2 Keyword-augmented Queries Guidelines . 48 B.3 Requiring Relaxation Queries Guidelines . 48 B.4 Keyword-augmented and Requiring Relaxation Queries Guidelines 49 B.5 Gold Queries . 49 C Abbreviations 53 List of Figures 1.1 An example RDF graph on movies . .2 2.1 Facts table in PostgreSQL . .7 2.2 Sample answer of a SPARQL query in PostgreSQL . .7 2.3 Witnesscounts associated with DBpedia triples . .8 2.4 Triples with keywords and keywords weights . .8 3.1 A subgraph of the "WikiLinks" graph . 15 4.1 A subgraph and its description . 38 5.1 Percentage of time needed per step . 44 A.1 Which set is better? . 46 A.2 One query and two sets of answers . 47 B.1 Purely structured queries guidelines . 49 B.2 Sample assessment question (purely structured) . 49 B.3 Keyword-augmented queries guidelines . 50 B.4 Sample assessment question (keyword-augmented) . 51 B.5 Requiring relaxation queries guidelines . 51 B.6 Sample assessment question (requiring relaxation) . 51 B.7 Keyword-augmented and requiring relaxation queries guidelines . 52 B.8 Sample assessment question (keyword-augmented and requiring re- laxation) . 52 ix List of Tables 1.1 The set of RDF triples for the RDF graph in Figure 1.1 . .2 1.2 Un-ranked result set for: director/star of the same show. .4 1.3 Non-diversified result set for: director couples. .5 2.1 List of relaxed queries for movies starred by Allen and directed by Madonna . .9 2.2 List of possible subgraphs for queries in Table 2.1 . .9 3.1 Top-10 un-ranked results: producer/director of the same show. 12 3.2 Top-10 ranked results: producer/director of the same show. 13 3.3 Top-10 ranked results: producer/director of the same show, with the keyword: "Chaplin". 14 3.4 Top-10 ranked results: people who graduated from Oxford and were born in Agatha Christie's death place. 18 3.5 Top-10 ranked results of: people who worked with Frank Sinatra. 19 3.6 Results for 100 evaluation queries: ranking algorithm . 23 3.7 Ours vs. No-rank explanation samples. 23 3.8 Ours vs. Indegrees explanations samples . 24 3.9 Ours vs. LMRQ explanations samples . 25 4.1 Top-10 ranked results of: Democratic politicians who had roles in military events or crises. 27 4.2 Average DIV −NDCG10 of the training queries for different values of λ .................................. 35 4.3 Average DIV − NDCG10 of the test queries . 35 4.4 Top-5 results: director-actor couple query with and without diversity 36 4.5 Top-5 results for the Disney query with and without diversity . 37 4.6 Top-5 results for the Columbia Pictures query with and without diversity . 37 4.7 Results for 100 evaluation queries: diversity algorithm . 38 4.8 No Diversity vs. Resource-based Diversity . 39 4.9 No Diversity vs. Term-based Diversity . 40 4.10 No Diversity vs. Text-based Diversity . 41 x 5.1 Database information . 43 5.2 Running time per query category . 43 5.3 Running time per diversity notion . 43 Chapter 1 Introduction 1.1 Motivation The continuous growth of knowledge-sharing communities like Wikipedia and the active progress in modern information extraction technologies have made it achievable to build large knowledge-bases (KBs) such as DBpedia [1], Freebase [2], and YAGO [3]. In addition, some domain-specific Knowledge-bases such as DBLife [4], and Libra [5] have been also constructed. These knowledge bases use the Resource Description Framework (RDF) [6] as a flexible data model for representing the information extracted and for publishing such information of the web. The RDF data model is based on representing binary facts as triples of form: Subject-Predicate-Object, where Subjects are URIs for resources, Predicates are URIs for properties, and Objects are URIs for resources or literals. A cer- tain fact can be represented in a table store as a triple, or in a graph as two nodes with one edge relating them. As an example, the following statement: \The sky has the color blue" can be converted into a triple form: <Subject:Sky, Predicate:hasColor, Object:Blue>.