An Entity Retrieval Model
Total Page:16
File Type:pdf, Size:1020Kb
Searching Web Data: an Entity Retrieval Model Renaud Delbru Supervisor: Dr. Giovanni Tummarello Internal Examiner: Prof. Stefan Decker External Examiner: Dr. J´er^omeEuzenat External Examiner: Dr. Fabrizio Silvestri Dissertation submitted in pursuance of the degree of Doctor of Philosophy Digital Enterprise Research Institute Galway National University of Ireland, Galway / Ollscoil na hEireann,´ Gaillimh December 14, 2011 Abstract More and more (semi) structured information is becoming available on the Web in the form of documents embedding metadata (e.g., RDF, RDFa, Microformats and others). There are already hundreds of millions of such documents accessible and their number is growing rapidly. This calls for large scale systems providing effective means of searching and retrieving this semi-structured information with the ultimate goal of making it exploitable by humans and machines alike. This dissertation examines the shift from the traditional web doc- ument model to a web data object (entity) model and studies the challenges and issues faced in implementing a scalable and high per- formance system for searching semi-structured data objects on a large heterogeneous and decentralised infrastructure. Towards this goal, we define an entity retrieval model, develop novel methodologies for sup- porting this model, and design a web-scale retrieval system around this model. In particular, this dissertation focuses on the following four main aspects of the system: reasoning, ranking, indexing and querying. We introduce a distributed reasoning framework which is tolerant against low data quality. We present a link analysis approach for computing the popularity score of data objects among decentralised data sources. We propose an indexing methodology for semi-structured data which offers a good compromise between query expressiveness, query processing and index maintenance compared to other approaches. Finally, we develop an index compression technique which increase both the update and query throughput of the system. The resulting system can index billions of data objects and provides keyword-based as well as more advanced search interfaces for retrieving the most relevant data objects. This work has been part of the Sindice search engine project at the Digital Enterprise Research Institute (DERI), NUI Galway. The Sindice system currently maintains more than 100 million pages downloaded from the Web and is being used actively by many researchers within and outside of DERI. The reasoning, ranking, indexing and querying compo- nents of the Sindice search engine is a direct result of this dissertation research. iv Declaration I declare that this thesis is composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. Renaud Delbru v Contents I. Prelude 1 1. Introduction 3 1.1. On Web Search and Retrieval . .5 1.2. Challenges in Information Retrieval for Web Data . .8 1.3. Scope of Research . .8 1.4. What the Thesis is not about . .9 1.5. Outline of the Thesis . 10 1.6. Contribution . 12 II. Foundations 13 2. An Entity Retrieval Model for Web Data 15 2.1. Management of Decentralised and Heterogeneous Information Sources . 15 2.2. Abstract Model . 20 2.3. Formal Model . 25 2.4. Search Model . 28 2.5. Conclusion . 36 3. Background 39 3.1. Reasoning over Web Data . 39 3.2. Link Analysis for Ranking Web Resources . 44 3.3. Indexing and Searching Semi-Structured Data . 46 3.4. Inverted Indexes and Inverted Lists . 50 vii Contents III.Methods 59 4. Context-Dependent Reasoning for Web Data 61 4.1. Introduction . 61 4.2. Context-Dependent Reasoning of RDF Models . 62 4.3. Context-Dependent Ontology Base . 67 4.4. Implementation . 71 4.5. Performance Evaluation . 72 4.6. Discussion and Future Works . 77 4.7. Conclusions . 78 5. Link Analysis over Web Data 79 5.1. Introduction . 79 5.2. A Two-Layer Model for Ranking Web Data . 80 5.3. The DING Model . 83 5.4. Scalability of the DING approach . 88 5.5. Experiments and Results . 89 5.6. Conclusion and Future Work . 96 6. Semi-Structured Indexing for Web Data 99 6.1. Introduction . 99 6.2. Node-Labelled Tree Model for RDF . 100 6.3. Query Model . 101 6.4. Implementing the Model . 104 6.5. Comparison among Entity Retrieval Systems . 111 6.6. Experiments . 115 6.7. Discussion . 121 6.8. Conclusion and Future Work . 121 7. Adaptive Frame Of Reference for Compressing Inverted Lists 123 7.1. Introduction . 123 7.2. Adaptive Frame of Reference . 124 7.3. Experiments . 130 7.4. Discussion and Future Work . 143 7.5. Conclusion . 144 viii Contents IV.System 145 8. The Sindice Search Engine: Putting the Pieces Together 147 8.1. Sindice Overview . 147 8.2. Data Acquisition . 148 8.3. Data Preprocessing and Denormalisation . 151 8.4. Distributed Entity Storage and Indexing . 154 8.5. Conclusions . 155 V. Conclusions 157 9. Conclusions 159 9.1. Summary of the Thesis . 159 9.2. Directions for Future Research . 161 VI.Appendices 163 A. Questionnaires of the User Study 165 B. Query Sets of the Index Evaluation 169 B.1. Real-World's Query Set . 169 B.2. Barton's Query Set . 170 B.3. 10 Billion's Query Set . 171 C. Results of the Compression Benchmark 173 Bibliography 181 List of figures 197 List of tables 201 ix Part I. Prelude Chapter 1. Introduction On the Web, a growing number of data management scenarios have to deal with hetero- geneous and inter-linked data sources. More and more structured and semi-structured data sources are becoming available. Hundreds of millions of documents embedding semi-structured data, e.g., HTML tables, Microformats, RDFa, are already published and publicly accessible. We observe a rapidly growing amount of online Semantic Web data repositories, e.g., the Linked Data community1. Also, we are expecting that millions of databases supporting simple web applications or more advanced Web 2.0 services, e.g., Wordpress or Facebook, will open their data to the Web. There is a growing demand for the exploitation of these data sources [AAB+09], and in particular there is the need for searching and retrieving information from these data sources. With the current availability of data publishing standards and tools, publishing semi-structured data on the Web, which from here on we will simply call Web Data, is becoming a mass activity. Indeed, nowadays it is not limited to a few trained specialists any more. Instead, it is open to industries, governments and individuals. Some prominent fields in which examples of semi-structured data publishing efforts exist are: e-government with the launch of data.gov by the US Federal Government, http: //data.gov.uk/ by the UK government or data.vancouver.ca by the City of Vancouver to publicly share open government data; editorial world in which we can cite the effort of the New York Times or Reuters to publish rich metadata about their news articles; 1Linked Data: http://linkeddata.org/ 3 Introduction e-commerce with for example BestBuy, which published product descriptions, store details and other information in a machine-readable format; web content management systems such as Drupal or Wordpress that provides means to expose their structured information as Web Data. social networking such as the Facebook's Open Graph Protocol which enables any web page to become a rich object in a social graph. e-science such as Bio2RDF2 which offers the integration of over 30 major biological databases. Also, the emergence of Web 2.0 communities in which users join ad-hoc communities to create, collaborate and curate domain-specific information has created many \closed data silos" that are not reusable outside the communities. Some of these communities, e.g., Last.fm or LiveJournal, are already publishing open data, and it is expected that other communities will follow the lead. The Web is becoming an universal interactive medium for exchanging data, being unstructured, semi-structured or structured. The result is a sea of data objects ranging from traditional web documents to records in online databases and covering various domains such as social information or business data. However, the usefulness of Web Data publishing is clearly dependent of the ease by which data can be discovered and consumed by others. Taking the e-commerce example, how can a user find products matching a certain description pattern over thousands of e-commerce data sources ? By entering simple keywords into a web search system, the results are likely to be irrelevant since the system will return pages mentioning the keywords and not the matching products themselves. Current search systems are inadequate for this task since they have been developed for a totally different model, i.e., a Web of Documents. The shift from documents to data objects poses new challenges for web search systems. One of the main challenge is to develop efficient and effective methods for searching and retrieving data objects from a world of distributed and heterogeneous data sources. Throughout this thesis, we introduce a novel methodology for searching semi-structured information coming from distributed data sources. This methodology applied directly and has been motivated by the Web Data scenario. We introduce a logical view of the distributed data sources and propose a boolean model to search over these data sources. The logical view is built around two important concepts, the concept of dataset and 2Bio2RDF: http://bio2rdf.org/ 4 Introduction entity. Based on this model, we propose appropriate methods to search data objects and improve the search results. All of the methods are designed especially for large scale scenarios. We finally show how to integrate and interconnect these novel techniques into a single system, a search engine for Web Data, which retains the same class of scalability than traditional web information retrieval systems.