Web-Scale Profiling of Semantic Annotations in HTML Pages

Web-Scale Profiling of Semantic Annotations in HTML Pages Inauguraldissertation zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften der Universitat¨ Mannheim vorgelegt von Dipl.-Wirtsch.-Inf. Robert Meusel aus Berlin Mannheim, 2016 Dekan: Professor Dr. Heinz Jurgen¨ Muller,¨ Universitat¨ Mannheim Referent: Professor Dr. Christian Bizer, Universitat¨ Mannheim Korreferent: Professor Dr. Wolfgang Nejdl, Leibniz Universitat¨ Hannover Tag der mundlichen¨ Prufung:¨ 10. Marz¨ 2017 Abstract The vision of the Semantic Web was coined by Tim Berners-Lee almost two decades ago. The idea describes an extension of the existing Web in which “information is given well-defined meaning, better enabling computers and people to work in cooperation” [Berners-Lee et al., 2001]. Semantic annotations in HTML pages are one realization of this vision which was adopted by large numbers of web sites in the last years. Semantic annotations are integrated into the code of HTML pages using one of the three markup languages Microformats, RDFa, or Microdata. Major consumers of semantic annotations are the search engine companies Bing, Google, Yahoo!, and Yandex. They use semantic annotations from crawled web pages to enrich the presentation of search results and to complement their knowledge bases. However, outside the large search engine companies, little is known about the deployment of semantic annotations: How many web sites deploy semantic annotations? What are the topics covered by semantic annotations? How detailed are the annotations? Do web sites use semantic annotations correctly? Are semantic annotations useful for others than the search engine companies? And how can semantic annotations be gathered from the Web in that case? The thesis answers these questions by profiling the web-wide deployment of semantic annotations. The topic is approached in three consecutive steps: In the first step, two approaches for extracting semantic annotations from the Web are discussed. The thesis evaluates first the technique of focused crawling for harvesting semantic annotations. Afterward, a framework to extract semantic annotations from existing web crawl corpora is described. The two extraction approaches are then compared for the purpose of analyzing the deployment of semantic annotations in the Web. In the second step, the thesis analyzes the overall and markup language-specific adoption of semantic annotations. This empirical investigation is based on the largest web corpus that is available to the public. Further, the topics covered by deployed semantic annotations and their evolution over time are analyzed. Subsequent studies examine common errors within semantic annotations. In addition, the thesis analyzes the data overlap of the entities that are described by semantic annotations from the same and across different web sites. The third step narrows the focus of the analysis towards use case-specific issues. Based on the requirements of a marketplace, a news aggregator, and a travel portal the thesis empirically examines the utility of semantic annotations for these use cases. Additional experiments analyze the capability of product-related semantic annotations to be integrated into an existing product categorization schema. Especially, the potential of exploiting the diverse category information given by the web sites providing semantic annotations is evaluated. Zusammenfassung Vor mehr als 20 Jahren veroffentlichte¨ Tim Berners-Lee seine Idee des Semantic Webs. Basierend auf seiner Vision, sollte das semantische Web eine Erweiterung des bestehenden Webs sein, in dem die enthaltenen Informationen semantisch definiert sind, wodurch die Kooperation zwischen Mensch und Maschine vereinfacht werden wurde.¨ [Berners-Lee et al., 2001] Semantische Annotationen in HTML Seiten sind eine konkrete Umsetzung dieser Idee, die in den letzten Jahren von sehr vielen Webseitenbetreibern adap- tiert wurden. Semantische Annotationen werden direkt im HTML Quellcode der Webseite mithilfe der drei HTML-Markup-Erweiterungen Microformats, RDFa, und Microdata eingefugt.¨ Hauptsachlich¨ werden so annotierte Informationen von den großen Suchmaschinenfirmen, Bing, Google, Yahoo! oder Yandex verarbeitet. Diese Firmen nutzen semantische Annotationen, die sie in dem HTML Quell- code von gecrawlten Webseiten finden, um die Anzeige von Suchergebnissen zu verbessern oder ihren internen Wissensgraphen zu erweitern. Trotz der starken Nutzung durch die Suchmaschinenfirmen ist wenig uber¨ die Einbindung und Ver- breitung von semantischen Annotationen im Web bekannt: Wie viele Webseiten bieten semantische Annotationen an? Welche Themengebiete werden beschrieben? Wie detailliert sind die annotierten Informationen und nutzen Webseitenbetreiber die Annotationen korrekt? Sind die so angebotenen Informationen nutzlich¨ und wie konnen¨ sie effizient gesammelt werden? Diese Fragen werden in den drei, aufeinanderfolgenden Teilen dieser Disserta- tion im Zuge einer umfassenden Profilierung des Datenraumes, der von semantischen Annotationen aufgespannt wird, beantwortet. Im ersten Teil werden zwei Moglichkeiten¨ zur Sammlung von semantischen An- notationen diskutiert. Zuerst evaluiert die Dissertation eine Methodik, die sich an der Idee des fokussierten Crawlens orientiert. Daraufhin wird ein Frame- work vorgestellt, welches semantische Annotationen aus bestehenden Webcrawl- datensatzen¨ extrahieren kann. Beide Vorgehensweisen werden verglichen und mit Bezug auf die Reprasentativit¨ at¨ der gewonnen Daten evaluiert. Im zweiten Teil analysiert die Arbeit empirisch die allgemeine, wie auch markup- spezifische Verbreitung von semantischen Annotationen im Web basierend auf dem großten¨ offentlich¨ zuganglichen¨ Webcrawldatensatzes. Uber¨ die Verbreitung hinaus werden die enthaltenen Themengebiete sowie deren Veranderung¨ uber¨ die Zeit be- trachtet. Nachfolgend untersucht die Arbeit, zu welchem Grad Webseitenbetreiber semantische Annotationen korrekt benutzen. Der abschließende Teil der Arbeit fokussiert sich auf eine anwendungsbezogene Analyse von semantischen Annotationen. Basierend auf den Anforderungen eines Onlineshops, einer Nachrichtenaggregationsseite und eines Reiseportals wird die Nutzlichkeit¨ von semantischen Annotationen evaluiert. Anschließend wird untersucht, in wie weit es moglich¨ ist, die seitenspezifischen Produktkategorisierungen zu nutzen um Produktinformationen, auf eine bestehende Produktklassifizierung abzubilden und somit eine feingranulare Themenanalyse zu ermoglichen.¨ Contents 1 Introduction 1 1.1 Motivation . .4 1.2 Problem Description and Contributions . .5 1.3 Thesis Outline . .8 1.4 Published Work . 11 2 Preliminaries 13 2.1 Semantic Markup Languages . 13 2.1.1 Microformats . 14 2.1.2 RDFa . 15 2.1.3 Microdata . 16 2.2 Common Vocabularies . 16 2.2.1 Open Graph Protocol . 17 2.2.2 Schema.org . 18 2.2.3 Other Vocabularies . 19 2.2.4 Namespaces and Abbreviations . 21 2.3 Parsing Semantic Annotations to RDF . 21 2.4 Dataspaces . 23 I Extraction of Semantic Annotations 25 3 Data Extraction from the Web using Focused Crawling 27 3.1 Related Work . 28 3.1.1 Structure of the Web . 28 3.1.2 Crawling . 31 3.1.3 Focused Crawling . 32 3.2 Focused Crawling Methodology . 34 3.2.1 Online Classification . 34 3.2.2 Bandit-Based Selection . 36 3.3 Experimental Setup . 39 3.3.1 System Architecture and Process Flow . 40 3.3.2 Research Data . 41 iii iv CONTENTS 3.3.3 Experiments Description . 42 3.3.4 Evaluation Metrics . 44 3.4 Results . 44 3.4.1 Online Classification Optimization . 44 3.4.2 Offline versus Online Classification . 45 3.4.3 Evaluation of Different Bandit Functions . 46 3.4.4 Adaptability to More Specific Semantic Annotations Crawl- ing Tasks . 49 3.4.5 Evaluation of Runtime . 50 3.5 Conclusion . 52 4 Data Extraction from Web Corpora 55 4.1 Public Web Corpora . 56 4.2 Overall Extraction Workflow . 58 4.3 Extraction of Microformats, RDFa, and Microdata . 59 4.4 Additional Use Cases . 60 4.5 Discussion and Conclusion . 61 5 Comparison of the Extraction Approaches 63 5.1 Representativity . 63 5.2 Sampling Errors . 66 5.3 Conclusion . 68 II Analysis of Semantic Annotations 69 6 Overall Adoption of Semantic Annotations 71 6.1 Introduction to Data Profiling . 72 6.1.1 Different Dimensions of Profiling . 72 6.1.2 Profiling Semantic Annotations in HTML pages . 73 6.2 Profiling of the Adoption of Semantic Annotations . 74 6.2.1 Research Data and Measures . 74 6.2.2 Overall Adoption . 75 6.2.3 Microformats Adoption . 76 6.2.4 RDFa Adoption . 76 6.2.5 Microdata Adoption . 78 6.3 Related Work . 81 6.3.1 Web Data Profiling . 81 6.3.2 Microformats, RDFa and Microdata . 82 6.4 Summary . 83 7 Profiling of schema.org Microdata 85 7.1 Problem Statement . 87 7.1.1 Duplicates . 88 CONTENTS v 7.1.2 Non-compliance to the Schema . 90 7.2 Methodology . 93 7.2.1 Syntactic Duplicate Removal . 93 7.2.2 Heuristics to Correct Schema Violations . 94 7.2.3 Combined Approach . 96 7.2.4 Semantic Identity Resolution . 97 7.3 Empirical Findings . 99 7.3.1 Syntactical Duplicate Removal and Correction of Schema Violations . 99 7.3.2 Semantic Duplicate Detection . 103 7.4 Discussion . 107 7.4.1 Quality of Structural Duplicate Detection . 107 7.4.2 Limitation of Duplicate Detection by RDF Graph Equivalence109 7.4.3 Limitation of Heuristics with High Precision . 109 7.4.4 Selection of (Pseudo-)Key Properties . 110 7.5 Related Work . 110 7.6 Summary . 112 8 Evolution of the Deployment of schema.org Microdata over Time 115 8.1 Research Data . 116 8.2 Research Questions and Methodology . 118 8.2.1 Top-down Processes . 118 8.2.2 Bottom-up Processes . 119 8.2.3 Overall Convergence of Vocabulary Usage . 121 8.2.4 Influence of Data Consumers . 122 8.3 Empirical Findings . 123 8.3.1 Top-down Processes . 123 8.3.2 Bottom-up Processes . 128 8.3.3 Overall Convergence of Vocabulary Usage . 131 8.4 Related Work . 134 8.5 Summary . 135 III Use Case-Specific Profiling 137 9 Use Case-specific Utility Analysis of schema.org Data 139 9.1 Use Cases . 140 9.1.1 Marketplace . 140 9.1.2 News Aggregator . 141 9.1.3 Travel Portal . 141 9.2 Related Work . 141 9.3 Research Data and Methodology . 142 9.3.1 Research Data . 142 9.3.2 Methodology . 143 vi CONTENTS 9.4 Empirical Findings . 144 9.4.1 Use Case-Independent Analysis .

Load more