Kurt Uwe Stoll
Using Existing Structured Data as a Learning Set for Web Information Extraction in E-Commerce
Doctoral Thesis Fakultät für Wirtschafts- und Organisationswissenschaften
Using Existing Structured Data as a Learning Set for Web Information Extraction in E-Commerce
Kurt Uwe Stoll
Univ.-Prof. Dr. Hans A. Wüthrich Univ.-Prof. Dr. Martin Hepp
Univ.-Prof. Dr. Claudius Steinhardt Univ.-Prof. Dr. Stephan Kaiser Univ.-Prof. Dr. Karl Morasch
12.7.2016
Dr. rerum politicarum
(Dr. rer. pol.)
1. November 2016 Doctoral Thesis
Using Existing Structured Data as a Learning Set for Web Information Extraction in E-Commerce
Author: Supervisor: Kurt Uwe Stoll Prof. Dr. Martin Hepp
A thesis submitted in partial fulfillment of the requirements for the degree of Dr. rer. pol.
at the
UNIVERSITÄT DER BUNDESWEHR MÜNCHEN
November 1, 2016 “I checked it very thoroughly,” said the computer, “and that quite definitely is the answer. I think the problem, to be quite honest with you, is that you’ve never actually known what the question is.”
“But it was the Great Question! The Ultimate Question of Life, the Universe and Everything,” howled Loonquawl. “Yes,” said Deep Thought with the air of one who su ers fools gladly, “but what actually is it?”
A slow stupefied silence crept over the men as they stared at the computer and then at each other.
“Well, you know, it’s just Everything ... Everything ...” o ered Phouchg weakly.
“Exactly!” said Deep Thought. “So once you know what the question actually is, you’ll know what the answer means.”
Douglas Adams - The Hitchhiker’s Guide to the Galaxy Abstract
Using Existing Structured Data as a Learning Set for Web Information Extraction in E-Commerce
by Kurt Uwe Stoll
In the last years, e-commerce has grown massively and evolved into a main driver of technological innovation on the Web. The Semantic Web is a vision to advance the technological foundation of the Web so that computers are empowered to better extract and process information from Web content [AH04, p. 1f.]. A core principle of the Semantic Web is to augment Web markup by structured data suited for machine processing, instead of markup just suitable for rendering the information for human consumption [AH04,p. 1f.]. The application of the Semantic Web to e-commerce shows significant potential in particular for the e ciency and precision of search, improving data quality, or raising market e ciency.
Despite a significant increase in adoption, the percentage of Web sites that provide data markup for e-commerce information is still limited and will likely remain limited for many years to come. Predominantly, the data is generated with shop software extension modules, covering only a small fraction of the Web. At the same time, automatic methods for Web Information Extraction are still not able to reconstruct the full amount of structured data behind Web content.
In order to address this issue, we propose a novel method for Web Information Extraction, targeted to the e-commerce domain. The approach exploits (1) the market dominance of a small amount of e-commerce systems, (2) the patterns those systems expose in Web page generation, and (3) the existing structured data in e-commerce.
We evaluate our findings by splitting our dataset into a learning set and an evaluation set. Our results show that the approach is feasible for extracting structured data from e- commerce sites that do not include data markup solely on the basis of template similarity and existing markup as training data.
The fundamental idea is to combine similarities in Web page templates, caused by the popularity of o -the-shelf shop software, with the use of data markup found in the subset of Web pages as training data for machine learning. Kurzzusammenfassung
Existierende strukturierte Daten als Lernset für Webinformationsextraktion im Bereich E-Commerce
von Kurt Uwe Stoll
Der Wirtschaftsbereich E-Commerce ist in den letzten Jahren stark gewachsen und hat sich dabei zu einer Triebfeder technischer Innovation im Web etabliert. Das semantische Web ist eine Vision, die technologischen Grundlagen des Webs so zu verbessern, dass Computer leichter Informationen aus Webinhalten extrahieren und verarbeiten können [AH04, p. 1f.]. Hierbei ist das Kernprinzip, Webseitencode, welcher ursprünglich für die Darstellung für Menschen entworfen wurde, mit strukturierten Daten anzureichern, welche maschinenlesbar sind [AH04, p. 1f.]. Im Zusammenhang mit E-Commerce birgt die Anwendung von Semantic-Web-Technologien bedeutende Potentiale, insbesondere E zienz und Suchgenauigkeit, Verbesserung von Datenqualität und Verbesserung von Markte zienz.
Trotz einer bedeutenden Zunahme in der Verwendung dieser Technologien ist der Anteil von Websites, die strukturierte Daten verwenden, nach wie vor begrenzt und wird dies aller Voraussicht nach in den nächsten Jahren bleiben. Die Daten werden vornehmlich durch Shop Extensions erzeugt. Gleichzeitig sind automatisierte Methoden aus dem Bereich Webinformationsextraktion noch nicht in der Lage, die Gesamtheit der in Webseiten enthaltenen Informationen als strukturierte Daten abzubilden.
Um dieses Problem zu lösen, wird eine neue Methode für Webinformationsextraktion für E-Commerce vorgeschlagen. Sie nutzt die marktbeherrschende Stellung weniger E- Commerce-Systeme, die Muster, welche die Systeme bei der Webseitengenerierung erzeu- gen, und die bestehenden strukturierten Daten aus dem semantischen E-Commerce.
Die Ergebnisse werden evaluiert, indem die zur Verfügung stehenden Daten in Train- ingsdaten und Testdaten aufgeteilt werden. Unsere Ergebnisse zeigen, dass der Ansatz lediglich durch die Verwendung von Ähnlichkeiten in Templates und existierendem Markup zusätzliche strukturierte Daten erzeugen kann. Die grundlegende Idee besteht in der Kombination von Ähnlichkeiten in Webseitentemplates, welche durch die Popularität von Standard Shopsoftware entsteht, mit der Verwendung von strukturiertem Markup als Trainingsdaten für Machine Learning. Acknowledgements
First of all, I would like to sincerely thank my supervisor, Prof. Dr. Martin Hepp, for his guidance, support and encouragement. Without his supervision and trust in my ideas, this thesis would have never existed. Working with him was a highly inspiring experience. Additionally, I want to thank Prof. Dr. Claudius Steinhardt for taking over the role of co-supervisor.
I want to thank my colleagues Dr. Mouzhi Ge, Andreas Radinger, Dr. Bene Rodriguez, Alex Stolz and Laszlo Török, for the inspiring discussions, and productive atmosphere at work. I owe progress in many critical points of this way to you. Many thanks also go to all my dear friends, without whom life would have never been so colorful.
Most of all, I want to thank my wife Nadine. You are the best thing that has ever happened to me. Without your love, I would have never come so far. Especially, I want to thank my family. In the rest of my life, I can never pay back the love and care I owe to my mother.
Finally, I want to thank Christopher David Ryan for the friendly provision of the title page graphic.
Last but not least, I would like to thank the Universität der Bundeswehr München, who funded this research for a significant period and provided a highly creative atmosphere.
v Contents
Abstract iii
Kurzzusammenfassung iv
Acknowledgements v
List of Figures xi
List of Tables xiii
Listings xv
Abbreviations xvi
1 Introduction 1 1.1 Problem Statement and Hypothesis ...... 1 1.2 Relevance ...... 3 1.2.1 Potential of the Semantic Web for E-Commerce ...... 3 1.2.2 Existing Semantic E-Commerce Data and Limitations .... 7 1.3 Contributions ...... 9 1.4 Research Questions ...... 13 1.5 Experimental Design ...... 14 1.6 Organization of the Thesis ...... 14 1.7 Previously Published Work ...... 15
2 Structured Data: Fundamentals and Usage for E-Commerce 17 2.1 Semi-Automated Structured Data Generation on the Semantic Web 18 2.1.1 The Web ...... 19 2.1.1.1 Economical Dimensions ...... 19 2.1.1.2 Social Dimensions ...... 21 2.1.1.3 Design Principles of the Web ...... 22 2.1.1.4 Fundamental Problems of the Web ...... 23 2.1.2 Semantic Web ...... 26 2.1.2.1 Vision ...... 26 2.1.2.2 Semantic Web Technology Stack ...... 28 2.1.2.3 Linked Data ...... 42
vi Contents vii
2.1.2.4 Schema.org, Google Semantic Web Tools and Google Knowledge Graph ...... 43 2.1.3 Conclusion ...... 45 2.2 Semantic E-Commerce ...... 45 2.2.1 Technological Foundations of E-Commerce ...... 45 2.2.2 The GoodRelations Web Ontology for E-Commerce ..... 47 2.2.2.1 Goals and Design Principles ...... 47 2.2.2.2 Data Model ...... 48 2.2.2.3 Features, Documentation, and Ecosystem ..... 50 2.2.2.4 Existing GoodRelations Data on the Web ..... 55 2.2.3 Existing Research in Semantic E-Commerce ...... 56 2.2.4 Real-World Usage of Structured E-Commerce Data ..... 57 2.2.5 Economical Implications of Semantic E-Commerce ...... 59 2.2.6 Conclusion ...... 60 2.3 Automated Generation of Structured Data with Web Information Extraction ...... 61 2.3.1 Research Strains in Web Information Extraction and Rela- tion to Semantic Web Research ...... 63 2.3.2 Classical Web Information Extraction Approaches ...... 64 2.3.3 Recent Approaches to Web Information Extraction ..... 65 2.3.4 Web Information Extraction Targeting the E-Commerce Do- main ...... 67 2.3.5 Ontology-Based Web Information Extraction ...... 67 2.3.6 Semantic Web Information Extraction Approaches Targeting E-Commerce ...... 69 2.3.7 Novelty of Our Approach ...... 70 2.3.8 Related Field: Web Mining ...... 70 2.4 Big Data and Validity of the Contribution ...... 72
3 Foundational Building Blocks 74 3.1 Impact of E-Commerce Systems on the Availability of Structured Data in E-Commerce ...... 75 3.1.1 Related Work ...... 75 3.1.1.1 Market Studies ...... 75 3.1.1.2 Functional Comparisons ...... 76 3.1.2 Understanding the Impact of E-Commerce Software on the Adoption of Structured Data on the Web ...... 76 3.1.3 Implementation ...... 79 3.1.3.1 Obtaining a List of Relevant Site URIs ...... 79 3.1.3.2 Counting Product Pages Based on XML Sitemaps . 80 3.1.4 Results ...... 81 3.1.4.1 Summary ...... 82 3.1.4.2 Impact of E-Commerce Software on the Adoption of Structured Data ...... 84 3.1.4.3 Site Popularity ...... 84 Contents viii
3.1.5 Evaluation ...... 85 3.1.6 Discussion and Limitations ...... 86 3.1.7 Conclusion ...... 87 3.2 E-Commerce System Identification Based on Sparse Features .... 87 3.2.1 Related Work ...... 87 3.2.1.1 Web Page Classification ...... 88 3.2.1.2 Supervised Classification ...... 89 3.2.2 Methodology, Approach, and Implementation ...... 91 3.2.2.1 Overview ...... 91 3.2.2.2 Design Rationales ...... 92 3.2.2.3 Generating Datasets and Preprocessing ...... 92 3.2.2.4 Building a Classifier ...... 94 3.2.2.5 Implementation ...... 95 3.2.3 Results ...... 96 3.2.3.1 Feature Set and Algorithm Performance ...... 96 3.2.3.2 Speed ...... 97 3.2.3.3 Performance on Di erent Clusters ...... 98 3.2.3.4 Consolidated Algorithm Review ...... 98 3.2.4 Evaluation ...... 99 3.2.4.1 Evaluation on GR-Notify Dataset ...... 99 3.2.4.2 Evaluation on Targeted ECS Reference Shops ...100 3.2.4.3 Evaluation on Non-Targeted ECS Reference Shops 102 3.2.4.4 Evaluation on Non-Shop Sites ...... 102 3.2.5 Limitations ...... 103 3.2.6 Conclusion ...... 104 3.3 Structured E-Commerce Data on the Web ...... 105 3.3.1 Related Work ...... 106 3.3.2 GR-Notify as a Registry for GoodRelations-enabled Shops . 107 3.3.2.1 Approach ...... 108 3.3.2.2 Implementation ...... 108 3.3.3 Analysis of GR-Notify Data ...... 109 3.3.3.1 Approach ...... 109 3.3.3.2 Implementation ...... 110 3.3.3.3 Results ...... 110 3.3.4 Generating a Sample of GoodRelations Data on the Web ..115 3.3.4.1 Approach ...... 115 3.3.4.2 Implementation ...... 116 3.3.5 Analysis of the Sample ...... 117 3.3.5.1 Implementation ...... 117 3.3.5.2 Results ...... 118 3.3.6 Evaluation ...... 133 3.3.7 Limitations ...... 134 3.3.8 Conclusion ...... 135 Contents ix
4 Structured Data for Web Information Extraction in E-Commerce 136 4.1 Approach ...... 136 4.1.1 Fundamentals ...... 137 4.1.1.1 Web Information Extraction in Comparison to Shop Extensions ...... 137 4.1.1.2 Focussing on the Promise Part of GoodRelations’ APO Principle ...... 138 4.1.2 Properties in Regard ...... 139 4.1.2.1 Properties Used in the Approach ...... 139 4.1.2.2 Additional Properties Regarded in the Use Case ..141 4.1.2.3 Excluded Properties ...... 142 4.1.3 Experimental Design ...... 144 4.1.3.1 Evaluation ...... 144 4.1.3.2 High-level Pseudocode Overview ...... 145 4.1.4 Conclusion ...... 147 4.2 Implementation ...... 147 4.2.1 Python as Main Programming Language ...... 148 4.2.2 Dataset Generation ...... 150 4.2.3 Extraction of Provided Data from O ering Pages ...... 152 4.2.4 Quality of the Extracted Data ...... 156 4.2.5 Generation of Extraction Rules ...... 157 4.2.6 Evaluation ...... 163 4.2.7 Conclusion ...... 166 4.3 Results ...... 166 4.3.1 Dataset Generation ...... 167 4.3.2 Extraction of Data from O ering Pages ...... 167 4.3.3 Rule Generation ...... 168 4.3.4 Conclusion ...... 172 4.4 Evaluation ...... 172 4.4.1 Standard Settings ...... 174 4.4.2 Modified Evaluation ...... 175 4.4.3 Modified Sample ...... 178 4.4.4 Modified Rule Generation ...... 182 4.4.5 Additional Dataset: Manually Labeled, n=20 per ECS ...183 4.4.6 Conclusion ...... 184 4.5 Use Case: Real-time E-Commerce Web Information Extraction System186 4.5.1 Design ...... 186 4.5.2 Implementation of the Extraction System ...... 187 4.5.3 Implementation of the Frontend ...... 188 4.5.4 Output of a Typical Run of the WIE System ...... 189 4.5.5 Frontend Overview ...... 189 4.5.6 Conclusion ...... 190 Contents x
5 Conclusion and Outlook 192 5.1 Contributions ...... 192 5.1.1 Structured Data: Fundamentals and Usage in the E-Commerce Domain ...... 192 5.1.2 Foundational Building Blocks ...... 194 5.1.3 Structured Data for WIE in E-Commerce ...... 195 5.2 Limitations and Future Work ...... 197 5.2.1 Dataset ...... 197 5.2.2 Approach ...... 198 5.2.3 Evaluation ...... 200 5.2.4 Use Cases ...... 200 5.2.5 Scale ...... 201 5.2.6 Scope ...... 201 5.3 Outlook: On the Self-Replicating Nature of Structured Data ....202
Bibliography 204 List of Figures
1.1 Approach ...... 3 1.2 Search engine bottleneck, referring to [Hepa]...... 4 1.3 Interplay of foundational and main contributions ...... 12 1.4 Web shop with exemplary extraction targets ...... 12 1.5 Extraction rule generator approach ...... 13
2.1 Strains of relevant related work ...... 19 2.2 Market capitalization of Internet companies (USA), April 2013 ... 20 2.3 Reduced Semantic Web technology stack relevant to this work, own representation based on [Ber00] ...... 28 2.4 URI scheme, Berners-Lee, Fielding, and Masinter [BFM05] ..... 29 2.5 URI scheme - example ...... 29 2.6 Graph of the RDF example ...... 32 2.7 Six e ects of ontologies, based on [Hep08b] ...... 37 2.8 Most important conceptual elements of the GoodRelations ontology 49
3.1 Research foundation ...... 74 3.2 E ect of enabling structured data for an e-commerce system on product pages ...... 77 3.3 Distribution of the number of product pages per shop software package 83 3.4 Supervised Machine Learning: General approach, based on [Kot07] . 90 3.5 Overview of experimental design ...... 95 3.6 Heat map of F1-all core for 18 feature / algorithm combinations .. 97 3.7 Heat map: time elapsed for 18 feature / algorithm combinations .. 98 3.8 GR-Notify - ping frequency ...... 111 3.9 GR-Notify - top level domains ...... 111 3.10 GR-Notify - submitting ECS ...... 112 3.11 GR-Notify - submitting ECS pie chart ...... 113 3.12 GR-Notify - submissions over time ...... 113 3.13 GR-Notify - frequency world heat-map ...... 114 3.14 Learning set generator overview ...... 116 3.15 Implementation pipeline - sample analysis ...... 118 3.16 Length analysis - name, unit: characters ...... 122 3.17 Length analysis - description, unit: characters ...... 122 3.18 Count analysis - eligibleRegions, unit: region codes ...... 123 3.19 Count analysis - acceptedPaymentMethods ...... 124 3.20 Count analysis - availableDeliveryMethods ...... 124 3.21 Distribution of hasCurrency by ECS ...... 125
xi List of Figures xii
3.22 Distribution of acceptedPaymentMethods by ECS ...... 127 3.23 Distribution of availableDeliveryMethods by ECS ...... 127 3.24 Distribution of valueAddedTaxIncluded by ECS ...... 128 3.25 Distribution of validity statement duration by ECS ...... 129 3.26 World map coloring ...... 132 3.27 World map of the frequency of eligibleRegions - Magento ...... 132 3.28 World map of the frequency of eligibleRegions - Oxid E-Commerce . 132 3.29 World map of the frequency of eligibleRegions - Prestashop .....133 3.30 World map of the frequency of eligibleRegions - Virtuemart .....133
4.1 Extraction rule generator approach ...... 148 4.2 Aggregated rule generation results ...... 171 4.3 Final results - standard settings - precision ...... 176 4.4 Impact of the stricter settings on the precision ...... 177 4.5 Impact of the relaxed settings on the precision ...... 178 4.6 Impact of a training set of 0.25 settings on the precision ...... 179 4.7 Impact of a training set of 0.75 settings on the precision ...... 180 4.8 Precision for early adopters ...... 181 4.9 Precision for later adopters ...... 181 4.10 Precision while omitting first class ...... 182 4.11 Result di erences - wild card - precision ...... 183 4.12 Precision with a manually created dataset ...... 185 4.13 Overview of the frontend functionality ...... 191
5.1 Final results - standard settings - precision ...... 196 List of Tables
2.1 Example XPaths ...... 31 2.2 Overview of discussed work in Web Information Extraction ..... 62
3.1 Consolidated list of search strings for the 56 e-commerce systems in regard ...... 79 3.2 URIs found in e-commerce sitemaps from one million Alexa sites and product item estimate, results (absolute) ...... 82 3.3 URIs found in sitemaps and product item estimate, results (relative) 83 3.4 Precision of the shop detection technique - Demandware - Prestashop 85 3.5 Precision of the shop detection technique - EC-SHOP - mean .... 86 3.6 Learning set instances by ECS ...... 93 3.7 Remaining recall-base after white list filtering ...... 94 3.8 F1-all-scores for 18 feature / algorithm combinations ...... 97 3.9 Time elapsed (s) for 18 feature / algorithm combinations ...... 97 3.10 Classification report of “class+id” / XTREE classifier on distinct ECS ...... 98 3.11 Consolidated review of speed / performance of used algorithms ... 99 3.12 GR-Notify evaluation: Remaining recall after white-list application . 99 3.13 GR-Notify evaluation: Classification report of “class+id” / XTREE classifier ...... 100 3.14 Evaluation on targeted ECS reference shops - classification results . 102 3.15 Evaluation on targeted ECS reference shops - precision, recall, F1-score102 3.16 Evaluation on non-targeted ECS reference shops ...... 103 3.17 GR-Notify - top level domains ...... 111 3.18 Evaluation on non-targeted ECS reference shops ...... 112 3.19 GR-Notify - frequency world - GEOIP analysis ...... 115 3.20 Analyzed HTML pages and RDF o ering graphs per ECS .....119 3.21 GoodRelations properties attached per o er by ECS ...... 121 3.22 Evaluation with crawl dataset, per o ering ...... 134
4.1 Comparison of shop extensions with our approach ...... 138 4.2 Extraction targets ...... 140 4.3 Overview of the extraction rules ...... 162 4.4 HTML sample pages - all / training / evaluation from di erent ECS and sums ...... 167 4.5 Ratio of valid data in the extracted raw data ...... 168 4.6 Rule generation results, desc. property, rank 1 to 5, and score ...170 4.7 Rule generation results, image property, rank 1 to 5, and score ...171
xiii List of Tables xiv
4.8 Rule generation results, name property, rank 1 to 5, and score ...172 4.9 Rule generation results, price property, rank 1 to 5, and score ...173 4.10 Aggregated rule generation results - dataset ...... 173 4.11 Final results - standard settings - precision ...... 175 4.12 Strict evaluation ...... 176 4.13 Impact of the stricter settings on the precision ...... 176 4.14 More relaxed evaluation - settings ...... 177 4.15 Impact of the relaxed settings on the precision ...... 177 4.16 Impact of a training set of 0.25 settings on the precision ...... 178 4.17 Impact of a training set of 0.75 settings on the precision ...... 179 4.18 Precision for early adopters ...... 180 4.19 Precision for later adopters ...... 181 4.20 Precision while omitting first class ...... 182 4.21 Result di erences - wild card - precision ...... 183 4.22 Precision with a manually created dataset ...... 184 4.23 Precision, absolute with a manually created dataset ...... 184 4.24 Output of a typical run of the WIE system, n=250 URIs ...... 190 Listings
2.1 Fragment identifiers ...... 29 2.2 Running example in Turtle notation ...... 32 2.3 Running example in RDF/XML notation ...... 34 2.4 Running example in RDFa notation ...... 35 2.5 Running example in JSON-LD notation ...... 35 2.6 O ering example in Turtle syntax ...... 40 2.7 SPARQL query to find o erings ...... 41
3.1 Parallelization with GNU parallel ...... 80 3.2 Overview query ...... 119 3.3 Property frequency analysis - O ering ...... 119 3.4 Length analysis - name - description ...... 122 3.5 Length analysis - eligibleRegions - acceptedPaymentMethods - avail- ableDeliveryMethods ...... 123 3.6 Multi-value analysis - hasCurrency ...... 125 3.7 Multi-value analysis - acceptedPaymentMethods - availableDeliv- eryMethods ...... 126 3.8 Multi-value analysis - valueAddedTaxIncluded ...... 128 3.9 Multi-value analysis - validity statement ...... 129 3.10 World heat map - eligibleRegions ...... 130
4.1 Experimental design pseudocode overview ...... 146 4.2 Dataset generation - source code ...... 151 4.3 Extract provided data from o ering pages - source code ...... 154 4.4 Check extracted data quality - source code ...... 156 4.5 Generate extraction rules - source code ...... 157 4.6 Evaluation - source code ...... 163
xv Abbreviations
API Application Programming Interface APO Agent-Promise-Object CMS Content Management System CPU Central Processing Unit CSS Cascading Style Sheets DOM Document Object Model ECS Electronic Commerce System HTML Hypertext Markup Language HTTP Hypertext Transfer Protocol IP Internet Protocol JSON Javascript Object Notation RAM Random Access Memory RDF Resource Description Framework REST Representational State Transfer RQ Research Question SEO Search Engine Optimization SPAQRL SPARQL Protocol and RDF Query Language SW Semantic Web TLD Top-Level Domain UML Unified Modeling Language URI Uniform Resource Identifier WIE Web Information Extraction XML EXtensible Markup Language XPath XML PATH Language
xvi 1 Introduction
In the following sections, we present (1) the problem statement and hypothesis, (2) discuss the relevance, (3) highlight the contributions, (4) formulate research questions, (5) specify the experimental design, and (6) explain the organization of the thesis. We close with (7) a list of previously published work.
1.1 Problem Statement and Hypothesis
In the last years, e-commerce has grown massively and evolved into a main driver of technological innovation on the Web1. The Semantic Web is a vision to advance the technological foundation of the Web so that computers are empowered to better extract and process information from Web content [AH04, p. 1f.]. A core principle of the Semantic Web is to augment Web markup by structured data suited for machine processing, instead of markup just suitable for rendering the information for human consumption [AH04, p. 1f.]. The application of the Semantic Web to e-commerce shows significant potential in particular for the e ciency and precision of search, improving data quality, or raising market e ciency.
Despite a significant increase in adoption, the percentage of Web sites that provide data markup for e-commerce information is still limited and will likely remain limited for many years to come. Predominantly, the data is generated with shop software extension modules, covering only a small fraction of the Web. At the same time, automatic methods for Web Information Extraction are still not able to reconstruct the full amount of structured data behind Web content. 1In the course of the thesis, we use “Web” synonymously for “World Wide Web (WWW)”. 1 Chapter 1. Introduction 2
A structured data representation of o erings, fundamental for Semantic E-Commerce, is only available at a relatively low market coverage. Predominantly, the data is generated with shop extensions, covering only a small fraction of the Web. At the same time, automatic methods that exploit Web Information Extraction to generate structured data are largely unexplored. Moreover, the existing structured data makes up for a significant learning set to drive Web Information Extraction.
In order to address this issue, we propose a novel method for Web Information Extraction, targeted to the e-commerce domain. The approach is based on the following observations:
1. A large share of Web shops is implemented on the basis of standard software packages, and the total number of popular software solutions is small.
2. The resulting HTML code of the published Web pages shows a significant amount of similarity for the same underlying software solution, despite the frequently very di erent visual appearance caused by customization e orts.
3. There is a significant amount of e-commerce Web pages with structured data markup, but the absolute market coverage and adoption are still limited.
This leads to our research hypothesis:
The existing structured e-commerce data on the Web, in combination with the market structure of e-commerce systems, and the similarity in the HTML patterns they expose, can be used as a lever to generate additional e-commerce data in significant quantity and quality, and thus increase the market coverage in the available data.
We aim at using a small proportion of the e-commerce Web that is already equipped with structured data as a “blueprint” to generate extraction rules that can be applied to o ering pages in the ordinary2 Web of e-commerce.
The approach we suggest is shown in Figure 1.1. Given a set of e-commerce pages that does contain structured data, we generate a set of extraction rules. If we apply
2We use the term ordinary here to distinguish the Web in its current state from a Web that is equipped with structured data. Chapter 1. Introduction 3
E-Commerce pages Extraction rules not containing structured data
E-Commerce pages E-Commerce pages containing containing structured data structured data
Figure 1.1: Approach
these rules to e-commerce pages not containing structured data, we can generate it for those.
1.2 Relevance
In the following two sections, we describe the relevance of the work by showing the potential of Semantic Web approaches for e-commerce, and discuss existing Semantic E-Commerce data and limitations.
1.2.1 Potential of the Semantic Web for E-Commerce
The Web was originally designed for documents that are consumed by humans. This legacy poses significant limitations to automated data processing of the information published on the Web (e.g. [AH04, pp. 1-2],[Lac05, p. 4], cf. [Jür12; Cha+06]). Data on the Web is mostly unstructured, and usually integrated into Web pages that contain rendering directives. As extraction of data from Web pages into a well- structured form is a complex task, many popular Web applications, like search engines, operate mainly on the basis of indexing the textual content of documents on the Web. Thus, Web search is dominated by queries that run against massive corpora of textual Web documents, and does not easily allow sophisticated data- centric queries. For instance, it is di cult to search for biographies of German composers who were born in Munich, and died at the age of 80 or older. While this Chapter 1. Introduction 4
Tea?
Granular offerings on web Search engine bottleneck Granular demand shops
Figure 1.2: Search engine bottleneck, referring to [Hepa].
information is surely available on the Web, with the current Web search technology it is not straightforwardly accessible, and requires extensive manual e ort.
At the same time, e-commerce has matured into one of the central drivers of retail growth, representing, for instance, 11.2 % of all retail transactions of the German market in 2013 [Bun13]. As e-commerce is based on Web paradigms, it likewise su ers from the problem mentioned above. Web shops provide their o erings in the form of Web pages, complicating the extraction of initially well-structured data like product name, price, or image. Traditional search does not allow, for instance, querying for products that are manufactured in South Tirol, and sold in Munich. Again, this leads to extensive manual e ort required to obtain such data.
Merchants are not able to articulate their value proposition in high fidelity, and customers are not able to search the market for highly specific goods. Detailed descriptions of the properties of companies (e.g. geo-position or contact informa- tion), of o erings (e.g. payment and delivery methods), and of products or services (product master data), are not available for elaborated queries, as they are hidden in text on Web pages.
This results in many buying decisions not based on the wealth of information the- oretically available, subsequently leading to suboptimal choices [Hepa]. Often, con- sumers initiate online buying decisions with general search engines [Saf13]. Those search engines reduce o ering pages to a minimal preview. This preview represents Chapter 1. Introduction 5
only a fraction of the content of the original value proposition (cf. [KHL08]), limit- ing the merchant’s ability for granular signaling. At the same time, the consumers’ ability to screen the market for highly specific goods is limited, as the special features of a product are not accessible through the search engine (cf. [ES07]). By limiting the communication bandwidth between the market participants, they limit market e ciency. Fig. 1.2 visualizes the problem: The granular o erings on Web shops are boiled down to a minimal search engine preview (cf. [KHL08]), that may often not match the granularity of the customers’ demand.
2001, Tim Berners-Lee et al. proposed the Semantic Web as an extension of the existing Web [BHL01], consisting of two main additions [Tre+08]:
1. First, existing Web pages should be marked-up with rich metadata. This metadata should match the content of the data of Web pages, for instance, that a certain number is the price of an o ering.
2. This metadata should be expressed on the basis of ontologies that define a consensual understanding of a specific domain (e.g. [UG96]).
If realized to a great extent, these enhancements would render the whole Web into a giant database, well-suited for automated data processing (e.g. [Ber09]). This would mitigate the initially introduced problem, that the Web in its current state requires human intelligence to act upon. While the Semantic Web has matured significantly from a technological perspective, the initial vision has so far not been realized. One likely reason is that the semantic annotation of Web pages is tedious, and thus did not reach broad adoption (e.g. [SBH06]).
Since the major search engines Google, Yahoo, Bing and Yandex endorse the use of the schema.org vocabulary for granular data markup and promise better per- formance in search in turn, the adoption rate has grown to about 30 % [web14]. However, still as of today, only a fraction of Web information is available as struc- tured data.
In e-commerce, realizing the Semantic Web vision would mitigate the aforemen- tioned problems to a large extent. Most importantly, the information bottleneck Chapter 1. Introduction 6
between market participants would diminish significantly. We lay out three exem- plary use cases to further underline the potential of the Semantic Web vision in e-commerce, which are bound to the perspective of certain market participants.
Merchants: Already today, integrating structured data into Web shop pages has the e ect of enhancing search engine results, which in turn are expected to raise sales [Edm14]. Structured data also allows third parties like a liate portals to propagate o erings of a Web shop e ciently without establishing a proprietary interface. If e-commerce data on the Web was available in a structured form, competition could be analyzed automatically, for instance in terms of the quality of product descriptions. Additionally, ordinary data quality approaches operate on the data of a single market participant, and proprietary data sources. Having the data of other market participants at hand on Web scale would allow, for instance, to detect price or product data errors. Those errors are known to be significant cost drivers in the enterprise context (e.g [Red98]).
Customers: It can be assumed that the specificity of products and services grew over time. Specificity is defined as the trade-o between the usage of a good in its original intent, and the usage of the good in a way it was not intended for (e.g. [McG91]). Thus, goods that can be used without significant trade-o for multiple purposes have a low specificity, for instance water. A highly specific good would be a custom-made birthday present. Current search engines do not support the search for highly specific goods well. For instance, it is not possible to search for protein bars that do not contain peanuts3. While this information might be expressed in the product description, it is not easily accessible in search engines. Providing a granular Semantic Web representation of a product would preserve this information, which could be easily integrated into applications beneficial for customers. In this context, providing structured data in e-commerce could help finding products or services with a high asset specificity, and thus cater for the growing specificity in modern economies. Therefore, enhancing e-commerce pages with Semantic Web
3To prevent, for instance, allergic reaction. Chapter 1. Introduction 7
technology would extend search capabilities to match raising specificity. One could reasonably argue that generating structured data for e-commerce, which means raising the specificity of data, is an important response to match raising asset specificity in modern economies. Asset Specificity is largely relevant to the wealth of societies, and raising data specificity subsequently reflects the raising specificity in markets.
Market research and authorities: Access to granular data about e-commerce with Semantic Web technologies would allow for on-demand economical statistics, for instance on consumer prices. Instead of collecting the data in the predomi- nant extensive process, it would be immediately at hand. That would reduce the collection cost massively. Additionally, it would reduce the time span between the occurrence of a situation and its detection, in turn reducing reaction time. Moreover, the Semantic Web already provides a wealth of social or spatial data sources. As it facilitates data integration, sophisticated applications could be built straightforwardly, that combine newly generated market data with existing data sources. This task is commonly considered as highly complex with legacy technol- ogy. Examples are the strategical positioning of points of sales, or highly targeted marketing.
1.2.2 Existing Semantic E-Commerce Data and Limitations
The most prominent way of using structured data markup in e-commerce is the integration into Web shop pages. This is often realized with extensions for standard- ized e-commerce software solutions (ECS). We define ECS as a term that describes software systems that allow merchants to manage and provide Web shops. By now, there are at least seven popular extension modules for adding data markup to widely adopted systems [Hep13]. These amount to about 20.000 shop installa- tions, generating structured data for about twenty million o erings4. Meanwhile, these figures only account for a relatively small share of Web shops and o erings
4Precise figures are not available here. We will elaborate in Section 3.3 on that topic. Chapter 1. Introduction 8
on a global scale. We like to call this way of generating structured e-commerce data semi-automatic. That is, while the data is generated automatically, it needs a manual action performed by the shop owner to activate the process.
As a means to expose structured data in e-commerce, the GoodRelations ontology has seen a significant adoption [Hep12]. Initially launched in 2008, it provides a data model for e-commerce, building on the Semantic Web technology stack. GoodRela- tions is equipped with substantial tooling and comprehensive documentation to ease the adoption [Hep+09]. It allows to express a wide range of e-commerce scenar- ios by default, and can be easily extended to custom domains and use cases [Hepc]. Recently, major search engines have integrated GoodRelations into schema.org [sch12]. Schema.org is the attempt of Google, Yahoo, Bing and Yandex to promote a consolidated vocabulary for structured data [Sch13]. From the search engines’ perspective, the support of structured data is motivated by significantly less com- plexity to extract meaningful content out of Web pages. Structured data is also used to provide contextual content to the user. Google has e.g recently integrated its “Knowledge Graph”, a huge graph of factual information about objects and topics, into its Web search, based on Semantic Web technology5. By using GoodRelations data, a similar interface to e-commerce is at reach. As introduced above, in the short run, there are at least two factors that motivate the integration of structured data in Web shops for merchants. As a result tangible today, search engines reward the integration with visually extended results, which in turn are expected to spur sales (cf. [Bru13]). Additionally, it facilitates data extraction for search engines, which is expected to influence search engine rankings in a positive way.
Meanwhile, the overall adoption of structured data in e-commerce has grown to only about 30 percent of the market in the last five years [web14]. One likely reason is that GoodRelations markup is usually deployed via extensions to e-commerce systems and not integrated into the default configuration of the shop software. Thus, structured data has to be turned on manually by the merchant, as introduced above.
5http://googleblog.blogspot.de/2012/05/introducing-knowledge-graph-things-not.html Chapter 1. Introduction 9
Many applications that operate on structured e-commerce data demand for a significant market coverage. For instance, a feature comparison engine driven by Semantic E-Commerce data would need a significant coverage of o erings to be useful. Therefore, the low market coverage hinders the sophistication of applications on the basis of existing Semantic E-Commerce data.
1.3 Contributions
In this section, we first present an overview of the contributions, and continue with a more detailed discussion of those.
Foundational contributions:
1. An analysis of the impact market structures for ECS for the deployment of structured data.
2. A reliable machine-learning based method for detecting the ECS used for a Web shop.
3. A collection of sources and an analysis of structured e-commerce data on the Web that provide that basis for our experiment.
Main contribution:
The main contribution is a novel method for the extraction of structured data in the e-commerce domain that builds on the three foundational contributions: By having a certain amount of (3) existing structured data at hand, and being able to (2) identify the e-commerce system, we design a novel data extraction method that exploits system (1) specific patterns. It generates extraction rules out of an aggregated mapping between o ering properties extracted out of the GoodRelations data and the Web page elements. Chapter 1. Introduction 10
Foundational Contributions
1. Impact of E-Commerce Systems on Structured Data
This first foundational contribution shows that only seven ECS generate more than 90 % of the product pages on the Web, which in turn generates a promising lever for the main contribution (see Fig. 1.3). By being able to craft extractors for those seven ECS, we could theoretically generate structured data for a major amount of product pages.
Determining that only a few ECS cover for a majority of o ering pages on the Web is a significant building block for the later course of the thesis, as an equal distribution would have led to constructing a high amount of ECS-specific extractors, at the expense of the high lever that emerges from regarding only a few ECS.
2. Identification of E-Commerce Systems
This contribution proposes a novel approach to automatically identify ECS. It is based on the machine learning field supervised classification (e.g. [Kot07]), and exploits a filtered set of Web page properties.
It is capable of detecting six di erent e-commerce systems by analyzing only one random page of a Web shop, and shows an overall F1-score of 0.9, see Section 4.3. An extensive evaluation confirms the results.
This contribution provides a practical building block for the main contribution as the former requires the accurate and fast detection of e-commerce systems. At the same time, the viability of this approach proves that there exist structural patterns in the markup of di erent ECS, a premise that is used in the later course of the thesis as an assumption in the rule generator design. Chapter 1. Introduction 11
3. Existing Structured E-Commerce Data on the Web
The third foundational contribution analyzes existing GoodRelations data on the Web. As this data is the learning set for our extraction rule generator, a detailed analysis of amount, properties, and quality is needed for the further course of the thesis.
Main Contribution: System-specific E-Commerce Extraction based on Structured Data
The foundational contributions become building blocks of the main contribution in the following form:
Impact: As more than 90 % of product detail pages are generated by seven e- commerce systems, our approach focuses on a relatively small number of ECS, while aiming for a high impact.
Patterns in product pages generated by e-commerce systems: Most Web shops are generated by standardized ECS. In this context, ECS are a subclass of Web content management systems. They usually generate Web pages by combining templates with database content. As templates are generally used for a broad range of similar entities on a Web shop, e.g. o erings or categories, it is possible to exploit the patterns generated by those templates to extract the underlying structured data. This building block is based on the foundational contribution 2, “Identification of e- commerce systems”, as the viability of the approach substantiates this observation.
Learning set: Supervised machine learning often su ers from the lack of a su - cient amount of labeled instances for training [DSW07, p. 37]. In the e-commerce case, a labeled instance would contain the locations of name, price or properties in an o ering Web page. We focus on four distinct ECS, as GoodRelations data exists only for four di erent ECS in a significant amount. Chapter 1. Introduction 12
Patterns in Main Impact product pages Learning set contribution generated by ECS
Existing 7 ECS Foundational structured data generate > 90 % ECS identification contributions in of product pages e-commerce
Figure 1.3: Interplay of foundational and main contributions
Figure 1.4: Web shop with exemplary extraction targets
These three phenomena are integrated into a novel approach and implemented into four ECS-specific extractors. We visualize the interplay of these phenomena and the main contribution in Fig. 1.3.
Extraction Rule Generator
The core of our approach is represented by the extraction rule generator. The extraction rule generator operates on extraction targets, that may be o ering name, description, or image, for instance. We provide a screenshot of a Web shop6 with exemplary extraction targets in Fig. 1.4.
For each di erent ECS, the extraction rule generator acts according to this high- level scheme, which we visualize in Fig. 1.5:
6http://www.la-mousson.de/ Chapter 1. Introduction 13
1. Learning set property value extraction Name 2. Search for page elements containing GoodRelations values Image
3. Element property extraction Price
4. Cumulative occurrence ranking Description
Figure 1.5: Extraction rule generator approach
1. For each o ering page in the GoodRelations data learning set, it extracts the “true” values of the extraction targets.
2. In the o ering pages, it searches for elements containing the given values in the content.
3. It extracts the properties of the respective elements.
4. The extracted data is ranked according to cumulative occurrences over all o ering pages belonging to a certain ECS.
1.4 Research Questions
The research questions are aligned to the aforementioned contributions.
RQ1: How can the combination of (1) the market domination of a few ECS, (2) an automated approach for detecting the ECS behind a Web site, (3) HTML template similarity, and (4) existing structured data be used to design a system that is able to extract structured data from e-commerce sites that do not contain data markup, with a level of granularity and data quality comparable to extraction from explicit data markup?
RQ2: What is the impact of ECS on the availability of structured data?
RQ3: Can we reliably detect the ECS behind an e-commerce Web site automati- cally by analyzing only a small number of pages from the site? Chapter 1. Introduction 14
RQ4: How can we measure the current di usion and quality of GoodRelations data?
RQ1 is the main research question, substantiating the main contribution. RQ2 to RQ4 are the foundational research questions that the main research question is build on.
1.5 Experimental Design
From a high-level view, our experimental design and the evaluation show the following layout:
1. We collect a sample of e-commerce o ering pages that contain GoodRelations data, and split the resulting data set into a learning set and a test set of equal size, i.e each will contain 50 % of the original data.
2. On the learning set, out of the GoodRelations data and the page element properties, we generate extraction rules that allow to produce structured data for unlabeled o ering pages.
3. We cross-validate these extraction rules on the test set.
1.6 Organization of the Thesis
Chapter 2, Data: Fundamentals and Usage in the E-Commerce Domain, provides an overview of the related work and state of the art that supports the further chapters of the thesis. It consists of a section on (1) the Semantic Web, emphasizing the seminal role of the (ordinary) Web in modern societies, and introduces the vision of a Semantic Web, that stands out as a method to manually generate structured data for Web resources, in the context of our work. A section on (2) Semantic Web-based E-Commerce introduces e-commerce technologies, and discusses the predominant GoodRelations Web vocabulary and its ecosystem. The last section Chapter 1. Introduction 15
of Chapter 2 is devoted to (3) Web Information Extraction, a research area that focusses on the automated generation of structured data from Web resources7.
Chapter 3, Foundational Building Blocks, provides the three foundational contri- butions introduced extensively above.
Chapter 4, Structured Data for Web Information Extraction in E-Commerce, pro- vides the main contribution. We devote a section to the discussion of the properties of the (1) approach, with special regard to what is achievable in comparison to the current state of the art. We move on with a full discussion of the main part of the (2) implementation based on literate Python programming. We present the main (3) results of our experiment. We (4) evaluate our results with cross-validation, modify experimental settings extensively to assess their influence on the results, and evaluate over a manually-generated dataset. We close the chapter with a concise (5), yet pragmatic, use case.
Chapter 5, Conclusion, highlights our achievements, discusses the limitations of the main contributions, and provides an outlook on future work.
1.7 Previously Published Work
Parts of the work presented in the thesis have already been published in conference papers with permission:
1. Kurt Uwe Stoll, Mouzhi Ge and Martin Hepp: Understanding the impact of e-commerce software on the adoption of structured data on the Web. Business Information Systems (BIS 2013), Poznan, Poland.
2. Kurt Uwe Stoll and Martin Hepp: Detection of e-commerce systems with sparse features and supervised classification. 10th IEEE International Confer- ence on E-Business Engineering (ICEBE 2013), Coventry, United Kingdom.
7Definition for our context, related work may have slightly di erent views. Chapter 1. Introduction 16
Paper (1) corresponds to “Impact of Structured Data on E-Commerce Systems”, and paper (2) to “Detection of E-Commerce Systems”, which are discussed in detail in Chapter 3. 2 Structured Data: Fundamentals and Usage for E-Commerce
This chapter summarizes work in fields of research related to the topic of this thesis.
Semantic Web: We will first introduce the Semantic Web vision. While its orig- inal vision has not been fully realized so far, it marks the current state of the art in Web science (cf. [SBH06]). In this section, we will underline (1) the paradigm- shifting character of the original Web, its (2) fundamental problems, and (3) the Semantic Web vision. We will analyze thoroughly (4) the Semantic Web technology stack, as it is a technological foundation of the further course of the thesis. We will shortly introduce (5) Linked Data, and conclude with (6) Semantic Web adoption of leading search engines.
Semantic Web-based E-Commerce: The main aim of our research is to design a system that automatically generates structured e-commerce data for Web shops that originally do not provide it. The research in Semantic Web-based E-Commerce is highly relevant, as it specifically operates on such data. This section has six main parts. We begin with (1) the technological foundation of Semantic Web-based E-Commerce. We then introduce (2) the GoodRelations Web vocabulary, which combines a wide range of use cases with significant market adoption. We (3) progress to a short discussion of existing structured e-commerce data on the Web. We (4) provide a discussion of existing research on Semantic E-Commerce. We then (5) go on with a short overview of how respective data is used in commercial settings.
17 Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 18
We (6) conclude with a discussion of the economic implications of Semantic E- Commerce.
Web Information Extraction: The third part of the chapter provides related work in the field of Web Information Extraction (WIE), a research area this thesis also belongs to. In this context, the work is characterized (1) by a focus on e- commerce data, (2) by trying to augment the existing structured data on the Web as a main goal, and as novel approach, (3) by using existing structured data on the Web as learning set. To the best of our knowledge, this specific approach has not been exploited in related research in the Web Information Extraction field.
This section has seven main parts. We start with (1) a discussion of relevant dimensions when classifying WIE approaches, and go on with (2) classic and (3) recent WIE approaches. We then introduce (4) e-commerce specific, and (5) ontology-based WIE approaches. We discuss (6) WIE approaches that combine those two subfields, and therefore constitute the best match to our work. We complement the section with (7) an introduction to Web mining, a related research area.
In that way, we strive to integrate two important areas in Artificial Intelligence research (Semantic Web and WIE), and apply the results to a practically highly relevant domain. Fig. 2.1 shows an overview of the relationship of these three areas.
2.1 Semi-Automated Structured Data Generation on the Semantic Web
In this section, we will mainly discuss the Semantic Web vision, which is an ex- tension to the existing Web in its core, providing machine-readable, structured data. Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 19
Artificial Intelligence Research Fields Domain
Manually generated structured data ! Contribution Semantic Web (2.1) (Semantic) ! ! E-Commerce (2.2) Web Information Extraction (2.3) ! Automatically generated structured data
Figure 2.1: Strains of relevant related work
2.1.1 The Web
The Web certainly among the most influential and dynamically-evolving tech- nologies of the late 20th century (e.g. [Hal11]). In less than 20 years since its introduction, it has influenced society from economical to social dimensions. In the following subsection, we chose those dimensions as examples for the many changes initiated by the Web. The following scenarios are non-exhaustive and aim to provide an introduction to the breadth of the influences.
2.1.1.1 Economical Dimensions
Companies that are mainly Web-driven like Google, Apple or Microsoft rank among the highest-valued enterprises in the US1. We provide an additional overview of the market capitalization of Internet-driven companies as of April 20132 in figure 2.2.
As we will elaborate in the Semantic Web-based E-Commerce section of this chap- ter, electronic commerce has become an important part of modern multi-channel
1Market capitalization of US firms according to Google stock screener, https://www.google.com/finance#stockscreener, as of 07/07/2013 2following http://www.statista.com/statistics/209331/largest-us-internet-companies-by-market -cap/ Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 20
Figure 2.2: Market capitalization of Internet companies (USA), April 2013
marketing. It has seen a tremendous growth in the last years. For instance, regard- ing Germany in 2013, e-commerce generated a turnover of nearly 50 billion Euro, representing almost 11.2 % of all retail [Bun13].
Before the advent of the Web, procurement of highly specific3 goods was a complex process that required large amounts of human action. For instance, if an enterprise manufacturing extension cards for PC’s had to procure a slot bracket in the late 80’s, an extensive process of finding low-priced manufacturers in e.g. China, would have occurred. Today, there exist many platforms like Alibaba4 or Globalsources5 that allow to e ciently procure specific goods from the manufacturing country, and even general e-commerce companies like Ebay6 or Amazon7 now provide access to these type of goods.
Another economic outcome of the Web is the usage of crowds to solve tasks, a technique called crowdsourcing, or more specifically human computation, which describes methods and technologies that use human agents to solve batches of small problems that are hard to tackle for algorithms [QB11]. In market research, human computation platforms can be used to easily analyze customer preferences. A presentation of two di erent product package designs to a large number of
3Introduced in Chapter 1. and Section 2.2. 4http://www.alibaba.com 5http://www.globalsources.com/ 6http://www.ebay.com 7http://www.amazon.com Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 21
customers, which formerly would have required a serious amount of resources, can now be evaluated in minutes. Crowdfunding, as another example, presents projects to a large number of small-scale investors on the Web [BLS14]. Currently, the platform Kickstarter8 is dominating the market, and has, for instance, created nearly 50 fundings above one million dollars and roughly five million small-scale investors9.
2.1.1.2 Social Dimensions
By its fundamental design, the Web allowed everyone who is capable of writing HTML10 and with access to a server to publish on the Web. In comparison to media that dominated before, like print, television or radio, that alone was a paradigm shift. It made it (1) relatively easy to publish for a world-wide audience. As (2) there was no controlling institution, freedom of speech could be installed to a large degree.
While these social properties have been included into the Web from the early days, we are now seeing the massive growth of social networks. Social networks originally gained their power from a further facilitation of Web content creation, or provided crowd-intelligence based tagging functionality to classify resources [HG06]. Early examples are flickr11, an online photography community, and delicio.us12, a service that allows to publicly share and manage bookmarks.
In the last few years, the social networks Facebook and Twitter have seen massive growth and gained high importance in the Web economy. As of July 2013, Facebook reported 1.155 billion active users a month [CZ13], and Twitter is processing more than 400 million tweets a day [Wic13].
8http://www.kickstarter.com 9http://www.kickstarter.com/help/stats?ref=footer, accessed 10/23/2013. 10HTML is the markup language for Web pages [RHJ99; Hic11]. 11http://www.flickr.com/ 12http://www.delicious.com/ Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 22
Facebook o ers a very broad range of services from establishing connections to friends and acquaintances, over online chatting, to event organization and online gaming.
Twitter, on the other hand, executed a lean platform business model (cf. [Che07]). In its core, it only provides a platform to publish short messages13 in a micro-blog fashion.
These major social networks have shown significant beneficial outcomes. For ex- ample, Twitter has been successfully used to report emergencies [HP09]. Facebook has gained significant attraction in political science by acting as an organization platform for the opposition in the Arab spring [How+11].
While Facebook has become an application so popular that it may be perceived by some as a replacement for the Web as a whole, it is clearly just a part of it. It is important to stress, that by (1) promoting proprietary standards, for instance in terms of structured data14, or (2) walling in the content that has been generated by users, Facebook’s impact on Web culture has to be assessed critically (e.g. [Yeu+09]). The same holds true for Twitter, as it operates on a proprietary standard, and also walls in user content, making it hard to extract. These strategies fundamentally collide with the principles of a Web built on open standards, which we will discuss in the section below.
2.1.1.3 Design Principles of the Web
From the perspective of our work, the following three principles stand out:
• Documents can reside on servers all over the world.
This principle first covers decentralization [BHL01; Ber02]. On the Web, there is no central point of failure. The functionality of the Web is not harmed if some servers go down. Second, as basically everyone can put a server online, there is no central control of the content available on the Web. While federal 13Twitter adhered to the limitation of 140 chars per tweet to be compatible with SMS. 14https://developers.facebook.com/docs/opengraph/ Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 23
legislation applies, this fundamentally allows for freedom of speech on the Web.
• Documents can link to other documents.
This principle allows authors of Web documents to refer to other Web re- sources [Ber+04]. In its core, this principle resembles traditional citations contained in print documents. Meanwhile, a very powerful side-e ect is that links between Web documents span a graph, that can be analyzed automati- cally. This was the initial idea of Google’s PageRank algorithm for automat- ically rating the relevance of a page on the Web. This algorithm emphasizes documents that are linked by many other documents as important in Web search (cf. [RU12, p. 3],[HG08]). Therefore, the ability to link documents contains an e cient by-product that allows to determine their importance.
• If a user clicks on a link, the linked document is automatically fetched from the server and presented to the user.
This principle covers the user perspective of the Web and integrates the two principles mentioned above. Seeming trivial, as we are now accustomed to Web browsing, its initial proposal was revolutionary. For the first time, it allowed to surf on arbitrarily linked documents residing on servers all over the world, without even noticing it (cf. [ML07, p. 3]).
2.1.1.4 Fundamental Problems of the Web
While the Web has been an ingenious invention that gained massive adoption right from the start, it soon became obvious that it included fundamental limitations. As we discussed above, the Web is essentially a distributed, electronic mapping of the document-centric knowledge representation approach, that already existed for centuries. To a large extent, the Web contains unstructured textual data. While this is easy to consume for human agents, it leads to severe limitations for automated data processing (e.g. [AH04, pp. 1-2],[Lac05, p. 4], cf. [Jür12; Cha+06]). Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 24
While, for instance, it is relatively easy for a human agent to extract the important persons and locations in a newspaper article, this is hard to solve algorithmically. Generally, a structured form is essential for automated data processing. Meanwhile, Web documents are mostly generated by content management systems that inter- mingle structured database contents with layout directives [CS08; ZL07; FGS12; Gul+10]. This finding is especially important, as it covers the second foundational contribution of the thesis, “Patterns in product pages generated by e-commerce systems”. This means in essence that the data originally available in a structured form becomes hard to extract.
Gibson, Punera, and Tomkins [GPT05] state that 40 to 50 % of Web pages are generated in the discussed way with the help of templates. Due to the more repet- itive nature of Web pages in the e-commerce domain, we expect the amount of pages generated with templates here to be even higher.
The lack of structured data on the legacy Web15 has the following adverse conse- quences:
Web search engines mainly operate on string search: General search engines are the main entry point into the Web [Saf13]. They operate on Web documents, which usually contain data covered in textual representations. Therefore, the en- gines are mostly based on string search, performing matches against the content of the document. From a semantic perspective, this is not very e ective, as a search engine initially cannot distinguish, for instance, the di erence between the car brand “Jaguar” and the animal. This leads to a suboptimal user experience, as there might be a need to set the right context (cf. [Hit+08, p. 10], cf. [AH04, pp. 1-2]). In recent years, search engines have made significant progress towards understanding the actual meaning of documents, parts of documents, and entities referred to in the documents, see e.g. 2.1.2.4. Despite such advancements, however, Web search is still heavily influenced by the match between terminology in the query and the use of matching words in textual Web content.
15The following paragraphs reflect the state before the establishment of the Semantic Web vision. Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 25
Information integration on the Web is di cult: As data might be spread across di erent Web sites, and again covered under layout directives, it is hard to integrate it. For instance, consider a digital camera buying decision. There are many product features on the manufacturer’s homepage. Additionally, the buyer has to query for prices. To perform a sophisticated decision, he or she would need to compile a spreadsheet with the di erent properties of the cameras, and subsequently weigh them according to personal preferences. Then, the buyer would need to consider a price-comparison site to find a merchant who matches his or her needs. With the current state of a airs, this is a highly extensive task, induced by the fundamental design principles of the Web (cf. [Hit+08],[Lac05, p. 5]). The core of the problem is that for processing information from the Web, computers are limited to automating the rendering of the published data and cannot support the human user in the process of interpreting and combining it.
In an information age, that is mainly driven by growing automation, textual docu- ments become a legacy form of knowledge representation. Algorithmic automation operates on data. Non-trivial algorithmic processing of information is currently dependent on structured data, i.e. granular data with unambiguous semantics. The Web in the still predominating stage mostly lacks such structured data, complicat- ing the further automation of information processing from Web pages. To date, the Web is a human-processable representation of structured database content [CS08]. In this context, the majority of Web data is so poorly structured that only humans are able to interpret it. At the same time, the sheer size of the aggregate data is so vast that only machines are suited to operate on it [SHB06]. In summary, the wealth of information available on the Web is not matched by e cient means for processing it (cf. [Fur+11]). Besides the inherent conceptual and syntactical heterogeneity of underlying data, the root cause for this problem is that data struc- ture and data semantics from the underlying databases of dynamic Web sites are stripped o in the process of publication on the Web. Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 26
2.1.2 Semantic Web
In this subsection, we will first introduce the vision of the Semantic Web, and then present the Semantic Web technology stack related to our work. The stack consists of the following components:
• Uniform Resource Identifiers: URIs [BFM05]
• Extensible Markup Language: XML [Bra+08]
• Resource Description Framework: RDF [CK04]
• and ontology languages, namely RDFS and OWL, the Web Ontology Lan- guage [BG14; Bec+04]
• SPARQL Query Language and Interface for RDF [PS08]
2.1.2.1 Vision
There are two fundamental approaches to the aforementioned problems of the Web (cf. [Hit+08, p. 11]).
1. The first approach is the Semantic Web vision and has been proposed in a 2001 article by Berners-Lee, Hendler, and Lassila [BHL01], as an extension to the Web.
In its core, it promotes the following two key ideas (e.g. [AH04; Tre+08; SCV07]):
• To enhance existing Web pages with machine-readable structured data.
• To express the data in a way that adheres to a commonly shared mean- ing.
This would ultimately create a database that contains all the information available on the Web (e.g. [Ber09]). Ideally, the machine-readable data should represent the important facts contained in a Web site in a granular way, so that Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 27
each fact could be integrated elsewhere. For instance, having all biographies of classical composers available on the Semantic Web would allow to calculate their mean age straightforwardly. Since 1999 the Semantic Web idea has matured to a significant field in research, and provides a solid technology stack ready to implement the vision, which will be discussed below.
2. The second approach aims at designing systems that are capable of automat- ically extracting data out of the documents on the Web (cf. [Hit+08, p. 11]). We will elaborate on this approach extensively in Section 2.3 of this chapter. In essence, this approach aims at making computer processing more powerful so that it could process unstructured text equally well as structured data (cf. [Hit+08, p. 11]).
Additionally, extending the heuristic known as Metcalfe’s Law [HG08], establishing the Semantic Web would lead to an explosion of the network value of the Web. Instead of documents, singular facts can be interconnected (cf. [HG08]). This is especially important, as Web platforms that spot strong network e ects, for instance social networks, have shown to be of high value to the users16.
In this regard, a main aim of the Semantic Web is to liberate data out of the application, or document, context [Rod09]. A currently very common pattern is that of Web companies gathering massive amounts of user data, as we introduced in the “Social dimensions of the Web” section above. This generates a Web made of walled gardens, in which each application locks the generated data and exploits it to its own benefit. The Semantic Web aims at marking up the data in a way that any user can integrate data from di erent applications.
At this point, we would like to emphasize that the approach of this thesis uses Web Information Extraction, which we will discuss below in 2.3, to generate Semantic Web data. Therefore, it aims at combining the two fundamental approaches to make available structured data representing the information on the Web.
16Facebook.com ranks currently number 2 of the most popular websites according to [Ama13] Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 28
Query Language & Interface SPARQL Ontology Languages RDFS & OWL Data Model RDF Global Identifiers URIs
Figure 2.3: Reduced Semantic Web technology stack relevant to this work, own representation based on [Ber00]
2.1.2.2 Semantic Web Technology Stack
Since its initial incubation by the World Wide Web Consortium, the Semantic Web community has released a substantial set of standards and technologies. In the following subsection, we discuss those that have the highest impact on our research.
There exist many versions of the Semantic Web technology stack (e.g. [Ber00; Sig05; Bra07]). We have decided to exclude technologies that are less important to our work, like rules or trust, resulting in a reduced Semantic Web technology stack that is shown in Fig. 2.3.
URIs: Uniform Resource Identifiers
URI Syntax: URIs are the most fundamental building block of the ordinary17 Web and the Semantic Web. The following description of the syntax of URIs has been excerpted from RFC 3986 [BFM05], which is the o cial document defining the URI standard. To describe the syntax of the distinct parts of URIs, RFC 3986 uses the Augmented Backus–Naur Form (ABNF), which itself is defined in RFC 5234 [CO08]. Originally, URIs have been identifiers to identify Web resources. They are subject to the syntax provided in Fig. 2.4.
The hier-part consists of authority and path, an example is provided in Fig. 2.5.
17We use the term ordinary here to stress the di erence between the original Web and a Semantic Web. Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 29
scheme ":" hier-part [ "?" query ] [ "#" fragment ]
Figure 2.4: URI scheme, Berners-Lee, Fielding, and Masinter [BFM05]
http://www.semantium.de:5984/research/machine-learning/extractor?target=price#currency
scheme authority path query fragment
Figure 2.5: URI scheme - example
URI Use in the Semantic Web: In the former paragraph, we introduced URIs as identifiers of Web resources. The Semantic Web extends the usage of URIs to basically any thing, for instance persons, cities, universities or abstract concepts like colors [Rod09]. Therefore, on the Semantic Web, URIs can both reference information resources (Web documents) that describe things, and things themselves [SCV07]. In that way, it is important to distinguish whether a reference targets an information resource or a thing. A common way to reference real-world or abstract things on the Semantic Web is to use the URI fragment as extension to a given URI. In Listing 2.1, we provide an example that references the Universität der Bundeswehr München itself (the institution), and its location. A respective information resource would be the homepage available at “http://www.unibw.de”.
1 http://www.unibw.de/about#university 2 http://www.unibw.de/about#location
Listing 2.1: Fragment identifiers
XML: Extensible Markup Language:
Fundamentals: We did not include XML as a layer in our technology stack, as it does not play a vital role in the Semantic Web context. Meanwhile, we introduce it briefly, as our research makes use of this technology. XML is a metalanguage that Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 30
allows to interchange data in a structured way by the definition of domain-specific grammars [Lac05, p. 62]. It is used to represent tree-based structures. The World Wide Web Consortium has been heavily involved in the definition of XML. A main advantage is that by establishing XML as a syntax for arbitrary grammars, a software implementation of XML, e.g. a library in a certain programming language, ensures compatibility with many use cases. A disadvantage of XML is that it only allows to define the structure of a document, while not providing means to define the content [SHB06].
Features: XML uses tags to delimit elements, attributes, and content. Tags are set in brackets. In here, attributes, which are placed in the tag (e.g. src=“uwe.jpg”) provide meta-data about a tag [Bra+08],[Lac05, p. 61 .]. Being normally used with start-tags (e.g.