Kurt Uwe Stoll

Using Existing Structured Data as a Learning Set for Web Information Extraction in E-Commerce

Doctoral Thesis Fakultät für Wirtschafts- und Organisationswissenschaften

Using Existing Structured Data as a Learning Set for Web Information Extraction in E-Commerce

Kurt Uwe Stoll

Univ.-Prof. Dr. Hans A. Wüthrich Univ.-Prof. Dr. Martin Hepp

Univ.-Prof. Dr. Claudius Steinhardt Univ.-Prof. Dr. Stephan Kaiser Univ.-Prof. Dr. Karl Morasch

12.7.2016

Dr. rerum politicarum

(Dr. rer. pol.)

1. November 2016 Doctoral Thesis

Using Existing Structured Data as a Learning Set for Web Information Extraction in E-Commerce

Author: Supervisor: Kurt Uwe Stoll Prof. Dr. Martin Hepp

A thesis submitted in partial fulfillment of the requirements for the degree of Dr. rer. pol.

at the

UNIVERSITÄT DER BUNDESWEHR MÜNCHEN

November 1, 2016 “I checked it very thoroughly,” said the computer, “and that quite definitely is the answer. I think the problem, to be quite honest with you, is that you’ve never actually known what the question is.”

“But it was the Great Question! The Ultimate Question of Life, the Universe and Everything,” howled Loonquawl. “Yes,” said Deep Thought with the air of one who suers fools gladly, “but what actually is it?”

A slow stupefied silence crept over the men as they stared at the computer and then at each other.

“Well, you know, it’s just Everything ... Everything ...” oered Phouchg weakly.

“Exactly!” said Deep Thought. “So once you know what the question actually is, you’ll know what the answer means.”

Douglas Adams - The Hitchhiker’s Guide to the Galaxy Abstract

Using Existing Structured Data as a Learning Set for Web Information Extraction in E-Commerce

by Kurt Uwe Stoll

In the last years, e-commerce has grown massively and evolved into a main driver of technological innovation on the Web. The Semantic Web is a vision to advance the technological foundation of the Web so that computers are empowered to better extract and process information from Web content [AH04, p. 1f.]. A core principle of the Semantic Web is to augment Web markup by structured data suited for machine processing, instead of markup just suitable for rendering the information for human consumption [AH04,p. 1f.]. The application of the Semantic Web to e-commerce shows significant potential in particular for the eciency and precision of search, improving data quality, or raising market eciency.

Despite a significant increase in adoption, the percentage of Web sites that provide data markup for e-commerce information is still limited and will likely remain limited for many years to come. Predominantly, the data is generated with shop extension modules, covering only a small fraction of the Web. At the same time, automatic methods for Web Information Extraction are still not able to reconstruct the full amount of structured data behind Web content.

In order to address this issue, we propose a novel method for Web Information Extraction, targeted to the e-commerce domain. The approach exploits (1) the market dominance of a small amount of e-commerce systems, (2) the patterns those systems expose in Web page generation, and (3) the existing structured data in e-commerce.

We evaluate our findings by splitting our dataset into a learning set and an evaluation set. Our results show that the approach is feasible for extracting structured data from e- commerce sites that do not include data markup solely on the basis of template similarity and existing markup as training data.

The fundamental idea is to combine similarities in Web page templates, caused by the popularity of o-the-shelf shop software, with the use of data markup found in the subset of Web pages as training data for machine learning. Kurzzusammenfassung

Existierende strukturierte Daten als Lernset für Webinformationsextraktion im Bereich E-Commerce

von Kurt Uwe Stoll

Der Wirtschaftsbereich E-Commerce ist in den letzten Jahren stark gewachsen und hat sich dabei zu einer Triebfeder technischer Innovation im Web etabliert. Das semantische Web ist eine Vision, die technologischen Grundlagen des Webs so zu verbessern, dass Computer leichter Informationen aus Webinhalten extrahieren und verarbeiten können [AH04, p. 1f.]. Hierbei ist das Kernprinzip, Webseitencode, welcher ursprünglich für die Darstellung für Menschen entworfen wurde, mit strukturierten Daten anzureichern, welche maschinenlesbar sind [AH04, p. 1f.]. Im Zusammenhang mit E-Commerce birgt die Anwendung von Semantic-Web-Technologien bedeutende Potentiale, insbesondere Ezienz und Suchgenauigkeit, Verbesserung von Datenqualität und Verbesserung von Marktezienz.

Trotz einer bedeutenden Zunahme in der Verwendung dieser Technologien ist der Anteil von Websites, die strukturierte Daten verwenden, nach wie vor begrenzt und wird dies aller Voraussicht nach in den nächsten Jahren bleiben. Die Daten werden vornehmlich durch Shop Extensions erzeugt. Gleichzeitig sind automatisierte Methoden aus dem Bereich Webinformationsextraktion noch nicht in der Lage, die Gesamtheit der in Webseiten enthaltenen Informationen als strukturierte Daten abzubilden.

Um dieses Problem zu lösen, wird eine neue Methode für Webinformationsextraktion für E-Commerce vorgeschlagen. Sie nutzt die marktbeherrschende Stellung weniger E- Commerce-Systeme, die Muster, welche die Systeme bei der Webseitengenerierung erzeu- gen, und die bestehenden strukturierten Daten aus dem semantischen E-Commerce.

Die Ergebnisse werden evaluiert, indem die zur Verfügung stehenden Daten in Train- ingsdaten und Testdaten aufgeteilt werden. Unsere Ergebnisse zeigen, dass der Ansatz lediglich durch die Verwendung von Ähnlichkeiten in Templates und existierendem Markup zusätzliche strukturierte Daten erzeugen kann. Die grundlegende Idee besteht in der Kombination von Ähnlichkeiten in Webseitentemplates, welche durch die Popularität von Standard Shopsoftware entsteht, mit der Verwendung von strukturiertem Markup als Trainingsdaten für Machine Learning. Acknowledgements

First of all, I would like to sincerely thank my supervisor, Prof. Dr. Martin Hepp, for his guidance, support and encouragement. Without his supervision and trust in my ideas, this thesis would have never existed. Working with him was a highly inspiring experience. Additionally, I want to thank Prof. Dr. Claudius Steinhardt for taking over the role of co-supervisor.

I want to thank my colleagues Dr. Mouzhi Ge, Andreas Radinger, Dr. Bene Rodriguez, Alex Stolz and Laszlo Török, for the inspiring discussions, and productive atmosphere at work. I owe progress in many critical points of this way to you. Many thanks also go to all my dear friends, without whom life would have never been so colorful.

Most of all, I want to thank my wife Nadine. You are the best thing that has ever happened to me. Without your love, I would have never come so far. Especially, I want to thank my family. In the rest of my life, I can never pay back the love and care I owe to my mother.

Finally, I want to thank Christopher David Ryan for the friendly provision of the title page graphic.

Last but not least, I would like to thank the Universität der Bundeswehr München, who funded this research for a significant period and provided a highly creative atmosphere.

v Contents

Abstract iii

Kurzzusammenfassung iv

Acknowledgements v

List of Figures xi

List of Tables xiii

Listings xv

Abbreviations xvi

1 Introduction 1 1.1 Problem Statement and Hypothesis ...... 1 1.2 Relevance ...... 3 1.2.1 Potential of the Semantic Web for E-Commerce ...... 3 1.2.2 Existing Semantic E-Commerce Data and Limitations .... 7 1.3 Contributions ...... 9 1.4 Research Questions ...... 13 1.5 Experimental Design ...... 14 1.6 Organization of the Thesis ...... 14 1.7 Previously Published Work ...... 15

2 Structured Data: Fundamentals and Usage for E-Commerce 17 2.1 Semi-Automated Structured Data Generation on the Semantic Web 18 2.1.1 The Web ...... 19 2.1.1.1 Economical Dimensions ...... 19 2.1.1.2 Social Dimensions ...... 21 2.1.1.3 Design Principles of the Web ...... 22 2.1.1.4 Fundamental Problems of the Web ...... 23 2.1.2 Semantic Web ...... 26 2.1.2.1 Vision ...... 26 2.1.2.2 Semantic Web Technology Stack ...... 28 2.1.2.3 Linked Data ...... 42

vi Contents vii

2.1.2.4 Schema.org, Google Semantic Web Tools and Google Knowledge Graph ...... 43 2.1.3 Conclusion ...... 45 2.2 Semantic E-Commerce ...... 45 2.2.1 Technological Foundations of E-Commerce ...... 45 2.2.2 The GoodRelations Web Ontology for E-Commerce ..... 47 2.2.2.1 Goals and Design Principles ...... 47 2.2.2.2 Data Model ...... 48 2.2.2.3 Features, Documentation, and Ecosystem ..... 50 2.2.2.4 Existing GoodRelations Data on the Web ..... 55 2.2.3 Existing Research in Semantic E-Commerce ...... 56 2.2.4 Real-World Usage of Structured E-Commerce Data ..... 57 2.2.5 Economical Implications of Semantic E-Commerce ...... 59 2.2.6 Conclusion ...... 60 2.3 Automated Generation of Structured Data with Web Information Extraction ...... 61 2.3.1 Research Strains in Web Information Extraction and Rela- tion to Semantic Web Research ...... 63 2.3.2 Classical Web Information Extraction Approaches ...... 64 2.3.3 Recent Approaches to Web Information Extraction ..... 65 2.3.4 Web Information Extraction Targeting the E-Commerce Do- main ...... 67 2.3.5 Ontology-Based Web Information Extraction ...... 67 2.3.6 Semantic Web Information Extraction Approaches Targeting E-Commerce ...... 69 2.3.7 Novelty of Our Approach ...... 70 2.3.8 Related Field: Web Mining ...... 70 2.4 Big Data and Validity of the Contribution ...... 72

3 Foundational Building Blocks 74 3.1 Impact of E-Commerce Systems on the Availability of Structured Data in E-Commerce ...... 75 3.1.1 Related Work ...... 75 3.1.1.1 Market Studies ...... 75 3.1.1.2 Functional Comparisons ...... 76 3.1.2 Understanding the Impact of E-Commerce Software on the Adoption of Structured Data on the Web ...... 76 3.1.3 Implementation ...... 79 3.1.3.1 Obtaining a List of Relevant Site URIs ...... 79 3.1.3.2 Counting Product Pages Based on XML Sitemaps . 80 3.1.4 Results ...... 81 3.1.4.1 Summary ...... 82 3.1.4.2 Impact of E-Commerce Software on the Adoption of Structured Data ...... 84 3.1.4.3 Site Popularity ...... 84 Contents viii

3.1.5 Evaluation ...... 85 3.1.6 Discussion and Limitations ...... 86 3.1.7 Conclusion ...... 87 3.2 E-Commerce System Identification Based on Sparse Features .... 87 3.2.1 Related Work ...... 87 3.2.1.1 Web Page Classification ...... 88 3.2.1.2 Supervised Classification ...... 89 3.2.2 Methodology, Approach, and Implementation ...... 91 3.2.2.1 Overview ...... 91 3.2.2.2 Design Rationales ...... 92 3.2.2.3 Generating Datasets and Preprocessing ...... 92 3.2.2.4 Building a Classifier ...... 94 3.2.2.5 Implementation ...... 95 3.2.3 Results ...... 96 3.2.3.1 Feature Set and Algorithm Performance ...... 96 3.2.3.2 Speed ...... 97 3.2.3.3 Performance on Dierent Clusters ...... 98 3.2.3.4 Consolidated Algorithm Review ...... 98 3.2.4 Evaluation ...... 99 3.2.4.1 Evaluation on GR-Notify Dataset ...... 99 3.2.4.2 Evaluation on Targeted ECS Reference Shops ...100 3.2.4.3 Evaluation on Non-Targeted ECS Reference Shops 102 3.2.4.4 Evaluation on Non-Shop Sites ...... 102 3.2.5 Limitations ...... 103 3.2.6 Conclusion ...... 104 3.3 Structured E-Commerce Data on the Web ...... 105 3.3.1 Related Work ...... 106 3.3.2 GR-Notify as a Registry for GoodRelations-enabled Shops . 107 3.3.2.1 Approach ...... 108 3.3.2.2 Implementation ...... 108 3.3.3 Analysis of GR-Notify Data ...... 109 3.3.3.1 Approach ...... 109 3.3.3.2 Implementation ...... 110 3.3.3.3 Results ...... 110 3.3.4 Generating a Sample of GoodRelations Data on the Web ..115 3.3.4.1 Approach ...... 115 3.3.4.2 Implementation ...... 116 3.3.5 Analysis of the Sample ...... 117 3.3.5.1 Implementation ...... 117 3.3.5.2 Results ...... 118 3.3.6 Evaluation ...... 133 3.3.7 Limitations ...... 134 3.3.8 Conclusion ...... 135 Contents ix

4 Structured Data for Web Information Extraction in E-Commerce 136 4.1 Approach ...... 136 4.1.1 Fundamentals ...... 137 4.1.1.1 Web Information Extraction in Comparison to Shop Extensions ...... 137 4.1.1.2 Focussing on the Promise Part of GoodRelations’ APO Principle ...... 138 4.1.2 Properties in Regard ...... 139 4.1.2.1 Properties Used in the Approach ...... 139 4.1.2.2 Additional Properties Regarded in the Use Case ..141 4.1.2.3 Excluded Properties ...... 142 4.1.3 Experimental Design ...... 144 4.1.3.1 Evaluation ...... 144 4.1.3.2 High-level Pseudocode Overview ...... 145 4.1.4 Conclusion ...... 147 4.2 Implementation ...... 147 4.2.1 Python as Main Programming Language ...... 148 4.2.2 Dataset Generation ...... 150 4.2.3 Extraction of Provided Data from Oering Pages ...... 152 4.2.4 Quality of the Extracted Data ...... 156 4.2.5 Generation of Extraction Rules ...... 157 4.2.6 Evaluation ...... 163 4.2.7 Conclusion ...... 166 4.3 Results ...... 166 4.3.1 Dataset Generation ...... 167 4.3.2 Extraction of Data from Oering Pages ...... 167 4.3.3 Rule Generation ...... 168 4.3.4 Conclusion ...... 172 4.4 Evaluation ...... 172 4.4.1 Standard Settings ...... 174 4.4.2 Modified Evaluation ...... 175 4.4.3 Modified Sample ...... 178 4.4.4 Modified Rule Generation ...... 182 4.4.5 Additional Dataset: Manually Labeled, n=20 per ECS ...183 4.4.6 Conclusion ...... 184 4.5 Use Case: Real-time E-Commerce Web Information Extraction System186 4.5.1 Design ...... 186 4.5.2 Implementation of the Extraction System ...... 187 4.5.3 Implementation of the Frontend ...... 188 4.5.4 Output of a Typical Run of the WIE System ...... 189 4.5.5 Frontend Overview ...... 189 4.5.6 Conclusion ...... 190 Contents x

5 Conclusion and Outlook 192 5.1 Contributions ...... 192 5.1.1 Structured Data: Fundamentals and Usage in the E-Commerce Domain ...... 192 5.1.2 Foundational Building Blocks ...... 194 5.1.3 Structured Data for WIE in E-Commerce ...... 195 5.2 Limitations and Future Work ...... 197 5.2.1 Dataset ...... 197 5.2.2 Approach ...... 198 5.2.3 Evaluation ...... 200 5.2.4 Use Cases ...... 200 5.2.5 Scale ...... 201 5.2.6 Scope ...... 201 5.3 Outlook: On the Self-Replicating Nature of Structured Data ....202

Bibliography 204 List of Figures

1.1 Approach ...... 3 1.2 Search engine bottleneck, referring to [Hepa]...... 4 1.3 Interplay of foundational and main contributions ...... 12 1.4 Web shop with exemplary extraction targets ...... 12 1.5 Extraction rule generator approach ...... 13

2.1 Strains of relevant related work ...... 19 2.2 Market capitalization of Internet companies (USA), April 2013 ... 20 2.3 Reduced Semantic Web technology stack relevant to this work, own representation based on [Ber00] ...... 28 2.4 URI scheme, Berners-Lee, Fielding, and Masinter [BFM05] ..... 29 2.5 URI scheme - example ...... 29 2.6 Graph of the RDF example ...... 32 2.7 Six eects of ontologies, based on [Hep08b] ...... 37 2.8 Most important conceptual elements of the GoodRelations ontology 49

3.1 Research foundation ...... 74 3.2 Eect of enabling structured data for an e-commerce system on product pages ...... 77 3.3 Distribution of the number of product pages per shop software package 83 3.4 Supervised Machine Learning: General approach, based on [Kot07] . 90 3.5 Overview of experimental design ...... 95 3.6 Heat map of F1-all core for 18 feature / algorithm combinations .. 97 3.7 Heat map: time elapsed for 18 feature / algorithm combinations .. 98 3.8 GR-Notify - ping frequency ...... 111 3.9 GR-Notify - top level domains ...... 111 3.10 GR-Notify - submitting ECS ...... 112 3.11 GR-Notify - submitting ECS pie chart ...... 113 3.12 GR-Notify - submissions over time ...... 113 3.13 GR-Notify - frequency world heat-map ...... 114 3.14 Learning set generator overview ...... 116 3.15 Implementation pipeline - sample analysis ...... 118 3.16 Length analysis - name, unit: characters ...... 122 3.17 Length analysis - description, unit: characters ...... 122 3.18 Count analysis - eligibleRegions, unit: region codes ...... 123 3.19 Count analysis - acceptedPaymentMethods ...... 124 3.20 Count analysis - availableDeliveryMethods ...... 124 3.21 Distribution of hasCurrency by ECS ...... 125

xi List of Figures xii

3.22 Distribution of acceptedPaymentMethods by ECS ...... 127 3.23 Distribution of availableDeliveryMethods by ECS ...... 127 3.24 Distribution of valueAddedTaxIncluded by ECS ...... 128 3.25 Distribution of validity statement duration by ECS ...... 129 3.26 World map coloring ...... 132 3.27 World map of the frequency of eligibleRegions - Magento ...... 132 3.28 World map of the frequency of eligibleRegions - Oxid E-Commerce . 132 3.29 World map of the frequency of eligibleRegions - Prestashop .....133 3.30 World map of the frequency of eligibleRegions - Virtuemart .....133

4.1 Extraction rule generator approach ...... 148 4.2 Aggregated rule generation results ...... 171 4.3 Final results - standard settings - precision ...... 176 4.4 Impact of the stricter settings on the precision ...... 177 4.5 Impact of the relaxed settings on the precision ...... 178 4.6 Impact of a training set of 0.25 settings on the precision ...... 179 4.7 Impact of a training set of 0.75 settings on the precision ...... 180 4.8 Precision for early adopters ...... 181 4.9 Precision for later adopters ...... 181 4.10 Precision while omitting first class ...... 182 4.11 Result dierences - wild card - precision ...... 183 4.12 Precision with a manually created dataset ...... 185 4.13 Overview of the frontend functionality ...... 191

5.1 Final results - standard settings - precision ...... 196 List of Tables

2.1 Example XPaths ...... 31 2.2 Overview of discussed work in Web Information Extraction ..... 62

3.1 Consolidated list of search strings for the 56 e-commerce systems in regard ...... 79 3.2 URIs found in e-commerce sitemaps from one million Alexa sites and product item estimate, results (absolute) ...... 82 3.3 URIs found in sitemaps and product item estimate, results (relative) 83 3.4 Precision of the shop detection technique - Demandware - Prestashop 85 3.5 Precision of the shop detection technique - EC-SHOP - mean .... 86 3.6 Learning set instances by ECS ...... 93 3.7 Remaining recall-base after white list filtering ...... 94 3.8 F1-all-scores for 18 feature / algorithm combinations ...... 97 3.9 Time elapsed (s) for 18 feature / algorithm combinations ...... 97 3.10 Classification report of “class+id” / XTREE classifier on distinct ECS ...... 98 3.11 Consolidated review of speed / performance of used algorithms ... 99 3.12 GR-Notify evaluation: Remaining recall after white-list application . 99 3.13 GR-Notify evaluation: Classification report of “class+id” / XTREE classifier ...... 100 3.14 Evaluation on targeted ECS reference shops - classification results . 102 3.15 Evaluation on targeted ECS reference shops - precision, recall, F1-score102 3.16 Evaluation on non-targeted ECS reference shops ...... 103 3.17 GR-Notify - top level domains ...... 111 3.18 Evaluation on non-targeted ECS reference shops ...... 112 3.19 GR-Notify - frequency world - GEOIP analysis ...... 115 3.20 Analyzed HTML pages and RDF oering graphs per ECS .....119 3.21 GoodRelations properties attached per oer by ECS ...... 121 3.22 Evaluation with crawl dataset, per oering ...... 134

4.1 Comparison of shop extensions with our approach ...... 138 4.2 Extraction targets ...... 140 4.3 Overview of the extraction rules ...... 162 4.4 HTML sample pages - all / training / evaluation from dierent ECS and sums ...... 167 4.5 Ratio of valid data in the extracted raw data ...... 168 4.6 Rule generation results, desc. property, rank 1 to 5, and score ...170 4.7 Rule generation results, image property, rank 1 to 5, and score ...171

xiii List of Tables xiv

4.8 Rule generation results, name property, rank 1 to 5, and score ...172 4.9 Rule generation results, price property, rank 1 to 5, and score ...173 4.10 Aggregated rule generation results - dataset ...... 173 4.11 Final results - standard settings - precision ...... 175 4.12 Strict evaluation ...... 176 4.13 Impact of the stricter settings on the precision ...... 176 4.14 More relaxed evaluation - settings ...... 177 4.15 Impact of the relaxed settings on the precision ...... 177 4.16 Impact of a training set of 0.25 settings on the precision ...... 178 4.17 Impact of a training set of 0.75 settings on the precision ...... 179 4.18 Precision for early adopters ...... 180 4.19 Precision for later adopters ...... 181 4.20 Precision while omitting first class ...... 182 4.21 Result dierences - wild card - precision ...... 183 4.22 Precision with a manually created dataset ...... 184 4.23 Precision, absolute with a manually created dataset ...... 184 4.24 Output of a typical run of the WIE system, n=250 URIs ...... 190 Listings

2.1 Fragment identifiers ...... 29 2.2 Running example in Turtle notation ...... 32 2.3 Running example in RDF/XML notation ...... 34 2.4 Running example in RDFa notation ...... 35 2.5 Running example in JSON-LD notation ...... 35 2.6 Oering example in Turtle syntax ...... 40 2.7 SPARQL query to find oerings ...... 41

3.1 Parallelization with GNU parallel ...... 80 3.2 Overview query ...... 119 3.3 Property frequency analysis - Oering ...... 119 3.4 Length analysis - name - description ...... 122 3.5 Length analysis - eligibleRegions - acceptedPaymentMethods - avail- ableDeliveryMethods ...... 123 3.6 Multi-value analysis - hasCurrency ...... 125 3.7 Multi-value analysis - acceptedPaymentMethods - availableDeliv- eryMethods ...... 126 3.8 Multi-value analysis - valueAddedTaxIncluded ...... 128 3.9 Multi-value analysis - validity statement ...... 129 3.10 World heat map - eligibleRegions ...... 130

4.1 Experimental design pseudocode overview ...... 146 4.2 Dataset generation - source code ...... 151 4.3 Extract provided data from oering pages - source code ...... 154 4.4 Check extracted data quality - source code ...... 156 4.5 Generate extraction rules - source code ...... 157 4.6 Evaluation - source code ...... 163

xv Abbreviations

API Application Programming Interface APO Agent-Promise-Object CMS Content Management System CPU CSS Cascading Style Sheets DOM Document Object Model ECS Electronic Commerce System HTML Hypertext Markup Language HTTP Hypertext Transfer Protocol IP Internet Protocol JSON Javascript Object Notation RAM Random Access Memory RDF Resource Description Framework REST Representational State Transfer RQ Research Question SEO Search Engine Optimization SPAQRL SPARQL Protocol and RDF Query Language SW Semantic Web TLD Top-Level Domain UML Unified Modeling Language URI Uniform Resource Identifier WIE Web Information Extraction XML EXtensible Markup Language XPath XML PATH Language

xvi 1 Introduction

In the following sections, we present (1) the problem statement and hypothesis, (2) discuss the relevance, (3) highlight the contributions, (4) formulate research questions, (5) specify the experimental design, and (6) explain the organization of the thesis. We close with (7) a list of previously published work.

1.1 Problem Statement and Hypothesis

In the last years, e-commerce has grown massively and evolved into a main driver of technological innovation on the Web1. The Semantic Web is a vision to advance the technological foundation of the Web so that computers are empowered to better extract and process information from Web content [AH04, p. 1f.]. A core principle of the Semantic Web is to augment Web markup by structured data suited for machine processing, instead of markup just suitable for rendering the information for human consumption [AH04, p. 1f.]. The application of the Semantic Web to e-commerce shows significant potential in particular for the eciency and precision of search, improving data quality, or raising market eciency.

Despite a significant increase in adoption, the percentage of Web sites that provide data markup for e-commerce information is still limited and will likely remain limited for many years to come. Predominantly, the data is generated with shop software extension modules, covering only a small fraction of the Web. At the same time, automatic methods for Web Information Extraction are still not able to reconstruct the full amount of structured data behind Web content. 1In the course of the thesis, we use “Web” synonymously for “World Wide Web (WWW)”. 1 Chapter 1. Introduction 2

A structured data representation of oerings, fundamental for Semantic E-Commerce, is only available at a relatively low market coverage. Predominantly, the data is generated with shop extensions, covering only a small fraction of the Web. At the same time, automatic methods that exploit Web Information Extraction to generate structured data are largely unexplored. Moreover, the existing structured data makes up for a significant learning set to drive Web Information Extraction.

In order to address this issue, we propose a novel method for Web Information Extraction, targeted to the e-commerce domain. The approach is based on the following observations:

1. A large share of Web shops is implemented on the basis of standard software packages, and the total number of popular software solutions is small.

2. The resulting HTML code of the published Web pages shows a significant amount of similarity for the same underlying software solution, despite the frequently very dierent visual appearance caused by customization eorts.

3. There is a significant amount of e-commerce Web pages with structured data markup, but the absolute market coverage and adoption are still limited.

This leads to our research hypothesis:

The existing structured e-commerce data on the Web, in combination with the market structure of e-commerce systems, and the similarity in the HTML patterns they expose, can be used as a lever to generate additional e-commerce data in significant quantity and quality, and thus increase the market coverage in the available data.

We aim at using a small proportion of the e-commerce Web that is already equipped with structured data as a “blueprint” to generate extraction rules that can be applied to oering pages in the ordinary2 Web of e-commerce.

The approach we suggest is shown in Figure 1.1. Given a set of e-commerce pages that does contain structured data, we generate a set of extraction rules. If we apply

2We use the term ordinary here to distinguish the Web in its current state from a Web that is equipped with structured data. Chapter 1. Introduction 3

E-Commerce pages Extraction rules not containing structured data

E-Commerce pages E-Commerce pages containing containing structured data structured data

Figure 1.1: Approach

these rules to e-commerce pages not containing structured data, we can generate it for those.

1.2 Relevance

In the following two sections, we describe the relevance of the work by showing the potential of Semantic Web approaches for e-commerce, and discuss existing Semantic E-Commerce data and limitations.

1.2.1 Potential of the Semantic Web for E-Commerce

The Web was originally designed for documents that are consumed by humans. This legacy poses significant limitations to automated data processing of the information published on the Web (e.g. [AH04, pp. 1-2],[Lac05, p. 4], cf. [Jür12; Cha+06]). Data on the Web is mostly unstructured, and usually integrated into Web pages that contain rendering directives. As extraction of data from Web pages into a well- structured form is a complex task, many popular Web applications, like search engines, operate mainly on the basis of indexing the textual content of documents on the Web. Thus, Web search is dominated by queries that run against massive corpora of textual Web documents, and does not easily allow sophisticated data- centric queries. For instance, it is dicult to search for biographies of German composers who were born in Munich, and died at the age of 80 or older. While this Chapter 1. Introduction 4

Tea?

Granular offerings on web Search engine bottleneck Granular demand shops

Figure 1.2: Search engine bottleneck, referring to [Hepa].

information is surely available on the Web, with the current Web search technology it is not straightforwardly accessible, and requires extensive manual eort.

At the same time, e-commerce has matured into one of the central drivers of retail growth, representing, for instance, 11.2 % of all retail transactions of the German market in 2013 [Bun13]. As e-commerce is based on Web paradigms, it likewise suers from the problem mentioned above. Web shops provide their oerings in the form of Web pages, complicating the extraction of initially well-structured data like product name, price, or image. Traditional search does not allow, for instance, querying for products that are manufactured in South Tirol, and sold in Munich. Again, this leads to extensive manual eort required to obtain such data.

Merchants are not able to articulate their value proposition in high fidelity, and customers are not able to search the market for highly specific goods. Detailed descriptions of the properties of companies (e.g. geo-position or contact informa- tion), of oerings (e.g. payment and delivery methods), and of products or services (product master data), are not available for elaborated queries, as they are hidden in text on Web pages.

This results in many buying decisions not based on the wealth of information the- oretically available, subsequently leading to suboptimal choices [Hepa]. Often, con- sumers initiate online buying decisions with general search engines [Saf13]. Those search engines reduce oering pages to a minimal preview. This preview represents Chapter 1. Introduction 5

only a fraction of the content of the original value proposition (cf. [KHL08]), limit- ing the merchant’s ability for granular signaling. At the same time, the consumers’ ability to screen the market for highly specific goods is limited, as the special features of a product are not accessible through the search engine (cf. [ES07]). By limiting the communication bandwidth between the market participants, they limit market eciency. Fig. 1.2 visualizes the problem: The granular oerings on Web shops are boiled down to a minimal search engine preview (cf. [KHL08]), that may often not match the granularity of the customers’ demand.

2001, Tim Berners-Lee et al. proposed the Semantic Web as an extension of the existing Web [BHL01], consisting of two main additions [Tre+08]:

1. First, existing Web pages should be marked-up with rich metadata. This metadata should match the content of the data of Web pages, for instance, that a certain number is the price of an oering.

2. This metadata should be expressed on the basis of ontologies that define a consensual understanding of a specific domain (e.g. [UG96]).

If realized to a great extent, these enhancements would render the whole Web into a giant database, well-suited for automated data processing (e.g. [Ber09]). This would mitigate the initially introduced problem, that the Web in its current state requires human intelligence to act upon. While the Semantic Web has matured significantly from a technological perspective, the initial vision has so far not been realized. One likely reason is that the semantic annotation of Web pages is tedious, and thus did not reach broad adoption (e.g. [SBH06]).

Since the major search engines Google, Yahoo, Bing and Yandex endorse the use of the schema.org vocabulary for granular data markup and promise better per- formance in search in turn, the adoption rate has grown to about 30 % [web14]. However, still as of today, only a fraction of Web information is available as struc- tured data.

In e-commerce, realizing the Semantic Web vision would mitigate the aforemen- tioned problems to a large extent. Most importantly, the information bottleneck Chapter 1. Introduction 6

between market participants would diminish significantly. We lay out three exem- plary use cases to further underline the potential of the Semantic Web vision in e-commerce, which are bound to the perspective of certain market participants.

Merchants: Already today, integrating structured data into Web shop pages has the eect of enhancing search engine results, which in turn are expected to raise sales [Edm14]. Structured data also allows third parties like aliate portals to propagate oerings of a Web shop eciently without establishing a proprietary interface. If e-commerce data on the Web was available in a structured form, competition could be analyzed automatically, for instance in terms of the quality of product descriptions. Additionally, ordinary data quality approaches operate on the data of a single market participant, and proprietary data sources. Having the data of other market participants at hand on Web scale would allow, for instance, to detect price or product data errors. Those errors are known to be significant cost drivers in the enterprise context (e.g [Red98]).

Customers: It can be assumed that the specificity of products and services grew over time. Specificity is defined as the trade-o between the usage of a good in its original intent, and the usage of the good in a way it was not intended for (e.g. [McG91]). Thus, goods that can be used without significant trade-o for multiple purposes have a low specificity, for instance water. A highly specific good would be a custom-made birthday present. Current search engines do not support the search for highly specific goods well. For instance, it is not possible to search for protein bars that do not contain peanuts3. While this information might be expressed in the product description, it is not easily accessible in search engines. Providing a granular Semantic Web representation of a product would preserve this information, which could be easily integrated into applications beneficial for customers. In this context, providing structured data in e-commerce could help finding products or services with a high asset specificity, and thus cater for the growing specificity in modern economies. Therefore, enhancing e-commerce pages with Semantic Web

3To prevent, for instance, allergic reaction. Chapter 1. Introduction 7

technology would extend search capabilities to match raising specificity. One could reasonably argue that generating structured data for e-commerce, which means raising the specificity of data, is an important response to match raising asset specificity in modern economies. Asset Specificity is largely relevant to the wealth of societies, and raising data specificity subsequently reflects the raising specificity in markets.

Market research and authorities: Access to granular data about e-commerce with Semantic Web technologies would allow for on-demand economical statistics, for instance on consumer prices. Instead of collecting the data in the predomi- nant extensive process, it would be immediately at hand. That would reduce the collection cost massively. Additionally, it would reduce the time span between the occurrence of a situation and its detection, in turn reducing reaction time. Moreover, the Semantic Web already provides a wealth of social or spatial data sources. As it facilitates data integration, sophisticated applications could be built straightforwardly, that combine newly generated market data with existing data sources. This task is commonly considered as highly complex with legacy technol- ogy. Examples are the strategical positioning of points of sales, or highly targeted marketing.

1.2.2 Existing Semantic E-Commerce Data and Limitations

The most prominent way of using structured data markup in e-commerce is the integration into Web shop pages. This is often realized with extensions for standard- ized e-commerce software solutions (ECS). We define ECS as a term that describes software systems that allow merchants to manage and provide Web shops. By now, there are at least seven popular extension modules for adding data markup to widely adopted systems [Hep13]. These amount to about 20.000 shop installa- tions, generating structured data for about twenty million oerings4. Meanwhile, these figures only account for a relatively small share of Web shops and oerings

4Precise figures are not available here. We will elaborate in Section 3.3 on that topic. Chapter 1. Introduction 8

on a global scale. We like to call this way of generating structured e-commerce data semi-automatic. That is, while the data is generated automatically, it needs a manual action performed by the shop owner to activate the process.

As a means to expose structured data in e-commerce, the GoodRelations ontology has seen a significant adoption [Hep12]. Initially launched in 2008, it provides a data model for e-commerce, building on the Semantic Web technology stack. GoodRela- tions is equipped with substantial tooling and comprehensive documentation to ease the adoption [Hep+09]. It allows to express a wide range of e-commerce scenar- by default, and can be easily extended to custom domains and use cases [Hepc]. Recently, major search engines have integrated GoodRelations into schema.org [sch12]. Schema.org is the attempt of Google, Yahoo, Bing and Yandex to promote a consolidated vocabulary for structured data [Sch13]. From the search engines’ perspective, the support of structured data is motivated by significantly less com- plexity to extract meaningful content out of Web pages. Structured data is also used to provide contextual content to the user. Google has e.g recently integrated its “Knowledge Graph”, a huge graph of factual information about objects and topics, into its Web search, based on Semantic Web technology5. By using GoodRelations data, a similar interface to e-commerce is at reach. As introduced above, in the short run, there are at least two factors that motivate the integration of structured data in Web shops for merchants. As a result tangible today, search engines reward the integration with visually extended results, which in turn are expected to spur sales (cf. [Bru13]). Additionally, it facilitates data extraction for search engines, which is expected to influence search engine rankings in a positive way.

Meanwhile, the overall adoption of structured data in e-commerce has grown to only about 30 percent of the market in the last five years [web14]. One likely reason is that GoodRelations markup is usually deployed via extensions to e-commerce systems and not integrated into the default configuration of the shop software. Thus, structured data has to be turned on manually by the merchant, as introduced above.

5http://googleblog.blogspot.de/2012/05/introducing-knowledge-graph-things-not.html Chapter 1. Introduction 9

Many applications that operate on structured e-commerce data demand for a significant market coverage. For instance, a feature comparison engine driven by Semantic E-Commerce data would need a significant coverage of oerings to be useful. Therefore, the low market coverage hinders the sophistication of applications on the basis of existing Semantic E-Commerce data.

1.3 Contributions

In this section, we first present an overview of the contributions, and continue with a more detailed discussion of those.

Foundational contributions:

1. An analysis of the impact market structures for ECS for the deployment of structured data.

2. A reliable machine-learning based method for detecting the ECS used for a Web shop.

3. A collection of sources and an analysis of structured e-commerce data on the Web that provide that basis for our experiment.

Main contribution:

The main contribution is a novel method for the extraction of structured data in the e-commerce domain that builds on the three foundational contributions: By having a certain amount of (3) existing structured data at hand, and being able to (2) identify the e-commerce system, we design a novel data extraction method that exploits system (1) specific patterns. It generates extraction rules out of an aggregated mapping between oering properties extracted out of the GoodRelations data and the Web page elements. Chapter 1. Introduction 10

Foundational Contributions

1. Impact of E-Commerce Systems on Structured Data

This first foundational contribution shows that only seven ECS generate more than 90 % of the product pages on the Web, which in turn generates a promising lever for the main contribution (see Fig. 1.3). By being able to craft extractors for those seven ECS, we could theoretically generate structured data for a major amount of product pages.

Determining that only a few ECS cover for a majority of oering pages on the Web is a significant building block for the later course of the thesis, as an equal distribution would have led to constructing a high amount of ECS-specific extractors, at the expense of the high lever that emerges from regarding only a few ECS.

2. Identification of E-Commerce Systems

This contribution proposes a novel approach to automatically identify ECS. It is based on the machine learning field supervised classification (e.g. [Kot07]), and exploits a filtered set of Web page properties.

It is capable of detecting six dierent e-commerce systems by analyzing only one random page of a Web shop, and shows an overall F1-score of 0.9, see Section 4.3. An extensive evaluation confirms the results.

This contribution provides a practical building block for the main contribution as the former requires the accurate and fast detection of e-commerce systems. At the same time, the viability of this approach proves that there exist structural patterns in the markup of dierent ECS, a premise that is used in the later course of the thesis as an assumption in the rule generator design. Chapter 1. Introduction 11

3. Existing Structured E-Commerce Data on the Web

The third foundational contribution analyzes existing GoodRelations data on the Web. As this data is the learning set for our extraction rule generator, a detailed analysis of amount, properties, and quality is needed for the further course of the thesis.

Main Contribution: System-specific E-Commerce Extraction based on Structured Data

The foundational contributions become building blocks of the main contribution in the following form:

Impact: As more than 90 % of product detail pages are generated by seven e- commerce systems, our approach focuses on a relatively small number of ECS, while aiming for a high impact.

Patterns in product pages generated by e-commerce systems: Most Web shops are generated by standardized ECS. In this context, ECS are a subclass of Web content management systems. They usually generate Web pages by combining templates with database content. As templates are generally used for a broad range of similar entities on a Web shop, e.g. oerings or categories, it is possible to exploit the patterns generated by those templates to extract the underlying structured data. This building block is based on the foundational contribution 2, “Identification of e- commerce systems”, as the viability of the approach substantiates this observation.

Learning set: Supervised machine learning often suers from the lack of a su- cient amount of labeled instances for training [DSW07, p. 37]. In the e-commerce case, a labeled instance would contain the locations of name, price or properties in an oering Web page. We focus on four distinct ECS, as GoodRelations data exists only for four dierent ECS in a significant amount. Chapter 1. Introduction 12

Patterns in Main Impact product pages Learning set contribution generated by ECS

Existing 7 ECS Foundational structured data generate > 90 % ECS identification contributions in of product pages e-commerce

Figure 1.3: Interplay of foundational and main contributions

Figure 1.4: Web shop with exemplary extraction targets

These three phenomena are integrated into a novel approach and implemented into four ECS-specific extractors. We visualize the interplay of these phenomena and the main contribution in Fig. 1.3.

Extraction Rule Generator

The core of our approach is represented by the extraction rule generator. The extraction rule generator operates on extraction targets, that may be oering name, description, or image, for instance. We provide a screenshot of a Web shop6 with exemplary extraction targets in Fig. 1.4.

For each dierent ECS, the extraction rule generator acts according to this high- level scheme, which we visualize in Fig. 1.5:

6http://www.la-mousson.de/ Chapter 1. Introduction 13

1. Learning set property value extraction Name 2. Search for page elements containing GoodRelations values Image

3. Element property extraction Price

4. Cumulative occurrence ranking Description

Figure 1.5: Extraction rule generator approach

1. For each oering page in the GoodRelations data learning set, it extracts the “true” values of the extraction targets.

2. In the oering pages, it searches for elements containing the given values in the content.

3. It extracts the properties of the respective elements.

4. The extracted data is ranked according to cumulative occurrences over all oering pages belonging to a certain ECS.

1.4 Research Questions

The research questions are aligned to the aforementioned contributions.

RQ1: How can the combination of (1) the market domination of a few ECS, (2) an automated approach for detecting the ECS behind a Web site, (3) HTML template similarity, and (4) existing structured data be used to design a system that is able to extract structured data from e-commerce sites that do not contain data markup, with a level of granularity and data quality comparable to extraction from explicit data markup?

RQ2: What is the impact of ECS on the availability of structured data?

RQ3: Can we reliably detect the ECS behind an e-commerce Web site automati- cally by analyzing only a small number of pages from the site? Chapter 1. Introduction 14

RQ4: How can we measure the current diusion and quality of GoodRelations data?

RQ1 is the main research question, substantiating the main contribution. RQ2 to RQ4 are the foundational research questions that the main research question is build on.

1.5 Experimental Design

From a high-level view, our experimental design and the evaluation show the following layout:

1. We collect a sample of e-commerce oering pages that contain GoodRelations data, and split the resulting data set into a learning set and a test set of equal size, i.e each will contain 50 % of the original data.

2. On the learning set, out of the GoodRelations data and the page element properties, we generate extraction rules that allow to produce structured data for unlabeled oering pages.

3. We cross-validate these extraction rules on the test set.

1.6 Organization of the Thesis

Chapter 2, Data: Fundamentals and Usage in the E-Commerce Domain, provides an overview of the related work and state of the art that supports the further chapters of the thesis. It consists of a section on (1) the Semantic Web, emphasizing the seminal role of the (ordinary) Web in modern societies, and introduces the vision of a Semantic Web, that stands out as a method to manually generate structured data for Web resources, in the context of our work. A section on (2) Semantic Web-based E-Commerce introduces e-commerce technologies, and discusses the predominant GoodRelations Web vocabulary and its ecosystem. The last section Chapter 1. Introduction 15

of Chapter 2 is devoted to (3) Web Information Extraction, a research area that focusses on the automated generation of structured data from Web resources7.

Chapter 3, Foundational Building Blocks, provides the three foundational contri- butions introduced extensively above.

Chapter 4, Structured Data for Web Information Extraction in E-Commerce, pro- vides the main contribution. We devote a section to the discussion of the properties of the (1) approach, with special regard to what is achievable in comparison to the current state of the art. We move on with a full discussion of the main part of the (2) implementation based on literate Python programming. We present the main (3) results of our experiment. We (4) evaluate our results with cross-validation, modify experimental settings extensively to assess their influence on the results, and evaluate over a manually-generated dataset. We close the chapter with a concise (5), yet pragmatic, use case.

Chapter 5, Conclusion, highlights our achievements, discusses the limitations of the main contributions, and provides an outlook on future work.

1.7 Previously Published Work

Parts of the work presented in the thesis have already been published in conference papers with permission:

1. Kurt Uwe Stoll, Mouzhi Ge and Martin Hepp: Understanding the impact of e-commerce software on the adoption of structured data on the Web. Business Information Systems (BIS 2013), Poznan, Poland.

2. Kurt Uwe Stoll and Martin Hepp: Detection of e-commerce systems with sparse features and supervised classification. 10th IEEE International Confer- ence on E-Business Engineering (ICEBE 2013), Coventry, United Kingdom.

7Definition for our context, related work may have slightly dierent views. Chapter 1. Introduction 16

Paper (1) corresponds to “Impact of Structured Data on E-Commerce Systems”, and paper (2) to “Detection of E-Commerce Systems”, which are discussed in detail in Chapter 3. 2 Structured Data: Fundamentals and Usage for E-Commerce

This chapter summarizes work in fields of research related to the topic of this thesis.

Semantic Web: We will first introduce the Semantic Web vision. While its orig- inal vision has not been fully realized so far, it marks the current state of the art in Web science (cf. [SBH06]). In this section, we will underline (1) the paradigm- shifting character of the original Web, its (2) fundamental problems, and (3) the Semantic Web vision. We will analyze thoroughly (4) the Semantic Web technology stack, as it is a technological foundation of the further course of the thesis. We will shortly introduce (5) Linked Data, and conclude with (6) Semantic Web adoption of leading search engines.

Semantic Web-based E-Commerce: The main aim of our research is to design a system that automatically generates structured e-commerce data for Web shops that originally do not provide it. The research in Semantic Web-based E-Commerce is highly relevant, as it specifically operates on such data. This section has six main parts. We begin with (1) the technological foundation of Semantic Web-based E-Commerce. We then introduce (2) the GoodRelations Web vocabulary, which combines a wide range of use cases with significant market adoption. We (3) progress to a short discussion of existing structured e-commerce data on the Web. We (4) provide a discussion of existing research on Semantic E-Commerce. We then (5) go on with a short overview of how respective data is used in commercial settings.

17 Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 18

We (6) conclude with a discussion of the economic implications of Semantic E- Commerce.

Web Information Extraction: The third part of the chapter provides related work in the field of Web Information Extraction (WIE), a research area this thesis also belongs to. In this context, the work is characterized (1) by a focus on e- commerce data, (2) by trying to augment the existing structured data on the Web as a main goal, and as novel approach, (3) by using existing structured data on the Web as learning set. To the best of our knowledge, this specific approach has not been exploited in related research in the Web Information Extraction field.

This section has seven main parts. We start with (1) a discussion of relevant dimensions when classifying WIE approaches, and go on with (2) classic and (3) recent WIE approaches. We then introduce (4) e-commerce specific, and (5) ontology-based WIE approaches. We discuss (6) WIE approaches that combine those two subfields, and therefore constitute the best match to our work. We complement the section with (7) an introduction to Web mining, a related research area.

In that way, we strive to integrate two important areas in Artificial Intelligence research (Semantic Web and WIE), and apply the results to a practically highly relevant domain. Fig. 2.1 shows an overview of the relationship of these three areas.

2.1 Semi-Automated Structured Data Generation on the Semantic Web

In this section, we will mainly discuss the Semantic Web vision, which is an ex- tension to the existing Web in its core, providing machine-readable, structured data. Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 19

Artificial Intelligence Research Fields Domain

Manually generated structured data ! Contribution Semantic Web (2.1) (Semantic) ! ! E-Commerce (2.2) Web Information Extraction (2.3) ! Automatically generated structured data

Figure 2.1: Strains of relevant related work

2.1.1 The Web

The Web certainly among the most influential and dynamically-evolving tech- nologies of the late 20th century (e.g. [Hal11]). In less than 20 years since its introduction, it has influenced society from economical to social dimensions. In the following subsection, we chose those dimensions as examples for the many changes initiated by the Web. The following scenarios are non-exhaustive and aim to provide an introduction to the breadth of the influences.

2.1.1.1 Economical Dimensions

Companies that are mainly Web-driven like Google, Apple or Microsoft rank among the highest-valued enterprises in the US1. We provide an additional overview of the market capitalization of Internet-driven companies as of April 20132 in figure 2.2.

As we will elaborate in the Semantic Web-based E-Commerce section of this chap- ter, electronic commerce has become an important part of modern multi-channel

1Market capitalization of US firms according to Google stock screener, https://www.google.com/finance#stockscreener, as of 07/07/2013 2following http://www.statista.com/statistics/209331/largest-us-internet-companies-by-market -cap/ Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 20

Figure 2.2: Market capitalization of Internet companies (USA), April 2013

marketing. It has seen a tremendous growth in the last years. For instance, regard- ing Germany in 2013, e-commerce generated a turnover of nearly 50 billion Euro, representing almost 11.2 % of all retail [Bun13].

Before the advent of the Web, procurement of highly specific3 goods was a complex process that required large amounts of human action. For instance, if an enterprise manufacturing extension cards for PC’s had to procure a slot bracket in the late 80’s, an extensive process of finding low-priced manufacturers in e.g. China, would have occurred. Today, there exist many platforms like Alibaba4 or Globalsources5 that allow to eciently procure specific goods from the manufacturing country, and even general e-commerce companies like Ebay6 or Amazon7 now provide access to these type of goods.

Another economic outcome of the Web is the usage of crowds to solve tasks, a technique called crowdsourcing, or more specifically human computation, which describes methods and technologies that use human agents to solve batches of small problems that are hard to tackle for algorithms [QB11]. In market research, human computation platforms can be used to easily analyze customer preferences. A presentation of two dierent product package designs to a large number of

3Introduced in Chapter 1. and Section 2.2. 4http://www.alibaba.com 5http://www.globalsources.com/ 6http://www.ebay.com 7http://www.amazon.com Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 21

customers, which formerly would have required a serious amount of resources, can now be evaluated in minutes. Crowdfunding, as another example, presents projects to a large number of small-scale investors on the Web [BLS14]. Currently, the platform Kickstarter8 is dominating the market, and has, for instance, created nearly 50 fundings above one million dollars and roughly five million small-scale investors9.

2.1.1.2 Social Dimensions

By its fundamental design, the Web allowed everyone who is capable of writing HTML10 and with access to a server to publish on the Web. In comparison to media that dominated before, like print, television or radio, that alone was a paradigm shift. It made it (1) relatively easy to publish for a world-wide audience. As (2) there was no controlling institution, freedom of speech could be installed to a large degree.

While these social properties have been included into the Web from the early days, we are now seeing the massive growth of social networks. Social networks originally gained their power from a further facilitation of Web content creation, or provided crowd-intelligence based tagging functionality to classify resources [HG06]. Early examples are flickr11, an online photography community, and delicio.us12, a service that allows to publicly share and manage bookmarks.

In the last few years, the social networks Facebook and Twitter have seen massive growth and gained high importance in the Web economy. As of July 2013, Facebook reported 1.155 billion active users a month [CZ13], and Twitter is processing more than 400 million tweets a day [Wic13].

8http://www.kickstarter.com 9http://www.kickstarter.com/help/stats?ref=footer, accessed 10/23/2013. 10HTML is the markup language for Web pages [RHJ99; Hic11]. 11http://www.flickr.com/ 12http://www.delicious.com/ Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 22

Facebook oers a very broad range of services from establishing connections to friends and acquaintances, over online chatting, to event organization and online gaming.

Twitter, on the other hand, executed a lean platform business model (cf. [Che07]). In its core, it only provides a platform to publish short messages13 in a micro-blog fashion.

These major social networks have shown significant beneficial outcomes. For ex- ample, Twitter has been successfully used to report emergencies [HP09]. Facebook has gained significant attraction in political science by acting as an organization platform for the opposition in the Arab spring [How+11].

While Facebook has become an application so popular that it may be perceived by some as a replacement for the Web as a whole, it is clearly just a part of it. It is important to stress, that by (1) promoting proprietary standards, for instance in terms of structured data14, or (2) walling in the content that has been generated by users, Facebook’s impact on Web culture has to be assessed critically (e.g. [Yeu+09]). The same holds true for Twitter, as it operates on a proprietary standard, and also walls in user content, making it hard to extract. These strategies fundamentally collide with the principles of a Web built on open standards, which we will discuss in the section below.

2.1.1.3 Design Principles of the Web

From the perspective of our work, the following three principles stand out:

• Documents can reside on servers all over the world.

This principle first covers decentralization [BHL01; Ber02]. On the Web, there is no central point of failure. The functionality of the Web is not harmed if some servers go down. Second, as basically everyone can put a server online, there is no central control of the content available on the Web. While federal 13Twitter adhered to the limitation of 140 chars per tweet to be compatible with SMS. 14https://developers.facebook.com/docs/opengraph/ Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 23

legislation applies, this fundamentally allows for freedom of speech on the Web.

• Documents can link to other documents.

This principle allows authors of Web documents to refer to other Web re- sources [Ber+04]. In its core, this principle resembles traditional citations contained in print documents. Meanwhile, a very powerful side-eect is that links between Web documents span a graph, that can be analyzed automati- cally. This was the initial idea of Google’s PageRank algorithm for automat- ically rating the relevance of a page on the Web. This algorithm emphasizes documents that are linked by many other documents as important in Web search (cf. [RU12, p. 3],[HG08]). Therefore, the ability to link documents contains an ecient by-product that allows to determine their importance.

• If a user clicks on a link, the linked document is automatically fetched from the server and presented to the user.

This principle covers the user perspective of the Web and integrates the two principles mentioned above. Seeming trivial, as we are now accustomed to Web browsing, its initial proposal was revolutionary. For the first time, it allowed to surf on arbitrarily linked documents residing on servers all over the world, without even noticing it (cf. [ML07, p. 3]).

2.1.1.4 Fundamental Problems of the Web

While the Web has been an ingenious invention that gained massive adoption right from the start, it soon became obvious that it included fundamental limitations. As we discussed above, the Web is essentially a distributed, electronic mapping of the document-centric knowledge representation approach, that already existed for centuries. To a large extent, the Web contains unstructured textual data. While this is easy to consume for human agents, it leads to severe limitations for automated data processing (e.g. [AH04, pp. 1-2],[Lac05, p. 4], cf. [Jür12; Cha+06]). Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 24

While, for instance, it is relatively easy for a human agent to extract the important persons and locations in a newspaper article, this is hard to solve algorithmically. Generally, a structured form is essential for automated data processing. Meanwhile, Web documents are mostly generated by content management systems that inter- mingle structured database contents with layout directives [CS08; ZL07; FGS12; Gul+10]. This finding is especially important, as it covers the second foundational contribution of the thesis, “Patterns in product pages generated by e-commerce systems”. This means in essence that the data originally available in a structured form becomes hard to extract.

Gibson, Punera, and Tomkins [GPT05] state that 40 to 50 % of Web pages are generated in the discussed way with the help of templates. Due to the more repet- itive nature of Web pages in the e-commerce domain, we expect the amount of pages generated with templates here to be even higher.

The lack of structured data on the legacy Web15 has the following adverse conse- quences:

Web search engines mainly operate on string search: General search engines are the main entry point into the Web [Saf13]. They operate on Web documents, which usually contain data covered in textual representations. Therefore, the en- gines are mostly based on string search, performing matches against the content of the document. From a semantic perspective, this is not very eective, as a search engine initially cannot distinguish, for instance, the dierence between the car brand “Jaguar” and the animal. This leads to a suboptimal user experience, as there might be a need to set the right context (cf. [Hit+08, p. 10], cf. [AH04, pp. 1-2]). In recent years, search engines have made significant progress towards understanding the actual meaning of documents, parts of documents, and entities referred to in the documents, see e.g. 2.1.2.4. Despite such advancements, however, Web search is still heavily influenced by the match between terminology in the query and the use of matching words in textual Web content.

15The following paragraphs reflect the state before the establishment of the Semantic Web vision. Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 25

Information integration on the Web is dicult: As data might be spread across dierent Web sites, and again covered under layout directives, it is hard to integrate it. For instance, consider a digital camera buying decision. There are many product features on the manufacturer’s homepage. Additionally, the buyer has to query for prices. To perform a sophisticated decision, he or she would need to compile a spreadsheet with the dierent properties of the cameras, and subsequently weigh them according to personal preferences. Then, the buyer would need to consider a price-comparison site to find a merchant who matches his or her needs. With the current state of aairs, this is a highly extensive task, induced by the fundamental design principles of the Web (cf. [Hit+08],[Lac05, p. 5]). The core of the problem is that for processing information from the Web, computers are limited to automating the rendering of the published data and cannot support the human user in the process of interpreting and combining it.

In an information age, that is mainly driven by growing automation, textual docu- ments become a legacy form of knowledge representation. Algorithmic automation operates on data. Non-trivial algorithmic processing of information is currently dependent on structured data, i.e. granular data with unambiguous semantics. The Web in the still predominating stage mostly lacks such structured data, complicat- ing the further automation of information processing from Web pages. To date, the Web is a human-processable representation of structured database content [CS08]. In this context, the majority of Web data is so poorly structured that only humans are able to interpret it. At the same time, the sheer size of the aggregate data is so vast that only machines are suited to operate on it [SHB06]. In summary, the wealth of information available on the Web is not matched by ecient means for processing it (cf. [Fur+11]). Besides the inherent conceptual and syntactical heterogeneity of underlying data, the root cause for this problem is that data struc- ture and data semantics from the underlying databases of dynamic Web sites are stripped o in the process of publication on the Web. Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 26

2.1.2 Semantic Web

In this subsection, we will first introduce the vision of the Semantic Web, and then present the Semantic Web technology stack related to our work. The stack consists of the following components:

• Uniform Resource Identifiers: URIs [BFM05]

• Extensible Markup Language: XML [Bra+08]

• Resource Description Framework: RDF [CK04]

• and ontology languages, namely RDFS and OWL, the Web Ontology Lan- guage [BG14; Bec+04]

• SPARQL Query Language and Interface for RDF [PS08]

2.1.2.1 Vision

There are two fundamental approaches to the aforementioned problems of the Web (cf. [Hit+08, p. 11]).

1. The first approach is the Semantic Web vision and has been proposed in a 2001 article by Berners-Lee, Hendler, and Lassila [BHL01], as an extension to the Web.

In its core, it promotes the following two key ideas (e.g. [AH04; Tre+08; SCV07]):

• To enhance existing Web pages with machine-readable structured data.

• To express the data in a way that adheres to a commonly shared mean- ing.

This would ultimately create a database that contains all the information available on the Web (e.g. [Ber09]). Ideally, the machine-readable data should represent the important facts contained in a Web site in a granular way, so that Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 27

each fact could be integrated elsewhere. For instance, having all biographies of classical composers available on the Semantic Web would allow to calculate their mean age straightforwardly. Since 1999 the Semantic Web idea has matured to a significant field in research, and provides a solid technology stack ready to implement the vision, which will be discussed below.

2. The second approach aims at designing systems that are capable of automat- ically extracting data out of the documents on the Web (cf. [Hit+08, p. 11]). We will elaborate on this approach extensively in Section 2.3 of this chapter. In essence, this approach aims at making computer processing more powerful so that it could process unstructured text equally well as structured data (cf. [Hit+08, p. 11]).

Additionally, extending the heuristic known as Metcalfe’s Law [HG08], establishing the Semantic Web would lead to an explosion of the network value of the Web. Instead of documents, singular facts can be interconnected (cf. [HG08]). This is especially important, as Web platforms that spot strong network eects, for instance social networks, have shown to be of high value to the users16.

In this regard, a main aim of the Semantic Web is to liberate data out of the application, or document, context [Rod09]. A currently very common pattern is that of Web companies gathering massive amounts of user data, as we introduced in the “Social dimensions of the Web” section above. This generates a Web made of walled gardens, in which each application locks the generated data and exploits it to its own benefit. The Semantic Web aims at marking up the data in a way that any user can integrate data from dierent applications.

At this point, we would like to emphasize that the approach of this thesis uses Web Information Extraction, which we will discuss below in 2.3, to generate Semantic Web data. Therefore, it aims at combining the two fundamental approaches to make available structured data representing the information on the Web.

16Facebook.com ranks currently number 2 of the most popular websites according to [Ama13] Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 28

Query Language & Interface SPARQL Ontology Languages RDFS & OWL Data Model RDF Global Identifiers URIs

Figure 2.3: Reduced Semantic Web technology stack relevant to this work, own representation based on [Ber00]

2.1.2.2 Semantic Web Technology Stack

Since its initial incubation by the World Wide Web Consortium, the Semantic Web community has released a substantial set of standards and technologies. In the following subsection, we discuss those that have the highest impact on our research.

There exist many versions of the Semantic Web technology stack (e.g. [Ber00; Sig05; Bra07]). We have decided to exclude technologies that are less important to our work, like rules or trust, resulting in a reduced Semantic Web technology stack that is shown in Fig. 2.3.

URIs: Uniform Resource Identifiers

URI Syntax: URIs are the most fundamental building block of the ordinary17 Web and the Semantic Web. The following description of the syntax of URIs has been excerpted from RFC 3986 [BFM05], which is the ocial document defining the URI standard. To describe the syntax of the distinct parts of URIs, RFC 3986 uses the Augmented Backus–Naur Form (ABNF), which itself is defined in RFC 5234 [CO08]. Originally, URIs have been identifiers to identify Web resources. They are subject to the syntax provided in Fig. 2.4.

The hier-part consists of authority and path, an example is provided in Fig. 2.5.

17We use the term ordinary here to stress the dierence between the original Web and a Semantic Web. Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 29

scheme ":" hier-part [ "?" query ] [ "#" fragment ]

Figure 2.4: URI scheme, Berners-Lee, Fielding, and Masinter [BFM05]

http://www.semantium.de:5984/research/machine-learning/extractor?target=price#currency

scheme authority path query fragment

Figure 2.5: URI scheme - example

URI Use in the Semantic Web: In the former paragraph, we introduced URIs as identifiers of Web resources. The Semantic Web extends the usage of URIs to basically any thing, for instance persons, cities, universities or abstract concepts like colors [Rod09]. Therefore, on the Semantic Web, URIs can both reference information resources (Web documents) that describe things, and things themselves [SCV07]. In that way, it is important to distinguish whether a reference targets an information resource or a thing. A common way to reference real-world or abstract things on the Semantic Web is to use the URI fragment as extension to a given URI. In Listing 2.1, we provide an example that references the Universität der Bundeswehr München itself (the institution), and its location. A respective information resource would be the homepage available at “http://www.unibw.de”.

1 http://www.unibw.de/about#university 2 http://www.unibw.de/about#location

Listing 2.1: Fragment identifiers

XML: Extensible Markup Language:

Fundamentals: We did not include XML as a layer in our technology stack, as it does not play a vital role in the Semantic Web context. Meanwhile, we introduce it briefly, as our research makes use of this technology. XML is a metalanguage that Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 30

allows to interchange data in a structured way by the definition of domain-specific grammars [Lac05, p. 62]. It is used to represent tree-based structures. The World Wide Web Consortium has been heavily involved in the definition of XML. A main advantage is that by establishing XML as a syntax for arbitrary grammars, a software implementation of XML, e.g. a library in a certain programming language, ensures compatibility with many use cases. A disadvantage of XML is that it only allows to define the structure of a document, while not providing means to define the content [SHB06].

Features: XML uses tags to delimit elements, attributes, and content. Tags are set in brackets. In here, attributes, which are placed in the tag (e.g. src=“uwe.jpg”) provide meta-data about a tag [Bra+08],[Lac05, p. 61 .]. Being normally used with start-tags (e.g. ) and end-tags (e.g. ), XML allows nested elements, for instance, multiple elements might be included in the oering element. It is possible to self-close a tag that contains no content. A common example is the image (e.g. ) tag in HTML, that contains all data in properties, and thus can be self-closed.

To provide an outlook, the approach discussed in this thesis lifts the attribute meta-data that is used to style Web pages as hints to discover semantics of a given Web page element. The precise workings will be elaborated in Chapter 4.

XPath: The XML Path Language XPath is a mechanism to address the ele- ments of XML and similar trees (e.g. [CD99]). As we exclusively use HTML files as data sources in our research, we can apply XPath as a language to formulate our extraction rules, as in the context of this research, HTML can be treated as if it was an XML grammar. We provide some examples to show the capabilities of XPath in Table 2.1. In the later course of the thesis, we will mainly use selectors that match certain tag properties (e.g. /div[@class=“price”] to extract their contents. An encompassing introduction of the capabilities of XPath is out of the scope of this work. Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 31

Table 2.1: Example XPaths

XPath expression Function /body Selects the “body” node /div[@class=“price”] Selects all “div” nodes that have the “class” property values of “price” /div/h1/span/text() Selects the content of all “span” nodes that are children of “h1” nodes that are children of “div” nodes

RDF: Resource Description Framework

Basics: RDF is the fundamental data model of the Semantic Web (e.g. [Tre+08]), and constitutes the first layer that enables machines to interpret data on the Seman- tic Web [SHB06]. It is characterized by a subject-predicate-object structure (e.g. [Rod09]). Regarding RDF, there is a fundamental distinction between resources, properties, and statements (e.g. [AH04, pp. 67-68],[SHB06]). Subjects and objects of the statement are resources, whereas the predicate is a property. In RDF, URIs18 are used for resources and predicates (e.g. [Tre+08]). Web content like pages or sites can be resources, as well as real-world objects or abstract concepts. Properties allow to state attributes or characteristics of relationships between resources. A resource-predicate-value triple forms a statement. There, the value can be a lit- eral, resource, or another statement (e.g. [SHB06]). The latter is called reification [Haa+04], but is not relevant for this thesis. It is important to remark that RDF, by using URIs to a large extent, fundamentally inherits the design principles of the Web [Swa02].

For instance, RDF allows to express statements like:

• Uwe lives in Munich.

• Uwe likes the Eisbach.

• Munich is the capital of Bavaria.

18Precisely, blank nodes (e.g. [CK04]) can also be used for resources. We exclude their discussion as they are not critical for our argument. Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 32

Uwe lives in Munich

likes capital of

Eisbach Bavaria

Figure 2.6: Graph of the RDF example

Multiple statements form a directed labeled graph (e.g. [SHB06]). We show the graph of the example above in Fig. 2.6.

As described above, we can easily see that through transitivity, we could reason19 that Uwe lives in Bavaria.

Much of the power of RDF comes from the repeated use of URIs in dierent statements (e.g. [Rod09]). For instance, another source might express that the Eisbach is a surfing spot. A possible application in an e-commerce context might be to deduce that Uwe also likes surfing, and thus promote related products. By generating more and more data in RDF, many aspects of the real world gain a digital representation and thus, the Semantic Web could become a world-wide repository of interlinked data20 (cf. [Rod09]).

Syntaxes and Notations: There exist multiple syntaxes or notations for RDF, well known ones are RDF/XML, Turtle and N-Triples [Rod09]. We provide an example that shows the case motivated above in Turtle syntax in Listing 2.2.

1 @prefix ex: . 2 ex:uwe ex:livesin ex:munich ; 3 ex:likes ex:eisbach . 4 ex:munich ex:capitalof ex:bavaria .

Listing 2.2: Running example in Turtle notation

19Reasoning is an additional layer in many Semantic Web stacks, but we leave it out of our discussion here. See also Section 2.1.2.2. 20From a philosophical point of view, we would like to extend that argument in the following way: The ordinary Web tries to build a human-readable representation of the world, and with its growth, this representation becomes more precise. At the same time, the Semantic Web tries to build a machine-readable representation of the world, and with its growth, the potential for automation grows. Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 33

In terms of conciseness, RDF introduces an abbreviating prefix, that allows to spare constant repetition of the base URI. In our example, “ex:” is defined as the prefix for the URI . For instance, in this way, the full URI can be replaced with ex:uwe.

We can see that the Turtle syntax allows for a human-readable form of RDF. In addition to the raw subject-predicate-object form, it includes syntactical shortcuts for authors, e.g. it allows to express repetitive subjects with a semicolon.

The serialization RDF/XML is an XML-based syntax for RDF but rather hard to write for human users, and is suited more towards the machine-based creation and reading. As RDF/XML expresses RDF statements in a XML syntax, it is compatible with a wide range of XML software, as introduced above (2.1.2.2). It is however important to understand that the XML tree of RDF/XML is dierent from the underlying RDF graph, that the same RDF graph can be represented in diering XML trees, and that thus the processing of RDF data with XML tools is error-prone.

Another important serialization is RDFa. RDFa allows for RDF to be embedded into HTML Web pages by extending tags (e.g. [Adr+10]). There are two dierent approaches to RDFa integration: The in-line approach integrates RDFa content at the code position of the respective element in the original Web page. The piggybacking approach, that has been developed at the Universität der Bundeswehr München, uses a succinct snippet, positioning an aggregated part of RDFa at a specific point of the HTML document.

Both approaches have strengths and weaknesses. The in-line approach generates RDFa at the point in the HTML documents where it originally belongs, but is rela- tively hard to add complex site templates to existng, non-trivial HTML templates. While the piggybacking approach diminishes this problem by its self-containing nature, it has the downside of repeating content and thus introducing redundancy. We would like to emphasize that the approach of this thesis, as it is capable to find the locations of Web page elements with certain semantics21, could produce both

21For instance price. Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 34

types of RDFa. Meanwhile, we will not implement the in-line approach in the later course of the thesis. While it is basically simple, it comes at the cost of significant HTML parsing to ensure the validity of the output document.

A last serialization we would like to mention at this point, which has gained some attention recently, is JSON-LD. Introduced by Lanthaler and Gütl [LG12], it is based on JSON, the Javascript Object Notation, a data interchange format that has been established by the Javascript programming language community22. Many software systems, programming languages, and libraries support JSON. Therefore, using it promises a wide compatibility.

Below, we provide the full listings for the running example in the RDF/XML, RDFa, and JSON-LD serializations.

The conversions for the following RDF notations have been generated with the RDF Translator of Stolz, Rodriguez-Castro, and Hepp [SRH13a].

1 2 6 7 8 9 10 11 12 13 14

Listing 2.3: Running example in RDF/XML notation

22http://www.json.org/ Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 35

1

7
8
9
10
11
12
13
14
15

Listing 2.4: Running example in RDFa notation

1 { 2 { 3 "@id": "http://example.org/ns#uwe", 4 "http://example.org/ns#likes": { 5 "@id": "http://example.org/ns#eisbach" 6 }, 7 "http://example.org/ns#livesin": { 8 "@id": "http://example.org/ns#munich" 9 } 10 }, 11 { 12 "@id": "http://example.org/ns#munich", 13 "http://example.org/ns#capitalof": { 14 "@id": "http://example.org/ns#bavaria" 15 } 16 } 17 }

Listing 2.5: Running example in JSON-LD notation

Ontologies: RDFS and OWL, the Web Ontology Language: An intermediary layer in the Semantic Web stack is RDFS, which allows to express the following Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 36

basic ontological23 features, (e.g. [AH04, pp. 84-87],[SHB06]):

• subclass and property hierarchies, and

• domain and range, and

• instances of classes.

In this way, RDF vocabularies can be structured hierarchically with RDFS [SBH06]. We intentionally keep out an elaborated discussion of RDFS at this point, as it does not play a significant role in the further course of the thesis.

While the RDF data model allows for a granular description of abstract and real- world scenarios, an essential part of the Semantic Web technology stack is the Web Ontology Language OWL. In comparison to RDFS, OWL introduces extended modelling possibilities, and provides formal semantics (e.g. [AH04; Hit+08]).

The GoodRelations vocabulary [Hep11a] is specified as an OWL ontology, and as already argued, it is central to our research. Therefore, we give a brief introduction to ontologies. The section consists of a paragraph discussing the fundamentals of ontologies from a more theoretical point of view, and a paragraph on the formal semantics of ontologies. We intentionally omit a further discussion of the OWL language or the dialects OWL-Lite, OWL-DL, and OWL-Full, because they are not essential to the further argument. The OWL specification is accessible at [Bec+04].

Fundamentals of Ontologies: The term ontology in computer science has originally been borrowed from philosophy, where it refers to the “science of the being” (e.g. [Lac05, p. 25]). Recently, it has become a popular term in computer science to describe artifacts that generally aim to improve interoperability by defining shared meaning (cf. [AH04, p. 197]).

There are several introductions to ontologies in the computer science context (e.g. Lacy [Lac05] and Colomb [Col07]). A popular definition has been provided by

23Ontologies will be introduced just below. Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 37

1. Using philosophical notions By using philosophically well-founded distinctions, modelling as guidance for identifying ontologies can create stable and lasting interoperability. stable and reusable conceptual elements

2. Unique In human communication, sometimes friction emerges regarding the semantic identifiers for meaning of a term (e.g. "Jaguar" animal / car manufacturer). The "controlled conceptual vocabulary effect" of ontologies describes their feature of providing unique elements identifiers for conceptual elements, often realized with URIs. Homonyms and synonyms are often motivating factors for the deployment of ontologies.

3. Excluding unwanted Ontologies allow for the provision of granular textual descriptions, interpretations by means of synonym sets or multimedia objects that further define the meaning informal semantics of identifiers. As those elements improve the exclusion of wrong interpretations, their authoring is a complex task. 4. Excluding unwanted Axioms in formal logic allow rendering unwanted usages of interpretations by means of conceptual elements logical contradictions, which can be detected formal semantics automatically. Properties or classes may be disjoint, or having a certain property may imply an element to belong to a certain class. 5. Inferring Because a complete definition of all logical axioms of an ontology may be very implicit facts extensive, it may be beneficial to derive implicit logical axioms from explicit ones. automatically This method is called reasoning. In the original Semantic Web technology stacks mentioned above, reasoning often is a layer above the ontology layer. We omit a detailed discussion of reasoning, as it has no practical implications for our work.

6. Spotting logical By using philosophically well-founded distinctions, modelling inconsistencies ontologies can create stable and lasting interoperability.

Figure 2.7: Six eects of ontologies, based on [Hep08b]

Guarino and Giaretta [GG95], who define an ontology as a logical theory which gives an explicit, partial account of a conceptualization.

From a more practical perspective, Hepp [Hep08b] additionally defines six eects of ontologies, which we summarize in Fig. 2.7.

To set our work in the context of ontology research subfields, as we aim to generate data according to GoodRelations, we operate in the field of ontology population (e.g. [Tre+08],[DSW07, p. 37]). Additionally, as we discuss how GoodRelations is used in a real-world scenario in Section 3.3, a foundational contribution belongs to the field of ontology usage mining.

Ontology Semantics: We devote the following section to an introduction to ontology semantics following Lacy [Lac05, pp. 32-35] and Gómez-Pérez, Fernández- López, and Corcho [GFC04, pp. 11-12]. According to their work, ontologies provide Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 38

semantics by (1) information representation building blocks, which are classes, properties, and individuals, and (2) semantic relationships, which can be classified in those that relate building blocks to each other, and those that describe relationships.

We will provide details on the dierent entities below, beginning with the funda- mental building blocks [Lac05]:

• The class concept can be compared to objects in object-oriented program- ming (e.g. [Arm06]) or tables in relational database management systems (e.g. [Cod70]). It relates to the notion that real-world objects form groups or sets with adjacent properties. In the same way, individuals can be members of classes. For instance, Angela Merkel is an individual, belonging to the class politicians. In this context, on the conceptual level, a class is a set with a specific overlap in the characteristics of its members. Classes can be ordered in hierarchies, and can be used for abstract and specific objects [GFC04].

• Properties link objects with values, which can be other objects or plain data values. A self-explanatory example is the object Angela Merkel, the attribute belongsToParty and the value CDU.

• Individuals constitute the members of classes at the instance-level (cf. [GFC04]). Often, it is non-trivial to decide whether an object should be a class or an instance. One could argue that love is felt by a couple as an instance of the romanticEmotions class. On the other hand, one might also argue that it is a class that has members that itself are classes, like romanticLove or platonicLove.

The core building blocks are aligned with a set of inter building block relationships (cf. [Lac05, pp. 34-36]), i.e. those that relate building blocks to each other:

• Membership in the ontology context allows to state that an individual is a member of a class. This relation is typically referred to as is an instance of.

• Ontologies allow to bind attributes to individuals. For instance, this relating features allow modeling that Angela Merkel studied physics. Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 39

• Ontologies allow to define restrictions on objects. For instance, we could state that the politicians’ class must have the attribute studied, hypotheti- cally.

Furthermore, in addition to the inter building block relationships an ontology can have, we need to introduce intra building block relationships (cf. [Lac05, pp. 36-39]), i.e. those that describe relationships:

• The synonymy term comes from linguistics, and describes words with the same or similar meaning (e.g. [Löb13, p. 203]). In this context, the synonymy relation allows to define the similarity of objects [Lac05, p. 36]). For (an) instance, we could define that Martin Hepp in the context of the Universität der Bundeswehr homepage24 is the same person as Martin Hepp in the context of his research homepage25.

• Antonymy is the opposite of synonymy, and thus used to characterize the fact that multiple classes, properties, or instances are either dierent or mutu- ally exclusive (disjoint). With a disjointness axiom, we could e.g. state that the classes Carnivores and Vegetarians are mutually exclusive and infer from membership in one of the classes the non-membership in the other [Lac05, p. 37]).

• Hyponymy allows to introduce relationships that define hierarchical order- ings of distinct classes, properties, or instances. It implies the generalization of sub-objects to an upper object, and the specialization for upper objects to sub-objects. A common feature of hyponymy is that specialized objects inherit the features of upper objects [Lac05, pp. 37-38]).

• Meronymy allows to state that sub-objects compose upper objects, for instance a part-list of a cupboard [Lac05, p. 39]).

24http://www.unibw.de 25http://www.heppnetz.de Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 40

• Holonymy is the reverse relation of meronymy and introduces a possibility to state that a given upper object consists of given sub-objects [Lac05, p. 39]).

We have to note that many of the mentioned terms originally come from the studies of human language and that their direct application to formal logic in the context of ontologies in computer science is often not free from inconsistencies with their original meaning in linguistics.

SPARQL: Query Language and Interface for RDF

Introduction: SPARQL is a W3C standard that defines an HTTP-based in- terface and a language to query and manipulate RDF graphs (e.g. [PS08]). After being released as a W3C recommendation in 2008, the current version, released in May 2013, is 1.1. SPARQL has a syntax that is similar to SQL [Tre+08]. We omit an in-detail formal introduction to the SPARQL language and instead discuss the topic by example.

SPARQL Example in GoodRelations: Consider the GoodRelations oering in Turtle RDF provided in Listing 2.6.

1 @prefix s: . 2 @prefix gr: . 3 foo:offer1 a gr:Offering; 4 gr:name "Stoll Technology SPIMSoft 5"; 5 gr:category "Scientific PIM Software"^^xsd:string ; 6 s:aggregateRating [ a s:AggregateRating; 7 s:ratingValue "4.9"^^xsd:float; 8 s:reviewCount 99 ]. 9 foo:offer2 a gr:Offering; 10 gr:name "John Doe Technology SCISoft 3"; 11 gr:category "Scientific PIM Software"^^xsd:string ; 12 s:aggregateRating [ a s:AggregateRating; 13 s:ratingValue "3.9"^^xsd:float; 14 s:reviewCount 50 ]

Listing 2.6: Oering example in Turtle syntax Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 41

There are two oerings, oer1 and oer2. Both belong to the category “Scientific PIM Software”. The name of the first oering is “Stoll Technology SPIMSoft 5”, and the name of the second oering is “John Doe Technology SCISoft 3”. The oerings dier regarding the rating. While oer1 has a rating value of 4.9, aggregated of 99 reviews, oer2 has a rating value of 3.9, aggregated of 50 reviews.

Now consider the following SPARQL query given in Listing 2.7:

1 PREFIX gr: . 2 PREFIX schema: . 3 SELECT ?name 4 WHERE { 5 ?off a gr:Offering . 6 ?off gr:name ?name. 7 ?off gr:category "Scientific PIM Software"^^xsd:string . 8 ?off schema:aggregateRating ?rat . 9 ?rat schema:ratingValue ?ratval . 10 }FILTER(?ratval>4.5)

Listing 2.7: SPARQL query to find oerings

In the “WHERE” clause of the query, the following statements are expressed:

• It is defined that the variable “?o” contains only triples of the type “gr:Oering”.

• The variable “?name” is bound to the “gr:name” values of the oerings.

• The oerings must have “gr:category” be the string “Scientific PIM Software”.

• The variable “?rat” is bound to the “schema:aggregateRating” object prop- erty of the oerings.

• The variable “?ratval” is bound to the “schema:ratingValue” value of the rating.

The results are filtered to only allow triples that have a rating value higher than 4.5. This excludes oer2. The initial “SELECT ?name” triggers the output of the variable “?name” that has been bound beforehand, resulting in the query result: “Stoll Technology Scientific PIM Software”. This example shows a simplistic way of Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 42

using SPARQL. Due to space limitations, the initial RDF example was very short. Normally, SPAQRL queries operate on larger datasets that may well include billions of triples. Therefore, while the specification that the queried triple should be a “gr:oering” seems trivial, this feature of SPARQL is highly relevant in real-world scenarios.

2.1.2.3 Linked Data

A recent sub-development in the context of the Semantic Web vision is Linked Data. Linked Data provides establishes methods for the provision and the interlinking of data on the Web [BBH09; HB11]. We provide a direct citation of the four Linked Data principles that have originally been published by Tim Berners-Lee [Ber06]:

1. Use URIs as names for things.

2. Use HTTP URIs so that people can look up those names.

3. When someone looks up an URI, provide useful information, using the stan- dards (RDF, SPARQL).

4. Include links to other URIs, so that they can discover more things.

While the Linked Data movement has gained significant attention in the Semantic Web research community, there is no perfect overlap with our research. This is (1) because the Linked Data community stresses the importance to provide the data under an open license. While our approach is assuming that the publication of a merchant’s oerings is in its best interest, we are quite sure that at the same time, a full transparency of merchant data is unrealistic, as providing pricing and assortment strategies would be provided too open for competition. Furthermore, the Linked Data community (2) maps its progress with the Linked Open Data Cloud. Unfortunately, the requirements to be mapped in this cloud exclude the existing Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 43

structured e-commerce data, as it does not feature enough incoming and outgoing links to other datasets. Therefore, a substantial part of data that fundamentally meets the Linked Data principles above is excluded. Furthermore, the emphasis of Linked Data is more towards publishing existing monolithic data sources than to equip major vertical parts of the Web with structured data, like the GoodRelations project aims at.

2.1.2.4 Schema.org, Google Semantic Web Tools and Google Knowledge Graph

In this subsection, we analyze recent developments of the major Web search engines.

Schema.org: Schema.org is a joint initiative of the major search engines Google26, Bing27, Yahoo28, and Yandex29 that defines a Web vocabulary that allows to an- notate a wide range of Web information with structured data. The aim of the project is to motivate Web site owners to mark up their sites with structured data, which significantly facilitates the indexing by search engines, and allows for novel services. Promoted by these significant stakeholders, it gained a good momentum since its initial release in June 2011. The GoodRelations Web vocabulary, that plays a central role in our approach30, has been fully integrated into the schema.org vocabulary as of November 2012. This is a significant milestone in the dissemina- tion process of Semantic Web-based E-Commerce. Additionally, it underlines the practical importance of the research provided in this thesis, as search engines might incorporate a production version of the proposed approach to add a new way of structured data generation.

26http://www.google.com 27http://www.bing.com 28http://www.yahoo.com 29http://www.yandex.ru 30To be introduced in detail in the following section. Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 44

Google Rich Snippets Testing Tool and Enhanced Search Results: In this context, Google provides a testing tool31 for Web pages that are annotated with structured data. The tool displays the data Google has recognized on a given page, and provides a rudimentary preview of how search engine results will be displayed. While the tool recognizes the data that is produced by the shop extensions, the final display in the search results has emerged to be dependent on multiple factors, that are not openly communicated by Google. For instance, we found that there is a fundamental correlation between search engine ranking and pricing compared to the competition.

Google Data Highlighter: Google Data Highlighter is a tool that has been put into service in December 2012. Added with e-commerce functionality in May 2013, it allows site owners to visually mark up structured data content on their pages, in order to facilitate the structured data extraction task for the search engine. The tool has gained remarkable attention in the search engine optimization community32. The release of the tool extends the importance of our research to the Web industry, as we aim to improve automation for this task.

Google Knowledge Graph: In May 2012, Google has launched the Google Knowl- edge Graph as a semantic enhancement to its standard search engine [Sin12]. It is driven by a knowledge base that gathers its input from a variety of sources like Wikipedia or CIA World Fact Book, and is able to present additional relevant infor- mation to the user, considering the searched concept is included in the knowledge base [Sin12]. For example, given a certain book author, it provides important facts like date of birth and death, and most important works in a list. The aim of this presentation is to provide the user with concise information that would normally require an extensive collection process33, as already introduced above.

31http://www.google.com/webmasters/tools/richsnippets 32275 likes on the social network Google Plus. 33http://googleblog.blogspot.co.uk/2012/05/introducing-knowledge-graph-things-not.html Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 45

These examples may mitigate the problems we introduced in Chapter 1 and thus weaken the problem statement provided there. Meanwhile, search engines are ex- ploiting very little of what Semantic technology may be capable of, or at least do not provide much of what they able to do to, yet.

2.1.3 Conclusion

In this section, we introduced the Web as a central technology of our time, the Semantic Web vision to mitigate limitations of the Web, Linked Data as a novel approach to realize the Semantic Web vision, and the usage of Semantic Web tech- nologies in search engines. In the next section, we will discuss Semantic Web-based E-Commerce, a field that combines the Semantic Web vision with e-commerce.

2.2 Semantic E-Commerce

This section is structured as follows: We (1) discuss selected technological foun- dations of e-commerce, and present (2) the GoodRelations Web ontology, as it represents the foundation of this thesis. We go on with (3) a short introduction of existing GoodRelations data on the Web, which will be elaborated extensively in Section 3.3. We subsequently present (4) scientific work on Semantic Web-based E-Commerce, and provide a discussion of (5) the real-world usage of structured e-commerce data. We finally provide an (6) outlook on the economical implications of Semantic Web-based E-Commerce.

2.2.1 Technological Foundations of E-Commerce

At this point of our discussion, we will introduce selected technological foundations of typical, shop-based e-commerce on the Web. A more detailed introduction to e-commerce technology, focussing on e-commerce systems (ECS), will be provided in Section 3.1. Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 46

We begin with the current market situation and its eects on technological aspects of e-commerce, and finally provide main treats of this technology that are relevant to our work.

In our context, it is helpful to split the B2B e-commerce market into two main groups:

1. First, there is a relatively small group of market leading companies like Amazon34, Sears35, or Walmart36 that oer hundreds of thousands of products. They commonly use proprietary software systems to provide Web shops to their customers.

2. A second group is constituted by a large amount of Web shops that use widespread e-commerce software systems (ECS) as a technology basis. Ex- amples for those software systems include Magento, Oxid EC, Prestashop or Virtuemart [Mag; eSa; Pre; Vir].

We chose the term ECS, as it underlines the many functions such systems usually cover. The most important functions are:

• Presentation of category pages, product pages, and informational pages, prod- uct search

• Provision of cart, checkout, and payment functionality.

• Management of customers, products, and stock.

At this point, one might argue that a payment system like PayPal37 is also an e-commerce system. We do not agree, as we see payment systems only as parts of e-commerce systems. The essential point of e-commerce systems, from our point of view, is that they integrate a significant part of the E-Commerce value chain.

34http://www.amazon.com 35http://www.sears.com/ 36http://www.walmart.com/ 37http://www.paypal.com Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 47

When we distinguish category pages and product pages in the e-commerce domain, the focus of our research lies on the product pages, as they can be seen as equivalent to the oering class that will be introduced in the following Section 2.2.2.2.

2.2.2 The GoodRelations Web Ontology for E-Commerce

This subsection is structured as follows: We (1) discuss goals and design principles of GoodRelations, (2) go on with its data model, and finally introduce its features, documentation, and ecosystem.

2.2.2.1 Goals and Design Principles

The GoodRelations ontology for e-commerce is a central basis for the work pre- sented in this thesis. GoodRelations is a generic, industry-neutral conceptual model for e-commerce information, specified as an OWL 1 DL ontology [Hep08a]. It pro- vides highly specific means to represent e-commerce information based on the Semantic Web technology stack, and thus covers for the raising need of specificity38 in economies.

Here, the goals of GoodRelations are [Hep11a]:

• to be applicable across several vertical industries,

• to work with dierent stages of the value chain,

• to be syntax-neutral.

It is important to stress that GoodRelations has been designed in a way that is generic enough to be extended to a wide variety of domains, at the same time being specific enough to cover a large amount of e-commerce cases. Defined as a data model and not in a specific representation, it is adaptable to future technologies.

38Discussed in detail in 1.1.1. and 2.2.5. Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 48

GoodRelations is subject to a Creative Commons license39, which allows to use, share, and modify the work as long as an attribution is provided [Var]. GoodRela- tions tries to “keep simple things simple and make complex things possible” 40 (cf. e.g. [Bak13]).

2.2.2.2 Data Model

The conceptual elements of GoodRelations are a central design paradigm for our research, as our goal is to generate novel GoodRelations data. GoodRelations is built on the assumption that e-commerce information in the pre-transaction stage follows an agent-promise-object pattern [Hep11a]: An (A)gent (e.g. person or organization), usually an enterprise or legal person, states a (P)romise of transferring a legal right regarding the object in exchange for a compensation, about an (O)bject, that may be a product or service [Hep11a]. If relevant, the APO expression can be extended with a location [Hep11a].

The principle dominates wide areas of commerce and allows, despite its simplicity, to cover a wide range of use cases. For instance, using GoodRelations, it is possible to express a garage surfboard sale, as well as carbon dioxide disposal in space. Additionally, just by using the GoodRelations predicate “seeks” instead of “oers”, it allows to flip the semantics of the predicate, which covers tendering or demand [Hep11a]. Fig. 2.8 depicts the agent-promise part of the principle, and the most important attached properties. An extensive UML diagram of the GoodRelations Web vocabulary can be found online [Hep11b].

In the following paragraphs, we will provide details on the depicted elements. The paragraphs are based on the GoodRelations specification [Hep11a].

Agent: The agent represents the business entity, i.e. the person or institution issuing the APO statement. For the figure we have attached the foaf:page property,

39http://creativecommons.org/licenses/by/3.0/ 40http://de.slideshare.net/mhepp/goodrelations-rdfa-for-deep-comparison-shopping-on-a-web- scale 41Own visualization, based on http://www.flickr.com/photos/heppresearch/5683265098/ Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 49

Product Features gr:Sell URI of item image

Company URI URI of item page gr:Feature foaf:depiction foaf:page gr:hasBusinessFunction foaf:page

Object Agent gr:offers Promise Business gr:includes Product or gr:seeks Offering Entity service gr:description

Item text gr:hasUnitPriceSpecification gr:name

Item name

gr:hasEAN_UCC-13 Price gr:has gr:UnitPrice EAN/UCC/GTIN 13 Currencyvalue Specification gr:hascurrency

Currency Code

Figure 2.8: Most important conceptual elements of the GoodRelations on- tology41

as it will be used in the extraction process later to specify the agent. Another important property42, from the perspective of this work, is gr:legalName allows to express the legal name of the business entity.

The gr:oers or gr:seeks property is used to link the agent to the promise.

Promise: In the following part, we discuss the outgoing properties of the oering, which will be later used as targets of the extraction process. The oering itself is, as described above, represented by the product detail page on the Web shop.

• gr:includes attaches the object, which can be a product or a service, to the oering. Product or service can be further defined, for instance by which manufacturer a product was made. We omitted some possible properties of the object at this point, as we did not use them in our extraction approach. We will elaborate further on that topic in Section 4.1. An object can have

42Not depicted. Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 50

a set of features, expressed by the gr:feature property, or product features defined in GoodRelations-compliant product type ontologies.

• gr:hasBusinessFunction, for instance, links gr:Sell to the oering. That is important, as GoodRelations also supports other (self-explanatory) busi- ness functions, namely gr:ConstructionInstallation, gr:Dispose, gr:LeaseOut, gr:Maintain, gr:ProvideService, gr:Repair, gr:Sell and gr:Buy. The business functions are essentially bundles of rights 43.

• gr:name represents the oering name, which is a highly dierentiating prop- erty in our context.

• gr:description allows to attach the oering description.

• gr:hasEAN_UCC13 is another important property for this work, as it allows to attach a strong identifier to the oering, i.e. is a number or code ad- hering to a scheme as issued by an authority, that allows a precise distinction between products or services.

• gr:hasUnitPriceSpecification is the last property we would like to point out as central to the GoodRelations data model. It is important to stress that it separates price and currency with the properties gr:hasCurrencyValue and gr:hasCurrency. That allows a straightforward data integration of oerings expressed in dierent currencies [SH13a].

2.2.2.3 Features, Documentation, and Ecosystem

GoodRelations features an extensive documentation. The ontology specification of Hepp [Hep11a] can be seen as state of the art in the field, as for each element of the ontology, it provides code examples in dierent RDF notations, and thorough integrations of social media, as well as currently leading technology discussion forums. We would like to adhere to the structure of the Wiki44 to further elaborate on the features of GoodRelations. 43http://www.heppnetz.de/ontologies/goodrelations/v1.html#BusinessFunction 44http://wiki.goodrelations-vocabulary.org/Main_Page Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 51

Tutorials and Cookbooks Concerning GoodRelations Implementation: While GoodRelations is basically a data model that has been designed in the tradition of Occam’s Razor (e.g. [Bak13]), the underlying complexity of Semantic Web technol- ogy makes it considerably hard to roll out for Web designers. Therefore, GoodRela- tions is equipped with an extensive set of tutorials and cookbooks to facilitate the integration into Web shops [Hep+09; Hepb; Hepd].

Information on How to Consume GoodRelations Data: As we have argued before, the establishment of a sucient market coverage of GoodRelations is a very important task for the progress of Semantic Web-based E-Commerce. At the same time, showcasing means of consuming data can spur early adoption. GoodRelations already features an advanced set of technologies that simplifies the integration into the legacy technology. Two notable projects are GR2RSS by Stolz and Hepp [SH13b], which allows for a straightforward production of RSS feeds (e.g. [Boa12]), which are supported by a wide range of Web software, and GR4PHP, an API in the popular programming language PHP, that facilitates the access to GoodRelations data [SGH12].

Case Studies: While GoodRelations is used by (as of 07/2013) about 20.000 Web shops with a wide range of sizes and domains, some use cases are especially qualified to show the potential of the technology.

• Volkswagen UK Car configurator The increasing importance of cus- tomization in the automotive industry has led to configuration options that can amount to over one million dierent combinations. GoodRelations has been successfully used by Volkswagen UK to cope with that problem45.

• Ravensburg, Germany mobile app Data integration and provision is a critical task for the expanding mobile sector, which is underlined by the advent of dedicated start-ups like cloudbase.io46. GoodRelations has been successfully used to drive a mobile application providing information about

45http://wiki.goodrelations-vocabulary.org/Case_studies/Volkswagen 46http://cloudbase.io/ Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 52

stores in the German town Ravensburg47. As a special feature, it provides the opening hours of stores in a convenient manner, rendering a distinctive benefit for the user as this feature is not commonly available in comparable applications.

Tools and Resources: GoodRelations features an extensive set of tools that facilitate the generation of GoodRelations data. Most noteworthy are the shop ex- tensions, that allow merchants to add structured data to their shop systems easily. Currently, there are extensions for the following shop systems: Magento48, Dru- pal Commerce49, Prestashop50, xt:Commerce51, Oxid eSales52, and Joomla/Virtue- mart53, of which the Magento extension is the most popular with more than 9.000 downloads as of May 2014. As the Magento extension has shown relatively high adoption rates, an extensive support landscape has been built, providing documen- tation, newsletters, and discussion groups54.

Other important tools are:

GR-Notify: GR-Notify55 is a Web service that allows merchants who have implemented GoodRelations to register their site, so that its existence can be signaled to Semantic Web applications. Originally designed with the intend to know how dierent shop extensions are used and put into action in 2010, the service quickly matured to a central pillar in the GoodRelations software environment. The significance of GR-Notify is substantiated because many applications, like the work at hand, or the crawler discussed below, need a collection of sites using GoodRelations for further analysis. As the data of GR-Notify allows to heuristically generate a sample of the existing structured e-commerce data on the Web, it plays

47http://wiki.goodrelations-vocabulary.org/Case_studies/Ravensburg 48http://www.magentocommerce.com/ 49http://www.drupalcommerce.org/ 50http://www.prestashop.com/de/ 51http://www.xt-commerce.com/ 52http://www.oxid-esales.com/ 53http://virtuemart.net/ 54http://www.msemantic.com 55http://gr-notify.appspot.com/ Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 53

a central role in the data generation process. Therefore, the design, implementation, and maintenance of the GR-Notify Web service can be seen as a further contribution of this thesis. We will provide an extensive analysis of the data collected by GR- Notify in Section 3.3 “Existing structured e-commerce data”.

Online Snippet Generator56: GR-Snippetgen allows to generate GoodRela- tions data in the RDFa syntax via a convenient Web form. It provides a straight- forward way to add GoodRelations markup for smaller businesses.

Converter from BMEcat: BMEcat is an electronic data interchange standard provided by the German Federal Association for Materials Management, Purchas- ing, and Logistics57 (e.g. [Sch+05]). It allows the granular modeling of product fea- tures in an extensive set of domains. Stolz, Rodriguez-Castro, and Hepp [SRH13b] have derived a method to convert BMEcat catalogues into GoodRelations data. The tool can be used to generate very rich product data based on typically available data from PIM or PDM systems [SRH13b].

GoodRelations Validator58: This tool provides a validation service for GoodRela- tions data, which is important to ensure data quality. As RDFa markup is not easy to debug for humans, it serves a critical role in the tooling landscape of GoodRelations.

GoodRelations Crawler: The GoodRelations crawler has been developed for the research project Intelligent Match59. As a focussed crawler (cf. [CBD99]), it visits the Web sites that are known to provide GoodRelations markup, and extracts this data. We provide related research in Section 3.3. In comparison to the crawler, which aims to generate a full replication of the known GoodRelations data on the Web, we analyze a sample only.

56http://www.ebusiness-unibw.org/tools/grsnippetgen/ 57Bundesverband Materialwirtschaft, Einkauf und Logistik e.V. (BME). 58http://www.ebusiness-unibw.org/tools/goodrelations-validator/ 59http://www.intelligent-match.de/ Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 54

Domain-Specific Extensions: GoodRelations can be extended to suit specific domains better, allowing for a very granular modeling of product classes and prop- erties. The following projects stand out:

• The Vehicle Sales Ontology VSO60

The Vehicle Sales Ontology allows for a very fine-grained modeling of e- commerce scenarios that contain vehicles, such as cars, bikes, or boats. In combination with the o-the-shelve capabilities of GoodRelations, it is possi- ble to express the sale of a motorbike with a specific transmission, or a repair service for four-seated rickshaws, for instance.

• The Tickets Ontology TIO61

Designed for the use with GoodRelations, the Tickets Ontology provides a sophisticated model to describe a wide range of scenarios like concert tickets, museum passes and public transport or airfare tickets.

• The Product Types Ontology PTO62

The Product Types Ontology conceives a special place among the GoodRela- tions extensions, as it is not a classical ontology extending GoodRelations to a specific domain, but provides roughly 300.000 categories to classify the prod- ucts attached as objects in a GoodRelations APO statement. The categories are derived from Wikipedia, which emerged as a valuable source to define an extensive set of product categories, for instance “laser printer” or “racing bicycle”. The Web interface of the Product Types Ontology provides a wide range of URI patterns catering for dierent use cases and RDF notations. Additionally, the service provides a bookmarklet to easily trigger Product Types Ontology directly out of Wikipedia. The Product Types Ontology has been developed in conjunction with the Intelligent Match project63.

• OPDM Ontologies

60http://www.heppnetz.de/ontologies/vso/ns 61http://www.heppnetz.de/ontologies/tio/ns 62http://www.productontology.org/ 63http://www.intelligent-match.de/ Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 55

OPDM64 is an European research project aiming at improving product data management on the Semantic Web of e-commerce. In the course of the project, 34 ontologies that cater for a specific product class have been designed. For instance, there are ontologies for garments, home hi-fi, or landline phones. Like the other ontologies, all OPDM ontologies are crafted in a way that they can easily be used as extension to GoodRelations.

2.2.2.4 Existing GoodRelations Data on the Web

When we use the term structured e-commerce data, we mean data that can be extracted from annotated e-commerce Web pages. More specifically, we mean Web pages that are annotated with GoodRelations RDFa, as this is the scope of this thesis. There are other ways of adding structured data to e-commerce sites, such as Facebook Open Graph, microformats, or schema.org. Meanwhile, we have decided to focus on GoodRelations data for this research, as it is (1) an open standard conforming to the Semantic Web technology stack, and (2) it provides significant granularity to cater for arbitrary specificity.

As the existing structured GoodRelations data on the Web is one of the foundational pillars for our extraction approach, it seems natural to devote an extensive section on the topic (see Section 3.3).

At the same time, to improve the reading flow, we would like to introduce the most prominent data sources now. First, as of July 2013, there are about 20.000 Web shops using GoodRelations to annotate their oerings. Additionally, GoodRela- tions is used by major market participants like Best Buy or Sears, each providing hundreds of thousands of oerings. A precise number of oerings equipped with GoodRelations on Web scale is hard to provide, as it would require a complete, unfocussed Web crawl that is way out of reach. Meanwhile, we estimate that about twenty million oerings are annotated. The major part of GoodRelations available in Web shops is generated by extensions, i.e. software modules that enhance the

64http://www.ebusiness-unibw.org/ontologies/opdm/ Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 56

capabilities of Web shops by accessing a standardized interface. These extensions ex- ploit the mechanism discussed above (2.1.1.4) of generating Web pages, combining database content and templates, by injecting metadata sections into the templates [MP12]. We define this structured e-commerce data as semi-automatically gener- ated. While once installed, the generation of the data is automated, the installation has to be performed manually by the shop owner. In comparison to this approach, the research at hand strives to generate structured e-commerce data in a (fully) automatic way.

At this point, we would like to explain that the terms

1. Semantic E-Commerce data

2. GoodRelations data

3. Structured e-commerce data

can be used interchangeably, as they only describe slightly dierent notions of essentially the same. (1) Semantic E-Commerce data is e-commerce data enhanced with a layer of clear semantics, e.g. which property denotes the price of an oering. (2) GoodRelations is the de facto world standard to express e-commerce scenarios based on Semantic Web technologies, meanwhile, Semantic Web-based E-Commerce data might be expressed according to another data model, as introduced above. Structured data, semi-structured data, and unstructured data, concepts originating from Web Information Extraction, are types of data with a decreasing degree of data semantics. Therefore, while the research topics have significantly dierent meanings, the data they use is the same, on the bottom line. As our research touches all three underlying topics, we do not adhere to a single terminology, and use the terms stringently in adequate situations.

2.2.3 Existing Research in Semantic E-Commerce

We would like to classify existing research in the field of Semantic Web-based E-Commerce in two main groups, separated by the release of the GoodRelations Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 57

Web vocabulary in 2008 [Hep08a].

The first group is extensively discussed by Hepp [Hep08a]. He identifies three waves of related work in the Semantic E-Commerce area.

1. The first wave is constituted by papers discussing the impact on allocation processes (e.g. [GQ02], product data management (e.g. [Fen+01]), caveats of using ontologies (e.g. [OWL01]), ontology mapping (e.g. [GC01]) and standards transformation (e.g. [McG01; Kle02; CW03a]).

2. The second wave emphasized implementation aspects. Those include ap- proach scalability, domain evolution and standard integration. During this wave, e.g. Tolksdorf et al. [Tol+03] recognized the importance of realizing the Semantic Web vision in a B2C context for the first time.

3. In the third wave, a thorough transformation of the eClass standard was presented by Hepp [Hep06]. At the timespan, this third wave covers (roughly 2006-2008), agents operating on Semantic Web data were expected to be of high importance (e.g. [Fas06]).

To conclude, at the time of its publication, GoodRelations was a notable contribu- tion to spur the Semantic Web vision in e-commerce.

The second group of work in the Semantic Web-based E-Commerce domain is constituted by research that appeared in the GoodRelations ecosystem. Selected work of that period has already been introduced above in Section 2.2.2.3.

2.2.4 Real-World Usage of Structured E-Commerce Data

In the former sections, we have introduced academic work that relates to structured data in e-commerce. We want to complete this chapter by a short introduction of how the e-commerce industry currently uses structured data, excluding the Semantic Web-based E-Commerce case. In this context, we would like to emphasize that in this work, we focus on oering data, and disregard product model data. Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 58

Therefore, we exclude product model data from the discussion of this section as well.

An important criterion for a buying decision on the Web is obviously the price, a fact that is underlined by the success of price comparison sites that aggregate e-commerce oerings. For the German market, examples include geizhals.de65 and billiger.de66, a well known US example is shopping.com67, to name only a few. From the merchants’ perspective, the listing on those sites requires the setup of a contract between the price comparison sites and the merchants, as those sites monetize based on sales provisions [LJC11]. From a technical perspective, for each comparison site a merchant wants to be listed on, he has to setup a proprietary data feed according to the specification of the respective comparison site. In this way, performing state-of-the-art multichannel marketing currently means to provide dierent data formats for dierent channels.

A widespread usage of GoodRelations would mitigate that problem significantly. If GoodRelations could be established as a data interchange standard for this scenario, it would suce to integrate GoodRelations for a merchant to ensure compatibility with a wide range of possible channels. At the same time, the price comparison sites operate on lock-in eects, and the exclusivity of their data, which may hamper the proliferation of GoodRelations for that case.

Recently, some ventures have been started that operate on the market data to provide competitive pricing advice to e-commerce merchants. Examples include Wisepricer68, that mainly operates in the US, and Beny Repricing, that operates in Germany and the UK69.

65http://www.geizhals.de 66http://www.billiger.de 67http://www.shopping.com 68http://www.wisepricer.com/ 69http://beny-repricing.com/ Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 59

2.2.5 Economical Implications of Semantic E-Commerce

This section has two parts: We (1) elaborate on the importance of the specificity concept, and (2) provide an introduction to data marketing.

Asset Specificity: An important economic aspect, that is influenced by the adop- tion of Semantic Web-based E-Commerce, is asset specificity. Asset specificity is defined as the loss in value generated by not using a product in its original intent (e.g. [McG91]). In the last century, an explosion of specificity of goods could be observed. While in the 1920s, a German reference of goods70 listed 4000 dier- ent products that every proficient merchant should know, the number of dierent products available today is legion, most certainly in the millions. If we project the influence of an advent of Semantic Web-based E-Commerce on specificity, we think that it will lead to a further increase. Today, highly specific demand can often not be met due to technical restrictions, as we motivated with the “search engine bottleneck” in the introduction. This leads to consumers restricting the specificity of their demand to what seems available. Uncovering the highly specific goods with granular structured data would in turn allow customers to fulfill their demand, most likely generating even more specific demand.

Thus, we argue that extending the Semantic Web-based E-Commerce data-space would not only cater for, but even increase asset specificity. To set the concept of economic specificity into perspective, we would like to refer to the need of software for specificity (cf. [Lac05, p. 9]). In this context, specificity means that software needs to operate on specific, i.e. structured, data. Therefore, from a terminological perspective, there is already a strong notion that structured data goes along with economic specificity. Summing up, economic growth in specificity has not been matched by technology that is used to initiate market transactions. With Semantic Web-based E-Commerce, and especially with the granularity that the GoodRelations ontology is able to cover, there is a significant chance to close this gap.

70Merck’s Warenlexikon. Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 60

Data Marketing: Besides the already introduced direct eects like search engine result extension and granular signaling, we project a decoupling of the Web of documents and oering data in e-commerce. Nowadays, the economic potential of a market participant in e-commerce is largely influenced by two factors. First, if we consider time spent on search results diminishes exponentially form rank one ([GJG04]), we can argue that highly-listed shops sell best. If we assume here that Web shops have roughly the same conversion rate, this leads to winner-takes-it-all markets. We propose Web site design as second most important factor influencing the buying decision in e-commerce (e.g. [LK10]). Considering the Semantic Web could be established to a large extent in e-commerce, we project the advent of data marketing. When the raw properties of an agent, promise, and object could be analyzed with high sophistication, the relevance of search engine rankings and Web site design would diminish in importance because neither would be needed to initiate a transaction. In turn, budgets that are spent nowadays for search engine optimization and Web site design would gradually be transferred to data marketing.

In conclusion, Semantic E-Commerce, in our opinion, has to be seen more like a key technology than a killer application. Like the Web in general, or specifically the Semantic Web, it may take a long time to discover its potential and benefits, as well as its risks and caveats. From our point of view, it opens the door to many applications that show business potential, but at the same time requires novel ways of thinking and omitting known paradigms.

2.2.6 Conclusion

We introduced Semantic Web-based E-Commerce with technological foundations, the GoodRelations Web vocabulary, a preview of existing GoodRelations data, which will be discussed more thoroughly in Section 3.3, scientific work, and real- world usage. Our discussion closed with economical implications of the research area. In the following section, we will provide an introduction to the field of Web Information Extraction, with an emphasis on work related to the fields discussed above. Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 61

2.3 Automated Generation of Structured Data with Web Information Extraction

An introduction to the following field is essential in the context of our work, as the research fields of Semantic Web and Web Information Extraction are increasingly related. We want to stress that in an elaborate Web Information Extraction system, there should be little need for human intervention to generate structured data for many Web shops. We again stress the comparison to the Semantic Web approach above (2.1), which needs significant human intervention, at least in the form of an enabling software installation.

The remainder of the section is structured as follows. We (1) clarify the ambiguous terminology in Web Information Extraction, then discuss (2) classical and (3) recent examples of work in the field, go on with (4) e-commerce-related and (5) ontology- based Web Information Extraction, and finally show approaches that combine both fields. In that way, we consecutively move from a broad view of the field to a narrow view on work that is very related to our approach.

Table 2.2 provides an overview of the related work we will discuss in this section. We provide the classification according to the subsections, author and year, the system name if applicable, whether the system is supervised or unsupervised (i.e. expecting labels ex ante), whether it targets a specific domain, and its main approach. Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 62 erences Overview of discussed work in Web Information Extraction : Table 2.2 CategoryClassicalClassical Authors (Year)ClassicalClassical Chawathe et al.Recent (1994) Muslea et alRecent (1999) Hogue and KargerRecent (2005) Crescenzi et Tsimmis al.E-Commerce (2001) System Name THRESHER CarlsonE-Commerce et al. (2010) Supervised Jindal Dalvi STALKER /Ontology-based and et Unsupervised Semi-supervised Liu al. (2009) Manually-constructed (2011) Qiu Ciravegna Furche RoadrunnerOntology-based and and et Domain Wilks Yang al. (2010) (2003) (2012) PopovOntology-based et Supervised al. AMILCARE (2003) Unsupervised AdrianSW et CSEAL & al. Independent EC (2010) G-STM Extraction Supervised objectSW & Unstructured EC - Approach Yahoo IndependentSW & DIADEM EC Structured BaumgartnerOur Semi-supervised KIM et Specific approach al. programming Semi-supervised (2001) languages Pazienza Epiphany et al. Unsupervised (2003) Stoll Svatek Independent (2014) (2006) LIXTO Independent Unsupervised GUI, Structured tree Structured edit Unsupervised distance Independent Unsupervised Supervised CROSSMARC Independent Unstructured E-Commerce Supervised Supervised Unstructured Structured Template di Extraction rules as Specific finite Adaptive automata annotation for Independent K-Space Coupling the Semantic GREX Web Structured Independent Tree matching, E-Commerce nested Structured lists Unstructured Structured Independent Unsupervised Unstructured E-Commerce E-Commerce Noise Supervised tolerance Semantic Annotation Structured Structured Domain-Knowledge Page similarity measure, Semantic page Annotation clustering NLP, Visual, Machine interactive Learning, E-Commerce agents Structured E-Commerce Structured Constraining extraction with ontologies ECS-based, templates, existing structured data Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 63

2.3.1 Research Strains in Web Information Extraction and Relation to Semantic Web Research

To provide an overview of related fields, we adhere to the fundamental ordering of Web extraction introduced by Furche, Gottlob, and Schallhart [FGS12]:

Web Information Extraction (WIE) in the narrow sense, also known as Open Information Extraction, mainly targets gaining information from textual resources on Web scale, and focusses on domain breadth and scalability, with a low recall. Ontologies can support Information Extraction at two stages: by defining the structure of the extraction targets, or by being the target data model itself [FGS12].

Web Data Extraction or Wrapper Induction is just one of many ways, on the other hand, that focusses on the extraction of structured data, also regarding the semi-structured nature of HTML (e.g. [RHJ99; Hic11]) documents [FGS12]. Furche, Gottlob, and Schallhart [FGS12] introduce two strains of WDE systems. There are (1) domain-independent systems that exploit templates used on the Web, yielding low accuracy, and (2) machine-learning based systems that are customized for sites, but yield high accuracy [FGS12]. From this point of view, our work belongs to the WDE strain exploiting templates. Often, work in this field is also sub-ordered to Web Information Extraction.

Web Mining fundamentally covers Web structure, Web usage, and Web content mining (e.g. [SHB06]). Fundamentally, Web content mining seems to be quite close to Web Information Extraction. From the application of the Semantic Web to these methods arise two important notions for our research. First, the initially disparate fields content mining [SHB06] and structure mining [SHB06] become hard to separate, as Web content mining generates structured data, that can also be derived from the links between pages. Second, Semantic Web Mining provides a holistic view on the process of extracting existing Semantic Web data, using it as a learning set for content mining, and later piping the results back into the Semantic Web. From that point of view, our research also belongs to the field of Semantic Web Mining. Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 64

While in the narrow sense, our work should be classified to the Web Data Extrac- tion subfield of Web Information Extraction, the majority of the strongly related literature does not adhere to this scheme. One could also argue that our approach is a special form of Wrapper Induction, i.e. the automated generation and con- figuration of software components for extracting information (see e.g. [KWD97; FK00]). In summary, we finally classify our work in the broader category of Web Information Extraction, and furter discuss this field below.

2.3.2 Classical Web Information Extraction Approaches

For this overview of classical WIE approaches, we join Chang et al. [Cha+06] in dividing WIE systems into the following four types: (1) Manually-constructed systems, (2) supervised systems, (3) semi-supervised systems, and (4) unsupervised systems. The remainder of this section elaborates on the specific features of these approaches and describes selected examples.

Manually-constructed WIE systems are built by manually defining wrappers in programming languages. Generic programming languages are used as well as programming languages specific for the WIE task. As this puts a high demand on the expertise of the users which design those wrappers, these systems are expensive [Cha+06]. TSIMMIS [Cha+94] is one of the first systems that exploits this approach. It operates on a file that declares where the data of interest is located, and how this location matches the entities to extract. TSIMMIS outputs the extracted data in an object exchange model, a way to define the structure of the result [Cha+94].

Supervised WIE systems exploit initially labeled datasets, that can be pro- vided or extended by labeling via graphical user interfaces. By lowering the bar of expertise for wrapper designers, this reduces the cost of building such systems [Cha+06]. STALKER Muslea, Minton, and Knoblock [MMK99] is based on hierar- chical extraction, allowing to break down the extraction problem to sub-problems. It uses the embedded catalogue formalism to represent the given semi-structured Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 65

document as hierarchical model [MMK99]. In here, leaves are the attributes to be extracted, whereas internal nodes are lists of tuples [MMK99]. For ordinary nodes, the wrapper needs an extraction rule, and for list nodes, the wrapper needs an iteration rule transforming the list into individual tuples [MMK99].

Semi-supervised WIE Systems dier from supervised systems in terms of the accuracy of the given data labels [Cha+06]. Since ex-ante, no extraction targets are specified, extraction targets have to be specified after the extraction, e.g. with a graphical user interface [Cha+06]. The THRESHER [HK05] system allows users to highlight and label elements of a Web page. Tree edit distance is applied to generate the wrapper [HK05]. The user then defines a mapping to RDF [HK05]. In this sense, THRESHER is also related to the “Ontology-based Web Information Extraction” Section (2.3.5) of this chapter.

Unsupervised WIE systems neither use labeled data, nor user interaction [Cha+06]. Unsupervised systems exploit database content and site templates [Cha+06]. By reversing the process of templating, i.e. by separating templates and changing data between pages, it is possible to construct wrappers [Cha+06]. If users get involved, then only to select a schema that fits the given data, [Cha+06] or to provide meaning to the extracted data elements [Cha+06]. RoadRunner [CMM02], for instance, considers the page production as an encoding of data in HTML elements, and approaches the extraction as a decoding problem [CMM02]. The decoding problem is addressed by defining a grammar for HTML [CMM02].

2.3.3 Recent Approaches to Web Information Extraction

Recent approaches to WIE try to improve the results of former work from the following angles:

Carlson et al. [Car+10] provide a recent approach to semi-supervised learning. It operates on a broad range of domains. As only a few labeled samples are used, they underline that semi-supervised WIE often yields mediocre performance, as the learning task is under-constrained [Car+10]. To tackle this problem, they propose a Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 66

coupling of multiple extractors to raise the overall extraction performance [Car+10]. While the authors use an ontology as input for the evaluation, we decided not to classify the work in the respective group, as it plays a subordinated role.

Dalvi, Kumar, and Soliman [DKS11] propose an automated framework to approach the extraction problem on Web scale that powers live applications at Yahoo!. Its main aim is to allow noise tolerance, i.e. that the system is able to cope with noise in the labeled samples. It combines supervised wrapper induction, domain knowledge, and unsupervised grammar induction Dalvi, Kumar, and Soliman [DKS11]. It builds on a publication model of the Web and a probabilistic model of the noise that emerges in the labeling process Dalvi, Kumar, and Soliman [DKS11].

Another strain of recent research provides encompassing frameworks that try to address former shortcomings of WIE holistically. As an example, we pick the DI- ADEM [FGS12] framework, which aims at providing highly accurate, automated Web scale extraction by trading o scope, specifically targeting vertical domains. These domains are targeted by providing significant information and the nature of their objects [FGS12]. DIADEM tries to extract deep Web (e.g. [Ber01]) resources, i.e. Web resources that are hidden in databases and only accessible via forms, by a four-step approach [FGS12]: It (1) automatically detects search or browse interfaces, (2) recognizes instances and attributes on overview pages, (3) refines instance extraction on detail pages, and (4) performs cleaning and integration into a database. In comparison to related work, DIADEM covers the whole extraction process from query, over instance and attribute recognition, through data cleansing and persistence [FGS12]. From our point of view, the OXPATH71 extraction com- ponent [Fur+11] of DIADEM additionally stands out. OXPATH extends the above discussed XPath (see 2.1.2.2) technology with browser interaction, enhanced ex- pressivity, and precision, as well as scalability [Fur+11]. We did not further explore OXPATH, as of the scope of this work, browser interaction has been omitted.

71http://www.oxpath.org/ Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 67

2.3.4 Web Information Extraction Targeting the E-Commerce Domain

In this subsection, we provide WIE research that specifically targets the e-commerce domain72.

[JL10] target the e-commerce domain with a tree matching approach. Tree matching analyzes the DOM structure of given HTML pages, and solves the extraction task by comparing characteristics of this structure [JL10]. The focus of this work lies on nested repeated patterns, like lists, as ordinary tree matching approaches cannot cope with these [JL10]. According to the evaluation, they outperform existing approaches, including RoadRunner, significantly [JL10].

Qiu and Yang [QY10] present an approach for extracting e-commerce data based on page similarity measure, page clustering, and wrapper generation. In this context, their work matches the first two dimensions introduced above, while the extraction target data model is XML [QY10]. They approach the data extraction problem by (1) crawling Web pages, (2) clustering them according to HTML structure, (3) extracting wrappers out of the clustered set, and (4) applying those wrappers to novel pages [QY10]. They do not provide information whether the initial crawled set is focussed or broad (see 2.2.2.3). At the bottom line, they achieve an average precision and recall that is roughly 15 % better than RoadRunner [QY10].

2.3.5 Ontology-Based Web Information Extraction

In this context, we would like to introduce the definition of Ontology-based Infor- mation Extraction (OBIE) according to Wimalasuriya and Dou [WD10]:

“A system that processes unstructured or semi-structured natural language text through a mechanism guided by ontologies to extract certain types of information and presents the output using ontologies.”

72This does not include research that uses Semantic Web techniques at this point. Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 68

Our work, in contrast, targets only semi-structured text (i.e. HTML Web pages). While not guided by ontologies, it presents the output according to the needs of the GoodRelations ontology. We would like to point out that there are many approaches that target Ontology-based Information Extraction (cf. [DSW07, pp. 40-44],[WD10]). A main dierence of our work is that it uses a learning set that was generated based on Semantic Web data.

In the following paragraphs, we provide a discussion of OBIE systems with a Web focus.

An early example of research in the field of WIE that targets ontology output is AMILCARE [CW03b]. Amilcare is “an adaptive IE system designed as support to document annotation in the SW framework” [CW03b]. It strives at reaching the following goals: (1) Suitability to users from layman to Information Extraction expert, (2) compatibility with a broad range of texts, (3) ability of being integrated into existing annotation workflows, (4) and ability to cope with a reduced training set [CW03b].

Another early example of Ontology-based Web Information Extraction is KIM [Pop+03]. It uses GATE [Cun+02], a natural language processing platform together with a proprietary upper-level ontology to represent and manage the extracted data [Cun+02]. A main goal is named-entity (NE) generation, that links to classess and instances in a semantic repository [Pop+03]. The presented upper-level ontology strives to cover common entity types like person, organization, or location [Pop+03].

The Epiphany Semantic WIE approach strives at generating Semantic Web anno- tation, specifically RDFa (2.1.2.2), to existing Web pages [Adr+10]. It performs the task by combining Ontology-based Information Extraction, and an existing dataset (e.g. DBpedia [Aue+07]). In this context, the emphasis of epiphany lies on the Semantic enrichment of textual content on the Web. Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 69

2.3.6 Semantic Web Information Extraction Approaches Targeting E-Commerce

In this section, we present approaches that have the most in common with our research.

The LIXTO [BFG01] extraction framework has already been presented in 2001. Originally, it was a system that allows users to design extraction rules on the basis of a visual rendering of Web pages. Since then, LIXTO has been spun o into a company73, focussing on online market intelligence [BGH09]. We have already introduced this field as an application of structured data in e-commerce in Sec- tion 2.2.4. Recently, interactive extraction on the target Web site as well as data cleansing have been included into the LIXTO framework, as well as exploitation of cloud computing to cope with Web scale extraction tasks.

An approach to use ontologies for the Information Extraction problem has been presented by Svatek [Sva06]. The work specifically targets Web product catalogues. While, as introduced below, ontologies can be used successfully to structure the WIE task, they extend their usage to so-called presentation ontologies, which allow to address the gap between high-level formalizations of a domain, which ontologies usually are, and low-level extraction features like XPath (see 2.1.2.2).

Pazienza et al. [Paz+03] propose the CROSSMARC platform, a system that com- bines language processing methods and machine learning to perform Web Informa- tion Extraction. In addition, the system aims at integrating agents that strive to automatically execute tasks on the behalf of the user [Paz+03]. A further goal of the CROSSMARC framework is to achieve compatibility with the changing Web environment [Paz+03]. Dierent existing WIE approaches have been employed to cater for dierent languages, like the Whisk algorithm [Sod99] for Italian, Boosted Wrapper Induction [Kus03] for English, and STALKER (2.3.2)forGreek.

73http://www.lixto.com Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 70

2.3.7 Novelty of Our Approach

Now that we have introduced relevant related fields, and discussed major strains, we would like to emphasize three central properties / novelties of our work.

• Work on Web Information Extraction research in general, and specifically on Web Data Extraction / Wrapper Induction fundamentally suers from having enough labeled examples for supervised approaches [DSW07, p. 37]. Therefore, much of the work is unsupervised. In this context, our work is novel as we use a new sources of data, namely the structured e-commerce data already available on the Web. To the best of our knowledge, there is no work that mainly exploits this direction.

• Much of the work in the Web Information Extraction research area is domain independent. At the moment, our work is restricted to the e-commerce domain. This is a legacy of our research focus, and the availability of data. In Chapter 5, we argue that our approach should be applicable to other domains.

• While many approaches work on the HTML DOM structure, there is no work that exploits template similarities on the ECS level. While this approach again seems domain-specific, we argue that an extension to content management systems could cater for a broader range of domains.

We elaborate in detail on limitations and future work in Chapter 5.

2.3.8 Related Field: Web Mining

Another topic related to our research is Web Mining, especially the sub-field Se- mantic Web mining. In comparison to WIE, that is largely similar to Web content mining, it additionally targets Web usage mining and Web structure mining. This section summarizes the survey of Stumme, Hotho, and Berendt [SHB06].

In this context, Web mining can be separated into three strains: Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 71

1. Web structure mining operates on links between Web pages to generate novel data. A prominent example is the PageRank algorithm that drove the success of Google in the early days (cf. [RU12, p. 3],[HG08]).

2. Web usage mining analyzes the patterns of user behavior on Web pages, for example the navigation paths, or the time spent on dierent pages of a site. Web usage mining has become a very popular technique, mostly driven by the market dominating application Google Analytics74. By recording user data like country, referring Web page, or browser, it provides in-detail data about which and how users interact with a given Web page. It has become a first-class tool for optimizing turnover in e-commerce, as it supports A-B testing (e.g. [KHS07]), a technique that tries to optimize design decisions by comparing the performance of two dierent variations.

3. Web content mining is quite similar to Web Information Extraction de- scribed above. The main focus is mining the textual content of detail pages [SHB06]. Web content mining also includes mining of multimedia content like images or videos [SHB06]. In comparison to ordinary content mining, Web content mining can explore the semi-structured nature of HTML Web pages [SHB06]. Applications include the broad field of information retrieval, a common task for Web search engines [SHB06].

If the Semantic Web vision is applied to the domain, the foundational Web mining approaches evolve as follows:

1. As already introduced in the section “Economical implications of Semantic Web-based E-Commerce” (2.2.5), if Web documents are marked up seman- tically, Web structure mining can operate on a much richer data source compared to the ordinary Web. While on the ordinary Web, links between pages are untyped and normally enriched by the link label, that only provides semantically vague description of the content that is linked. In comparison, links on the Semantic Web, which are constituted by the predicates in RDF, have a precise definition. In addition to the strong semantics, by marking

74http://www.google.com/analytics/ Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 72

up Web page data in a granular way, the density of links is multiplied. As the Web content of the Web page is transformed into structured data by Semantic Web techniques, the fields of structure mining and content mining become heavily related in this context.

2. If Semantic Web techniques are combined with Web usage mining, the main goal is to map raw usage data to application events. For instance, from a Web shop owner’s perspective, it is highly relevant which factors might harm the conversion of search and navigation into orders. That information might be covered in server log files. Therefore, a semantic layer which provides a business point of view to those files could be very useful.

3. Web content mining could be improved by Semantic Web techniques in two ways. First, ontologies can be used as a framework to manage the mining task, for instance in order to define the dierent manifestations of mined data. Second, ontologies can be used as target representation for the mined data. We introduced these dimensions above (see Section 2.3.5).

Our research remains strongly rooted in Web Information Extraction, while only describing some parts, Semantic Web Mining research provides a holistic view on our approach. Specifically, to generate our learning set, we mine the existing Web of Semantic Web-based E-Commerce. From a Web mining terminological perspective, we close the loop by using this learning set to drive instance learning, or ontology population for the Semantic Web.

2.4 Big Data and Validity of the Contribution

A research area that has recently emerged and gained massive attraction is Big Data. In its core, it targets the usage of data that has very high volume, velocity, or variety [Jür12]. It was triggered by data storage becoming so cheap that every data generated could be saved, resulting in vast data piles building up, with an explosion of interoperability challenges. That has created a new demand for ecient Chapter 2. Structured Data: Fundamentals and Usage in E-Commerce 73 data mining. Big data can be found in a wide range of domains. Typical use cases include precisely targeted marketing, sensor-driven production planning, or health applications based on mobile phone data.

In the Big Data community, Semantic Web technologies are not regarded at large. Meanwhile, graph computing and graph databases gained some attraction. This is surprising, as the fundamental data model of graph databases is quite similar to RDF (see 2.1.2.2). Aligned to our contribution, an important problem in Big Data is extracting structured data out of unstructured data [Jür12], for instance finding the important persons in a text75. Structured data is needed to perform computations, e.g. like finding trends in news. On the bottom line, our approach could also be relevant for Big Data, as it extracts structured e-commerce data from the Web.

75Also a fundamental problem in Information Extraction, discuss in 2.3. 3 Foundational Building Blocks

This chapter describes the fundamental building blocks that will be exploited in the further course of the thesis. The related work that is relevant in the respective parts is discussed in-place for a better flow of argument. The chapter contains three main parts according to the research foundation we introduced in chapter 1. Fig. 3.1 provides an overview of the three main parts of the groundwork, essentially the dominance of six ECS covering more than 90 % of the oering pages on the Web, the HTML pattern similarity these ECS expose, and the existing structured data on the Web of e-commerce.

Novel extraction approach for e-commerce

Existing structured data 6 ECS in Patterns in generate > 90 % e-commerce product pages of product pages generated by ECS

Figure 3.1: Research foundation

74 Chapter 3. Foundational Building Blocks 75

3.1 Impact of E-Commerce Systems on the Availability of Structured Data in E-Commerce

This part of the groundwork shows that a few e-commerce systems (ECS) dominate the market in a way that a high share of product detail pages on the Web are generated by only six dierent systems1. This is a main building block for the further course of the research, as it allows to cover a significant amount of the e-commerce Web by only constructing extractors for a few e-commerce systems.

3.1.1 Related Work

In this section, we summarize work related to this part of the groundwork, which can be grouped into (1) market studies and (2) functional comparisons:

3.1.1.1 Market Studies

Due to the very dynamic nature of the field, it is unfortunately inevitable to refer to Web resources for some figures. For instance, Raju [Raj12] estimates that in 2012, there were 90.500 shops in the US earning more than $12.000 a year. Since 2011, Robertshaw [Rob12] has been conducting a semi-annual analysis of the market shares of ECS. According to his results, the eleven biggest ECS account for more than 80 % of all sites. The service Builtwith.com provides ongoing reports of the popularity of a wide range of Web technologies, including e-commerce packages [bui13]. Unfortunately, the site only delivers relative market share data of the ten most popular e-commerce packages with respect to the top one million sites sample, which is of limited value for our research. Note that all these reports do not take into account the size of the deep link or product page part of shop sites but merely count the sites directly.

1This section is based to a large extent on the previously published paper Stoll, Ge, and Hepp [SGH13], as declared in section 1.7. Chapter 3. Foundational Building Blocks 76

3.1.1.2 Functional Comparisons

Beside the market studies, there are many analyst publications targeting ECS, mostly aimed at corporate audiences. Those publications put a stronger focus on the comparison of features, and strategic properties of the regarded systems, than e.g. on the number of deployments. For instance, in 2011, Alvarez et al. provided a report that maps dierent ECS into four clusters. The criteria for inclusion in the report are defined in a way that they exclude lower end solutions, which may account for a substantial amount of Web shops in the long tail. In 2012, Walker provided a similar report ranking dierent solutions in terms of oering and strategic position, also excluding lower end solutions [Wal12].

3.1.2 Understanding the Impact of E-Commerce Software on the Adoption of Structured Data on the Web

The basic rationale for this part of our research is the following: From our previous development of extension modules, we know that it is possible to modify respective shop software packages to automatically add the publication of structured data based on GoodRelations, in a manner that (1) requires only minimal configuration eort for the site owner, and (2) is tolerant with regard to modifications of the stylesheets, themes, and HTML templates, or the installation of other modules. Our next goal was to add respective functionality to the core codebase of popular e-commerce packages so that the adoption of structured data does no longer depend on the manual installation of such extensions. If we succeeded with that, a large number of shop sites will automatically add GoodRelations markup once they are updated to the next version of the system. This promises to be a huge lever for the implementation of the Semantic Web vision for e-commerce. Given that we have limited resources for implementing the idea, we need to know (1) how many ECS we should target, and (2) which coverage we can achieve on the level of product detail pages. We expect the number of those pages to be pareto-distributed, likewise than, e.g. firm sizes. Thus, covering a minor share of systems with structured data may Chapter 3. Foundational Building Blocks 77

Global Layer Web Shops

Standard Custom E-Commerce e-commerce systems e-commerce systems System Layer System 1 System 2 System m

Web Shop Layer Web Shop 1 Web Shop 2 Web Shop n

Product Product Product Product Product Product Range Layer item page 1 item page 2 item page 3 item page 4 item page p

Figure 3.2: Eect of enabling structured data for an e-commerce system on product pages have a high impact on the overall coverage of the market. There are four layers of interest in this context, as Fig. 3.2 illustrates.

• The global layer, which represents all shops on the Web.

• The e-commerce system layer, which is divided into Web shops running standard e-commerce packages, and custom ECS based on proprietary soft- ware. In our work, we focus on standardized ECS, such as Magento, ATG, or Prestashop.

• The third layer is the Web shop layer, constituted by the actual shop sites that run a specific e-commerce package, or a proprietary implementation.

• The fourth layer is the product range layer. It consists of all the product detail pages hosted by a particular shop site and system.

In the context of this thesis, the same analysis is essential for projecting the number of ECS for which we need to produce an automated extraction mechanism in order to extract structured data for a major share of detail pages of Web shops.

Our research approach consists of the following steps:

1. Obtaining a list of relevant site URIs: Since we cannot analyze the Web as a whole, we need a subset of URIs representing Web site main pages to start with. Roughly speaking, this is a list of Web sites, but not limited to Chapter 3. Foundational Building Blocks 78

e-commerce sites. We will screen them for e-commerce functionality in the subsequent steps. For our analysis, we take the freely provided Alexa Top one million tra rank [Ama13]. This gives us the URIs of the main pages of the one million most popular sites.

2. Defining the shop software packages to search for: As a second step, we need a list of relevant e-commerce software applications. For that purpose, we merged the top 40 list provided by Robertshaw [Rob12], the top 10 list from Builtwith.com [bui13], and the systems mentioned in the reports by Gartner [Alv+11] and Forrester [Wal12], resulting in a list of 56 search strings for ECS. This list is shown in Table 3.1.

3. Determining whether an URI represents a Web shop using one of the systems from our list: To get a hold of URIs that are run by specific ECS, we used the tool Whatweb [Hor13], originally a site profile scanner from the context of computer security. Whatweb is able to detect a wide range of properties of a Web site, including e-commerce functionality. We then matched the results against the list of search strings.

4. Counting product item pages based on sitemaps: Next, we need to estimate the number of product detail pages for each shop, which is a non-trivial challenge. As an approximation, we used XML sitemaps [GYC08] of the shop sites, if available. In this context, we assumed the remaining sites to be a sucient sample of the base population. We then conducted a cluster analysis on the sitemap properties to find the ratio of product item pages on a Web shop. We could show that product item pages and overall sitemap pages correlate. Thus, we use the URI counts based on sitemaps in combination with the average share of product detail pages within a site as approximation of the number of products per shop.

5. Extrapolation of the product item count to Web scale: In order to predict the impact of equipping ECS with structured data markup, we project our results on the total number of shops in the population. Chapter 3. Foundational Building Blocks 79

Table 3.1: Consolidated list of search strings for the 56 e-commerce systems in regard

E-Commerce Systems (ECS) magento, zen cart, virtuemart, oscommerce, prestashop, opencart, volusion, Yahoo! stores, interspire, ubercart, wp e-commerce, ecshop, actinic, miva merchant, shopify, cs-cart, ibm websphere commerce, xcart, oxid esales, 3dcart, atg, demandware, ejunkie, intershop, shopp, ablecommerce, nopcommerce, prostores, shopsite, foxycart, big cartel, ekmpowershop, gsi commerce, shopfactory, cubecart, romancart, tomatocart, drupal commerce, blucommerce, lemonstand, thefind upfront, google trusted store, cleverbridge, elastic path, icongo, jagged peak, marketlive, microsoft commerce server, netsuite, istore, venda, micros-retail, redprairie, digital river, sap e-commerce, xt-commerce

6. Evaluation of the e-commerce system detection: As the e-commerce system detection provided by Whatweb is a critical part of our analysis, we addi- tionally evaluate its performance on a sample of n=550 URIs with human computation.

3.1.3 Implementation

In the following sections, we discuss the implementation details of this part of the thesis.

3.1.3.1 Obtaining a List of Relevant Site URIs

The initial input to our study is the top one million list of Alexa [Ama13]. Alexa analyzes Web site popularity. There is a monthly global ranking of top-level domains according to trac (the “Top1m list”), provided for free in a CSV format. We worked on the 09/2012 release. To understand which e-commerce packages are used for the Web shops in the Top1m list, we employed the tool Whatweb [Hor13]. Whatweb is an open-source security scanner, written in the programming language Ruby. Among other site characteristics, it detects server software, content management systems, and ECS. Applying a tool like Whatweb on such a large amount of URIs is computationally expensive. Thus, we used cloud computing resources. While a common pattern is to Chapter 3. Foundational Building Blocks 80

distribute the task on many cloud instances, we found that running it parallelized on a single powerful machine was sucient, and resulted in the smallest overhead. We used the Amazon EC2 Cluster Compute Eight Extra Large cloud computing instance (cc2.8xlarge)2. We distributed four threads on each of the 16 cores of the machine using GNU parallel [Tan11] with one line of code, which can be found in Listing 3.1. Running the task took eight hours and 32 minutes, resulting in server costs of 19.20 $ for the given one million URIs.

1 cat 1m.csv | parallel --max-threads=64 ruby whatWeb.rb > 1m.txt

Listing 3.1: Parallelization with GNU parallel

To get the subset of results related to the e-commerce packages of interest, we merged the top 40 list provided by Robertshaw, the top 10 of Builtwith.com and the leading systems from the Gartner and Forrester reports, as already mentioned. The merged list of 56 search strings is given in Table 3.1 above.

For consistency, in the tables and figures of the remainder of the section, we use the original lower case spellings of the search strings. We ran the list against the Whatweb results using a small script, matching the search strings against the Whatweb result file. It is important to stress that, to a certain degree, this approach is also able to detect shop systems even if there was no specific Whatweb plugin beforehand, as there are often strings hinting to shop systems in the part of the results of Whatweb (e.g. cookies or HTTP headers). Those were not targeted by the original server detection plugins.

3.1.3.2 Counting Product Pages Based on XML Sitemaps

After fetching and parsing the sitemaps, we went on to get a hold of product item pages. This number is important, as Web shops usually provide, beside the (1) product detail pages, (2) category pages, (3) review pages, and (4) pages about payment and shipping options, to name only a few. In order to assess the number

2http://aws.amazon.com/en/ec2/#instance Chapter 3. Foundational Building Blocks 81

of product detail pages in a given Web shop, which hold the actually interesting information for our extraction task, we assumed that the product item pages count should be correlated to the total URI count of the XML sitemap. To validate this, we conducted a k-means (e.g. [Mac+67]) cluster analysis on the properties of each entry of the sitemap of a sample of 7169 randomly selected shops (out of shops detected in the Alexa population), using Scikit-learn 2011 [Ped+11]. We set the cluster size to three, as we assumed there would be a cluster of product pages, category pages, and arbitrary pages. In pre-processing, we filtered URIs linking to images, and generated a property that yielded the existence of the string “product” in the server path. For the further analysis, we only took product, priority, and lastmod properties into account. The resulting clusters were filtered to have a silhouette coecient (e.g. [Rou87]) of at least 0.6, and the relative size of the biggest cluster to be between 0.6 and 0.9 of the number of entries in the sitemap, as we considered only those who match this threshold as valuable sitemaps matching our initial assumptions. Silhouette coecient was used as the measure of how well the clusters separate the data. The relative size reflected that our product page cluster should be the biggest one. We then computed Pearson’s correlation between the biggest cluster and the total sitemap page count as a proxy of the relation between product pages and the total sitemap count. This resulted in a value of 0.879, indicating a strong correlation. Additionally, we computed a final correction factor that represents the mean dierence between URIs found in a sitemap and its biggest cluster. The result is 0.774, with a 95% confidence interval of 0.759 to 0.790. Thus, in 95 % of the cases, there will be between 759 and 790 product item pages per 1000 URIs in a XML sitemap.

3.1.4 Results

In the following sections, we discuss the results. Chapter 3. Foundational Building Blocks 82

3.1.4.1 Summary

Number of Products per Site: According to the correlation analysis conducted in 3.1.3.2, we can take the URIs listed in a XML sitemap as an estimate for the number of product detail pages. The analysis of the XML sitemaps gives preliminary hints that the market for ECS is pareto-distributed with regard to the number of product detail pages, i.e. at the level of deep links. Six systems leading the URI count represent more than 90 % of all URIs. The respective results are shown in Table 3.2 and Table 3.3. Overall, 23.32 million URIs could be extracted from the XML sitemaps. If we apply the correction factor of 0.774 (see 3.1.3.2), this projects to roughly 18 million product item pages for the population of the one million sites from Alexa.

Table 3.2: URIs found in e-commerce sitemaps from one million Alexa sites and product item estimate, results (absolute)

Shop soft- URIs Lower boundary Projected # Upper boundary ware of 95 % confi- of product of 95 % confi- dence interval item pages (n dence interval * 0.774) Magento 12.610.254 9.721.324 9.760.337 9.803.393 ATG 3.016.552 2.190.263 2.334.811 2.522.813 PrestaShop 2.756.334 2.120.100 2.133.403 2.150.217 osCommerce 1.597.558 1.221.927 1.236.510 1.253.693 Zen Cart 769.947 583.229 595.939 609.573 CS-Cart 524.695 400.594 406.114 415.041 VirtueMart 508.310 384.969 393.432 402.558 Others 1.544.481 1.183.660 1.195.428 1.210.618 Total 23.328.131 17.806.066 18.055.974 18.367.906

Additionally, we visualized the findings using a box plot (e.g. [MTL78]), as shown in Fig. 3.3. For a higher expressiveness of the plot, we filtered the systems to have product page counts in the 0.5 area of the standard deviation of each system’s distribution, i.e. we filtered out extremely large (and small) sites. The resulting set of eight systems is the result of applying a filter so that only shops that have more than 50 results are considered. The boxes show the 50 % quantile of the distributions after applying the filter above, the line in the box the median. The lines above and below the boxes are the whiskers, they show the remaining upper and lower 25 % quantiles. Outliers are plotted as crosses. We can see that Demandware Chapter 3. Foundational Building Blocks 83

Table 3.3: URIs found in sitemaps and product item estimate, results (rela- tive)

% of Aggregated Shop products % of software of all products products Magento 54.06 54.06 ATG 12.93 66.99 PrestaShop 11.82 78.81 osCommerce 6.85 85.66 Zen Cart 3.30 88.96 CS-Cart 2.25 91.21 VirtueMart 2.18 93.39 Others 6.62 100

Figure 3.3: Distribution of the number of product pages per shop software package

aggregates a high amount of URIs and additionally has high top 25 % whiskers, whereas Virtuemart or Zen Cart do not. As the median of the distributions is mostly located considerably towards the bottom of the 50 % box, those systems spot a positive skew towards a low number of products. It also matches our informal experiences with maintaining various shop extensions.

Market Structure and Popularity: The initial Whatweb experiment resulted in 912.865 successful responses. Overall, 21.367 shops could be detected in the sample. The frequency count of the dierent ECS is shown in Table ??. Only five ECS cover more than 80 % of the regarded sample, and Magento leads the results with 35.48 %. Chapter 3. Foundational Building Blocks 84

We have to stress that this data comes with two caveats. First, the Alexa dataset may contain a lower number of Web shops than the full set of Web sites, as among the top one million sites, there could be a few larger shops and many popular other informational sites, like news, fora, etc., whereas a long tail of specific shops might hardly be in that dataset. For instance, the GoodRelations project page is only ranked 344,917th, despite being prominently referenced by major search engines.

Second, it may be that some sites in the datasets are actually shops or contain shop functionality, but are not detected as such.

3.1.4.2 Impact of E-Commerce Software on the Adoption of Structured Data

Based on the tentative findings regarding the number of product detail pages and market structure, we assume that adding structured data to the core codebases of only six ECS would already augment nearly 90 % of the product detail pages found in the sample with structured data markup. The systems with the highest impact would be Magento, osCommerce, ATG, Zen Cart, Prestashop, EC-Shop, and Virtuemart. Other than ATG and EC-Shop, for all of these there are GoodRelations extension modules available [Hep13]. However, only a small share of shops actually uses those extensions. The approach described in this thesis could help to bridge this gap without the need of increasing the adoption of such extension modules.

3.1.4.3 Site Popularity

We analyzed the popularity of sites generated by specific ECS in terms of the Alexa trac ranking. Herein, it is of interest which shop systems tend to be more present in the high-trac sites, and which are not. To answer this question, we chose to use the mean of each shop’s ranking distribution (AX-mean). To make the result more transparent, we provide an additional variable AX-factor, defined by dividing 500.000, i.e. the middle rank of the Alexa trac ranking, with the mean of each shop. A higher value means higher ranking sites in average. Most shops Chapter 3. Foundational Building Blocks 85

Table 3.4: Precision of the shop detection technique - Demandware - Prestashop

Demand. 3DCart ShopSite Magento Zen Cart PrestaShop Precision 1 0.99 0.98 0.97 0.95 0.95

ranked less than one, which means that most of them position in the lower ranks of Alexa Top1m sites. Only Demandware, E-junkie, ekmPowershop, Intershop, and Ubercart yielded significant values above 1.25, indicating that they are used by highly popular shops. A possible explanation is that really large shop applications use either proprietary code, or employ technology components like load balancers that make the detection of the underlying e-commerce system hard.

3.1.5 Evaluation

The e-commerce system detection is a critical part of our approach. We decided to assess the performance of the method using human computation via the ser- vice Crowdflower3, which is an intermediary providing access to a manifold of human computation services through a standardized interface. We use precision (e.g. [MRS08, p. 155]) to evaluate the performance of information retrieval systems, as we cannot measure recall, because our approach is limited to the aforementioned list of systems. We set up a task for humans to decide whether a given URI is an e-commerce site or not. Thus, the experiment provides insight whether the list of shop URIs actually contains shop sites. We ran the experiment for eleven ECS and presented a list of 50 randomly selected URIs to the human participants, which resulted in 550 items to be judged. According to the evaluation, the shop detection approach achieved a mean precision of 96 %, i.e. the shops detected by Whatweb are actually shops. The systems we analyzed yielded a precision between 92 % (e.g. ATG, Virtuemart) and 100 % (Demandware). We show the results in Table 3.4 and 3.5.

3http://www.crowdflower.com Chapter 3. Foundational Building Blocks 86

Table 3.5: Precision of the shop detection technique - EC-SHOP - mean

ECSHOP CS-Cart osCommerce ATG VirtueMart Mean Precision 0.95 0.95 0.94 0.92 0.92 0.96

3.1.6 Discussion and Limitations

This part of the groundwork is subject to the following limitations:

1. Alexa Top1m as basis for the data collection induces a bias towards popular sites. As future work, we plan to run Whatweb against the data of Common- Crawl [Com13], a public crawl of a substantial part of the Web. This would mitigate the bias towards popular sites, and better represent the long tail of the Web.

2. We used Whatweb as it is, without additions to the plugins or constraining functionality. Improving the plugins could have resulted in higher perfor- mance in the site recognition process, but the overall result of our research is not dependent on marginal performance improvements of the underlying data collection. Constraining functionality of Whatweb in terms of excluding detection features would have resulted in a lower computational eort, but we would have lost additional data, which can be explored in future work.

3. E-commerce software missing in our initial links, additional components like load balancing tools, or weaknesses in the recall of our detection technique may account for a significant number of sites incorrectly excluded from our analysis. This may reflect a fundamental limitation of our quantitative results, unless the shop sites properly detected are a suciently representative sample of the overall situation.

4. Another shortcoming might be the reliability of the string search over the results of Whatweb in order to detect the dierent shop systems.

5. The approach of using XML sitemaps to estimate the number of deep product detail pages is a limited technique. Many sites do not provide XML sitemaps, Chapter 3. Foundational Building Blocks 87

and XML sites provided may list only a subset of actually available product item URIs. Alternative approaches for counting the number of product detail pages would be (1) deep crawling or (2) counting pages indexed by Google.

3.1.7 Conclusion

In this section, we have provided evidence that standardized Web-shop software acounts for a large amount of product detail pages, which justifies our approach of using HTML template similarity in combination with the detection of the un- derlying software of a shop site for automated information extraction.

3.2 E-Commerce System Identification Based on Sparse Features

This part of the groundwork plays two important roles for the argumentation4. First, it proves that there are distinguishing patterns in the markup of Web oering pages that allow the identification of e-commerce systems (ECS). Those patterns are a fundamental premise for the further course of the thesis. Second, it acts as a key module for the final extraction system, as ECS have to be identified before a suitable extractor can be applied.

3.2.1 Related Work

There are two dimensions of work related to this section. The problem is situated in Web page classification (see 2.3), whereas the methods we use are in the supervised classification subfield of Machine Learning (e.g. [Kot07]).

4This section is based to a large extent on the already published paper [SH14], as declared in section 1.7. Chapter 3. Foundational Building Blocks 88

3.2.1.1 Web Page Classification

Fundamental Problems: According to Qi and Davison [QD09], who provide an encompassing survey on the topic, Web page classification aims at assigning predefined category labels to Web pages. Web page classification can be divided into three subtopics. (1) Subject classification aims at detecting the topic of a Web page, e.g. ’sports’, ’culture’, or ’business’. (2) Functional classification tries to identify the role of a Web page, e.g. ’Business home page’, ’Personal blog entry’, or ’Web shop product page’. (3) Sentiment classification targets opinions and attitudes conveyed with a page. As our approach does not fit into any of the aforementioned subtopics, we propose to classify it as (4) generator classification.

A further dimension that characterizes Web page classification is the number of classes [QD09]. As we aim to detect six dierent ECS, our research elaborates a multi-class problem. As only one ECS can be assigned to a given page, we handle a single-label classification. As classification, in our case, allows instances to be either in a class or not, we operate in the field of hard classification. Finally, we operate on a flat classification problem, as the labels show no hierarchical order.

Qi and Davison regard Web directories, search engines, question answering sys- tems, and focused crawlers as important applications of Web page classification, and mention Web content filtering, assisted Web browsing, and knowledge base construction as less important [QD09]. Our work can be assigned best to knowledge base construction, as our final aim is to generate additional structured data for e-commerce. In structured data research, data expressed according to ontologies is also commonly referred to as a knowledge base (e.g. [Hep08b]). Finally, it can be argued that the work should rather be subsumed under Web site classification. We think the aforementioned criteria are suited nevertheless, as we execute the classification by operating on a single page, inducing our findings to the whole Web site. In this context, we assume that a Web shop is always generated by a single ECS. Chapter 3. Foundational Building Blocks 89

Web Page Classification in E-Commerce: Narrowed down to the e-commerce domain, there are two non-scientific sources that perform generator classification. Since 2011, (1) Robertshaw conducts an analysis of the market share of ECS semi-annually [Rob12]. (2) Builtwith.com generates ECS statistics on a daily basis [bui13]. As both sell the resulting data, they do not disclose their methods of detection. We provide a scientific overview of ECS market share on a product page level, and its impact on the adoption of structured data on the Web in Section 3.3. Beside the classification of ECS, classification of dierent page types is a significant problem when trying to extract structured data in e-commerce. We identified (1) product pages, (2) category pages and (3) arbitrary pages as fundamental categories in the Web shop domain in Section 3.1. To the best of our knowledge, there is no scientific work on automatically labeling shop pages to those categories.

3.2.1.2 Supervised Classification

Supervised Machine Learning operates on problems where a learning set with given labels is provided (e.g. [Kot07]). The discrete labeling problem is called classifica- tion, whereas the continuous labeling problem is called regression (e.g. [Kot07]). The learning problem in supervised classification is characterized by i feature vec-

tors ≠æv i that are labeled by j classes cj. The goal is to label novel vectors correctly, yielding high precision and recall (cf. [MRS08, p. 155]).

We provide a general overview of supervised Machine Learning in Fig. 3.4, taken from Kotsiantis [Kot07].

Vectorization: Tf-idf Term Weighting: As categorical string features cannot be used directly in classification algorithms, they have to be transformed into numerical vectors. A common approach to this task is to count the frequency of terms. Chapter 3. Foundational Building Blocks 90

(1) Problem

(2) Identification of required data

(3) Data pre-processing

(4) Definition of training set

(5) Algorithm selection

(9) Parameter (6) Training tuning

(7) Evaluation with test set

no (8) OK?

yes

(10) Classifier

Figure 3.4: Supervised Machine Learning: General approach, based on [Kot07]

But against our aim, this would emphasize terms that occur often. In contrary, we assume that a high discriminative power emerges from the terms occurring rarely. This problem has been addressed by the term frequency - inverse document frequency (tf-idf) approach by Jones [Jon72], which we use to transform the string features into vectors. As word count in an instance increases, the tf-idf value increases proportionally, but is corrected by an oset reflecting the occurrence in the aggregated instances.

Classification algorithms: In the next subsection, we briefly introduce the se- lected classification algorithms. We also introduce common abbreviations to refer to the algorithms.

1. NC In nearest centroid classifiers, the centroid reflects the vector average of the class members [MRS08, p. 292]. Bhatia and Vandana [BV10] provide a survey about k-nearest neighbors algorithms. Chapter 3. Foundational Building Blocks 91

2. SGD Stochastic gradient descent improves the model by subsequently an- alyzing instances [Owe+11, p. 274]. It operates with special optimization methods, that allow early convergence (e.g. [Owe+11, p. 274]). Recently, it has drawn more attention by a publication of Zhang [Zha04], that emphasized its power for large scale learning.

3. SVM Support-Vector Machines have been introduced by Cortes and Vapnik [CV95]. SVMs operate on maximizing the margin between a separating high- dimensional hyperplane and the given features.

4. DTREE Decision tree learning has originally been introduced by Quinlan [Qui86]. It generates class assignments by sorting features into trees [Kot07]. In comparision to other classification algorithms, which often act as black boxes, an important feature of decision trees is that they can be easily in- spected by humans [Kot07].

5. RF Random forests have been introduced by Breiman [Bre01] and belong to ensemble methods. Ensemble methods employ majority votes of multiple classifiers to raise performance [Bre01].

6. XTREE Extremely randomized trees [GEW06] evolve the approach of ran- dom forests. They mainly dier in operating fully randomly when splitting nodes and using the whole tree for learning [GEW06].

3.2.2 Methodology, Approach, and Implementation

In the following sections, we provide details regarding the methodology, approach, and implementation of the component for detecting the ECS behind a Web site on the basis of a superviszed classification of Web pages from the site.

3.2.2.1 Overview

We want to use supervised classification for detecting the ECS behind a Web shop site, in order to be able to apply the proper extractor. We assume that certain Chapter 3. Foundational Building Blocks 92

aspects of the HTML source code of a small number of randomly chosen pages from a given site might be sucient for this task. We may also include additional signals from the site, like HTTP response header parameters. As a starting point, we strip o all content from the HTML document and get an empty tree of HTML elements and their attributes. We will then decide on which subset from this empty tree we will use as features for the classification task, e.g. all or a subset of element names, sequences (n-grams) of elements, or values from element attributes, like @id.

3.2.2.2 Design Rationales

When using supervised machine-learning-based classification systems, an important task for humans is feature engineering. With the initial hypothesis of using the ’class’ and ’id’ properties of HTML tags, the curse of dimensionality of Machine Learning occurred. For Machine Learning, the curse of dimensionality was first described by Hughes [Hug68]. It states that for a given amout of samples, the performance of a classifier decreases with increasing dimensions [Hug68]. Regarding our data, the phenomenon is triggered by a high variety of ’class’ and ’id’ properties. We first tried to reduce dimensions with manually crafted blacklists to exclude noisy terms, e.g. those that are related to document styling, and automatically filtering terms of very high and very low frequency. While this yielded little improvements, it showed that features filtered by a short domain-specific white-list to be (a) sucient to achieve significant classification performance, and (b) to be a good way to reduce the curse of dimensionality.

3.2.2.3 Generating Datasets and Preprocessing

Classificator Design Dataset: To generate the learning data, we used the sitemaps [GYC08] downloaded in Section 3.1. We selected CS-Cart, Magento, Prestashop, Virtuemart, XT-Commerce, and Zencart, as they provided a significant amount of training data, and were the most important ECS in terms of product pages, Chapter 3. Foundational Building Blocks 93

Table 3.6: Learning set instances by ECS

System Instances CS-Cart 460 Magento 234 Prestashop 2205 Virtuemart 431 XT-Commerce 684 Zen-Cart 458 Sum 4472

according to Section 3.1. We provide an overview of the instances per ECS in the learning set in Table 3.6. As a first step, of all available URIs, we eliminated those containing the string ’blog’, as they would dilute the data, and those containing ’.png’ / ’.jpg’, as images are out of the focus. To maximize the entropy of HTML pages, we randomly selected three URIs per site, downloaded them, and rinsed the resulting directory excluding files smaller than 2KB, as by manual inspection, those were most often error pages. A learning set of 4472 HTML files remained.

Additional Evaluation Dataset: In addition to the usual splitting of the data into learning set and training set, we evaluated our results on entirely dierent data, which was extracted from GR-Notify [Sto10]. GR-Notify is a Web service that allows Web shop owners to register their sites after they have implemented GoodRelations (see also section 3.3). The shop extensions for Magento and Virtue- mart automatically submit data. Additionally, a frontend to manually submit Web shops exists. By default, the generating ECS is included in the submission. There- fore, GR-Notify provides a reliable external evaluation set for the trained classifier. From the GR-Notify data, we could download 3461 HTML files labeled with Ma- gento, Prestashop, and Virtuemart. Other systems were not included, as they were not significantly represented in the GR-Notify dataset. Chapter 3. Foundational Building Blocks 94

3.2.2.4 Building a Classifier

Feature Generation and White-Listing: We assume that pages generated by distinct ECS show patterns in the values of HTML tag attributes ’class’ and ’id’. For instance, we observed that Magento is often generating

, or Prestashop . Meanwhile, on the given data, this approach generates a feature set with a di- mensionality in the lower 105 range. As introduced in Section 3.2.2.2, this results in limitations to the computability. There are automated methods to reduce the dimensions in a learning problem, e.g. PCA (e.g. [Pea01]). However, we chose a domain-specific white-list approach to mitigate the problem. We reduced the feature set of attributes to only those that contain, but do not match, the strings ’price’, ’product’, and ’cart’. We generated the white-list by starting with a manual collection of sensible terms, iteratively reducing those, and reviewing the resulting classification performance. The final three terms were of

high discriminatory power. In Table 3.7, we show the recallbase related to the 4472 downloaded files after applying the white-list to the feature sets ’class’, ’id’, and the combination ’class+id’, as well as the number of instances.

Table 3.7: Remaining recall-base after white list filtering

class id class+id recallbase 0.849 0.811 0.875 Instances 3798 3628 3915

Classification Algorithms: We split the data into a 60 % learning set and 40 % test set. We trained the six classification algorithms (NC, SGD, SVM, DTREE, RF, XTREE) discussed in section 3.2.1.2. Combining them with the three dierent feature sets resulted in an experiment size of 18 dierent combinations. It is im- portant to state that for ECS detection, we analyzed only one HTML page. That is because we assumed using multiple pages per site would not have generated additional variance, which generally is considered as helpful for Machine Learning tasks. We compile our experimental design in Fig. 3.5. Chapter 3. Foundational Building Blocks 95

Learning Set

Feature generation class id class+id

Whitelisting

Null filtering Base recall

Training set 0.6 Test set 0.4 F1all

Fit F1

NC SVM SGD classification RF DTREE XTREE algorithms

Figure 3.5: Overview of experimental design

Performance Metric: To assess the final performance of the feature set / classifier combination, we modified the common F1-Score integrating the loss linked with feature generation.

precision (recall recall ) F 1 =2 ú base ú classifier all ú precision +(recall recall ) base ú classifier

3.2.2.5 Implementation

The experiment was implemented in the Python5 programming language. The learning set was generated by a small script that drew sample URIs based on the sitemaps discussed above and downloaded those asynchronously with the library grequests6.

Vectorization, classificator application, and evaluation have been realized with the library Scikit Learn [Ped+11]. The features were generated by applying regular expressions to the HTML files yielding lists of all values of the attribute in regard. We then excluded the instances that yielded no features after white-list filtering. Before training and testing on the algorithms, to prevent overfitting, each dataset was split by 0.6 / 0.4 into training and testing set.

5http://www.python.org/ 6https://github.com/kennethreitz/grequests Chapter 3. Foundational Building Blocks 96

The results have been visualized with Matplotlib [Hun07]. Apart from the learning set generator, which did not benefit from an interactive environment, we employed iPython notebook [PG07] to prototype in an agile manner. Additionally, the Pandas library has been used to manipulate matrix data [Mck11]. These technologies are discussed in more detail in Section 4.2.1.

The Web service is based on the Class-ID / XTREE classifier. We introduced a probability threshold of at least 0.6 to cut out predictions that were too ambiguous, providing the user of the service with a warning. To set the later discussed speed results into perspective, a 2012 Mac Mini equipped with a quad core 2.3 GHz Core I7 CPU, 8 GB of RAM, and a SSD hard disk was used, scoring a 32-bit geekbench7 of 10823.

3.2.3 Results

We provide the results of the experiment in the following sections.

3.2.3.1 Feature Set and Algorithm Performance

Table 3.8 shows the F 1all-score introduced above for the 18 feature / algorithm combinations. We additionally provide a heat map in Fig. 3.6.

We can see that the combination ’class+id’ performs best. That is mainly due to the highest base recall we showed in Table 3.6, as the top algorithms did not yield significantly dierent performance. Applied to ’class+id’, XTREE, RF, SGD, and DTREE perform similarly well around 0.9, while XTREE shows the best results. NC shows significantly worse results of 0.81, and SVM performs worst with 0.634.

7http://www.primatelabs.com/geekbench/ Chapter 3. Foundational Building Blocks 97

Table 3.8: F1-all-scores for 18 feature / algorithm combinations

XTREE RF SGD DTREE NC SVM class+id 0.902 0.897 0.893 0.886 0.81 0.634 class 0.88 0.868 0.859 0.865 0.809 0.643 id 0.856 0.854 0.852 0.849 0.747 0.639

Figure 3.6: Heat map of F1-all core for 18 feature / algorithm combinations

Table 3.9: Time elapsed (s) for 18 feature / algorithm combinations

NC SGD DTREE RF XTREE SVM id 0.119 0.131 0.596 1.461 1.893 13.093 class 0.155 0.16 0.859 2.173 3.069 18.499 class+id 0.23 0.252 1.158 3.086 3.928 30.687 mean 0.168 0.181 0.871 2.24 2.963 20.76

3.2.3.2 Speed

Additionally, we analyzed the computational complexity of the feature / algorithm combinations. The results in seconds are shown in Table 3.9. We measured the time for fitting and generating the scores of a combination.

Regarding elapsed time in terms of feature sets, we see that the bigger ones took exponentially longer to compute. Regarding the algorithms, NC and SGD constitute a very fast group with means 0.168s and 0.181s. While DTREE is also relatively fast with 0.871s, RF and XTREE form a second, slower group with 2.24s and 2.963s. SVM is about two magnitudes slower than the fastest group. We additionally provide a heat map for this analysis in Fig. 3.7. To adjust it to the same colors as above, we first normalized the elapsed times and then subtracted them from 1. Chapter 3. Foundational Building Blocks 98

Figure 3.7: Heat map: time elapsed for 18 feature / algorithm combinations

Table 3.10: Classification report of “class+id” / XTREE classifier on distinct ECS

precision recallclassifier F 1classifier CS-cart 1.00 0.99 1.00 Magento 0.86 0.94 0.90 Prestashop 0.99 0.99 0.99 virtuemart 0.77 0.92 0.84 XT-commerce 0.99 0.92 0.96 Zen-Cart 0.99 0.93 0.96 avg / total 0.97 0.97 0.97

3.2.3.3 Performance on Dierent Clusters

We additionally provide the classification report of the feature / algorithm combi-

nation with the highest performance, ’class+id’ / XTREE based on recallclassifier in Table 3.10. In terms of precision, CS-cart, Prestashop, XT-Commerce and Zen Cart can be detected with results of 1.00/0.99. Magento can be detected signifi- cantly worse with 0.86, and Virtuemart worst with 0.77. In recall, CS-Cart and Prestashop again produce a very good score of 0.99, while the other ECS move ± 0.01 around 0.93. This results in four groups of F1-scores: CS-Cart and Prestashop with 1.00 and 0.99, XT-Commerce and Zen Cart with 0.96, Magento with 0.90, and Virtuemart with 0.84. An average F1-score of 0.97 could be measured.

3.2.3.4 Consolidated Algorithm Review

We conclude the results with a consolidated review of strengths and weaknesses of the dierent classification algorithms in Table 3.11, regarding performance and speed. NC is very fast, but does not provide competitive results. SGD and DTREE Chapter 3. Foundational Building Blocks 99

Table 3.11: Consolidated review of speed / performance of used algorithms

NC SGD DTREE RF XTREE SVM Speed +++ ++ + o o - - - Performance - ++ ++ ++ +++ - -

Table 3.12: GR-Notify evaluation: Remaining recall after white-list applica- tion

class id class+id recallbase 0.63 0.55 0.64 Number of instances 2831 2461 2864

are fast and perform well. RF and XTREE provide a little better performance, but are significantly slower. SVM seems to be a bad choice for our problem.

3.2.4 Evaluation

We evaluate the results of the experiment below.

3.2.4.1 Evaluation on GR-Notify Dataset

We evaluated the “class+id” / XTREE classifier discussed in Section 3.2.2.4 on the GR-Notify [Sto10] dataset, consisting of Magento, Prestashop, and Virtuemart instances. In Table 3.12, we provided the base recall after applying the white-list. We can see that the results are significantly worse than those in the learning set. This is because we only loaded the root URIs of the given sample, mostly excluding product pages, that the white-list was tailored to. A classification report is provided in Table 3.13. Magento and Prestashop performed well with F1-scores of 0.94 and 0.97, while Virtuemart only yielded 0.79. An average F1-score of 0.94

could be measured based on recallclassifier. The results show that the classifier performs nearly as well on a total independent dataset, even when analyzing the root URI only. Based on the proposed formula in Section 3.2.2.4, the classifier yields a recallall-score of 0.73. Chapter 3. Foundational Building Blocks 100

Table 3.13: GR-Notify evaluation: Classification report of “class+id” / XTREE classifier

precision recallclassifier F 1classifier Magento 0.95 0.92 0.94 Presta 0.98 0.95 0.97 Virtuemart 0.85 0.74 0.79 avg / total 0.96 0.93 0.94

3.2.4.2 Evaluation on Targeted ECS Reference Shops

We provide an evaluation of the classification performance on ECS reference shops the system has been trained for. We discuss sources of reference shops and results in the following section. As described in Section 3.2.2.5, we classified results that yielded a probability below 0.6 not to belong to the targeted ECS.

1. CS-Cart Acquiring the URIs of ten reference CS-Cart stores was possible by checking the portfolios of specialized agencies. All ten CS-Cart shops could be classified correctly.

2. Magento Magentoshopping.de8 is a German portal that listed 256 Magento shops as of 04/23/2013. We picked the ten shops that were listed to be the newest for the evaluation. Eight shops could be detected correctly. One shop had been erroneously submitted to the portal, as manual checking showed it was generated by ubercart9.

3. Prestashop To acquire the Prestashop reference sites we consulted the prestashop.com showcases10. Again, we extracted the ten URIs that were listed most recently and checked them with the classificator API. This resulted in seven correctly detected URIs and three URIs that did not meet the threshold.

4. Virtuemart Virtuemart provides a collection of live stores11. We picked again the ten shops listed most recently. Of those, four have been labeled

8http://www.magentoshopping.de 9http://www.ubercart.org/ 10http://www.prestashop.com/de/showcase 11http://virtuemart.net/features/live-stores/16 Chapter 3. Foundational Building Blocks 101

correctly above the threshold. All others were below. We explain the poor performance in this part of the evaluation as a result of the weak technical impression Virtuemart showed, and with the non-curated character of the data source. As it is principally possible to add arbitrary sites, we expect it to be polluted by sites submitted only for SEO benefits.

5. XT-Commerce Here, we used the URIs referred to by the ocial XT- Commerce Web site12. Out of ten extracted URIs, three could be detected correctly as XT-Commerce. Five URIs fell below the threshold, the remaining two were classified wrong. We interpret the poor results to be rooted in a se- vere overfitting in this specific ECS, and by the strong template customization of the showcase XT-Commerce shops.

6. Zen-Cart For Zen-Cart we consulted the apparel subcategory of the ocial showcase site13. The results were the worst in this part of the evaluation. One URI could not be resolved at all, and the nine remaining URIs did not pass the threshold. We think this is due to the questionable impression most Zen-Cart shops left. The poor technical realization might aect the patterns the templates spot. Additionally the low number of learning instances must have resulted in overfitting.

We see the mixed results of this part caused by two factors. First, shops that miss technical sophistication generally suer from wrong classification. At the same time, we expect the labeled URIs of those ECS to be way less accurate, as listings often are crowd-curated. Second, result dierences on the test set and on this data hint that our approach tends to overfit on specific clusters. We discuss this potential shortcoming in the limitations Section 3.2.5. We provide an overview of the results of this section in Table 3.14 and 3.15. Table 3.14 provides the relative instance frequency in each result group. Table 3.15 provides the achieved precision, recall and F1-scores. Overall, a precision of 0.892, a recall of 0.608, and F1-score of 0.708 could be measured.

12http://www.xt-commerce.com/ 13http://www.zen-cart.com/showcase.php?do=showcat&catid=1 Chapter 3. Foundational Building Blocks 102

Table 3.14: Evaluation on targeted ECS reference shops - classification results

True pos. False pos. False neg. Errors cs-cart 1.000 0.00 0.00 0.000 mage 0.800 0.00 0.20 0.000 presta 0.700 0.00 0.30 0.000 virtuemart 0.400 0.00 0.60 0.000 xt-commerce 0.300 0.20 0.50 0.000 zen-cart 0.300 0.10 0.50 0.100 mean 0.583 0.05 0.35 0.017

Table 3.15: Evaluation on targeted ECS reference shops - precision, recall, F1-score

Precision Recall F1-score cs-cart 1.000 1.000 1.000 mage 1.000 0.800 0.889 presta 1.000 0.700 0.824 virtuemart 1.000 0.400 0.571 xt-commerce 0.600 0.375 0.462 zen-cart 0.750 0.375 0.500 mean 0.892 0.608 0.708

3.2.4.3 Evaluation on Non-Targeted ECS Reference Shops

We additionally conducted a performance analysis for ECS reference sites the classificator was not trained for on the systems 3DCart, Oxid esales, and Volusion. By definition, in this experiment there are only true negatives, false positives and errors (if the page could not be fetched e.g.). We cannot compute precision and recall based on those figures, but the true negative rate. The true negative rate for 3DCart is 0.6, for Oxid esales 0.875, and 1.0 for Volusion. Again, we interpret the result aligned to the low-end impression 3Dcart conveyed on the systems’ home page and in the associated shops. We compile our results in Table 3.16.

3.2.4.4 Evaluation on Non-Shop Sites

We additionally evaluated the predictive performance of the system on non-shop sites. As test sample we used 20 randomly selected sites from the Alexa Top 1M Chapter 3. Foundational Building Blocks 103

Table 3.16: Evaluation on non-targeted ECS reference shops

True neg. False pos. Errors True neg. rate 3DCart 0.600 0.400 0.000 0.600 Oxid esales 0.700 0.100 0.200 0.875 Volusion 1.000 0.000 0.000 1.000 mean 0.767 0.167 0.067 0.825

URIs list and manually filtered shops. We then submitted the URIs to the Class+ID / XTREE classificator. Beside one 404, 19 URIs did not meet the threshold, and were labeled correctly as true negatives. This result confirms that our approach also works well for non-shop-sites.

3.2.5 Limitations

Limited Features: We did not exploit further features that might raise the per- formance, for instance HTTP header [Fie+99], or tag frequency in the HTML document. In this context, we propose that computing graph properties of the HTML trees might be another promising way to design classificators. We do not think that our focus on the limited features is a problem, as the classificator matches our performance needs for the outlined extractor. From our point of view, it is already a contribution to gain an additional low percentage of market coverage.

Biased Learning Set: A severe bias might have been introduced by choosing a learning set that has been labeled automatically. For future work, we aim at collecting a curated dataset of Web shop URIs labeled by ECS. We assume that this would raise the performance of the classificator significantly. A fundamental principle in Machine Learning is that having more data is of higher importance to the performance than the algorithm in use (e.g. [HNP09]). We think that this is not critical, as we yield sucient results for the projected use. Additionally, one might argue that a bias is introduced by the learning set not evenly distributed across the dierent ECS (see Table 3.6). At the same time, taking into account all labeled pages available yielded the best overall classification performance. Chapter 3. Foundational Building Blocks 104

Hyperparameter Tuning: Additional classifier performance could have been gen- erated by tuning the hyperparameters of the algorithms. We decided not to perform hyperparameter tuning, as a preliminary tuning of the best classifier yielded only

little better F1-scores. These get further diminished by the recallbase and thus did not seem critical for this prototype. Meanwhile, in a production system, parameter tuning should be performed as soon as the final algorithm is chosen, e.g. based

on our heuristic. We assume e.g˙, that the poor SVM results are due to missing parameter tuning.

White-List Generation: It is possible that the heuristically generated white-list may not be the optimal one. Formally approaching this problem could be future work.

Detecting ECS by only analyzing one HTML page: The performance of the classificator could be further risen by considering multiple HTML pages per Web site. Again, for the projected use case, the yield rate is satisfactory.

Overfitting: Section 3.2.4.2 showed that the system tends to overfit on some ECS. We think this problem could be addressed by additional learning data. The procurement of high-quality labeled ECS is a non-trivial task and could serve as future work.

Threshold: We heuristically set the threshold to invalidate predictions to 0.6. Future work based on the evaluation could provide formal methods to set the figure.

3.2.6 Conclusion

To extend the foundations of the main approach, we designed a system capable of ECS detection based on supervised classification and a filtered set of HTML Chapter 3. Foundational Building Blocks 105

attribute values. It detects six dierent e-commerce systems by analyzing only one random HTML page of a Web shop. Taken into account the loss in recall when not being able to generate any features, it shows a F1-score of 0.9. An extensive evaluation confirmed the results. We provided an analysis of the speed of the dierent algorithms, the performance on specific ECS, and a heuristic to choose a classification algorithm for the task at hand.

3.3 Structured E-Commerce Data on the Web

In this subsection, we describe sources and analyze GoodRelations data on the Web with a two-fold approach. First, we analyze the data that has been gathered by GR-Notify14, a Web service that receives notifications from Web shops using GoodRelations. Second, we download a sample of HTML pages equipped with GoodRelations based on the data gathered by GR-Notify. In here, we only download pages from e-commerce systems (ECS) that have more than 100 submissions in GR-Notify. These boil down to Magento, Prestashop, Oxid, and Virtuemart. In Chapter 4, we will use the downloaded HTML pages as a learning set for automated extraction, the main contribution of the thesis.

There are two main strains of known structured e-commerce data on the Web. First, the data the GR-Notify service provides, and second, known deployments that are not included in GR-Notify. For this research, we target only the sites submitted to GR-Notify.

Additionally, we exclude large deployments, for instance Overstock15 or Bestbuy16, which are collected in the GoodRelations Wiki17, because they do not add much discriminatory power to our approach. It works by detecting the HTML patterns of existing Web shops using structured e-commerce data. Therefore, we strive to generate data from as many dierent shops as possible, not from as many pages

14http://gr-notify.appspot.com 15http://www.overstock.com 16http://www.bestbuy.com 17http://wiki.goodrelations-vocabulary.org/References Chapter 3. Foundational Building Blocks 106

as possible. A huge shop with millions of oerings, each generated by the same template, would provide little help for the ability to generalize for our system.

The most prominent data resource for this section are product pages that are generated with shop extensions for popular ECS, as they provide a manifold of dierent Web shops in the long tail. As of August 2013, there are nearly 13.000 distinct URIs in GR-Notify.

To give an aggregated overview of the size of the GoodRelations ecosystem, we need to add this resource to several big rollouts of the GoodRelations vocabulary on large enterprise level. We estimate the total number of Web shops using GoodRelations to be about 20.000, as of October 2013. For instance Rakuten, a German e-commerce mall, aggregates 6500 merchants with 16.1 million product pages alone. We estimate the total number of product pages equipped with GoodRelations, as of October 2013, to be about sixty million. The sites unknown to GR-Notify represent the ’dark’ part of structured e-commerce data on the Web. We omitted these sources in our research, as discovering them would include extensive unfocussed crawling, which is out of the scope of this thesis.

3.3.1 Related Work

Ashraf et al. [Ash+11] present an encompassing analysis of 105 Web shops using GoodRelations. Their findings are partially outdated, as the number of Web shops using GoodRelations has grown significantly, and GoodRelations has been adopted by search engines into the schema.org venture (see 2.1.4). They collect data about namespaces that are used together with GoodRelations, frequency of the usage of vocabulary elements, and the usage of annotation properties. They analyze concept coverage, use cases, and axioms for reasoning. In comparison, our approach in this chapter targets submission data in GR-Notify, and attribute distribution in the properties relevant to the research at hand. Chapter 3. Foundational Building Blocks 107

Mühleisen and Bizer [MB12] and Bizer et al. [Biz+13] propose Web Data Commons, an analysis of structured data based on the Common Crawl18 project. Common Crawl crawls large amounts of the Web, and provides the results for free. Mühleisen and Bizer [MB12] discuss the usage of syntax and types of data. Bizer et al. [Biz+13] analyze the data set in greater detail, split into overall results, and a discussion of the structured data standards RDFa, Microdata, and Microformats. While the fundamental idea behind using a free crawl to gain insight into structured data usage sounds tempting, we question the validity of the presented work. As their data regarding the GoodRelations vocabulary is far o the figures we estimate, we expect the underlying Common Crawl data to be strongly biased.

3.3.2 GR-Notify as a Registry for GoodRelations-enabled Shops

By the end of 2010, a substantial amount of shop extensions had been developed for GoodRelations, producing an increasing amount of data (see 2.2.1). Meanwhile, the extensions were mostly distributed through the extension platforms of the re- spective ECS, like Magento Connect19. As these platforms do not provide download numbers, there was basically no way of knowing which merchants had installed the extensions, and thus which shops were equipped with GoodRelations. This lead to the development of GR-Notify in late 2010. GR-Notify is a Web service that allows to submit an URI (see 2.1.2) and the generating ECS to state that a given Web shop is using GoodRelations. URIs can be submitted automatically by shop extensions, or manually entered through an online form. In the following sections, we describe the design and implementation of GR-Notify, and the data it collected in the last three years. At this point, we would like to stress the importance of GR-Notify for the GoodRelations ecosystem. Knowing which shops are using GoodRelations enables a wide range of further research, e.g. the GoodRelations Crawler (see 2.2.2.3), global shopping histories [TH14], or this thesis. Alternative

18http://http://commoncrawl.org 19http://www.magentocommerce.com/magento-connect/ Chapter 3. Foundational Building Blocks 108

methods of discovering GoodRelations data in the wild, like unfocussed crawls, are out of scope.

3.3.2.1 Approach

GR-Notify was designed as a Web service that is able to receive signals from shops about the existence of GoodRelations data. The service saves an URI, and optionally contact E-Mail address and the generating ECS. It was initially intended to further distribute this data to Semantic Web search engines, but as these emerged to be rather unreliable, its main use became to generate statistics of GoodRelations data available on the Web. There are two ways to use the service:

• A Web form allows to submit data directly in a well-established human- computer-interface pattern.

• A REST-like [Fie00] interface allows for an automated submission.

Therefore, beside a manual submission through the form, it was possible to integrate submission functionality in the shop extensions. GR-Notify allows for multiple submissions of the same URI, but not more than one a day. If an existing URI is submitted again, a ’last modified’ field is updated in the database.

3.3.2.2 Implementation

GR-Notify was implemented in the Python programming language (see 4.2.1)on the platform-as-a-service (PAAS) Google App Engine20 (e.g. [Ciu09]). The main advantage of the PAAS model is that the servers are managed centrally by the provider, which raises developer productivity by limiting security and deployment issues. For instance, it is possible to have a database powered Web app deployed to Google App Engine in less than five minutes, which would not have been possible with traditional deployment methods. While using Google App Engine seemed a wise decision at first, several drawbacks emerged since:

20https://cloud.google.com/products/app-engine Chapter 3. Foundational Building Blocks 109

• Initially, Google provided a generous free quota that allowed to operate the service basically at no cost. While the service has grown, at the same time, the free quota has been reduced, forcing us to reduce the service in functionality.

• Using the specific functions of Google App Engine initially reduced the de- velopment eort, but at the same time generated a lock-in eect. If, some day, we decided to self-host GR-Notify, we would need to refactor the code significantly.

• Another major lock-in eect emerges from the usage of the Domain ’gr- notify.appspot.com’, that was provided for free by Google App Engine. As now all shop extensions are configured to use this domain for submissions, when migrating to a new one, we would be forced to run the old in parallel in order to avoid loosing submissions.

On top of Google App Engine we used the Python Web micro framework Flask21. Flask itself builds on established Python libraries that provide benefits in engineer- ing eciency and security.

3.3.3 Analysis of GR-Notify Data

In the following sections, we provide an analysis of the GR-Notify data.

3.3.3.1 Approach

The analysis approach regarding the GR-Notify Data is comparatively straightfor- ward. We compute distributions, frequencies, and show plots of the data GR-Notify recorded in the respective time range. We analyze base URIs, E-Mail address pro- visions, pings / submissions, top-level domains, submitted ECS, submissions over time, and finally provide a world heat-map regarding the countries of origin of the submissions.

21http://flask.pocoo.org/ Chapter 3. Foundational Building Blocks 110

3.3.3.2 Implementation

Google App Engine exposes a ’bulk loader’ tool that allows to download data. We used this to download the URI database in comma-separated-values (CSV) [Sha05]. We then loaded the CSV file into the Python Pandas library and performed the analysis in an iPython notebook. Both are introduced in Section 4.2.1. Additionally, we used the freely available ’GEOIP’ database [LLC07] to reference submission Internet Protocol numbers (IPs) to countries.

3.3.3.3 Results

Fundamentals: The CSV export of GR-Notify contained 18.318 entries as of 07/25/2013. We could compute 12.772 dierent URIs (base URIs), which represent 69.72 % of the base sample. In the remainder of this analysis, we use this reduced sample.

Out of the 12.772 base URIs, we could gather 7.147 E-Mail addresses, which rep- resent 55.96 % of the base sample.

Ping Frequency: If an URI is submitted multiple times to the service, a database value is incremented. We show the frequency of the 25 most common values for ping in Fig. 3.8 with logarithmic y-scale. We can see that the distribution fundamentally follows a power-law (e.g. [RU12, p. 13]), with 10.526 instances showing one ping, 1.624 instances showing two pings, and 718 instances showing three pings. This function continues until it reaches a long tail at about one-hundred instances for ping values bigger than ten. We additionally calculated the distribution of hours between pings, with a mean of 511.31, and a standard deviation of 1119.63.

Top Level Domains: We analyzed the top level domains (e.g. ’.com’, ’.de’) of the submitted URIs. Again, we show the frequency with logarithmically scaled y-axis (Fig. 3.9). For this analysis, we used the distinct base URIs computed above, to exclude repeated submissions. We can see that .com has the highest share with 5915 Chapter 3. Foundational Building Blocks 111

Figure 3.8: GR-Notify - ping frequency

Figure 3.9: GR-Notify - top level domains

Table 3.17: GR-Notify - top level domains

TLD com de nl uk fr net org pl es it Rest Freq. 5915 1061 658 522 469 442 279 271 270 245 2154 % 46.31 8.31 5.15 4.09 3.67 3.46 2.18 2.12 2.11 1.92 16.87

instances, followed by .de with 1061 instances. Many european top level domains, as well as .net and .org, show values between 100 and 1000. We additionally provide an overview of the ten most popular top-level domains in Table 3.17.

Submitting ECS: We furthermore provide a histogram with the same proper- ties as above for the agent value that has been saved in the database. The agent value was introduced to the Web service to capture the generating ECS. Before computing the most common 25 values, we limited the length of the values to nine characters to combine dierent versions of shop extensions. The five manifestations that have a frequency higher than 1000 are “gr4presta” (the Prestashop GoodRela- tions extension), “form_mage” (the form submission of Magento), “goodrelat” (the extension for Joomla/Virtuemart), “locsubmit” (programmatic submission, but er- roneously used the example agent value), and MSemantic (the Magento extension). As “form_mage” and “MSemantic” both are Magento submissions, they represent the biggest group, with about 5586 instances. Below 1000 are “grome-rdf” and Chapter 3. Foundational Building Blocks 112

Figure 3.10: GR-Notify - submitting ECS

Table 3.18: Evaluation on non-targeted ECS reference shops

ECS Magento Prestashop Joomla OxidEC Rest Freq. 4658 4257 2117 689 1051 % 36.47 33.33 16.58 5.39 8.23

“grome-bot”, which represent a browser extension for Chrome that allows manual submission of Web pages, if GoodRelations markup could be detected, “Oxid”, and “grsnippet”, which represent the GoodRelations snippets generator22. “Mozilla/5” shows about two-hundred instances, and represents all manual submissions that have been performed with browsers that are based on the Mozilla rendering engine of Firefox. The frequency of all other agents is below 100. We chose this as a threshold for ECS to qualify for the extraction process, which results in Magento, Prestashop, Oxid, and Virtuemart for further research.

We additionally provide a pie chart in Fig. 3.11, and an additional Table 3.18 that show the share of those four ECS, and the rest of the submissions. We can see that Magento (aggregated from “form_mage” and “MSemantic” agent strings) provides for the largest share with 36.47 %. Prestashop is responsible for 33.33 %, Joomla yields 16.58 %, and Oxid 5.39 %. The rest represents 8.23 %. That means that for the following research, we take into account 91.77 % of the known GoodRelations data in GR-Notify. Regarding the form interface mentioned above, 3402 submissions of the sample of 12772 contained the string ’form’, resulting in 26.64 % of the submissions.

22http://www.ebusiness-unibw.org/tools/grsnippetgen/ Chapter 3. Foundational Building Blocks 113

Figure 3.11: GR-Notify - submitting ECS pie chart

Figure 3.12: GR-Notify - submissions over time

Submissions through time: In Fig. 3.12 we show the amount of submissions through time. Again, we use the sample filtered by base URIs. We can see that after a relatively slow start in 2011, the submissions grew by about 8.000 submissions a year since April 2012.

Countries of Origin of the Submissions: For anonymity, we blanked out the last digits of the Internet Protocol addresses (IPs) of the submissions. Meanwhile, the remaining addresses (e.g. 123.123.123.000) could be partially translated into country-codes with a freely available ’GEOIP’ database [LLC07]. We provide the 10 countries with the highest frequency in Table 3.19, and a world heat-map in Fig. 3.13.

We can distinguish four groups: Chapter 3. Foundational Building Blocks 114

low high Amount of GR-Notify submissions

Figure 3.13: GR-Notify - frequency world heat-map

• The first group are the countries that did not see any submissions at all. Examples include Greenland, Bolivia and Paraguay, large parts of Africa, as well as the Arabian Peninsula with Syria and Iraq.

• The second group, colored in dark blue, marks the low end of frequency of GR-Notify submissions by country. Examples include Mexico, Western South America, a small part of the African countries, like Morocco, Egypt and South Africa. Additional examples include Scandinavia, Iran, and Pakistan, Kazakhstan and China, South East Asian countries as well as Japan.

• A third group is constituted by countries colored in lighter blue to turquoise. These include Spain, the Netherlands, France, and Romania.

• The fourth group with the most submissions is colored in yellow, orange, and red and consists of the United Kingdom, Germany, and the United States.

Regarding the results, it is important to state that 434 instances could not be geo-located, because the GEOIP database could not generate a result based on the IP saved in GR-Notify. We expect many of the grey countries, that show no submission, to be hidden there. Chapter 3. Foundational Building Blocks 115

Table 3.19: GR-Notify - frequency world - GEOIP analysis

Country US DE FR NL GB ES CH IT RO PL Rest Freq. 3439 2304 1177 845 788 441 344 321 316 197 2154 % 26.93 18.04 9.22 6.62 6.17 3.45 2.69 2.51 2.47 1.54 16.87

3.3.4 Generating a Sample of GoodRelations Data on the Web

In the following section, we provide details about the sample data generation.

3.3.4.1 Approach

The sample and learning generator is built on the GR-Notify data which has been introduced in Section 3.3. Fig 3.14 provides an overview of its functionality. It downloads a sample of oering pages from Web shops that contain structured data, and sorts them by the e-commerce system (ECS) that is behind the shop.

1. First, we filter submissions of Magento, Prestashop, Oxid E-Commerce, and Virtuemart out of the GR-Notify data. If URIs contained path data, it is omitted, leaving only base URIs. Duplicates are filtered.

2. For each given base URI, the generator tries to download a sitemap [GYC08]. In this context, sitemaps are lists in XML syntax that list every page a Web site serves, and provide additional information, like last change or priority. Sitemaps are a technology that should facilitate Web crawling for robots, and which is mainly used by search engines.

3. If the sitemap download was successful, five random URIs are selected from each sitemap. We chose to only fetch this number of URIs, as our approach works best if we have a high number of oering pages from dierent shops. As a premise of our approach was that shops usually employ templates to generate Web pages belonging to a certain group like oering or category, fetching additional pages would only add redundancy to the learning dataset, providing no additional discriminatory power. Additionally, Section 3.1.3.2 Chapter 3. Foundational Building Blocks 116

GR-Notify 5 Example Files per Shop Base URIs Sitemaps GR-active Files Data Sitemap

Download Fetch & Save Extract Data Pick 5 Files Sitemaps Files if GR

Figure 3.14: Learning set generator overview

showed that about 78 % of pages in Web shops are oering pages. Therefore, as the probability of randomly picking no product page at all is roughly 0.05 %(0.225), we argue that five randomly picked pages suce.

4. Those five randomly picked URIs are downloaded, and saved if GoodRelations structured data is contained.

3.3.4.2 Implementation

The learning set generator was implemented with the following Python scripts:

1. The first script parsed the URI dataset from GR-Notify. Google App Engine allows to export comma separated value (CSV) (e.g. [Sha05]) files via the Datastore API23. The CSV file was subsequently loaded into a Pandas (see 4.2.1) data structure. We then filtered all entries containing the agent string “Mozilla”, as this hinted to a browser submission without an ECS specifica- tion, and filtered agent strings containing “msem”24 or “mage” for Magento, “presta” for Prestashop, “oxid” for Oxid e-commerce, and “joomla”25 for Virtuemart. We reduced the remaining URIs to base URIs26 and saved the URIs to a file for every ECS.

2. Given a list of URIs of GR-Notify belonging to an ECS and showing struc- tured data, we then tried to fetch the XML sitemaps of the shops. We did

23https://developers.google.com/appengine/docs/python/datastore/ 24The GoodRelations extension for Magento is called “MSemantic”, see http://www.msemantic.com 25The GoodRelations extension for Virtuemart is called “GoodRelations for Joomla”, see https://code.google.com/p/goodrelations-for-joomla/ 26e.g. http://ex.org/1.html to http://ex.org/ Chapter 3. Foundational Building Blocks 117

that by first parsing the robots.txt file for a sitemap definition. If this yielded a result, we tried to fetch the respective file, otherwise, we tried to fetch the standard sitemap path “/sitemap.xml”. This approach is largely similar to the one in Section 3.1.

3. Having these sitemap files sorted by the four dierent ECS, we randomly picked five URIs of each sitemap and saved the results in a file.

4. The fourth script downloaded the HTML files behind the URIs and saved them, if they contained GoodRelations markup. The presence of the markup was tested by this regular expression:

1 ’property="gr:hasCurrencyValue"\\s+content="(.*)"’

We chose to use the “hasCurrencyValue” property of GoodRelations, as we argue the price definition is the most essential part of an oering. The files were saved with a leading string to identify the ECS, and an unambiguous string (hash) calculated from the URI.

3.3.5 Analysis of the Sample

In the following sections, we provide an in-depth analysis of the structured data that we will use in the further course of the thesis. As a foundation, we used the HTML files equipped with GoodRelations markup, generated as described in the section above.

3.3.5.1 Implementation

The analysis of the HTML files was implemented in an iPython notebook (see 4.2.1). We first read the content of the HTML files from the disk into a Python data structure, resulting in 2430 Magento files, 1347 Prestashop files, 1708 Virtuemart files, and 103 Oxid files. Chapter 3. Foundational Building Blocks 118

HTML + rdflib Turtle RDF Stardog Data RDFa Files extraction Files Triplestore SPARQL Matplotlib Pandas iPython queries JSON result Visualization DataFrame Notebook Software

Figure 3.15: Implementation pipeline - sample analysis

We then extracted the structured data out of the downloaded Web pages with rdflib27 and serialized the resulting graph into Turtle (see 2.1.2). In the serialization process, graphs belonging to Oxid E-Commerce, Prestashop, and Virtuemart were extended with an additional triple denoting the generating ECS, as those did not provide this information in the original Web page RDFa. Magento, on the other hand, did provide it. Before loading the RDF data into a triple store, we used a script that (1) corrected a small error in the Turtle serializer of rdflib and (2) corrected erroneous date/time literals in the Prestashop data. We then loaded the resulting Turtle files into the Stardog triplestore28. The dierent analyses were performed with a set of SPARQL queries. The SPARQL results in the JSON format were loaded into a Pandas Dataframe and visualized with Matplotlib. We introduce these important Python libraries in Section 4.2.1. To reduce the complexity of the SPARQL queries, we partly executed the analyses of the dierent ECS separately. A combined analysis is generally possible in SPARQL, meanwhile, the combination outside is way more ecient. To execute this, we ran the queries substituted with placeholders for ECS and combined the results in a Pandas DataFrame for further analysis and visualization. While we introduced SPARQL in Section 2.1.2.2 only briefly, we now provide more detailed examples of the power of the query language. We provide an overview of the implementation pipeline in Fig. 3.15.

3.3.5.2 Results

Fundamentals: Table 3.20 provides an overview of the amount of HTML files downloaded and how many oering graphs could be loaded into the triplestore. This result is generated by Query 3.2.

27https://github.com/RDFLib/rdflib 28http://stardog.com/ Chapter 3. Foundational Building Blocks 119

Table 3.20: Analyzed HTML pages and RDF oering graphs per ECS

ECS OxidEC Prestashop Magento Virtuemart HTML 103.00 1347.00 2430.00 1708.00 RDF 99.00 1310.00 2267.00 1594.00 % 96.12 97.25 93.29 93.33

1 select (count(?s) as ?$sys) 2 where {?s ?p } 3 group by ?p

Listing 3.2: Overview query

We pursue with an analysis of the frequency of dierent properties in the dataset by ECS.

Table 3.21 shows the relative frequencies of properties that are attached to the GoodRelations oerings. This result was generated by Listing 3.3. The values in the table represent the mean frequency of properties attached to the oering.

1 SELECT ?prop (?p1/?p2 as ?rel) 2 WHERE 3 {{SELECT ?prop (count(?prop) as ?p1) 4 WHERE 5 {?o ?p ?ob. 6 ?o foaf:maker . 7 ?o gr:includesObject ? ?tqn . 8 ?tqn ?prop ?ob2 . 9 FILTER(STRSTARTS(STR(?prop), "http://purl.org/goodrelations/v1#"))} 10 GROUP BY ?prop} 11 {SELECT (COUNT(?s) AS ?p2) 12 WHERE {?s ?pr2 } } } 13 GROUP BY ?prop ?p1 ?p2 14 ORDER BY desc(?rel)

Listing 3.3: Property frequency analysis - Oering

From the table, four result groups emerge: Chapter 3. Foundational Building Blocks 120

• The first group is constituted by values that have a mean above 1, like eligibleRegions, acceptedPaymentMethods, and hasPriceSpecification. Those represent elements that have a cardinality (e.g. [Hub07]) of 1:1..*. We can see that the dierent ECS show largely dierent values, for instance, Magento has a mean of 84.646 regarding valid shipping regions, while the value of Virtuemart is below 1. As we expect no dierences in this magnitude between the true values in the dierent ECS groups, we argue that the figures are largely a result of the design of (1) the shop systems, and / or (2) the default settings in the ECS or extensions. We will elaborate on this finding in the conclusion of this section.

• The second group is formed by properties with a value in the neighborhood of 1. These represent properties with a cardinality of 1:1. All shop extensions implement those properties correctly.

• The third group is formed by properties that show values of 0.75 or 0.5. These figures imply that some extensions did implement a 1:1-cardinality property correctly, and some did not, resulting in these typical means.

• The fourth group is formed by properties with values significantly below 0.5. These values are implemented only by some extensions, and not broadly used.

For Listings 3.2 and 3.3, we used the substitution of the “$sys” placeholder to execute the queries separately for each ECS, and combined them in an iPython notebook. An important pattern subsequently used in the queries below is first shown in Listing 3.3. We use a combination of two queries: The first one29 computes the frequency of the property in regard, and the second one30 computes the total frequency of the ECS. In the surrounding query, we divide the frequency of the property by the total frequency, resulting in a relative frequency.

Length and count: We additionally performed a frequency distribution analysis of the length of suitable properties. Those can be separated into two groups:

29select ?prop (count(?prop) as ?p1.. 30select (count(?s) as ?p2).. Chapter 3. Foundational Building Blocks 121

Table 3.21: GoodRelations properties attached per oer by ECS

GR property Magento OxidEC Prestashop Virtuemart Mean eligibleRegions 84.646 21.333 0.379 1.883 27.06025 acceptedPaymentM. 4.040 4.747 2.892 0.354 3.00825 hasPriceSpecific. 0.981 4.859 0.966 0.999 1.95125 availableDeliver. 1.434 2.939 1.001 0.120 1.37350 validThrough 0.995 1.000 1.000 0.999 0.99850 validFrom 0.995 1.000 1.000 0.999 0.99850 name 0.980 0.990 1.002 0.999 0.99275 description 0.994 0.929 1.002 0.999 0.98100 hasBusinessFunc. 0.983 0.939 1.000 0.999 0.98025 eligibleCustome. 0.044 2.020 0.998 0.144 0.80150 hasStockKeeping. 0.000 1.000 1.001 0.997 0.74950 hasInventoryLev. 0.568 0.242 0.233 0.979 0.50550 includes 0.000 1.000 0.998 0.000 0.49950 includesObject 0.979 0.000 0.002 0.999 0.49500 hasMPN 0.000 0.465 0.000 0.000 0.11625 availableAtOrFro. 0.307 0.000 0.002 0.002 0.07775 hasEAN_UCC-13 0.000 0.000 0.118 0.000 0.02950 hasWarrantyPromi. 0.000 0.000 0.000 0.040 0.01000 BusinessEntity 0.000 0.000 0.003 0.000 0.00075

1. String typed properties such as gr:name and gr:description, where we analysed the string length.

2. Properties with a cardinality of 1:1..*, such as gr:eligibleRegions, gr:availableDeliveryMethods, and gr:acceptedPaymentMethods, where we anal- ysed the number of given values.

We provide box plots that show the properties of the distributions for the four dierent ECS. The plots show a box that represents the boundaries of the lower and upper quartiles (25 % to 75 %) of the distribution. The line inside the box shows the median. The border of the black dashed lines, the whiskers, are defined by a factor multiplied with the inner quartile range. We use the standard setting of Matplotlib, which is 1.531. Crosses show outliers.

The length analyses of the properties gr:name and gr:description have been gener- ated by Listing 3.4.

31http://matplotlib.org/api/pyplot_api.html Chapter 3. Foundational Building Blocks 122

Figure 3.16: Length analysis - name, unit: characters

Figure 3.17: Length analysis - description, unit: characters

1 SELECT ?maker (GROUP_CONCAT(?c; SEPARATOR = " ") AS ?val) 2 {SELECT ?o (strlen(?name) as ?c) ?maker 3 WHERE { ?o $prop ?name . 4 ?o foaf:maker ?maker 5 FILTER(regex(str(?maker), "[A-Z]"))} 6 GROUP BY ?o ?maker ?name} 7 GROUP BY ?maker

Listing 3.4: Length analysis - name - description

Name: Excluding the outliers, all four ECS show similar distributions in the name box plot. The median is about thirty, the lower quartile is about twenty, and the upper quartile is about forty-five. Magento shows slightly higher figures and many outliers, with a long tail spanning up to 160 characters length. The box plot for the name distribution is provided in Fig. 3.16.

Description: The description box plot shows similar figures for Magento, Oxid EC, and Prestashop, with a median in the lower hundreds and boxes spanning from 50 to 500. Virtuemart shows only a very small amount of gr:descriptions. That is rooted in the usage of rdfs:comment instead of the gr:description property for product descriptions. We omit an analysis of the distribution of this border case. The box plot for the description distribution is provided in Fig. 3.17. Chapter 3. Foundational Building Blocks 123

Figure 3.18: Count analysis - eligibleRegions, unit: region codes

The properties eligibleRegions, acceptedPaymentmethods, and availableDeliveryMeth- ods have been generated with Listing 3.5. In addition to the substitution of the “$sys” variable, in 3.4 and 3.5 we used the “$prop” variable to pass the GoodRela- tions property in regard. That allowed to use the query for multiple properties.

1 SELECT ?maker (GROUP_CONCAT(?c; SEPARATOR = " ") AS ?val) 2 {SELECT?o(count(?pm)AS?c)?maker 3 WHERE { ?o gr:$prop ?pm . 4 ?o foaf:maker ?maker 5 FILTER(regex(str(?maker), "[A-Z]"))} 6 GROUP BY ?o ?maker 7 }GROUPBY?maker

Listing 3.5: Length analysis - eligibleRegions - acceptedPaymentMethods - availableDeliveryMethods

eligibleRegions: The count of eligible regions shows significantly dierent dis- tributions. Magento has a box spanning from 1 to 250, with a median of 12. Oxid EC, as already seen in the property analysis above, has a relatively small box ranging from 0 to 25, and an upper whisker of about 40. Prestashop shows a large box spanning from one to 125 with a median of 1. This represents three classes of shops: (1) shops shipping only to their country of residence, (2) shops shipping to the surrounding countries, like the European Union, and (3) shops shipping world- wide. As there is no shortcut to express world-wide shipping in GoodRelations, all countries need to be explicitly stated in this case. Virtuemart shows an even more extraordinary distribution with no box, a median of two, and some outliers. This is rooted in the very rare existence of this property in virtuemart-generated oerings. The box plot for the eligibleRegions distribution is provided in Fig. 3.18. Chapter 3. Foundational Building Blocks 124

Figure 3.19: Count analysis - acceptedPaymentMethods

Figure 3.20: Count analysis - availableDeliveryMethods

acceptedPaymentmethods: The plots of acceptedPaymentmethods for the dierent ECS are relatively similar. All show an outlier at 0, the lower whisker starts at 1, and the box starts at 3. Magento has a median of 5, and the box ranges up to 6. The upper whisker of Magento is 10. OxidEC has a median of 5, and the box ranges up to 7. The upper whisker ends at 11. Prestashop has a median of 4, and the box ranges up to 5. The upper whisker ends at 8. Virtuemart has a median of 5, and the box ranges up to 5 as well. The upper whisker ends at 8. All plots except Prestashop have outliers above their whiskers. The box plot for the acceptedPaymentmethods distribution is provided in Fig. 3.19.

availableDeliveryMethods: Magento has a box spanning from 1 to 3, and a median of 2, regarding the number of available delivery methods. The whiskers span from 0 to 6. Prestashop shows the same plot as Magento, with outliers at 7. Oxid EC shows a slightly dierent plot, with a box up to 4, the median at 3, and the same whisker. The plot of Virtuemart stands out. There is no box, only a median at 2, and two outliers at 0 and 3. The box plot for the availableDeliveryMethods distribution is provided in Fig. 3.20.

Frequency Distribution of Multi-Value Elements by ECS: In the following paragraphs, we show the relative frequencies of properties with a 1:1..* cardinal- ity. It is important to state that as we show the relative frequency, sorting the distribution by its mean obscures the largely dierent sizes of the base samples. Chapter 3. Foundational Building Blocks 125

Figure 3.21: Distribution of hasCurrency by ECS

hasCurrency: The gr:hasCurrency property exposes the currency of a price of an oering. We show only the leading three currencies EUR, USD, and GBP, as all other currencies did not yield results above a mean over 5 %. We can see that EUR leads the distribution, with OxidEC and Virtuemart having over 80 % of all oerings in this currency. The plot for the hasCurrency distribution is provided in Fig. 3.21. The data for the hasCurrency plot was generated the query shown in Listing 3.6.

1 SELECT ?key (?c/?p2 AS ?val) 2 WHERE 3 { 4 {SELECT ?key (count(?key) AS ?c) 5 WHERE 6 {?o a gr:Offering . 7 ?o gr:hasPriceSpecification ?ps . 8 ?ps gr:$prop ?key . 9 ?o foaf:maker } 10 GROUP BY ?key 11 ORDER BY desc(?c)} 12 {select (count(?o2) AS ?p2) 13 where {?o2 foaf:maker } 14 } 15 }

Listing 3.6: Multi-value analysis - hasCurrency

The data for the acceptedPaymentMethods and availableDeliveryMethods plots was generated by Listing 3.7. Chapter 3. Foundational Building Blocks 126

1 SELECT ?key (?c/?p2 AS ?val) 2 WHERE 3 { 4 {SELECT ?key (count(?key) AS ?c) 5 WHERE 6 {?o a gr:Offering . 7 ?o $prop ?key . 8 ?o foaf:maker 9 FILTER(STRSTARTS(STR(?key), "http://purl.org/goodrelations/v1#"))} 10 GROUP BY ?key 11 ORDER BY desc(?c)} 12 {SELECT (count(?o2) AS ?p2) 13 WHERE {?o2 foaf:maker } 14 } 15 }

Listing 3.7: Multi-value analysis - acceptedPaymentMethods - avail- ableDeliveryMethods

acceptedPaymentMethods: The gr:acceptedPaymentMethods property pro- vides the payment methods accepted by a Web shop. Fig. 3.22 shows the relative frequency by ECS. Again, the payment methods are sorted by mean occurrence. Fundamental dierences in frequency, which e.g. exist between Magento and Virtue- mart, are due to the dierent base rates discussed before. For instance, Virtuemart rarely exposes payment methods at all, and thus shows low figures in the plot. We can see that Paypal, Visa, MasterCard, and ByBankTransferInAdvance are the leading payment methods in GoodRelations-equipped Web shops, with a mean above 0.35 and figures between 0.5 and 0.6 for Magento and Prestashop. A sec- ond, middle group is constituted by ByInvoice, Cash, AmericanExpress, COD, and CheckInAdvance, with a mean between 0.1 and 0.2. We see a third cluster emerging from the rest of payment methods, having a lower prevalence. If we exclude base line bias, Magento and Prestashop show similar distributions. At the same time, the leading payment methods of OxidEC are ByBankTransferInAdvance, then Paypal, ByInvoice, and MasterCard and Visa at roughly the same level. The plot for the acceptedPaymentMethods distribution is provided in 3.22. Chapter 3. Foundational Building Blocks 127

Figure 3.22: Distribution of acceptedPaymentMethods by ECS

Figure 3.23: Distribution of availableDeliveryMethods by ECS

availableDeliveryMethods: The gr:availableDeliveryMethods provides the de- livery methods a merchant oers. We applied the same visualization method as above. Regarding the mean frequency, DHL, DeliveryModePickup, DeliveryMod- eFreight, and DeliveryModeMail rank about 0.15. DeliveryModeOwnFleet shows a mean frequency of 0.12, which is largely influenced by a 0.15 score of OxidEC. We doubt that many Web shops deliver via own fleet, so we expect this data to be rooted in an erroneous interpretation of the extension configuration. The delivery methods UPS, FederalExpress, and DeliveryModeDirectDownload rank below 10 %. The plot for the availableDeliveryMethods distribution is provided in 3.23.

valueAddedTaxIncluded: Regarding whether VAT is included in the oerings, three groups emerge. More than 80 % of the oerings of OxidEC, Prestashop, and Virtuemart show the value as ’true’. The figure for Magento is relatively equally distributed between tax included or not, with values of about 0.5 and 0.4, and 0.1 without a specification. We executed the analysis with the help of listing 3.8, that Chapter 3. Foundational Building Blocks 128

Figure 3.24: Distribution of valueAddedTaxIncluded by ECS

provides the manifestations of tax values by oerings separated by ECS. The plot for the valueAddedTaxIncluded distribution is provided in Fig. 3.24.

1 SELECT ?key (?c/?p2 AS ?val) 2 WHERE 3 { 4 {select ?key (count(?key) AS ?c) 5 WHERE 6 {?o a gr:Offering . 7 ?o gr:hasPriceSpecification ?ps . 8 ?ps gr:valueAddedTaxIncluded ?key . 9 ?o foaf:maker } 10 GROUP BY by ?key 11 ORDER BY desc(?c)} 12 {SELECT (count(?o2) AS ?p2) 13 WHERE {?o2 foaf:maker } 14 }}

Listing 3.8: Multi-value analysis - valueAddedTaxIncluded

Validity Statement: We further analyze the distribution of the validity statement for the dierent ECS. GoodRelations allows to define the duration of an oering, which allows e.g. time-specific rebates, like happy hours. In this case, we computed the dierence between the validThrough and validFrom properties. The analysis was executed with the help of Listing 3.9 that provides the distribution of values for validity in days by ECS.

The distribution of Magento has a mean at 1, and some outliers at 365 days. Oxid EC has a very wide range of dierent validity duration, with a median of about 50, Chapter 3. Foundational Building Blocks 129

Figure 3.25: Distribution of validity statement duration by ECS

and outliers scattered up to 10.000 days. Prestashop has a box spanning from 2 to 30 with a median of 7 days. Virtuemart shows no box, with a median of 1 day, and outliers of 15 and 29 days. The box plot for the validity distribution is provided in Fig. 3.25. To keep the analysis lean at this point, we excluded the border cases of sites only using validThrough while omitting validFrom, and validity statements that are attached directly to oers.

1 SELECT ?maker (GROUP_CONCAT(?days; SEPARATOR = " ") AS ?val) 2 WHERE

3 {{SELECT ?maker (?fd+(30*?fm)+(365*?fy) AS ?days) 4 WHERE 5 {?o a gr:Offering . 6 ?o gr:hasPriceSpecification ?ps . 7 ?ps gr:validFrom ?fro . 8 ?ps gr:validThrough ?tro . 9 BIND (day(?tro) - day(?fro) AS ?fd) 10 BIND (month(?tro) - month(?fro) AS ?fm) 11 BIND (year(?tro) - year(?fro) AS ?fy) 12 ?o foaf:maker ?maker 13 FILTER(regex(str(?maker), "[A-Z]")) }}} 14 GROUP BY ?maker

Listing 3.9: Multi-value analysis - validity statement

Eligible Regions: We finalize the analysis with a world heat map of eligible regions, i.e. the countries the shops deliver to. We provide the maps for all four Chapter 3. Foundational Building Blocks 130

ECS.

We used the query shown in Listing 3.10 to get the results for each ECS. It uses only the ECS substitution pattern. The query computes the relative frequency of a country, e.g. a key-value pair of “DE”-“0.6” for Magento means that 60 % of all Magento shops list Germany as eligible region.

1 SELECT ?key (?c/?p2 AS ?val) 2 WHERE 3 {{select ?key (count(?key) AS ?c) 4 WHERE 5 {?o a gr:Offering . 6 ?o gr:eligibleRegions ?key . 7 ?o foaf:maker } 8 GROUP BY ?key 9 ORDER BY desc(?c)} 10 {SELECT (count(?o2) AS ?p2) 11 WHERE {?o2 foaf:maker } }}

Listing 3.10: World heat map - eligibleRegions

We pick the Magento heat map for detailed discussion, as the Magento extension provides the largest amount of data. To plot it as a world map, we had to transform the initial float values to discreet values between 0 and 255. The color range spans from minimal values in blue, over low (green), and middle (yellow) values to high (red).

Going over the globe from Hawaii in the very West and New Zealand in the very East of our plot, we can see that the United States range relatively high with dark orange. Shown in mid yellow, Canada has significantly less GoodRelations-equipped shops. Regarding South America, all the countries on the Western side show light green colors, hinting to low GoodRelations adoption. Only Brazil is a significantly higher exception, and French Guyana is an outlier, as it belongs to France. Regarding Europe, we distinguish three groups of countries. Germany is leading with dark red, Italy, France, Spain, Austria, Switzerland and the Benelux countries form a second leading group, whereas Portugal, Poland, the Czech Republic, and the Scandinavian countries form a third, meanwhile strong, group similar to the United States. The Chapter 3. Foundational Building Blocks 131

ex-Yugoslavian countries generally show weak figures in relation to the rest of Europe, as well as Turkey. Africa is relatively weakly represented in light green, with South Sudan as the sole country in the world that has no GoodRelations data in our dataset. South Africa is relatively well represented in comparison to the rest of the continent. Regarding Asia, light green dominates, with India and Thailand dominating slightly. Australia is similar to Canada. The world map for Magento is provided in Fig. 3.27.

The Prestashop world map is largely colored in mid-blue, only Poland stands out with dark red. We derive from this map that the extension for the system does not implement eligible regions in a sophisticated way, resulting in many merchants stating a world wide delivery. The world map for Prestashop is provided in Fig. 3.29.

Oxid E-Commerce is largely colored in dark blue, while the European Union stands out in turquoise. Furthermore, Austria ranks higher and is colored in yellow. Ger- many significantly stands out in dark red. The world map for Oxid E-Commerce is provided in Fig. 3.28.

Virtuemart shows a relatively low worldwide spread, having only the United States, Canada, India, and Burma colored in dark blue as well as major parts of the European Union, without France. Germany stands out in a lighter blue, and the Netherlands show a dark red. The world map for Virtuemart is provided in Fig. 3.30.

To summarize, we can see that GoodRelations’ Magento deployments are mainly used in first-world countries, and especially prevalent in Germany, where GoodRela- tions has been originally invented. The map remarkably resembles the map of the Human Development Index32.

We provide a legend of the coloring of the world maps in Fig. 3.26.

32http://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index Chapter 3. Foundational Building Blocks 132

low high Frequency of eligibleRegions property

Figure 3.26: World map coloring

Figure 3.27: World map of the frequency of eligibleRegions - Magento

Figure 3.28: World map of the frequency of eligibleRegions - Oxid E- Commerce Chapter 3. Foundational Building Blocks 133

Figure 3.29: World map of the frequency of eligibleRegions - Prestashop

Figure 3.30: World map of the frequency of eligibleRegions - Virtuemart

3.3.6 Evaluation

We evaluated the analysis of the sample by running the initial Listing 3.2 on GoodRelations crawl introduced in Section 2.2.2.3. We provided the results in Table 3.22. We computed a Pearson’s correlation between the result values of the crawl and the sample with a result of 0.863924, which is in line with what we can derive from the table. Given that our sample is biased through the choice of regarding only 4 ECS, we argue that this result validates our approach. Chapter 3. Foundational Building Blocks 134

Table 3.22: Evaluation with crawl dataset, per oering

Property Crawl Sample eligibleRegions 11.126953 37.463202 acceptedPaymentM. 7.134712 2.653263 hasPriceSpecific. 1.753126 1.055577 name 1.021910 0.991654 description 1.016381 0.996396 hasBusinessFunct. 1.016373 0.991275 validThrough 0.963063 0.997724 hasStockKeepingU. 0.726073 0.569423 includes 0.700882 0.266882 hasEAN_UCC-13 0.438602 0.029211 hasInventoryLeve. 0.422009 0.602807 availableDeliver. 0.417845 0.957511 hasBrand 0.410875 n/a validFrom 0.374725 0.997724 hasMPN 0.293408 0.008725 includesObject 0.253914 0.723824 eligibleCustome. 0.183842 0.348634 availableAtOrFr. 0.081306 0.132777 hasCurrency 0.045662 n/a BusinessEntity 0.045226 0.000759 condition 0.017113 n/a hasWarrantyProm. 0.014877 0.011950

3.3.7 Limitations

A central limitation of this section is the underlying dataset. We are only aware of GoodRelations-enabled Web shops that have been submitted to GR-Notify. At the moment, this reduces our data to only four ECS, with a substantial amount of samples. In this context, our approach does not take into account Web Shops that developed the structured data markup independently, or failed in the GR-Notify submission process. While we expect the number of those to be substantial, we currently see no viable solution of finding them. A broad crawl could help here, but is out of reach with the resources at hand, as argued above.

Furthermore, we limited our research to GoodRelations data. We expect to find significant relations between e.g. the validity of the delivered Web documents, and Chapter 3. Foundational Building Blocks 135

the quality of the structured data. Additionally, the patterns of the HTTP headers could provide additional insight. Both would qualify for future work.

3.3.8 Conclusion

We provided a thorough analysis of the data generated by GR-Notify as well as of a sample of real-world GoodRelations data.

From an aggregated point of view, our results show that there are massive dierences in the prevalence and usage of features in the GoodRelations ecosystem. While we acknowledge a bias introduced by the dierent target groups of the ECS33, many of the results can only be explained with data quality problems, introduced by the design of the shop system, or the extensions, or bad data at the source. An extraordinary example is the result in the hasValueAddedTax property, where all extensions produced values over 0.8 for included tax, and only Magento showed an equal distribution of included / not included of 0.5/0.4. We therefore underline that the central contribution of this part, besides paving the ground for the remainder of this thesis, is that the data outcome of Web shop extensions is highly dependent on their design and standard settings. So, at the bottom line, the data generated by a broad range of dierently designed extensions, shows significant noise. We hope that the approach to follow in the further course of the thesis is able to significantly mitigate this problem, as by fundamental design, it should not suer from this problem.

33We mentioned in 2.1 that Magento is used by professional shops, and Prestashop by less professional ones. 4 Structured Data for Web Information Extraction in E-Commerce

This chapter provides the main contribution. We begin with a discussion of the properties of the approach, with special regard to what is achievable in comparison to the current state of the art. We move on with a full discussion of a prototypical implementation based on Literate Programming [Knu84] in Python. We then present the main results of our experiment. We evaluate our results with cross- validation, variate experimental settings extensively in order to understand the sensitivity of our approach, and evaluate the approach on the basis of a manually generated dataset. We close the chapter with a concise, yet pragmatic, use case.

4.1 Approach

In the following section, we discuss the details of our approach. We start with a summary of the key aspects of the approach and compare it with the currently popular approach of adding structured e-commerce data to the Web by the means of extension modules for shop software packages. We then discuss the selection of data structures and features and describe our experimental design.

136 Chapter 4. Structured Data for Web Information Extraction in E-Commerce 137

4.1.1 Fundamentals

In the following section, we provide the fundamentals of the approach, with a comparison of traditional Web Information Extraction with shop extensions, and the focus on the promise part of GoodRelations’ APO principle.

4.1.1.1 Web Information Extraction in Comparison to Shop Extensions

In Section 3.3, we discussed the prevailing approach of generating structured data with extension modules for Web shop software packages. Our approach diers from the use of extension modules for shop software in the following ways. First, we expect that we will not be able to achieve the data quality in terms of granularity and reliability of the shop extension approach, as extensions directly generate structured data based on the database of the shops. At the same time, as our approach is mainly limited by the computing power available, we thus expect a higher coverage of Web sites and pages. Our main expectation is to be able to increase the coverage of relevant Web sites and page content from those sites in a largely automated fashion with a degree of data quality and data granularity that is sucient for practical purposes. A qualitative comparison of strengths and weaknesses in comparison to the shop extension approach is provided in Table 4.1.

One main bottleneck of the shop extension approach is to provide sucient in- centives for shop owners to install the extensions. It is practically given by the SEO benefits of the use of structured data. In comparison to that, given the novel WIE approach works with a tolerable accuracy, its main need in order to grow is computing power, and the amount and quality of training data.

Regarding market coverage, while the shop extension approach has generated some significant uptake, the overall share of relevant Web pages with data markup remain at 30 % of the Web (cf. [web14]).

As our approach targets Web shops with a base market share of nearly 70 % (Ma- gento: 54.05, Prestashop: 11.81, Virtuemart: 2.18, Table 3.2), we could potentially Chapter 4. Structured Data for Web Information Extraction in E-Commerce 138

Table 4.1: Comparison of shop extensions with our approach

Shop extensions WIE approach Main need Incentive Computing power Market coverage Relatively low Potentially high Data Quality Inherently high Relatively low

get a significant additional coverage in terms of available structured data, even if we were able to extract only from a relatively small number of shop software packages.

At the same time, the data quality of the shop extension approach is inherently high, because if there are no user-induced errors, the extension output directly reflects the database contents.

4.1.1.2 Focussing on the Promise Part of GoodRelations’ APO Principle

The GoodRelations Web vocabulary for e-commerce is fundamentally constituted by the APO principle, where A represents an agent, i.e. the business entity, P represents a promise1 this agent makes, and O the object, most often a product or service, the promise refers to [Hep11a]. For this research, we focus on the P / Promise part of this principle.

This has two main reasons. First, the promise is arguably the part of the GoodRela- tions data that has the highest business value. Fundamental business data like name, location, or contact details rarely change, and are therefore relatively easy to cover with automated extraction technologies. Product data, at the same time, is usually dependent on manufacturers’ data sources, and is curated by specialized service providers like GfK Etilize2. The oering data, at the same time, is often gathered by proprietary parties, like price comparison engines, or market places. At the same time, this data is not publicly available. Therefore, an automated method to generate this data qualifies, from our point of view, as a research contribution, more than an extraction of business entity or product model data.

1Most often oerings, but may be tenders, too. 2http://www.etilize.com/ Chapter 4. Structured Data for Web Information Extraction in E-Commerce 139

4.1.2 Properties in Regard

In the following sections, we describe the GoodRelations-related data properties that have been extracted in the research part of the thesis, the use case, and respectively the shop extensions. Basically, we use a reduced property set for the research part, as this suces to state the main arguments, as well as an extended set in the use case, that matches the important properties the shop extensions provide.

An overview of the features and their reflection in the dierent parts of the thesis respectively the shop extension approach, is provided in Table 4.2. The labels R/U/S (Research, Use case, Shop extensions) show properties regarded in the respective parts / methods, the label B shows properties that are more relevant from a business point of view, and the label L shows properties of limited importance or accessibility. For clarity, we additionally mark the respective property sections with those labels.

We chose to show the approach only on the limited property set, because the extended set regarded in the use case would cause additional engineering eort, exceeding a tolerable level of detail for the implementation part (4.2) of this chapter. The motivation is to keep things clear for the scientific part, and achieve real-world relevance in the use case.

4.1.2.1 Properties Used in the Approach

To determine the target properties of our extraction, we analyze what is commonly produced by shop extensions, and explain why we included this information into our extractor, or not. Designing a production-ready system that adapts to a wide range of exotic cases is not the aim of our research. The central claims of this thesis are not aected by the limitation on a subset of targets. Chapter 4. Structured Data for Web Information Extraction in E-Commerce 140

Table 4.2: Extraction targets

Property Research Use case Shop extensions Name R U S Description R U S Image R U S Price R U S Currency - U S Features - U - Validity - - B Payment & Shipping methods - - B Eligible regions - - B Tax inclusion - - B Condition - - B Inventory Level - - L Amount of products - - L Category - - L

The goal is to prove that structured data can be used as a learning set to extract further structured data, which can be shown with a limited set of data properties. In the following section, we introduce these properties in detail.

Name (R/U/S): The name of the oering is highly essential from our point of view. First, from a business perspective, the product or service name has a marketing / branding function and often communicates the main utility of the oering. Second, from an attention perspective in Web shops, the product name is usually dominant in the design of the page. As a central data property of an oering, we included the name property in the research part.

Description (R/U/S): While the oering name can be seen as a feature shared by many oerings of a certain product or service on the Web, the description is usually specific to a shop, and often generated by shop owners. As emphasizing certain aspects of the product has an important influence on the buying decision, we selected the description property as a further core extraction target. A poten- tial problem with the extraction of description texts is that shop owners do not Chapter 4. Structured Data for Web Information Extraction in E-Commerce 141

automatically agree with the appearance of expensively generated product descrip- tions elsewhere than in their own Web shop. As of its importance, we included the description property in the research part.

Price (R/U/S): The price is another central extraction target in the research part. From our point of view, it reflects the first-class part of a promise from a business perspective, while additional oering properties, like e.g. payment or deliv- ery methods, are secondary. While in stationary retail, the location of a merchant is quite important for buying decisions, it is not in e-commerce, as products are usually delivered to the customer. Therefore, in the e-commerce domain, price is a highly important criterion in buying decisions. We omitted discounts in the use case, using the lower price if two similarly positioned prices were found, and tier prices, as they are usually hidden behind Javascript browser logic.

Image (R/U/S): Multimedia objects, like product images or video, provide ex- ceptional value for the buying decision. At the same time, as they are usually subject to copyrights, they pose similar legal problems as the product description. We like to emphasize at this point that the legal dimension is out of the scope of this research, and has to be addressed in a commercial setting. We included the image extraction in the research part because of its high importance.

4.1.2.2 Additional Properties Regarded in the Use Case

The following properties are not reflected in the scientific part of this chapter, but covered in the use case.

Currency (U/S): The extraction of currency information is relatively complex, as there are often multiple versions of symbols for each currency. Therefore, the problem is omitted for the research part. Meanwhile, the topic is addressed in the use case part of this chapter (4.2). Fundamentally, extracting the currency Chapter 4. Structured Data for Web Information Extraction in E-Commerce 142

of an oering is worthwhile, as this allows for instance for cross-currency price comparisons see e.g. [SH13a].

Features (U): While shop extensions do not generate product features markup so far, in the course of the development of the use case it turned out that they are accessible for extraction methods in a significant number of cases. As we expect product features to possess high discriminatory value in buying decisions, we decided to extract them in the use case.

4.1.2.3 Excluded Properties

The following properties have been excluded from the research and use case parts for two reasons. First, as we limited our scope to the promise part of the APO principle of GoodRelations, properties that are prevalent throughout a whole Web shop are more aligned to the business entity, and can therefore be excluded (label B). Second, there are properties that cannot be used due to limitations of the approach or due to their limited value (label L).

Validity of statement (B): While the GoodRelations vocabulary allows for ex- pressing the duration of an oering, this feature is not used widely in its original intent. As we could show in the analysis of existing structured e-commerce data on the Web (Section 3.3), shop owners, typically simple use the date of the HTTP request plus a fixed amount of time, typically between 24 h and several weeks. Additionally, as we have not encountered a shop that states the validity of oerings in the visible HTML code, it is out of reach for our extraction method. As this information generates relatively little value, we decided to omit this extraction target. The missing validity data could also likely be estimated on the basis of HTTP caching directives. Meanwhile, specific shops exist that operate solely on oerings with a short validity, usually called “deals”. We did not include this special case in our research. Chapter 4. Structured Data for Web Information Extraction in E-Commerce 143

Payment and shipping methods, and eligible regions (B): While the GoodRela- tions vocabulary proposes to integrate payment and shipping methods and eligible regions in the oering, and the shop extensions generate such markup attached to the oering, we omitted this property from the extractor design. That is because this information, in real-world scenarios, is prevailing throughout Web shops at the business entity level. As this is, from our point of view, more related to the agent (A) part of the introduced GoodRelations data model that we excluded above, we excluded it from our research and use case.

Tax inclusion in price (B): While shop extensions usually provide this informa- tion, we omitted it based on two premises. First, this property is true for the most part of cases, as Section 3.3 showed. Additionally, to extract it with high precision, we would have to design a highly capable, (multiple) natural language processing based module. This seemed disproportionate for the additional benefit at stake.

Condition (B): Shop extensions additionally provide the condition of products contained in oerings as a standard property expressed in structured markup. In addition to the massive dominance of new items in most Web shops, extensive multilingual natural language processing would have been needed.

Inventory level (L): Shop extensions provide the amount of items left in stock of the merchant. We omitted this feature, as by the best of our knowledge, the provision of this information to customers is quite rare. That is mainly because detailed stock information can be exploited by competitors, e.g. for yield management.

Amount of products included in the oering (L): This is another property that is generated by the extensions, but is “one” in the very most cases. As it provides little additional value from our point of view, it has been omitted.

Category (L): From a customer point of view, categories provide a tree-like structure that facilitates the navigation in stores, groups products and allows for Chapter 4. Structured Data for Web Information Extraction in E-Commerce 144

serendipitous product discovery, while skimming through. Additionally, from a merchant’s perpective, category information is useful when analyzing assortment strategies on an aggregated level, and for the mapping to existing categorization standards. GoodRelations is extensible with category information according to Eclass-Owl [SRH13b], which could be linked with the extracted information. We excluded category information, as only the Magento extension supports this feature, there is not enough data available.

4.1.3 Experimental Design

Our experimental design is dominated by two choices. First, as argued before, we limit the research part to only four data properties, which reduces the complexity of the program code so that it is more understandable. Second, instead of evaluating the approach by comparing it to manually extracted properties, we use the learning and test set methodology of cross-validation [Koh95].

4.1.3.1 Evaluation

A fundamental problem of our research is to evaluate the approach on real-world Web shops, i.e. those that do not provide structured data in the first place. A viable, but costly method would be the manual extraction of property values and the consequent comparison to automatically extracted ones. We follow this method by creating an additional dataset in Section 4.4.5.

As this does not scale, as it involves a significant amount of manual, non-automizable labour, we decided to pursue the established cross-validation method of splitting the data set in a learning set and a test set [Koh95], each randomized and 50 % of the base sample. The approach is then trained / applied only to the learning set, and evaluated on the test set. This method allows for a rigorous evaluation, which would not have been possible with a manual data generation approach. A positive side-eect is that the evaluation operates on local data, and does not suer from numerous sources of error and bias introduced by crawling the Web. Chapter 4. Structured Data for Web Information Extraction in E-Commerce 145

Through the further course of the chapter we use the common definitions of precision and recall, widely used in Information Retrieval and Machine Learning [Pow11].

Both are defined as division of four characteristic subset of results3:

• True positives (TP): Results that are true and have been labeled true.

• False positives (FP): Results that are false and have been labeled true.

• False negatives (FN): Results that are true and have been labeled false.

• True negatives (TN): Results that are false and have been labeled false.

Based on that, precision and recall are defined as (e.g. [Pow11]):

Precision = TP/(TP+FP)

Recall = TP/(TP+FN)

In the evaluation part of this chapter, we only measure the precision of the experi- ments. As we evoke predictions for every sample, recall is always 100 %.

4.1.3.2 High-level Pseudocode Overview

To illustrate the following implementation part, we provide a high-level pseudocode version of our experiment. It provides the essential blueprint of the implementation of the approach of this thesis. We would like to underline that this section is extremely reduced and simplified for clarity.

3In our case, the truthiness of a result could map to a website element being the price or not, for instance. Chapter 4. Structured Data for Web Information Extraction in E-Commerce 146

1 #Section1:ECS/datapropertydefinition 2 systems = ["mage","virtue","oxid","presta"] 3 data_properties = ["name","desc","price","img"] 4 5 for ecs in systems: 6 #Section2:Generatesplitdataset(learn/test) 7 html = allhtml(ecs) 8 html_learn,html_test = split(html) 9 10 for prop in data_properties: 11 #Section3:Extractrespectivepropertyfromofferingpages 12 data_learn,data_test = extract_data(html_learn,html_test,prop) 13 14 #Section4:Generateextractionrules 15 ex_rules = get_extraction_rules(html_learn,data_learn) 16 17 #Section5:Applyextractionrulesontestset 18 data_eval = apply_extraction_rules(html_test,ex_rules) 19 20 #Section6:Evaluateextracteddatavstruedata 21 precision = evaluate(data_test,data_eval)

Listing 4.1: Experimental design pseudocode overview

1. Section 1: ECS / data property definition: First, the four ECS and four data properties are declared, as respective parts of the learning data will be processed separately below.

2. Section 2: Generate split data set (learn/test)4: According to the evaluation approach, a function split randomly separates the HTML files into a 50 % learning set, and a 50 % test set (html_learn, html_test).

3. Section 3: Extract respective property from oering pages5: A func- tion extract_data extracts the respective property values for each instance of learning and test set (data_learn, data_test).

4Steps 2-6 are executed for each ECS. 5Steps 3-6 are executed for each data property. Chapter 4. Structured Data for Web Information Extraction in E-Commerce 147

4. Section 4: Generate extraction rules: Based on the input HTML and the learning set, a function get_extraction_rules generates the extraction rules for the respective ECS / property combination (ex_rules).

5. Section 5: Apply extraction rules on test set: The extraction rules are applied with the function apply_extraction_rules to the HTML test set, resulting in data for the evaluation (data_eval).

6. Section 6: Evaluate extracted data vs. true data: A function evaluate compares the data extracted with the rules against the true data, computing the result of the approach in the respective ECS / property combination (precision).

4.1.4 Conclusion

We provided an overview of the approach, discussing fundamentals rooted in the target domain, data properties considered, and the experimental design. In the next section, we will provide a full discussion of implementation details of the approach.

4.2 Implementation

In the following sections, we discuss the implementation details of the approach. We chose to provide a detailed discussion of the most important parts of the code, as it is compact, and thus adds value for understanding the approach, without interrupting the reading flow too much. Python is very near to pseudocode, therefore, we hope this section provides insight into the inner workings of our experiment.

A fundamental problem of our extraction task is that a very high reliability depends on a very extensive implementation, catering for the ambiguous nature of the Web document / HTML sources. Regarding this problem, we chose not to over- implement for the sake of conciseness, knowing that we might not gain the ultimate performance this way. Instead, we provide a limited implementation that clearly Chapter 4. Structured Data for Web Information Extraction in E-Commerce 148

1. Learning set property value extraction Name 2. Search for page elements containing GoodRelations values Image

3. Element property extraction Price

4. Cumulative occurrence ranking Description

Figure 4.1: Extraction rule generator approach

backs the fundamental hypothesis, i.e. that structured / GoodRelations data can be used as a learning set for Web Information Extraction. The goal of this experiment is to provide a proof-of-concept. It is thus clear that the actual reliability of the extraction will be limited and can be improved in a production environment.

We discuss (1) the generation of the split data set and data loading, (2) the extraction of the data provided from oering pages, (3) a basic sanity check for data quality, (4) the generation of the extraction rules, and close with (5) a discussion of the implementation of the evaluation. Meanwhile, we skip the code parts that are used to print and visualize the results, and significant parts of the evaluation, as they are not critical to the approach.

For the experiment, we used the same data as in Section 3.3. Regarding the base data, for each ECS the data directory contains a folder with the HTML files, and a CSV file with HTTP response data.

The implementation is fundamentally coherent with the method blueprinted in the introduction, and shown in Fig. 4.1. Additionally, we have included the imple- mentation of the evaluation section here. While step 1 is discussed in the “dataset generation” section below, steps 2-4 belong to “Generate Extraction Rules”.

4.2.1 Python as Main Programming Language

The implementation of the extractor component was completely done with the programming language Python (e.g. [Alc10]). Python has several main benefits: Chapter 4. Structured Data for Web Information Extraction in E-Commerce 149

• It has a very clear syntax without brackets, making code easily understand- able for non-technical people. We therefore omit providing the discussion of algorithms in pseudo-code, in favor of showing parts of the implementation in literate programming style (e.g. [Knu84]). Literate programming aims at making code easy to understand for humans by providing extensive inline documentation.

• Commonly referred to as “batteries included” [NA13], Python oers a wide range of high quality libraries. For instance, throughout the thesis we made use of the excellent Pandas [Mck11] library, which will be discussed below.

In the course of the thesis, we made use of the following Python libraries, which provide an example of the “batteries included” concept introduced before.

• Pandas is a library that allows working with structured datasets in the tradition of the statistical software R (e.g. [Tea+11]). It significantly facili- tates handling and manipulation of data in serial, tabular, and panel forms at high performance. Pandas [Mck11] has been used in the learning set and extraction rule generator parts of Chapter 4, as well as in Chapter 3.

• grequests is a library that allows to perform parallel HTTP (e.g. [Fie+99]) requests. It combines (1) requests6, a library that provides a straightforward API for HTTP interaction, and gevent7, an asynchronous networking library. grequests was used for the learning set generator.

• BeautifulSoup is a XML/HTML parser that supports the extraction and manipulation of Web documents8. For example, it allows for an easy extraction of all div elements of a Web page that match a certain property. Therefore, it is well-suited for the extraction rule generator.

• iPython [PG07] is a library for interactive computing with Python. It allows for agile implementation in a browser window with the iPython notebook component, as well as easy parallelization. The power of iPython can be

6http://docs.python-requests.org/en/latest/ 7http://www.gevent.org/ 8http://www.crummy.com/software/BeautifulSoup/bs4/doc/ Chapter 4. Structured Data for Web Information Extraction in E-Commerce 150

grasped best by considering the iPython notebook gallery9. As iPython allows for a mixture of explanation and multimedia with code, it is very well suited for literate programming.

• matplotlib [Hun07] is a two-dimensional graphic plotting library for Python. It is capable of producing a wealth of dierent graphs and visualizations and integrates seamlessly into iPython notebooks10.

• RDFlib is a Python library that allows working with RDF. It can handle many RDF syntaxes, and was therefore e.g. used for the RDF Translator, a Web service that converts arbitrary RDF syntaxes into others11, developed by my colleague Alex Stolz.

• gensim is a natural language processing library for Python12. Beside general features for handling text corpora, it provides Latent Semantic Indexing [Dee+90] and Latent Dirichlet Allocation [Ble+03] algorithms.

• lxml13 is a Python wrapper for the C programming language libraries libxml2 and libxslt, which allows a performant handling of XML data, and XSLT functionality. To the best of our knowledge, lxml is the fastest parser for HTML, that meets certain structural requirements.

In the following sections, we describe the most important code snippets of the implementation in detail.

4.2.2 Dataset Generation

The import and pre-processing of the data is described in more detail in Annex A, Section ??.

9https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks 10http://matplotlib.org/index.html 11http://rdf-translator.appspot.com/ 12http://radimrehurek.com/gensim/index.html 13http://lxml.de/ Chapter 4. Structured Data for Web Information Extraction in E-Commerce 151

1 def load_data(path): 2 def getfilename(row): 3 return path+row[’ecs’]+"-html/"+row[’uuid’]+".html" 4 5 def getnetloc(uri): 6 return urlparse(uri).netloc 7 8 cols = "uuid enc data uri redir ecs".split() 9 d=[pd.read_csv(path+system+"-response.csv",names=cols) for system in ecs] 10 df1 = pd.concat(d) 11 df1[’filename’] = df1.apply(getfilename, axis=1) 12 df1[’html’] = df1.filename.apply(loadfile) 13 df1[’baseuri’] = df1.uri.apply(getnetloc) 14 return df1 15 16 17 d1 = load_data(’data/’) 18 d2 = load_data(’data2/’) 19 d3 = pd.concat([d1,d2]) 20 21 22 training_size = 0.5 23 d5 = pd.DataFrame() 24 for e in ecs: 25 d4 = d3[d3.ecs==e] 26 d4 = d4.reset_index(drop=True)

27 d4[’test’] = d4.index > (int(len(d4)*training_size))-1 28 print e, len(d4[d4.test==True]),len(d4[d4.test==False]) 29 d5= pd.concat([d5,d4]) 30 d3 = d5 31 d3 = d3.sort(columns=[’baseuri’]) 32 d3 = d3.reset_index(drop=True) 33 34 35 timebaseuri = pd.read_csv(’csv/time-baseuri-grn02-2014.csv’,\ 36 parse_dates=[’created’]) 37 timebase = dict(zip(timebaseuri.base,timebaseuri.created)) 38 d3[’created’] = d3.baseuri.apply(lambda x: timebase.get(x,False))

Listing 4.2: Dataset generation - source code

• Line 1-14: A function load_data with a path parameter is defined. For each ECS, the CSV file containing the HTTP responses is read into a new Chapter 4. Structured Data for Web Information Extraction in E-Commerce 152

Pandas DataFrame d2 with the columns uuid, encoding, response data, uri, redirection and ecs, and appended to a list.

From the resulting list, an aggregated DataFrame is concatenated. In the DataFrame df1, another column filename is introduced, which is set to a full path to the HTML file associated with the respective response row loaded above.

• Line 17-19: The load_data function is executed for the two datasets, while the third is generated from the concatenation of the first two datasets.

• Line 22-32: For each ECS, this section labels the “test” column of the dataset to true or false depending on the training size, here 0.5. The results are sorted by base URIs, to prevent time-dependent eects.

• Line 35-38: From an external file, which holds a mapping between base URIs and the created date from the GR-Notify data, a new column is introduced, which adds a temporal dimension to our samples. This will be used in the evaluation Section 4.4.

4.2.3 Extraction of Provided Data from Oering Pages

The extraction of GoodRelations data out of HTML files is non-trivial, with a conflict of goals between precision and speed. In the course of the thesis, we developed three main approaches to solve the problem.

1. The first approach used the Python RDFlib library, and extracted the prop- erty values with SPARQL queries. While the approach was highly accurate, first generating an RDF graph out of the data and then querying it, was computationally expensive.

2. We developed an approach with custom, precise regular expressions to extract the properties. While the approach was very fast, and the implementation highly concise, it missed some properties due to variations in the HTML code. Chapter 4. Structured Data for Web Information Extraction in E-Commerce 153

Coping with these would have required highly complex regular expressions. Therefore, we did not further pursue this direction.

3. Another approach was extracting the properties with the BeautifulSoup li- brary. BeautifulSoup parses HTML files and provides a sound programmatic access to the DOM content in Python. Similar to the RDFlib approach, while yielding highly precise results, the problem with this approach was again the high computing cost.

4. We finally settled with an approach that fundamentally uses regular expres- sions to extract relevant parts of the DOM (here: div elements), and then iteratively filters the elements to generate the data in regard. We provide code and discussion of this approach below. For our purpose, this approach showed the best tradeo between completeness of the extracted data and speed. Chapter 4. Structured Data for Web Information Extraction in E-Commerce 154

1 rex = re.compile(’

’,re.DOTALL) 2 3 def ex_rdfa(html): 4 data = re.findall(rex,html) 5 data = map(lambda x: x.split(’"’)[:-1],data) 6 data = [map(lambda x: x.replace("=","").strip(),data) for data in data] 7 data = map(lambda x: dict(zip(x[0::2], x[1::2])),data) 8 result = {} 9 for d in data: 10 if "property" in d.keys() and "content" in d.keys(): 11 result[d[’property’]] = d[’content’] 12 if "rel" in d.keys() and "resource" in d.keys(): 13 if d[’rel’] == "foaf:depiction v:image": 14 d[’rel’] = "foaf:depiction" 15 if d[’rel’] not in result.keys(): 16 result[d[’rel’]] = [] 17 result[d[’rel’]].append(d[’resource’]) 18 return result 19 20 21 def ex_data(df): 22 df[’data’] = df.html.apply(ex_rdfa) 23 24 for c in "name description hasCurrencyValue".split(): 25 df[c] = df.data.apply(lambda x: \ 26 x["gr:"+c] if "gr:"+c in x.keys() else None) 27 28 df[’hasCurrencyValue’] = df.hasCurrencyValue.apply(savetofloat) 29 30 pic = "foaf:depiction" 31 df[pic] = df.data.apply(lambda x: x[pic] if pic in x.keys() else None) 32 df[pic] = df[pic].apply(lambda x: x[0] if x else None) 33 return df 34 35 d3 = ex_data(d3)

Listing 4.3: Extract provided data from oering pages - source code

• Line 1: A simple regular expression is defined that extracts the text between the starting tag (’) of div tags. Shop extensions in regard use the div tag properties to express the structured / GoodRelations data. Chapter 4. Structured Data for Web Information Extraction in E-Commerce 155

• Line 3-18: A function ex_rdfa is declared with the HTML file content as parameter. First, the regular expression defined above is executed on the content, and the matches are saved in a list. On the list elements, an inline function (lambda) is applied, that splits the elements by the hyphen character. In the next step, in all results of the split function the equal sign is removed, together with leading and trailing whitespace. In the last important step of this section, the property-value tuples are converted into a list of dictionaries. The last line in this section initializes a dictionary for the result data.

For every dictionary d in the list generated above (data), the following steps are executed. If the strings “property” and “content” are in the keys of the dictionary, the respective key / value combination is appended to the result dictionary. If the strings “rel” and “resource” are in the keys of the dictionary, the values of the resource keys are appended to names and keys in the result. This caters for properties that have a resource instead of a literal as their value. A small code part caters for rare cases that express the image property with the deprecated “v:image” property instead of “foaf:depiction”. Finally, the result is returned.

• Line 21-33: A function ex_data is declared, that executed the data extrac- tion process for a DataFrame passed in the parameter. First, the ex_rdfa function is applied to the HTML columns and written to a column data. For the name, description, and price (gr:hasCurrencyValue) properties, a column is generated in the DataFrame and filled with the respective values from the data column. The values on the price column are converted into values of the datatype float. The same as above is done for foaf:depiction as of the dierent name space (gr: vs. foaf:). Additionally, only the first result is taken here, as a pragmatic heuristic to return a single value.

• Line 35: Finally, the function is applied to the dataset. Chapter 4. Structured Data for Web Information Extraction in E-Commerce 156

4.2.4 Quality of the Extracted Data

In the following part of the code, we check the quality of the extracted properties by applying simple validation rules. The results are provided and discussed in the evaluation Section 4.4 of this chapter.

1 validation_rules = {"name":lambda x: isinstance(x,str) and len(x) > 4, 2 "description":lambda x: isinstance(x,str) and len(x) > 15, 3 "hasCurrencyValue":lambda x: isinstance(x,float), 4 "foaf:depiction":lambda x: re.search(regex_uri,x) != None} 5 6 def evalu(x,rule): 7 if not x: 8 return False 9 if rule(x): 10 return rule(x) 11 12 13 def gen_val(df): 14 val_keys = [] 15 for k,vr in validation_rules.items(): 16 val_key = k+"_val" 17 val_keys.append(val_key) 18 df[val_key] = df[k].apply(lambda x: evalu(x,vr)) 19 return df,val_keys 20 21 22 d3,val_keys = gen_val(d3) 23 24 25 print len(d3), 26 d3 = d3[(d3.hasCurrencyValue_val)&(d3[’foaf:depiction_val’])& \ 27 (d3.name_val)&(d3.description_val)] 28 len(d3)

Listing 4.4: Check extracted data quality - source code

• Line 1-4: We define the validation rules for the four properties. The rule for the name property checks if the given extracted data is a string, and if the length is at least 5 characters. The rule for the property description also checks for a string, and requires a minimal length of 15 characters. The price Chapter 4. Structured Data for Web Information Extraction in E-Commerce 157

/ hasCurrency rule checks if the given data is a float value. Last, the image value is required to be a valid URI.

• Line 6-10: A small function wraps the validation rules, returning False if no argument to check was passed.

• Line 13-19: A function gen_val is declared, having the respective dataset as parameter. First a list val_keys is initialized. For every key and validation rule in the validation rules, a string val_key is generated by appending “val” to the key. This string is then added to the val_keys list. Finally, in the DataFrame df, a column with this key is introduced and set to the result of the respective validation rule. The DataFrame and the val_keys are returned.

• Line 22: As above, the gen_val function is applied to the dataset. Addition- ally, the returned validation keys are saved in a variable.

• Line 25-28: We reduce the main dataset to samples with positive validation results. We argue that given the size of our dataset, we can aord removing a whole sample if it is missing a property.

4.2.5 Generation of Extraction Rules

The following section is the main part of the implementation of our approach.

1 d3[’tree’] = d3[’html’].apply(fromstring) 2 3 4 def preproc_desc(xp): 5 xp = simple_preprocess(xp,min_len=7)[::4]

6 xp = ".*?".join(map(any2unicode,xp)) 7 return "//*[re:match(text(), ’"+xp+"’)]" 8 9 10 ns = {’re’: ’http://exslt.org/regular-expressions’} 11 def get_xrules(data): 12 try: 13 tree,prop,xp,dataselect = data 14 parent = tree.xpath(xp(prop),namespaces=ns) 15 grandparent = tree.xpath(xp(prop)+"/..",namespaces=ns) Chapter 4. Structured Data for Web Information Extraction in E-Commerce 158

16 r=[[(i.tag,i.attrib.get("class"))for i in e] \ 17 for e in zip(grandparent,parent)] 18 r=map(lambda x: flatten(x),r) 19 pdct = tree.xpath(xp(prop)+dataselect,namespaces=ns) 20 zipper = zip(r,[levenshtein(p.strip(),prop) / \ 21 float(len(prop)) for p in pdct]) 22 zipper = filter(lambda x: x[1] < 0.3,zipper) 23 r=getn(zipper,0) 24 except: 25 return False 26 else: 27 if 0

33 default = "/descendant-or-self::*/text()" 34 params = 35 [ 36 ("price1","hasCurrencyValue",

37 lambda x: ’//*[contains(text(),"’+x+’")]’,default), 38 39 ("price2","hasCurrencyValue",

40 lambda x: ’//*[contains(text(),"’+flip_point(x)+’")]’,default), 41 42 ("name1","name",

43 lambda x: ’//*[text()="’+x+’"]’,default), 44 45 ("name2","name",

46 lambda x: ’//*[contains(text(),"’+x+’")]’,default), 47 48 ("pic1","foaf:depiction", 49 lambda x:’//img[@src="’+x+’"]’,"//@src"), 50 51 ("pic2","foaf:depiction", 52 lambda x:’//img[contains(@src,"’+x.split(’/’)[-1]+’")]’,"//@src"), 53 54 ("desc","description", 55 lambda x: preproc_desc(x),default) 56 ] 57 58 59 def gen_rules(df,par): 60 print len(df), 61 di = df[(df.test==False)].dropna().copy() 62 shp = len(di) 63 print shp Chapter 4. Structured Data for Web Information Extraction in E-Commerce 159

64 for p,prop,rule,getter in par: 65 print p 66 di[p+’_data’] = zip(di.tree,map(to_unicode, \

67 map(str,di[prop])),[rule]*shp,[getter]*shp) 68 di[p+’_rule’] = di[p+’_data’].apply(get_xrules) 69 return di 70 71 72 d3r = gen_rules(d3,params) 73 74 mc = 5 75 rules3,rules_x3,eval_ru3 = get_rules(d3r,params,ecs,mc,True) 76 77 78 mc = 100 79 rules3,rules_x3,eval_ru3 = get_rules(d3r,params,ecs,mc,"normal") 80 kmap = {’price1’:’price’,’desc’:’description’,’name1’:’name’,’pic1’:’image’} 81 for k,v in rules_x3.items(): 82 kfn = kmap[k] 83 for k2,v2 in v.items(): 84 fn = k2+"_"+kfn+".txt" 85 f=open(’rules/’+fn,"w") 86 f.write("\n".join(v2)) 87 f.close()

Listing 4.5: Generate extraction rules - source code

• Line 1: A column “tree” is introduced, containing the lxml parsing trees of the HTML documents of the respective sample.

• Line 4-7: A function to pre-process the description data is defined. The gensim simple_preprocess function is called, splitting up the description string into a list of words, and filtering those below a length of seven. To keep the XPath-contains expression lean, we only use each forth word out of the original description as search trigger. The next line creates a regular expression after applying an unicode conversion to the list elements. The list elements are concatenated with “.*?”, which is a greedy matcher in regular expressions, allowing arbitrary characters. This results in a regular expression which consists of high-length words, and matches arbitrary surroundings. Finally, Chapter 4. Structured Data for Web Information Extraction in E-Commerce 160

the regular expression is inserted into a XPath expression, that matches DOM element contents according to the regular expression.

• Line 10-30: The following function get_xrules can be seen as a central component of the implementation. Given extracted GoodRelations property values, it generates candidate DOM elements containing these values. It fundamentally operates by searching for the content in regard with specific XPath expressions, and finally returns HTML tags and class properties for parent and grandparent DOM elements. The ns variable declaration caters for a namespace the regular expression XPath needs.

First, from a packaged data variable, the variables containing the lxml parsing tree, the property in regard, the XPath expression, and a data select variable are unpacked. The next two steps apply the parent’s and the grandparent’s XPath expressions to the lxml parse tree. Then, a list comprehension extracts the respective tags and class properties, e.g. “h1, product, span, product- name”. On the resulting list r, the flatten function is applied, that combines multiple lists to one.

Then, the rule is executed, and its extraction results are compared by Lev- enshtein [Lev66] distance with the true value. If the distance is higher than 30 % of the length of the true value, the rule is discarded. This has been implemented to filter rules that are generated by the XPath operating on the “contains” function, which could match much more than wanted.

Finally, if the resulting list has at least one element, its first 10 elements are returned. We chose to limit the rule size in this way to keep the system lean. This caters for rare cases generated by the description rule, which yielded an overly high amount of rules. Otherwise, “False” is returned, to easily count a malfunction.

• Line 33-56: These lines define rule generation parameters which are wrapped around the data the get_xrules function expects. The parameters are a set of Python objects with the following content: Chapter 4. Structured Data for Web Information Extraction in E-Commerce 161

1. A string that provides a human-readable name for the rule.

2. The property / data column of the DataFrame, i.e. the value that should be searched for in the HTML file.

3. A function that wraps the search value with an XPath expression, spe- cific to the respective property.

4. A dataselect XPath expression that is appended to the search XPath, for later extraction.

To gain a higher precision, we designed two slightly dierent rules for price, name, and picture, respectively, while one rule seemed sucient for the de- scription property. The rules are presented in detail below, and displayed in Table 4.3 for a better overview.

1. The first rule price1 aims to extract the DOM elements containing the price by a respective XPath “contains” expression.

2. The second rule price2 is very similar, but has only a pre-processing function switching the dots in prices for commas and vice versa. This simple measure has turned out to be highly eective, as price values in GoodRelations are commonly expressed with floating point values, but often with commas in the visible content of Web shops.

3. The rule name1 aims for a perfect match of the name property as content of a DOM element.

4. The rule name2 also targets the oering name, but evaluates for DOM elements containing the name string.

5. The rule pic1 tries to find an exact match of the oering image URI. As content selector, in comparison to the other rules, it uses “//@src”.

6. The second rule to extract images, pic2, allows a broader range of candidates, matching “img” DOM elements that contain the image file name. Chapter 4. Structured Data for Web Information Extraction in E-Commerce 162

Table 4.3: Overview of the extraction rules

Name Property XPath Data-Sel. price1 hasCurrencyValue ’//*[contains(text(),“’+x+’”)]’ default price2 hasCurrencyValue ’//*[contains(text(),“’+flip_point(x)+’”)]’ default name1 name ’//*[text()=“’+x+’”]’ default name2 name ’//*[contains(text(),“’+x+’”)]’ default pic1 foaf:depiction ’//img[@src=“’+x+’”]’ //@src pic2 foaf:depiction ’//img[contains(@src,“’+x.split(’/’)[-1]+’”)]’ //@src desc description preproc_desc(x) default

7. Finally, the description extraction rule desc uses the pre-processor in- troduced above. All rules except the two image rules use the default setting as content selector, that extracts the visible content of the tag itself and its children.

• Line 59-69: A function gen_rules is declared, with parameters for the re- spective DataFrame and the extraction parameters described above. First, a copy of the working DataFrame is generated, selecting only the learning set rows (test==False). Rows with empty elements are dropped to ensure that the following functions get the needed data. A variable shp, containing the length of the DataFrame, is instantiated. Finally, for all the rules in the parameter set described above, the following two steps are executed. First, an additional column with the key “rulename”+_data is generated, containing a zipped list of the respective variables. Finally, the get_xrules function is applied to this column and written to a rulename+“_rule” column. The in- termediary step seems unnecessary, but became viable because Pandas does not allow functions receiving multiple columns to return composed values. Finally, the resulting DataFrame is returned.

• Line 72: Similar to above, the gen_rules function is applied to the dataset. Meanwhile, new variable names are used, as we executed the process only on the learning set, and therefore want to keep the base dataset intact.

• Line 74-75: This part of the implementation ends with a function call for all datasets, that generates the final rules as executable XPaths, and evaluates the rules at the same time. As parameters, the function expects the Chapter 4. Structured Data for Web Information Extraction in E-Commerce 163

DataFrame, the extraction parameters defined above, the ECS list, and the limit of the most common rules to finally return. This function is extensive, and as it is not central to the approach, we put it to the library loaded above. For every property / ECS combination, it aggregates the extracted rules and combines the resulting lists, if needed, of the two dierent extraction rules, e.g. price1 and price2. It filters the occurrence of “script” DOM elements out of the results, as those lowered the final performance. Additionally, only rules that contain at least one class property, either in the first DOM element or the second, are allowed. It returns the rules in the original format, XPath version, and the evaluation scores of the rules. Concerning the rule evaluation scores, in the function the relative extraction success of each property / ECS combination is computed, e.g. 0.5 in “name” / “Magento” means that in rule name1 and name2, Magento yielded rules in 50 % of the cases. It also includes a parameter to modify the rules generation process, that will be used in the evaluation.

• Line 78-87: Additionally, the XPath rules are written to 16 files for each property / ECS combination, for a later use in Section 4.5. The most common rules parameter is set to 100, as the use case is able to cope with many rules.

4.2.6 Evaluation

The results of this subsection are presented in Section 4.4, Evaluation.

1 def eval_gen(ecs, true,tree,ru,rulename): 2 ru = ru[rulename][ecs] 3 if true: 4 true = decode_htmlentities(to_unicode(true).replace(’"’,’"’).lower()) 5 r=[tree.xpath(r)for r in ru] 6 r=flatten(filter(lambda x: x,r)) 7 r=map(lambda x: x.replace("\r","").replace("\n","").strip(),r) 8 r=filter(lambda x: len(x),r) 9 r=map(lambda x: to_unicode(x.lower()),r) 10 if r: 11 pdct = Counter(r).most_common(5) 12 score = float(levenshtein(true,pdct[0][0]))/len(true) Chapter 4. Structured Data for Web Information Extraction in E-Commerce 164

13 return score,pdct[0][0] 14 else: 15 return (-1,"") 16 17 18 def eval_price(ecs, true, tree,ru,rulename): 19 ru = ru[rulename][ecs] 20 try: 21 true = float(true) 22 r=map(tuple,filter(None,[tree.xpath(r) for r in ru])) 23 pdct = Counter(r).most_common(1)[0][0][0] 24 pdct = pdct.strip()

25 rex = re.match(".*?(\d*)[,.](\d*).*?",pdct) 26 if rex: 27 vk,nk = rex.groups() 28 29 if nk: 30 if len(nk) == 2: 31 pdct = vk+"."+nk 32 if len(nk) == 3: 33 if nk != "0": 34 pdct = vk+nk 35 else: 36 pdct = nk 37 else:

38 rex = re.match(".*?(\d*).*?",pdct).groups()[0] 39 pdct = float(pdct) 40 score = (abs(true-pdct))/100 41 except Exception as e: 42 return (-1,"") 43 else: 44 return (score,pdct) 45 46 47 def test(dftt,col,evalfunc,ecs,ru,rn): 48 print rn, 49 f2 = dftt[dftt.test==True].filter(["ecs",col,"tree"]) 50 f2[’res’] = f2.apply(lambda row: \ 51 evalfunc(row[’ecs’],row[col],row[’tree’],ru,rn),axis=1) 52 return f2 53 54 55 tests = [("name",eval_gen,"name1"), 56 ("description",eval_gen,"desc"), 57 ("foaf:depiction",eval_gen,"pic1"), 58 ("hasCurrencyValue",eval_price,"price1")] 59 60 Chapter 4. Structured Data for Web Information Extraction in E-Commerce 165

61 res_dfe3 = pd.DataFrame({p:test(d3,p,f,ecs,t,rules_x3) for p,f,t in test_params}) 62 63 thresholds = {"hasCurrencyValue":0.4,"foaf:depiction":0.3, \ 64 "name":0.3,"description":0.5} 65 finalresults = getfinalresults(res_dfe3,ecs,thresholds)

Listing 4.6: Evaluation - source code

• Line 1-15: The function eval_gen is defined. It caters for the description, image, and name properties. It expects the parameters for the ECS, the true value (i.e. the value extracted from the GoodRelations data), and the lxml (see 4.2.1) tree of the respective HTML document, the ruleset of the respective dataset, and the rule name.

Regarding the selection of the correct extracted element, if there are multiple results after applying the rules, we decided again not to over-implement and applied a very simple heuristic, namely taking the element that appears most often. Again, this is a tribute to keep the implementation as concise as possible. We expect that much performance could be gained here, given an elaborate implementation in a production setting.

The function returns the predicted value and the score, as in that form, evaluating the scores dependent on dierent thresholds outside this procedure is possible. We will exploit this in the evaluation in Section 4.4.

• Line 18-44: The price evaluation function is a bit more extensive, as it includes a basic handling of dierent price notations. In a production setting, an elaborate handling of prices is complicated, as there are many dierent notations, e.g. for thousands, or values without decimals. Here, the score reflects how much the predicted price diers from the true value. Threshold caters for price reductions. The function additionally projects the scores to the same range as in the generic evaluation function above. Chapter 4. Structured Data for Web Information Extraction in E-Commerce 166

• Line 47-52: A function test is defined, expecting as parameters the DataFrame to test, the property / DataFrame column to operate on, the evaluation func- tion, the ECS, a threshold parameter (will be modified in the evaluation section), and the respective set of XPath rules, as well as the rule name. First, the function selects the rows labeled “True” in the test column, and returns only the ECS, respective property / DataFrame column and tree variables. Then, to all these rows, the evaluation function is applied, with the parameters as introduced above.

• Line 55-58: A list of test parameters is declared, consisting of the 4 properties in regard, the respective evaluation function, and the respective rule name.

• Line 61: Similar to above, for the datasets, the extraction rules are evaluated, as defined by the environment just introduced.

• Line 63-65: In this last section of the implementation, we first define evalua- tion thresholds for the four properties. The getfinalresults function is provided in the lib.py function, as it is of subordinate importance.

4.2.7 Conclusion

In this section, we documented the implementation details of our approach, based on the near-pseudocode characteristics of the Python programming language. While some parts of the implementation could be elaborated more to reach higher perfor- mance, we have tried to keep the complexity as low as possible, while backing our hypothesis. In the next section, we will provide the results of the rule generation process.

4.3 Results

In this section, we present the results of our experiment. It is relatively short, as we only provide the results of the intermediary steps in the extraction rule generation Chapter 4. Structured Data for Web Information Extraction in E-Commerce 167

Table 4.4: HTML sample pages - all / training / evaluation from dierent ECS and sums

All Training Evaluation Magento 3.103 1.552 1.551 Oxid 240 120 120 Prestashop 1.585 793 792 Virtuemart 2.358 1.179 1.179 Sum 7.286 3.644 3.642

process, while the results of the evaluation will be provided in Section 4.4.

Here, we follow the structure that we provided in the implementation in Section 4.2. We first show the results of the data acquisition process, then proceed with an analysis of the data that could be extracted, and finally present the results of the rule generation process.

4.3.1 Dataset Generation

Table 4.4 provides an overview of the results of the data loading, cleansing and training / evaluation set generation. Regarding the aggregated samples of all ECS, we can see that there are 7.278 HTML page samples in the dataset. Magento has the highest share with 3.101 samples, followed by Virtuemart with 2.356 samples, and Prestashop with 1.583 samples. Oxid E-Commerce accounts for only 238 samples. All samples are split into nearly equal 50 % training and evaluation sets, as described in the implementation part.

4.3.2 Extraction of Data from Oering Pages

The second result of our experiment is provided by the data quality analysis imple- mented after the oering page extraction, and shown in Table 4.5. We can see that with a mean of nearly 95 %, our extraction method yielded valid results according to the data quality check. We operate only on the validated samples below. This reduces the overall sample size from 7.218 to 6.105. Chapter 4. Structured Data for Web Information Extraction in E-Commerce 168

Table 4.5: Ratio of valid data in the extracted raw data

Description Image Price Name Mean Prestashop 0.894 0.883 0.946 0.976 0.925 Oxid EC 0.908 0.992 0.954 0.988 0.960 Magento 0.885 0.937 0.937 0.958 0.929 Virtuemart 0.948 0.954 0.968 0.986 0.964 Mean 0.909 0.941 0.951 0.977 0.945

4.3.3 Rule Generation

In the following subsection, we discuss the results of the rule generation process. We provide details regarding the four properties we extracted, and conclude with aggregated results. While the details show the relative occurrence per sample of a single rule on a given ECS by property, the aggregated results combine the relative occurrences to a single figure, providing a better overview. High scores mean that a certain rule could be extracted very often and represent work for a large share of the samples, whereas low scores mean the opposite. The aggregate values can exceed 1 because e.g. for the name and image properties, multiple occurrences of the element in regard could be extracted. The properties can dier significantly in a single ECS, because templates used by the shops may dier significantly in their usage of tags and classes that include the properties in regard.

Description Property

Table 4.6 shows the five most common rules for the description property, and the share of samples they could be extracted of14. Regarding the extracted rules, the string “std” is prevalent, which is most likely an abbreviation for “short-description”. It is remarkable that this string is used throughout dierent ECS in the same way. All description rules did not yield scores as high as the other properties, with an overall maximum of 0.6 distributed to the first three ranks.

14The values can be above 1, as one sample can yield more than one rule prevalent in the five most common rules. Chapter 4. Structured Data for Web Information Extraction in E-Commerce 169

Image Property

Table 4.7 provides the five most common rules for the image property, and the share of samples they could be extracted of. All ECS except Magento rank significantly above one for the first rule, while Magento shows values of below 20 % for the first rules. Interestingly, the string “cloud-zoom” appears quite often across ECS. We attribute this to an image extension that is oered for all four ECS.

Name Property

Table 4.8 shows the five most common rules for the name property, and the share of samples they could be extracted of. Regarding the extracted rules, the string “product-name” is relatively frequent. Additionally, we can see some obviously false positives, like rank 4 of Prestashop, which obviously is a rule targeting a description. We decided not to filter the rules manually at this point, as this would invalidate the evaluation of our approach.

Price Property

Table 4.9 presents the five most common rules for the price property, and the share of samples they could be extracted of. We can see that the extraction process yielded relatively stable rules containing “price” strings. Regarding the scores, the price extraction was not very successful, with only Magento and Oxid EC yielding over 20 % of the first ranks.

Aggregated Results

We provide an overview of the aggregated results of the extraction rule generation process in Table 4.10 and Fig. 4.2. The aggregated results are the sum of the scores of the n=5 most frequent rules. Chapter 4. Structured Data for Web Information Extraction in E-Commerce 170

Table 4.6: Rule generation results, desc. property, rank 1 to 5, and score

ECS Rank Decription - HTML Path for the Property Score Magento 0 (div, short-description, div, std) 0.032 1 (div, product-tabs-content, div, std) 0.023 2 (div, std, p, None) 0.016 3 (div, None, div, std) 0.014 4 (div, box-collateral box-description, div, std) 0.012 Oxid_EC 0 (div, panel, div, std) 0.018 1 (a, collapsible-wrapper, input, checkbox relat... 0.009 2 (div, footer_top, a, None) 0.009 3 (div, product-tabs-content, div, std) 0.009 4 (li, level1 nav-2-6 collapsible, a, product-im... 0.009 Prestashop 0 (div, rte, p, None) 0.053 1 (div, rte align_justify, p, None) 0.026 2 (div, sheets align_justify, div, rte) 0.013 3 (div, descripcion, p, None) 0.005 4 (div, panel, div, std) 0.003 Virtuemart 0 (div, jwts_tabbertab, p, None) 0.048 1 (div, yagendoo_vm_fly1_inner, p, None) 0.033 2 (div, std, p, None) 0.005 3 (div, None, div, std) 0.004 4 (a, None, span, tl) 0.004

• The aggregated score for the Description property is quite low for all ECS, with 0.097 for Magento, 0.054 for Oxid E-Commerce, 0.1 for Prestashop, and with 0.094 for Virtuemart. That is rooted in the highly complex nature of the description extraction task, as described in Section 4.2.

• The aggregated score for the Image property is generally high, with Oxid E-Commerce yielding a score of 1.447, Prestashop 2.479, and Virtuemart 1.876. Only Magento significantly lies out with 0.62.

• The aggregated score for the Name shows a relatively sparse distribution, with Magento scoring 2.905, Oxid E-Commerce 2.213, Prestashop 1.115, and Virtuemart 1.306.

• The aggregated score for the Price is relatively low with 0.286 for Magento, 0.294 for Oxid EC, and 0.176 % for Prestashop. Virtuemart significantly lies out with 0.061. Chapter 4. Structured Data for Web Information Extraction in E-Commerce 171

Table 4.7: Rule generation results, image property, rank 1 to 5, and score

ECS Rank Image - HTML Path for the Property Score Magento 0 (a, cloud-zoom, img, None) 0.215 1 (p, product-image product-image-zoom, img, None) 0.188 2 (a, cloud-zoom-gallery, img, None) 0.142 3 (a, ig_lightbox2, img, None) 0.039 4 (a, fancybox, img, None) 0.036 Oxid_EC 0 (a, cloud-zoom, img, None) 0.714 1 (div, zoomed, img, None) 0.277 2 (span, artIcon, img, None) 0.188 3 (a, cloud-zoom-gallery, img, None) 0.152 4 (p, product-image product-image-zoom, img, None) 0.116 Prestashop 0 (a, thickbox shown, img, None) 1.435 1 (a, thickbox , img, None) 0.849 2 (div, None, img, jqzoom) 0.099 3 (a, kujjukZoom, img, None) 0.050 4 (a, cloud-zoom-gallery, img, None) 0.046 Virtuemart 0 (a, modal, img, None) 1.368 1 (div, yagendoo_gallery_item, img, None) 0.347 2 (a, cloud-zoom, img, None) 0.077 3 (a, cloud-zoom-gallery, img, None) 0.043 4 (p, product-image product-image-zoom, img, None) 0.041 occ. freq. of rule Aggregated score / score Aggregated

Figure 4.2: Aggregated rule generation results Chapter 4. Structured Data for Web Information Extraction in E-Commerce 172

Table 4.8: Rule generation results, name property, rank 1 to 5, and score

ECS Rank Name - HTML Path for the Property Score Magento 0 (li, product, strong, None) 1.403 1 (div, product-name, h1, None) 1.258 2 (div, product-name, h2, None) 0.113 3 (div, short-description, div, std) 0.107 4 (div, page-title, h1, None) 0.024 Oxid_EC 0 (div, product-name, h1, None) 0.973 1 (li, product, strong, None) 0.893 2 (div, None, h2, pageHead) 0.205 3 (div, short-description, div, std) 0.071 4 (li, product, span, None) 0.071 Prestashop 0 (div, clearfix, h1, None) 0.417 1 (div, product-name, h1, None) 0.258 2 (li, product, strong, None) 0.248 3 (div, breadcrumb, span, navigation_end) 0.161 4 (div, short-description, div, std) 0.031 Virtuemart 0 (center, None, h1, yagendoo_vm_fly_prod_title) 0.628 1 (div, product-name, h1, None) 0.322 2 (li, product, strong, None) 0.305 3 (div, product-name, h2, None) 0.033 4 (div, jwts_tabbertab, p, None) 0.018

4.3.4 Conclusion

In this section, we have shown how HTML path information that indicates the position of relevant information can be generated on the basis of HTML pages that contain explicit data markup in RDFa syntax.

The next section will provide an evaluation of the experiment.

4.4 Evaluation

In Section 4.1, we introduced our main evaluation setup.

• In Section 4.3.1, we split the data into a training set and an evaluation set, each containing 50 % of the samples, separated by ECS. Chapter 4. Structured Data for Web Information Extraction in E-Commerce 173

Table 4.9: Rule generation results, price property, rank 1 to 5, and score

ECS Rank Price - HTML Path for the Property Score Magento 0 (span, regular-price, span, price) 0.209 1 (p, special-price, span, price) 0.060 2 (div, block-content , span, price) 0.008 3 (div, popular_final_price, span, price) 0.005 4 (div, product-shop, span, price) 0.004 Oxid_EC 0 (span, regular-price, span, price) 0.196 1 (p, special-price, span, price) 0.098 2 0 0.000 3 0 0.000 4 0 0.000 Prestashop 0 (span, our_price_display, span, None) 0.055 1 (span, regular-price, span, price) 0.053 2 (p, our_price_display, span, None) 0.026 3 (p, special-price, span, price) 0.022 4 (a, None, span, price) 0.020 Virtuemart 0 (span, regular-price, span, price) 0.039 1 (div, yagendoo_productPrice, span, yagendoo_pr... 0.012 2 (strong, None, span, price) 0.004 3 (strong, None, span, size) 0.003 4 (td, None, span, productPrice) 0.003

Table 4.10: Aggregated rule generation results - dataset

Desc. Image Name Price Magento 0.077 0.614 2.603 0.269 Oxid_EC 0.050 2.980 1.095 0.260 Prestashop 0.082 2.824 0.626 0.122 Virtuemart 0.115 2.229 0.924 0.028

• We performed the rule generation process on the training set only.

• We evaluated the performance of the approach by applying the extraction rules to the evaluation set only. As it is already labeled by the structured data / GoodRelations markup, gaining the “true” data is relatively simple compared to manual approaches, and scales well.

On this main dataset, after discussing the results generated with the standard settings, we evaluate four modifications of the experiment. These are (1) stricter Chapter 4. Structured Data for Web Information Extraction in E-Commerce 174

/ more relaxed evaluation settings, modifications of the base sample regarding (2) training / evaluation set size, and (3) temporal criteria, and finally, a (4) modifica- tion of the rule generation process.

In addition to the main dataset based on GR-Notify, we generated a second dataset, by manually labeling n=20 oering pages per ECS. We evaluate the performance of this dataset in the same manner as with modifications of the experiment above.

The basic hypothesis of this evaluation section is that existing structured data can be used to drive a Web Information Extraction system in e-commerce. If we achieve significant precision using the implementation described before, we argue that structured data is fundamentally helpful for Web Information Extraction in this domain.

In the following sections, we always provide the precision of the experiments. We argued above that recall, as of the design of the experiment, is always 100 %.

4.4.1 Standard Settings

We provide the combined results of the main evaluation in Table 4.11. The score presents the relative precision of the evaluation functions applied to the respective test sets. If the score shows 0.7, it means that in 70 % of the cases, our approach extracted the correct value for the given property / ECS combination.

• Regarding the description property, we can see that the experiment yielded positive results in only 14.2 % of the cases. We have already introduced the complexity of the extraction of the description properties in Section 4.2. Oering descriptions often make full use of HTML markup, i.e. text parts are set in special typography, related content is linked, or even structured content like lists or tables is embedded. As the dierent shop extensions do not follow a determined way of converting this content into a string that qualifies as literal in the GoodRelations markup, it is quite hard to automatically extract these values correctly. Therefore, we think that the achieved score already shows the basic working of our approach. Chapter 4. Structured Data for Web Information Extraction in E-Commerce 175

Table 4.11: Final results - standard settings - precision

Desc. Image Name Price Magento 0.250 0.473 0.839 0.820 Oxid_EC 0.131 0.477 0.925 0.710 Prestashop 0.113 0.633 0.155 0.552 Virtuemart 0.074 0.410 0.286 0.415 Mean 0.142 0.498 0.551 0.624

• The image property shows an overall score of 49.8 %, which from our point of view, is a solid evidence that our approach is valid. While all other ECS score about 40 %, Prestashop is a positive outlier with 63.3 %. We expect a stable pattern regarding this property to be at work for this ECS.

• The name property is again indicating that our approach is viable. It yields an aggregated score of 55.1 %. Magento and Oxid EC score highly with 83.9 % and 92.5 %. As Prestashop and Virtuemart score low with 15.5 % and 28.6 %, we performed a manual inspection. It turned out that by standard settings, the shops combined the manufacturer name with the product name in a single DOM element, which explains the low score. As already described in Section 4.2, we did not implement a workaround for this cases in order to not invalidate the evaluation.

• The price property also supports the positive assessment with an aggregated score of 62.4 %. Magento and Oxid EC score well with 82 % and 71 %, while Prestashop and Virtuemart yield only 55.2 % and 62.4 %. We attribute the low scores in this case also to the aforementioned low technical sophistication we encountered in these ECS.

4.4.2 Modified Evaluation

In the following sections, we show the dierences in comparison to the standard settings. Chapter 4. Structured Data for Web Information Extraction in E-Commerce 176

Figure 4.3: Final results - standard settings - precision

Table 4.12: Strict evaluation

Description Image Name Price Standard settings 0.5 0.3 0.3 0.4 Modifications -0.1 -0.1 -0.2 -0.3 Modified settings 0.4 0.2 0.1 0.1

Table 4.13: Impact of the stricter settings on the precision

Desc. Image Name Price Magento -0.014 -0.028 -0.010 -0.032 Oxid_EC -0.019 -0.000 -0.018 -0.018 Prestashop -0.013 -0.001 -0.014 -0.022 Virtuemart -0.012 -0.000 -0.000 0.000 Mean -0.014 -0.008 -0.011 -0.018

Strict Settings

We first used stricter thresholds for the evaluation. Standard settings, modifications, and modified settings are presented in Table 4.12, while the results are presented in Table 4.13 and Fig. 4.4. Overall, the modification yielded only slightly worse results. The property description suered from a -0.014 loss of performance, image -0.008, price -0.018, and name -0.011. The low changes in Virtuemart / name are attributed to the low baseline score. Chapter 4. Structured Data for Web Information Extraction in E-Commerce 177

Figure 4.4: Impact of the stricter settings on the precision

Table 4.14: More relaxed evaluation - settings

Desc. Image Name Price Standard settings 0.5 0.3 0.3 0.4 Modifications +0.1 +0.1 +0.2 +0.3 Modified settings 0.6 0.4 0.5 0.7

Table 4.15: Impact of the relaxed settings on the precision

Desc. Image Name Price Magento 0.032 0.011 0.005 0.013 Oxid_EC 0.009 -0.000 0.000 0.000 Prestashop 0.011 0.002 0.014 0.003 Virtuemart 0.014 -0.000 0.001 0.000 Mean 0.017 0.003 0.005 0.004

More Relaxed Settings

We additionally evaluated symmetrically designed, more relaxed evaluation settings. Standard settings, modifications, and modified settings are presented in Table 4.14, while the results are presented in Table 4.15 and Fig. 4.5. In comparison to the stricter settings, the changes are even smaller, with mean deltas below 0.005 for image, name, and price properties. Only description sees changes of up to 0.032 (Magento), with a mean of 0.017. We have already discussed the complexity of description extraction above. Chapter 4. Structured Data for Web Information Extraction in E-Commerce 178

Figure 4.5: Impact of the relaxed settings on the precision

Table 4.16: Impact of a training set of 0.25 settings on the precision

Desc. Image Name Price Magento -0.008 -0.079 -0.014 -0.022 Oxid_EC -0.093 0.227 -0.051 -0.282 Prestashop 0.039 0.152 0.533 0.185 Virtuemart 0.168 0.361 0.387 0.431 Mean 0.026 0.166 0.214 0.078

4.4.3 Modified Sample

Training / Evaluation Size 0.25/0.75

As a next modification of our experiment, we limited the training set to 25 % of the sample, with an evaluation set of 75 %. The results are shown in Table 4.16 and Fig. 4.6. From the results of this setting, we can see how well the approach generalizes, given a limited training set. Unexpectedly, by mean all properties profited from this change. Meanwhile, we can see that especially those that yielded very low scores before grew exceptionally, e.g. Prestashop and Virtuemart in the name property. We attribute these results to the base dataset, as they hint that the first quarter of samples contains much of the data needed. Future work should analyze the dataset regarding this more deeply. Chapter 4. Structured Data for Web Information Extraction in E-Commerce 179

Figure 4.6: Impact of a training set of 0.25 settings on the precision

Table 4.17: Impact of a training set of 0.75 settings on the precision

Desc. Image Name Price Magento -0.014 -0.038 0.007 -0.027 Oxid_EC 0.061 0.446 -0.098 -0.152 Prestashop 0.070 0.154 0.553 0.245 Virtuemart 0.142 0.363 0.620 0.489 Mean 0.065 0.231 0.270 0.139

Training / Evaluation Size 0.75/0.25

We additionally evaluated the opposite setting of a 75 % training set and a 25 % evaluation set, to understand how our approach can be impeded by overfitting. The results are shown in Table 4.17 and Fig. 4.7. Regarding the mean, all properties profited, with description +0.065, image +0.231, name +0.270, and price +0.139. These results fit our intuition that more data should improve the results of our approach. Again, we see massive gains in the name property regarding Prestashop and Virtuemart, which scored weak before. This is a hint that the most valuable samples were not included in the even (50/50) splitting process for these ECS. Methods how to address this problem would be promising future work. Chapter 4. Structured Data for Web Information Extraction in E-Commerce 180

Figure 4.7: Impact of a training set of 0.75 settings on the precision

Table 4.18: Precision for early adopters

Desc. Image Name Price Magento -0.101 -0.151 -0.039 -0.046 Oxid_EC 0.021 0.523 -0.410 -0.710 Prestashop 0.069 0.309 0.481 0.201 Virtuemart 0.138 0.440 -0.045 0.442 Mean 0.031 0.280 -0.003 -0.028

Limiting the Experiment on Early Adopters

Next, we split our sample by the date an URI was submitted for the first time, resulting in a first dataset spanning from 2011-07-14 to 2012-12-25, and a second dataset spanning from 2012-12-25 to 2014-02-16. This modification of the base sample shows how much the approach profits from recently submitted Web shops. The results are shown in Table 4.18 and Fig. 4.8. Only the image property profited significantly from this setting with a mean gain of 0.28. This hints at stable patterns regarding this property in this time frame.

Limiting the Experiment on Later Adopters

The results of the second period are shown in Table 4.19 and Fig. 4.9. Except for the description property (+0.013), all other properties could yield significantly Chapter 4. Structured Data for Web Information Extraction in E-Commerce 181

Figure 4.8: Precision for early adopters

Table 4.19: Precision for later adopters

Desc. Image Name Price Magento -0.054 0.029 -0.013 -0.006 Oxid_EC -0.083 0.217 -0.296 -0.033 Prestashop -0.018 -0.026 0.546 0.319 Virtuemart 0.208 0.268 0.602 0.364 Mean 0.013 0.122 0.210 0.161

Figure 4.9: Precision for later adopters better results in the second period. We see this rooted in a higher stability of patterns in the second period. Chapter 4. Structured Data for Web Information Extraction in E-Commerce 182

Table 4.20: Precision while omitting first class

Desc. Image Name Price Magento -0.080 -0.410 -0.549 -0.335 Oxid_EC 0.061 -0.447 -0.612 -0.276 Prestashop 0.012 -0.476 -0.058 -0.263 Virtuemart -0.070 -0.410 0.369 0.246 Mean -0.019 -0.436 -0.212 -0.157

Figure 4.10: Precision while omitting first class

4.4.4 Modified Rule Generation

Omitting First Class

For this experiment, in the rule generation process, we disabled the output of the class value of the grandparent element. The results of this setting are shown in Table 4.20 and Fig. 4.10, respectively. We can see that overall, this eected in a significant loss of performance. The image (-0.436), name (-0.212), and price (-0.157) properties suered the most, while the description property only saw a slight performance loss of -0.019. We attribute this to the poor performance of this property on the baseline.

Replacing DOM Tags with Wildcard Matchers

We additionally evaluate a rule generation variant that instead of using the specific DOM tags (e.g. div, h1), uses XPath wildcard matchers. This results in rules that Chapter 4. Structured Data for Web Information Extraction in E-Commerce 183

Table 4.21: Result dierences - wild card - precision

Desc. Image Name Price Magento -0.036 -0.016 -0.293 -0.087 Oxid_EC 0.010 0.311 -0.369 -0.185 Prestashop -0.026 0.135 -0.075 -0.076 Virtuemart -0.043 0.218 0.110 0.268 Mean -0.024 0.162 -0.157 -0.020

Figure 4.11: Result dierences - wild card - precision

are only defined by the values of the two class properties. The results of this setting are shown in Table 4.21 and Fig. 4.11, respectively. In comparison to the last setting, we see relatively small losses. Surprisingly, the image property profited from the change with +0.162 by mean. We attribute this to the vendor-specific class properties spotted in Section 4.3, that do not seem to depend heavily on HTML tags.

4.4.5 Additional Dataset: Manually Labeled, n=20 per ECS

We finally evaluate the performance of our approach on an independent dataset. Instead of relying on the data of GR-Notify, we used Web shops that do not employ structured data.

For Magento, Prestashop, and Virtuemart, we used a sample of the Web shops detected in Section 3.2. For Oxid EC, we used a sample of the Web shops provided Chapter 4. Structured Data for Web Information Extraction in E-Commerce 184

Table 4.22: Precision with a manually created dataset

Desc. Image Name Price Magento -0.139 -0.195 -0.117 -0.209 Oxid_EC -0.131 -0.277 -0.925 -0.710 Prestashop -0.002 0.145 -0.044 0.337 Virtuemart -0.074 -0.360 -0.236 -0.165 Mean -0.086 -0.172 -0.330 -0.187

Table 4.23: Precision, absolute with a manually created dataset

Desc. Image Name Price Magento 0.111 0.278 0.722 0.611 Oxid_EC 0 0.17 0 0 Prestashop 0.111 0.488 0.111 0.215 Virtuemart 0 0.05 0.05 0.25 Mean 0.056 0.326 0.221 0.437

as reference sites15. For each shop, we manually determined n=20 oering pages, and extracted the data manually, instead of acquiring it automatically from the structured data, as before. The relative results of this setting are shown in Table 4.22 and Fig. 4.12, the absolute results additionally in 4.23.

Considering that we operate on a non-representative sample with these datasets, only relatively small losses emerged. By mean, the description property lost -0.086, the image property -0.172, the name property -0.330, and the price property -0.187. The biggest loss could be seen in Oxid EC, that has a comparatively low amount of samples in our learning dataset.

4.4.6 Conclusion

We provided an extensive evaluation of our approach. We modified the evaluation setting, the base sample, and the rule generation process. Finally, we applied our approach to an independent dataset.

15http://www.oxid-esales.com/ Chapter 4. Structured Data for Web Information Extraction in E-Commerce 185

Figure 4.12: Precision with a manually created dataset

The overall results show that the approach is feasible for selected ECS / property combinations, e.g. Prestashop / image, Magento / name, Oxid EC / name, Magento / price, and Oxid EC / price, albeit with a limited absolute performance in the prototypical implementation without optimizations. We therefore argue that our main hypothesis,

The existing structured data generated by the GoodRelations ecosystem, in combi- nation with the market structure of ECS and the patterns ECS show, can be used as a lever to generate new e-commerce data in significant precision,

is valid.

One major limitation of our work is that we could only test it on a relatively small set of simple properties, which represent just a small fraction of relevant information in e-commerce Web sites. This limitation was caused by two eects. First of all, our training data in the form of Web shops with markup was limited to the set of properties that are relevant for tangible eects in major search engines like Google. More advanced properties are only recently gaining relevance for mainstream search engines. Second, the eort for implementing and validating the approach over a larger set of features would have been too time-consuming.

The modifications of the experiment settings partially evoked the expected changes. Examples include stricter and more relaxed evaluation settings, increasing the relative size of the training set, the exclusion of the first class. Meanwhile, also Chapter 4. Structured Data for Web Information Extraction in E-Commerce 186

unexpected / noisy changes could be observed, for instance in reducing the training set, and the early / later adopter setting. We attribute these to the overall small size of our sample, that may influence the behavior. Additionally, e.g. regarding the time experiment, we expect the dataset to be biased in submission frequency depending on the dierent ECS. While this may influence the results significantly, we did not further elaborate on this issue.

The experiment on the independent, manually gathered dataset generally showed satisfactory results. We argue that especially in this environment, the implemen- tation should be improved to become more resilient to additional variation and noise in the input data. We provide a preview on such a system in the use case in Section 4.5.

4.5 Use Case: Real-time E-Commerce Web Information Extraction System

In this section, as use case for our approach, we develop a Web Information Ex- traction system that extracts specific oering pages from live Web shops, with the intention to show the basic architecture of a production system based on our approach.

4.5.1 Design

In comparison to the prototypical implementation discussed before, that operates on prefetched oering HTML pages, we build a system that crawls Web shops live and shows the results in real-time.

The use case consists of two main parts:

• A WIE component that is able to download URIs, and perform the extraction process. It combines the rules generated in Section 4.2 and heuristics to gain Chapter 4. Structured Data for Web Information Extraction in E-Commerce 187

a wider range of oering properties. In addition to the description, image, name, and price properties, we extract currencies, features, and categories (see Section 4.2).

• A frontend that displays the extracted data in real-time.

4.5.2 Implementation of the Extraction System

We used the Scrapy library for the WIE system [Scra]. While thelearning set and main experiment have been implemented from the ground, we decided to use an established framework here. Web scraping or crawling is a complex task that is solved quite well by commodity solutions, so covering new ground in this domain is out of the focus of this work.

Scrapy is promoted by the developers as being fast and powerful, extensible, and portable [Scra]. As it is built on the Python programming language, it is a good match for our environment. Regarding speed, it uses the event-driven Twisted networking library 16 that is used for parallel downloading while processing multiple URIs.

Scrapy is fundamentally built around the idea to provide specific extraction rules for each Web site [Scrb]. As this did not suit our approach of generating rules that should work on ECS level, we customized the system according to this need. We leave out an in-detail discussion of the Scrapy architecture and point to the sophisticated documentation available online17.

From a meta level, the extraction system works as follows:

1. First, a set of XPath expressions is exported from the main experiment (see 4.2) iPython notebook.

2. These expressions are loaded per each ECS in Scrapy.

16https://twistedmatrix.com/trac/ 17http://doc.scrapy.org/en/latest/index.html Chapter 4. Structured Data for Web Information Extraction in E-Commerce 188

3. The expressions are executed against the respective HTML file.

4. The results are processed by input processors, which do data cleansing.

5. Output processors finally choose the predicted value.

Per each oering page, the results are sent to a MongoDB, a database which is connected in with the frontend.

4.5.3 Implementation of the Frontend

The main requirement for the frontend was to present the extraction results in a human-readable way, to enable fast iterations for improvement. Secondary require- ments were (1) to keep the implementation as lean as possible, as this is only a very small part of the overall contribution, and (2) to use a system that reflects the results of the extraction process / database changes in real-time, if possible.

Traditionally, database-driven Web applications are built around server-side gen- erated HTML pages, which trigger diverse database requests via specific URIs [Gar05]. At the same time, seeing new results demands for a full page reload, and therefore does not qualify for the real-time requirement.

Meanwhile, in the last years, Web application development has moved away from server-side generation to client-side generation. Here, page changes are asynchronously loaded from the server, mainly by Javascript programs running on the client-side [Gar05]. While this approach had to be coded manually in the beginning, in the last years many frameworks to facilitate the development process have emerged.

For our frontend, we use a Javascript framework called Meteor [Met]. Compared to other Javascript frameworks, it emphasizes simplicity and real-time capabilities. In this context, with only a minimal codebase, it allows us to develop a fully featured result display of the extraction process. Chapter 4. Structured Data for Web Information Extraction in E-Commerce 189

4.5.4 Output of a Typical Run of the WIE System

We ran the extraction system with n=250 Magento seed URIs. In Table 4.24,we show the most important results of this experiment. The first five properties show errors on the network level, summing up to 77 exceptions. The run took a requests size of 84 KB, and yielded 5.4 MB responses totally. The request count of 353, which is above the experiment size of 250, can have two reasons. First, the extractor may try again if there was no response in the first place. Second, the HTTP codes in the 300 range allow to redirect requests to other resources [Fie+99]. Therefore, the total amount of requests may be higher than the number of URIs in the queue.

Overall, 143 documents could be successfully downloaded (HTTP status code 200). The run took 1 minute and 35 seconds (0.38s per URI). During the extraction process, only those oerings which contained at least name and price properties were yielded. This resulted in 39 dropped oerings. Overall, we could extract 99 oerings.

4.5.5 Frontend Overview

Fig. 4.13 shows the frontend of the use case. It consists of two main sections. The upper part of the page shows the product details, while the lower part shows the result list of the oerings generated by the extraction system.

The detail section provides the product name, the base URI of the shop, the description, the price, and the features. The oerings’ list provides the base URI, the name, price and currency separated, indicators which properties could be extracted ((f)eatures, (i)mage, (d)escription), and a link to the online version of the document. If an entry is clicked in the oering list, the respective details are shown above.

We can see that, while most oerings contain description and image, feature ex- traction was relatively rare. Price and currency were extracted correctly in all but three oerings in the figure. Chapter 4. Structured Data for Web Information Extraction in E-Commerce 190

Table 4.24: Output of a typical run of the WIE system, n=250 URIs

Property Value downloader...DNSLookupError 57 downloader...NoRouteError 6 downloader...ResponseNeverReceived 3 downloader...TimeoutError 11 downloader/exception_count 77 downloader/request_bytes 84069 downloader/request_count 353 downloader/request_method_count/GET 353 downloader/response_bytes 5460211 downloader/response_count 276 downloader/response_status_count/200 143 downloader/response_status_count/301 26 downloader/response_status_count/302 32 downloader/response_status_count/401 1 downloader/response_status_count/404 62 downloader/response_status_count/503 12 response_received_count 210 start_time 16, 54, 16 finish_time 16, 55, 51 item_dropped_count 39 item_scraped_count 99

4.5.6 Conclusion

We provided an use case of our approach, consisting of two parts. First, we developed an extraction system that is able to extract oerings from given URIs. Second, we developed a Web interface that shows the results of the extraction process in real-time.

In comparison to the aforementioned experiment, the use case shows the basic architecture of a production system, and its viability. Meanwhile, the results show that for an use in a commercial setting, the dierent parts of the system would require optimizations in terms of the reliability of the output and the computational eort per site. Chapter 4. Structured Data for Web Information Extraction in E-Commerce 191

Figure 4.13: Overview of the frontend functionality 5 Conclusion and Outlook

This chapter consists of three sections. First, we summarize the achievements of the thesis, with special regard to the research questions. Second, we discuss the limitations of the main contribution, and provide an outlook on future work. We round up with closing remarks.

5.1 Contributions

We start this chapter by first reviewing what has been achieved in the respective parts of the thesis.

5.1.1 Structured Data: Fundamentals and Usage in the E-Commerce Domain

The second chapter provided an overview of the three research areas this thesis belongs to. As already discussed in Chapter 2, we strived to connect two important areas in Artificial Intelligence research (Semantic Web and Information Extraction), and applied the results to a practically very relevant domain.

Semantic Web

In this section, we underlined (1) the paradigm-shifting character of the original Web, (2) its fundamental problems, and (3) the Semantic Web vision. We analyzed

192 Chapter 5. Conclusion and Outlook 193

(4) the Semantic Web technology stack, as it is a technological foundation of the further course of the thesis. We shortly introduced (5) Linked Data, and concluded with (6) a description of the adoption of Semantic Web techniques by major search engines.

Semantic E-Commerce

This section was structured as follows: We began (1) with the technological foun- dation of Semantic E-Commerce. We then introduced (2) the GoodRelations Web vocabulary, which covers a wide range of use cases and shows significant mar- ket adoption. We progressed with a (3) short discussion of existing structured e-commerce data on the Web, which is highly relevant to this research, as it pro- vides for one of the three main foundations. We provided (4) a discussion of existing research on Semantic E-Commerce. We (5) went on with a short overview of how structured data is used in commercial settings. We finally (6) concluded with a discussion of economic implications of Semantic E-Commerce, a highly new-ground topic.

Web Information Extraction (WIE)

This section had seven main parts, which elaborated the domain in order of raising detail. We started with a (1) discussion of relevant dimensions of classifying WIE approaches, and went on with (2) classic and (3) recent WIE approaches. We then introduced (4) e-commerce-specific and (5) Ontology-based WIE approaches. We discussed (6) WIE approaches that combine those two subfields, and therefore constitute the best match to the work at hand. After discussing the (7) novelty of our approach, we rounded up the section with (8) an introduction to Web mining, a related research field. Chapter 5. Conclusion and Outlook 194

5.1.2 Foundational Building Blocks

In the foundations chapter, we provided three basic contributions, that were needed to achieve our overall goal. It provided background on the fundamental building blocks that were exploited in the further course of the thesis. The chapter contains three main parts, according to the research foundation we introduced in Chapter 1.

The first section of the chapter answered Research Question 2,

How can we assess the impact of ECS on the availability of structured data?

The section provided a market overview of e-commerce systems (ECS). It brought up premilinary indications that only six ECS dominate the market in a way that they generate more than 90 % of the oering pages on the Web. This was a main building block for the further course of the research, as it allows to cover a significant amount of the e-commerce Web by constructing extractors for a few ECS.

The second section of the chapter answered Research Question 3,

How can we identify important ECS by looking at only one Web page with high accuracy?

The section provided a Machine-Learning-driven ECS identification system. We designed a system capable of ECS detection based on supervised classification, and a filtered set of HTML attribute values. It detects six dierent ECS by analyzing only one random HTML page of a Web shop. Taken into account the loss in recall when not being able to generate any features, it shows a F1-score of 0.9. An extensive evaluation confirmed the results. We provided an analysis of the speed of the dierent algorithms, the performance on specific ECS, and a heuristic to choose a classification algorithm for the task at hand.

The third section of the chapter answered Research Question 4,

How can we measure the current diusion and quality of GoodRelations data? Chapter 5. Conclusion and Outlook 195

Here, we described sources and analyzed GoodRelations data on the Web with a twofold approach. First, we analyzed the data that had been gathered by GR- Notify1, a Web service that receives notifications of Web shops using GoodRelations. Second, we downloaded a sample of HTML pages equipped with GoodRelations based on the data gathered by GR-Notify, and conducted an extensive analysis.

5.1.3 Structured Data for WIE in E-Commerce

Chapter four answered the main Research Question 1,

How can the combination of (1) the market domination of a few ECS, (2) an automated approach for detecting the ECS behind a Web site, (3) HTML template similarity, and (4) existing structured data to design a system that is able to extract structured data from e-commerce sites that do not contain data markup, with a level of granularity and data quality comparable to extraction from explicit data markup?

and the derived main hypothesis

The existing structured e-commerce data on the Web, in combination with the market structure of e-commerce systems, and the similarity in the HTML patterns they expose, can be used as a lever to generate additional e-commerce data in significant quantity and quality, and thus increase the market coverage in the availabe data.

The hypothesis has been backed by the central results of our evaluation, which are again shown in Fig. 5.1.

In detail, the following has been achieved in the sub-contributions:

Approach: In this section, we discussed the approach details of the thesis. We provided (1) fundamental properties in comparison to the prevailing shop extension approach, discussed (2) the choice of data properties, and finally bridged (3) the implementation section with a description of our experimental design.

1http://gr-notify.appspot.com Chapter 5. Conclusion and Outlook 196

Figure 5.1: Final results - standard settings - precision

Implementation: In this section, we discussed the implementation details of the approach. We chose to provide a detailed discussion of the most important parts of the code.

Results: In this section, we first provided the results of the data acquisition, proceeded with an analysis of the data that could be extracted, and finally provided the results of the rule generation process.

Evaluation: We provided a detailed evaluation of our approach. We modified the evaluation settings, the base sample, and the rule generation process, in order to understand how sensitive our approach will be to changes in the available data and parameter settings. Finally, we applied our approach to an independent dataset. We could show that our approach yielded a precision of 14.2 % for the description property, 49.8 % for the image property, 55.1 % for the name property, and 62.4 % for the price property. Therefore, we argue that our main approach has shown to be feasible.

Use Case: In this section, as an use case for our approach, we developed a Web Information Extraction system that extracts specific oering pages from live Web shops, with the intention to show the basic architecture of a production system. We provided a Web frontend that shows the results in real-time. Chapter 5. Conclusion and Outlook 197

5.2 Limitations and Future Work

In the following section, we discuss limitations of the main contribution. The limitations of the foundational contributions have already been discussed in the respective sections in Chapter 3. If adequate, we discuss possible future work after the respective limitations.

5.2.1 Dataset

Our dataset spots two fundamental flaws: noise and bias.

The datasource for our experiment were Web shops. As by design, standards on the Web are not enforced, and rather have a guideline character, much Web content has errors. This demands sorting out erroneous samples. As a result, much of the data originally gathered cannot be used in the experiment. To circumvent this problem, sophisticated Web software like browsers (e.g. Chrome, Internet Explorer, Safari) is able to operate on erroneous Web data. Future work could include integrating the advanced HTML parsing capabilities of browsers into the experiment.

A second shortcoming of our dataset is its bias. We mainly used the data generated by the GR-Notify service. Therefore, by provenance, our dataset is limited to the Web shops that submitted their URI through a shop extension. Given a market containing at least 50 relevant ECS, an experiment that only operates on four ECS is most likely not representative. At the same time, there is no alternative source of GoodRelations deployments. Future work could use a broad crawl to spot novel GoodRelations sites and operate on this extended dataset.

Additionally, we see two fundamentally dierent datasets that could be used for future work. First, there is the GoodRelations crawler, which has been introduced in Section 2.2.2.3. At the moment, it only extracts RDF data out of the Web shops, and discards underlying HTML pages. A modified version could save the HTML documents, which would render the crawler as a viable data source for our approach. We tried to represent the results of a full crawl of the GR-Notify data Chapter 5. Conclusion and Outlook 198 with our n=5 sample dataset, and argued that this is sucient for our approach (see 3.3.4.1). Meanwhile, we expect the results to slightly improve by using a full crawl as learning set.

Another potential datasource might be Common Crawl, a publicly available broad crawl of the Web Commoncrawl.org [Com13] (see 3.3). A main benefit would be its very large size, currently containing 2.8 billion Web pages2, and its availability in the cloud. Meanwhile, we already argued above that Common Crawl data is likely biased, as quantitative properties of structured e-commerce data could not be reproduced. An elaborated exploration of this issue, and a possible use for our approach, would be valuable future work.

5.2.2 Approach

We see two main limitations regarding our approach. First, its performance is limited, as it is only a prototypical implementation. A second potential weakness is the strict limitation to the fundamental approach, disregarding potential alternative solutions.

Performance

To keep the approach as simple as possible, we developed a relatively naive solution. As discussed thoroughly in Section 4.2, to gain the extraction rules, we simply counted the most common XPaths. While this suced to clearly validate our hypothesis, it is obvious that this is far from gaining optimal performance. For future work, an elaborate version could feed the value properties, e.g. to a Natural Language Processing algorithm. At the same time, we argue that for our experiment, the provided approach is sucient, as we do not strive for a production solution. The main insight, that structured data can be used to generate novel structured data, could be demonstrated.

2http://commoncrawl.org/march-2014-crawl-data-now-available/ Chapter 5. Conclusion and Outlook 199

Hybrid Methods

In this context, we expect that hybrid approaches, using dierent methods for dierent problems, might show the best overall performance. The ideas discussed below are related to existing work in WIE, discussed in Section 2.3. We see three methods that could extend our approach complementarily:

• During the course of the thesis, sometimes we stumbled upon heuristics that might work quite well to extract specific properties. For instance, in SEO-optimized Web shops, the URI often contains the product name. This could be exploited with string processing.

• Another direction that might increase the performance is the automated analysis of the visual rendering of Web pages. So far, our approach disregards the rendering directives that are attached to the CSS styles. Mean- while, Web shops commonly use those to emphasize dierent parts of the page, e.g. smaller font sizes for the description, and bigger font sizes for name and price. Future work could use these additional properties to improve the results of the extraction.

• Another promising direction to extract data from the Web is to use human computation [QB11]. For a human agent, it is fairly trivial to mark im- portant elements of a Web page. Future work should show how automated approaches, like the one at hand, and manual approaches, can be combined optimally.

A hybrid solution would use the results of the thesis approach as well as the aforementioned methods and existing approaches to increase the performance. Meanwhile, for this thesis, we chose a basic approach to gain results that are not influenced by these techniques, which do not depend on existing structured data. Chapter 5. Conclusion and Outlook 200

5.2.3 Evaluation

Regarding the evaluation, we chose the established cross-validation approach of a 50 % learning set, and a 50 % evaluation set. We modified many experiment settings to show the validity of this approach. Additionally, we generated a n=80 size dataset that was manually labeled. Future work could include a large-scale crowd-sourced evaluation dataset. The two additional datasets mentioned above could also be used for an evaluation of our approach in future work.

Regarding the settings of the evaluation, we tried to back our choices of threshold by modifying them carefully and monitoring the impact on the results. Future work could include an algorithmic optimization of the settings. This also applies to the overall experimental settings. Meanwhile, we argue again that this would have exceeded the scope of the thesis.

In comparison to related work (Section 2.3), there are two limitations regarding the evaluation of our approach. First, we cannot use existing datasets as a benchmark for our work, as they lack the structured data integrated in Web pages our approach builds on. At the same time, our main goal is to provide evidence for the usefulness of the data at hand as a learning set, and not to strive for higher performance. Therefore, we argue that our evaluation approach is sucient. Second, related work mostly provides precision and recall in the evaluation, and derives the F1-score as harmonic mean. As our dataset already consists of relevant samples, recall is always 100 %. Therefore, we omitted this score and the derived F-Score. Future work should provide controlled datasets that match our approach, for a further improvement of its performance.

5.2.4 Use Cases

Extended use cases would demand a more sophisticated implementation of our approach. Here, a generic Web service that delivers structured e-commerce data on request would be highly attractive. Instead of generating the data per Web shop, which has the downside of a manual installation of shop extensions, and their Chapter 5. Conclusion and Outlook 201

fast decay, the data could be generated externally on a central service, that is fed only with the respective Web pages. A similar use case based on this central Web service, that is able to generate structured data out of arbitrary shop pages, would be a Javascript plugin that injects structured data.

5.2.5 Scale

In its current implementation, the system is not designed to scale for production settings. To gain an encompassing overview of the e-commerce Web, a broad crawl would be needed. That alone is a highly complex task. Additionally, as the price data contained in the oerings is highly relevant for the buying decision, the crawl would need sophisticated prioritization strategies to be up to date. Those problems are currently solved only by industry-leading search engines like Google, Bing and Yandex. In the use case in Section 4.5, we provided an outline of how such a solution might work, and a glimpse on its performance. Meanwhile, we argue that it is out of the scope of the thesis to further target those issues.

5.2.6 Scope

Regarding the scope of the thesis, there are two central limitations. We (1) limited our experiment on the RDFa syntax and (2) on the e-commerce domain.

We chose the syntax limitation as of its dominance regarding GoodRelations data. As our data source were shops that used GoodRelations in RDFa, this was an obvious choice for the experiment. Meanwhile, the schema.org initiative of Google integrated GoodRelations, which will result in more Web shops using Microdata syntax to express structured data. Therefore, future work should include additional datasources containing schema.org / Microdata markup. As mentioned above, data driven approaches fundamentally work better on more data, (e.g. [HNP09]). We expect this to apply to our approach as well. Chapter 5. Conclusion and Outlook 202

The limitation to the e-commerce domain was natural as of the research environ- ment of the thesis. Meanwhile, we expect the fundamental findings regarding the replicative nature of structured data to hold for many more domains. If these qual- ify similarly regarding the foundational findings (dominance of a few ECS, template patterns, availability of structure data), the approach should be transferable. For instance, the domain of Internet news, specifically blogs, is dominated by a few blogging systems, like Wordpress, Drupal, or Blogger. Therefore, we expect that future work could show the applicability of the approach on additional domains.

5.3 Outlook: On the Self-Replicating Nature of Structured Data

Research can either focus on problem solving or gaining an understanding about a certain phenomenon. Derived from the research question, the thesis focussed on showing a novel way to generate structured data in e-commerce. Meanwhile, in the final stage of the thesis, we became aware that it also provides an understanding about the fundamental character of structured data. As this strain of discussion is rather experimental and not backed rigorously, but provides a meta perspective on the topic, we decided to choose it for these final remarks.

Here, our main claim is that structured data on the Web has a self replicating nature. We understand our approach metaphorically as a fertilizer, that, applied to the existing structured data, can catalyze the generation of novel data. As we pointed out in the introduction, a main problem of structured data right now is that it only covers a small share of the Web. In this context, the remaining gap might be closed partially with our approach.

We expect the growth process to be governed by two factors. First, there is the share of the Web covered with structured data in the first place. For the process to become viable, we expect a certain tipping point or “critical mass” of structured data to be needed. If this amount is passed, the process should show self-reinforcing Chapter 5. Conclusion and Outlook 203 properties. Second, the process depends on the performance of the fertilizer / extraction system. While we could show that our approach fundamentally proves the hypothesis, we underlined that there is significant room for improvement.

If our approach could be extended to a wider scope, and be deployed on Web scale, it would likely be able to cover a large share of the Web with structured data. Bibliography

[Adr+10] Benjamin Adrian et al. “Epiphany: Adaptable RDFa Generation Link- ing the Web of Documents to the Web of Data”. In: Proceedings of the 17th International Conference on Knowledge Engineering and Manage- ment by the Masses. 2010, pp. 178–192.

[AH04] Grigoris Antoniou and Frank van Harmelen. A Semantic Web Primer. MIT Press, 2004.

[Alc10] M. Alchin. Pro Python. Apress, 2010.

[Alv+11] Gene Alvarez et al. Magic Quadrant for E-Commerce. https://www. gartner.com/doc/1839418/magic-quadrant-ecommerce. [Online; ac- cessed 09/13/2015].

[Ama13] Amazon.com. Alexa Top 1 Million Sites by Trac Rank as CSV. http: //s3.amazonaws.com/alexa-static/top-1m.csv.zip. [Online; accessed 01/10/2013].

[Arm06] Deborah J. Armstrong. “The Quarks of Object-Oriented Development”. In: Communications of the ACM 49.2 (2006), pp. 123–128.

[Ash+11] Jamshaid Ashraf et al. “Open Ebusiness Ontology Usage: Investigating Community Implementation of GoodRelations”. In: Proceedings of the WWW2011 Workshop on Linked Data on the Web LDOW2011. 2011, pages n.a.

[Aue+07] Sören Auer et al. “DBpedia: A Nucleus for a Web of Open Data”. In: ISWC/ASWC. 2007, pp. 722–735.

204 Bibliography 205

[Bak13] Alan Baker. “Simplicity”. In: The Stanford Encyclopedia of Philosophy. 2013, pages n.a.

[BBH09] Tim Berners-Lee, Christian Bizer, and Tom Heath. “Linked Data - The Story so far”. In: International Journal on Semantic Web and Information Systems 5.3 (2009), pp. 1–22.

[Bec+04] Sean Bechhofer et al. OWL Web Ontology Language Reference. http: //www.w3.org/TR/owl-ref/. [Online; accessed 09/13/2015].

[Ber+04] Tim Berners-Lee et al. Architecture of the World Wide Web, Volume One, 3. Interaction. http://www.w3.org/TR/webarch/#interaction. [Online; accessed 01/10/2013].

[Ber00] Tim Berners-Lee. Semantic Web - XML 2000. http : / / www . w3 . org/2000/Talks/1206-xml2k-tbl/slide10-0.html. [Online; accessed 03/20/2013].

[Ber01] M.K. Bergman. “The Deep Web: Surfacing Hidden Value”. In: Journal of Electronic Publishing 7.1 (2001), pages n.a.

[Ber02] Tim Berners-Lee. Principles of Design. http : / / www . w3 . org / DesignIssues/Principles.html. [Online; accessed 01/10/2013].

[Ber06] Tim Berners-Lee. Linked Data. http://www.w3.org/DesignIssues/ LinkedData.html. [Online; accessed 10/04/2013].

[Ber09] Tim Berners-Lee. Design Issues: Architectural and Philosophical Points. http://www.w3.org/DesignIssues/. [Online; accessed 09/13/2015].

[BFG01] Robert Baumgartner, Sergio Flesca, and Georg Gottlob. “Visual Web Information Extraction with Lixto”. In: Proceedings of the 27th Inter- national Conference on Very Large Data Bases. 2001, pp. 119–128.

[BFM05] Tim Berners-Lee, Roy Fielding, and L. Masinter. Uniform Resource Identifier URI: Generic Syntax. http://tools.ietf.org/html/rfc3986. [Online, accessed 09/15/2015].

[BG14] D. Brickley and R. V. Guha. RDF Schema 1.1. http://www.w3.org/ TR/rdf-schema/. [Online, accessed 09/15/2013]. Bibliography 206

[BGH09] Robert Baumgartner, Georg Gottlob, and Marcus Herzog. “Scalable Web Data Extraction for Online Market Intelligence”. In: Proceedings of the VLDB Endowment. 2. 2009, pp. 1512–1523.

[BHL01] Tim Berners-Lee, James Hendler, and Ora Lassila. “The Semantic Web”. In: Scientific American 284.5 (2001), pp. 28–37.

[Biz+13] Christian Bizer et al. “Deployment of RDFa, Microdata, and Microfor- mats on the Web - A Quantitative Analysis”. In: ISWC. 2013, pp. 17– 32.

[Ble+03] David M. Blei et al. “Latent Dirichlet Allocation”. In: Journal of Machine Learning Research 3 (2003), pp. 993–1022.

[BLS14] Paul Belleflamme, Thomas Lambert, and Armin Schwienbacher. “Crowdfunding: Tapping the Right Crowd”. In: Journal of Business Venturing 29.5 (2014), pp. 585–609.

[Boa12] RSS Advisory Board. RSS 2.0 Specification Version 2.0.11. http:// www.rssboard.org/rss-specification. [Online; accessed 10/04/2013].

[Bra+08] Tim Bray et al. Extensible Markup Language XML 1.0 Fifth Edition. http://www.w3.org/TR/xml/. [Online; accessed 04/20/2013].

[Bra07] Steve Bratt. Semantic Web, and Other Technologies to Watch. http: //www.w3.org/2007/Talks/0130-sb-W3CTechSemWeb/24. [Online; accessed 10/04/2013].

[Bre01] Leo Breiman. “Random Forests”. In: Machine Learning 45.1 (2001), pp. 5–32.

[Bru13] Paul Bruemmer. How To Get A 30% Increase In CTR With Structured Markup. http://searchengineland.com/how-to-get-a-30-increase-in- ctr-with-structured-markup-105830. [Online; accessed 07/29/2013].

[bui13] builtwith.com. Ecommerce Technology Web Usage Statistics. http : //trends.builtwith.com/shop. [Online; accessed 03/20/2013]. Bibliography 207

[Bun13] Bundesverband des deutschen Versandhandels. Der Interaktive Handel wächst - Steigender Anteil des Interaktiven Handels am Einzelhandel 2009-2013. http://www.bevh.org/typo3temp/pics/053ae6b887.jpg. [Online; accessed 10/04/2013].

[BV10] Nitin Bhatia and Vandana. “Survey of Nearest Neighbor Techniques”. In: International Journal of Computer Science and Information Secu- rity 8.2 (2010), pp. 302–305.

[Car+10] Andrew Carlson et al. “Coupled Semi-Supervised Learning for Infor- mation Extraction”. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining. 2010, pp. 101–110.

[CBD99] Soumen Chakrabarti, Martin van den Berg, and Byron Dom. “Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery”. In: Computer Networks 31.11–16 (1999), pp. 1623–1640.

[CD99] James Clark and Steve DeRose. XML Path Language XPath Version 1.0. http://www.w3.org/TR/xpath/. [Online, accessed 09/15/2015].

[Cha+06] Chia Hui Chang et al. “A Survey of Web Information Extraction Systems”. In: IEEE Transactions on Knowledge and Data Engineering 18.10 (2006), pp. 1411–1428.

[Cha+94] Sudarshan Chawathe et al. “The TSIMMIS Project: Integration of Het- erogeneous Information Sources”. In: Proceedings of the 16th Meeting of the Information Processing Society of Japan. 1994, pages n.a.

[Che07] Henry Chesbrough. “Business Model Innovation: It’s not just about Technology anymore”. In: Strategy & Leadership 35.6 (2007), pp. 12– 17.

[Ciu09] Eugene Ciurana. Developing with Google App Engine. Apress, 2009.

[CK04] Jeremy J. Carroll and Graham Klyne. Resource Description Frame- work RDF: Concepts and Abstract Syntax. http://www.w3.org/TR/ 2004/REC-rdf-concepts-20040210/. [Online, accessed 09/15/2015]. Bibliography 208

[CMM02] Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. “Road- Runner: Towards Automatic Data Extraction from Large Web Sites”. In: Proceedings of the 27th VLDB Conference. 2002, pp. 109–118.

[CO08] David H. Crocker and Paul Overell. Augmented BNF for Syntax Specifi- cations: ABNF. https://tools.ietf.org/html/rfc5234. [Online, accessed 09/15/2015].

[Cod70] E. F. Codd. “A Relational Model of Data for Large Shared Data Banks”. In: Communications of the ACM 13.6 (1970), pp. 377–387.

[Col07] Robert M. Colomb. Ontology and the Semantic Web. IOS Press, 2007.

[Com13] Commoncrawl.org. Common Crawl. http://commoncrawl.org/. [On- line; accessed 10/04/2013].

[CS08] Andrew Carlson and Charles Schafer. “Bootstrapping Information Ex- traction from Semi-Structured Web Pages”. In: Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I. 2008, pp. 195–210.

[Cun+02] H. Cunningham et al. “GATE: A Framework and Graphical Devel- opment Environment for Robust NLP Tools and Applications”. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics. 2002, pp. 223–254.

[CV95] Corinna Cortes and Vladimir Vapnik. “Support-Vector Networks”. In: Machine Learning. 1995, pp. 273–297.

[CW03a] Bizer C. and J. Wolk. RDF Version of the eClass 4.1 Product Clas- sification Schema. http : / / www . wiwiss . fu - berlin . de / suhl / bizer / ecommerce/eClass-4.1.rdf. [Online; accessed 03/20/2013].

[CW03b] Fabio Ciravegna and Yorick Wilks. “Designing Adaptive Information Extraction for the Semantic Web in Amilcare”. In: Annotation for the Semantic Web, Frontiers in Artificial Intelligence and Applications. 2003, pages n.a. Bibliography 209

[CZ13] Deborah Crawford and Ashley Zandy. Facebook Reports, Second Quar- ter, 2013 Results. http : / / investor . fb . com / releasedetail . cfm ? ReleaseID=780093. [Online; accessed 07/29/2013].

[Dee+90] S. Deerwester et al. “Indexing by Latent Semantic Analysis”. In: Jour- nal of the American Society for Information Science 41 41.6 (1990), pp. 391–407.

[DKS11] Nilesh N. Dalvi, Ravi Kumar, and Mohamed A. Soliman. “Automatic Wrappers for Large Scale Web Extraction”. In: Proceedings of the VLDB Endowment. 4. 2011, pp. 219–230.

[DSW07] John Davies, Rudi Studer, and Paul Warren. Semantic Web Technolo- gies: Trends and Research in Ontology-based Systems. Wiley, 2007.

[Edm14] Derek Edmond. 20 Ways B2B SEOs Can Leverage Schema.org Markup. http://searchengineland.com/20-ways-b2b-seos-can-leverage-schema- markup-online-marketing-208712. [Online; accessed 10/09/2015].

[ES07] Mathias Erlei and Andreas Szczutkowski. Adverse Selection. http : //wirtschaftslexikon.gabler.de/Archiv/922/adverse-selection-v8.html. [Online; accessed 04/20/2013].

[eSa] OXID eSales. OXID eSales. http://www.oxid-esales.com/. [Online; accessed 01/11/2014].

[Fas06] Maria Fasli. “Shopbots: A Syntactic Present, A Semantic Future”. In: IEEE Internet Computing 10.6 (2006), pp. 69–75.

[Fen+01] Dieter Fensel et al. “Product Data Integration in B2B E-Commerce”. In: IEEE Intelligent Systems 16.4 (2001), pp. 54–59.

[FGS12] Tim Furche, Georg Gottlob, and Christian Schallhart. “DIADEM: Do- mains to Databases”. In: Database and Expert Systems Applications. 2012, pp. 1–8.

[Fie+99] R. Fielding et al. Hypertext Transfer Protocol – HTTP/1.1. http : //www.ietf.org/rfc/rfc2616.txt. [Online; accessed 03/20/2013]. Bibliography 210

[Fie00] Roy Thomas Fielding. “REST: Architectural Styles and the Design of Network-Based Software Architectures”. University of California, 2000.

[FK00] Dayne Freitag and Nicholas Kushmerick. “Boosted Wrapper Induc- tion”. In: AAAI/IAAI. 2000, pp. 577–583.

[Fur+11] Tim Furche et al. “Oxpath: A Language for Scalable, Memory-Ecient Data Extraction From Web Applications”. In: Proceedings of the VLDB Endowment. 11. 2011, pp. 1016–1027.

[Gar05] Jesse James Garrett. Ajax: A New Approach to Web Applications. https://courses.cs.washington.edu/courses/cse490h/07sp/readings/ ajax_adaptive_path.pdf. [Online; accessed 01/10/2013].

[GC01] Asunción Gómez-Pérez and Óscar Corcho. “Solving Integration Prob- lems of E-Commerce Standards and Initiatives Through Ontological Mappings”. In: International Journal of Intelligent Systems 16.16 (2001), pages n.a.

[GEW06] Pierre Geurts, Damien Ernst, and Louis Wehenkel. “Extremely Ran- domized Trees”. In: Machine Learning 63.1 (2006), pp. 3–42.

[GFC04] Asunción Gómez-Pérez, Mariano Fernández-López, and Óscar Corcho. Ontological Engineering. Springer Verlag, 2004.

[GG95] N. Guarino and P. Giaretta. “Ontologies and Knowledge Bases: To- wards a Terminological Clarification”. In: Towards Very Large Knowl- edge Bases: Knowledge Building & Knowledge Sharing. 1995, pp. 25– 32.

[GJG04] Laura A Granka, Thorsten Joachims, and Geri Gay. “Eye-Tracking Analysis of User Behavior in WWW Search”. In: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. ACM. 2004, pp. 478–479. Bibliography 211

[GPT05] David Gibson, Kunal Punera, and Andrew Tomkins. “The Volume and Evolution of Web Page Templates”. In: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web. 2005, pp. 830–839.

[GQ02] Tanya Gupta and Abir Qasem. “Reduction of Price Dispersion through Semantic E-Commerce: A Position Paper”. In: Proceedings of the Se- mantic Web Workshop. 2002, pp. 1–2.

[Gul+10] Pankaj Gulhane et al. “Exploiting Content Redundancy for Web In- formation Extraction”. In: Proceedings of the 19th International Con- ference on World Wide Web. 2010, pp. 1105–1106.

[GYC08] Inc. Google, Inc. Yahoo, and Microsoft Corporation. Sitemaps.org - Protocol. http://www.sitemaps.org/protocol.html. [Online; accessed 03/20/2013].

[Haa+04] Peter Haase et al. “A Comparison of RDF Query Languages”. In: The Semantic Web–ISWC 2004. 2004, pp. 502–517.

[Hal11] Wendy Hall. “The Ever Evolving Web: The Power of Networks”. In: International Journal of Communication 5 (2011), pp. 651–664.

[HB11] Tom Heath and Christian Bizer. “Linked Data: Evolving the Web into a Dlobal Data Space”. In: Synthesis lectures on the semantic web: theory and technology 1.1 (2011), pp. 1–136.

[Hepa] Martin Hepp. Advertising with Linked Data in Web Content. http: //de.slideshare.net/mhepp/advertising- with- linked- data- in- web- content. [Online, accessed 10/11/2014].

[Hepb] Martin Hepp. The GoodRelations CookBook. http : / / wiki . goodrelations - vocabulary . org / Cookbook. [Online; accessed 04/20/2015].

[Hepc] Martin Hepp. The GoodRelations User’s Guide. http : / / wiki . goodrelations - vocabulary . org / Documentation. [Online; accessed 04/20/2015]. Bibliography 212

[Hepd] Martin Hepp. Tools - GoodRelations Wiki. http://wiki.goodrelations- vocabulary.org/Tools. [Online; accessed 04/20/2015].

[Hep+09] M. Hepp et al. “GoodRelations Tools and Applications”. In: Poster and Demo Proceedings of the 8th International Semantic Web Conference ISWC 2009 , Washington , DC , USA. 2009, pages n.a.

[Hep06] Martin Hepp. “Products and Services Ontologies: A Methodology for Deriving OWL Ontologies from Industrial Categorization Standards”. In: International Journal on Semantic Web and Information Systems 2.1 (2006), pp. 72–99.

[Hep08a] Martin Hepp. “GoodRelations: An Ontology for Describing Products and Services Oers on the Web”. In: Proceedings of the 16th Interna- tional Conference on Knowledge Engineering and Knowledge Manage- ment (EKAW2008). 2008, pp. 332–347.

[Hep08b] Martin Hepp. “Ontologies: State of the Art, Business Potential, and Grand Challenges”. In: Ontology Management: Semantic Web, Seman- tic Web Services, and Business Applications. 2008, pp. 3–22.

[Hep11a] Martin Hepp. GoodRelations Language Reference. http : / / www . heppnetz . de / ontologies / goodrelations / v1. [Online; accessed 04/20/2013].

[Hep11b] Martin Hepp. GoodRelations UML Class Diagram. http : / / www . heppnetz.de/ontologies/goodrelations/goodrelations-UML.pdf. [On- line; accessed 01/10/2013].

[Hep12] Martin Hepp. “The Web of Data for E-Commerce in Brief”. English. In: Web Engineering. Vol. 7387. 2012, pp. 510–511. doi: 10.1007/978- 3-642-31753-8_58.

[Hep13] Martin Hepp. Shop Extensions - GoodRelations Wiki. http://wiki. goodrelations - vocabulary. org / Shop - extensions. [Online; accessed 03/20/2013]. Bibliography 213

[HG06] Paul Heymann and Hector Garcia-Molina. Collaborative Creation of Communal Hierarchical Taxonomies in Social Tagging Systems. http: //ilpubs.stanford.edu:8090/775/. [Online, accessed 09/15/2015].

[HG08] James A. Hendler and Jennifer Golbeck. “Metcalfe’s Law, Web 2.0, and the Semantic Web”. In: Journal of Web Semantics 6.1 (2008), pp. 14–20.

[Hic11] Ian Hickson. HTML Living Standard. http://www.whatwg.org/specs/ web-apps/current-work/multipage/. [Online; accessed 04/20/2013].

[Hit+08] Pascal Hitzler et al. Semantic Web. Springer, 2008.

[HK05] Andrew Hogue and David Karger. “Thresher: Automating the Unwrap- ping of Semantic Content from the World Wide Web”. In: Proceedings of the Fourteenth International World Wide Web Conference. 2005, pp. 86–95.

[HNP09] A. Halevy, P. Norvig, and F. Pereira. “The Unreasonable Eectiveness of Data”. In: IEEE Intelligent Systems 24.2 (2009), pp. 8–12.

[Hor13] Andrew Horton. WhatWeb. http://www.morningstarsecurity.com/ research/whatweb. [Online; accessed 03/20/2013].

[How+11] Philip N. Howard et al. Opening Closed Regimes: What Was the Role of Social Media During the Arab Spring? http://pitpi.org/index.php/ 2011/09/11/opening-closed-regimes-what-was-the-role-of-social- media-during-the-arab-spring/. [Online; accessed 01/10/2013].

[HP09] A.L. Hughes and L. Palen. “Twitter Adoption and Use in Mass Conver- gence and Emergency Events”. In: International Journal of Emergency Management 6.3 (2009), pp. 248–260.

[Hub07] Peter Hubwieser. “Datenmodellierung und Datenbanken”. German. In: Didaktik der Informatik. 2007, pp. 155–176.

[Hug68] G. Hughes. “On the Mean Accuracy of Statistical Pattern Recognizers”. In: IEEE Transactions on Information Theory 14.1 (1968), pp. 55–63. Bibliography 214

[Hun07] J. D. Hunter. “Matplotlib: A 2D Graphics Environment”. In: Comput- ing In Science & Engineering 9.3 (2007), pp. 90–95.

[JL10] Nitin Jindal and Bing Liu. “A Generalized Tree Matching Algorithm Considering Nested Lists for Web Data Extraction”. In: The SIAM International Conference on Data Mining. 2010, pp. 930–941.

[Jon72] Karen Spärck Jones. “A Statistical Interpretation of Term Specificity and its Application in Retrieval”. In: Journal of Documentation 28.1 (1972), pp. 11–21.

[Jür12] Dr. Mathias Weber Jürgen Urbanski. Big Data im Praxiseinsatz – Szenarien, Beispiele, Eekte. https : / / www . bitkom . org / Bitkom / Publikationen/Publikation_4232.html. [Online, accessed 09/15/2015].

[KHL08] Michael Kaisser, Marti A Hearst, and John B Lowe. “Improving Search Results Quality by Customizing Summary Lengths”. In: ACL. 2008, pp. 701–709.

[KHS07] Ron Kohavi, Randal M. Henne, and Dan Sommerfield. “Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the Hippo”. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2007, pp. 959– 967.

[Kle02] M. Klein. DAML+OIL and RDF Schema Representation of UN- SPSC. http://www.cs.vu.nl/~mcaklein/unspsc/. [Online; accessed 03/20/2013].

[Knu84] Donald Ervin Knuth. “Literate Programming”. In: The Computer Journal 27.2 (1984), pp. 97–111.

[Koh95] Ron Kohavi. “A Study of Cross-validation and Bootstrap for Accuracy Estimation and Model Selection”. In: Proceedings of the 14th Inter- national Joint Conference on Artificial Intelligence - Volume 2. 1995, pp. 1137–1143.

[Kot07] S. B. Kotsiantis. “Supervised Machine Learning: A Review of Classifi- cation Techniques”. In: Informatica. 2007, pp. 249–268. Bibliography 215

[Kus03] Nicholas Kushmerick. “Finite-State Approaches to Web Information Extraction”. In: Information Extraction in the Web Era. 2003, pp. 77– 91.

[KWD97] N. Kushmerick, D. Weld, and R. Doorenbos. “Wrapper Induction for Information Extraction”. In: Proceedings of 15th International Con- cerence on Artificial Intelligence (1997), pp. 729–735.

[Lac05] Lee W. Lacy. OWL: Representing Information Using the Web Ontology Language.Traord, 2005.

[Lev66] Vladimir Iosifovich Levenshtein. “Binary Codes Capable of Correcting Deletions, Insertions and Reversals”. In: Soviet Physics Doklady. 1966, p. 707.

[LG12] Markus Lanthaler and Christian Gütl. “On Using JSON-LD to Create Evolvable RESTful Services”. In: Proceedings of the Third Interna- tional Workshop on RESTful Design. 2012, pp. 25–32.

[LJC11] Hyung Seok Lee, Dong Soo Jin, and Jemi Choi. “Eects of Information Intermediary Functions of Comparison Shopping Sites on Customer Loyalty”. In: Journal of Internet Banking & Commerce 16.2 (2011), pages n.a.

[LK10] Sangwon Lee and Richard J. Koubek. “The Eects of Usability and Web Design Attributes on User Preference for E-Commerce Web Sites”. In: Computers in Industry 61.4 (2010), pp. 329–341.

[LLC07] MaxMind LLC. GeoLite Free Downloadable Databases. http://dev. maxmind.com/geoip/legacy/geolite/. [Online; accessed 10/04/2013].

[Löb13] S. Löbner. Understanding Semantics, 2nd Edition. Taylor & Francis, 2013.

[Mac+67] James MacQueen et al. “Some Methods for Classification and Analysis of Multivariate Observations”. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. 1967, p. 14.

[Mag] Magento. Magento eCommerce Software and Platform Magento. http: //www.magentocommerce.com/. [Online; accessed 10/04/2013]. Bibliography 216

[MB12] Hannes Mühleisen and Christian Bizer. “Web Data Commons - Ex- tracting Structured Data from Two Large Web Corpora”. In: 5th Linked Data on the Web Workshop. 2012, pages n.a.

[McG01] D.L. McGuinness. DAML version of UNSPSC - Universal Standard Products and Services Classification Code. http://www.ksl.stanford. edu/projects/DAML/UNSPSC.daml. [Online; accessed 03/20/2013].

[McG91] T. McGuinness. “Markets and Managerial Hierarchies”. In: Markets, Hierarchies and Networks: The Coordination of Social Life. 1991, pp. 66–81.

[Mck11] Wes Mckinney. “Pandas: A Foundational Python Library for Data Analysis and Statistics”. In: Proceedings of 2011 International Con- ference for High Performance Computing, Networking, Storage and Analysis. 2011, pp. 273–297.

[Met] Meteor. Meteor. http://microformats.org/about. [Online; accessed 10/04/2013].

[ML07] Zdravko Markov and Daniel T. Larose. Data Mining the Web - Un- covering Patterns in Web Content, Structure, and Usage. Wiley, 2007.

[MMK99] Ion Muslea, Steve Minton, and Craig A. Knoblock. “A Hierarchical Ap- proach to Wrapper Induction”. In: Proceedings of the 3rd International Conference on Autonomous Agents. 1999, pp. 190–197.

[MP12] Peter Mika and Tim Potter. “Metadata Statistics for a Large Web Corpus”. In: Linked Data on the Web (LDOW2012). 2012, pages n.a.

[MRS08] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008.

[MTL78] Robert McGill, John W. Tukey, and Wayne A. Larsen. “Variations of Box Plots”. In: The American Statistician 32.1 (1978), pp. 12–16.

[NA13] N/A. Python 2: Brief Tour of The Standard Library. https://docs. python.org/2/tutorial/stdlib.html. [Online; accessed 01/10/2013]. Bibliography 217

[Owe+11] Sean Owen et al. Mahout in Action. Manning Publications, 2011.

[OWL01] Leo Obrst, Robert E. Wray, and Howard Liu. “Ontological Engineering for B2B E-Commerce”. In: Proceedings of the International Conference on Formal Ontology in Information Systems. 2001, pp. 117–126.

[Paz+03] Maria Teresa Pazienza et al. “Combining Ontological Knowledge and Wrapper Induction Techniques into an E-Retail System”. In: Workshop on Adaptive Text Extraction and Mining ATEM03 held with ECML/P- KDD. 2003, pp. 50–57.

[Pea01] K. Pearson. “On Lines and Planes of Closest Fit to Systems of Points in Space”. In: Philosophical Magazine 2.6 (1901), pp. 559–572.

[Ped+11] Fabian Pedregosa et al. “Scikit-Learn: Machine Learning in Python”. In: Journal of Machine Learning Research 12 (2011), pp. 2825–2830.

[PG07] Fernando Perez and Brian E. Granger. “IPython: A System for Inter- active Scientific Computing”. In: Computing in Science & Engineering 9.3 (2007), pp. 21–29.

[Pop+03] Borislav Popov et al. “Towards Semantic Web Information Extraction”. In: Human Language Technologies Workshop at the 2nd International Semantic Web Conference. 2003, pages n.a.

[Pow11] D. M. W. Powers. “Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation”. In: Journal of Machine Learning Technologies 2.1 (2011), pp. 37–63.

[Pre] PrestaShop. PrestaShop. http://www.prestashop.com/en/. [Online; accessed 10/04/2013].

[PS08] Eric Prudhommeaux and Andy Seaborne. SPARQL Query Language for RDF. http://www.w3.org/TR/rdf-sparql-query. [Online, accessed 09/15/2015].

[QB11] Alexander J. Quinn and Benjamin B. Bederson. “Human Computation: A Survey and Taxonomy of a Growing Field”. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 2011, pp. 1403–1412. Bibliography 218

[QD09] Xiaoguang Qi and Brian D. Davison. “Web Page Classification: Fea- tures and Algorithms”. In: ACM Computing Surveys 41.2 (2009), pp. 1– 31.

[Qui86] J. R. Quinlan. “Induction of Decision Trees”. In: Machine Learning 1.1 (1986), pp. 81–106.

[QY10] Taofen Qiu and Tianqi Yang. “Automatic Information Extraction from E-Commerce Web Sites”. In: International Conference on E-Business and E-Government ICEE. 2010, pp. 1399–1402.

[Raj12] Dinesh Raju. How Many Online Stores Are there in the U.S.? http: //blog.referralcandy.com/2012/08/14/how-many-online-stores-are- there-in-the-u-s/. [Online; accessed 10/04/2013].

[Red98] T.C. Redman. “The Impact of Poor Data Quality on the Typical Enterprise”. In: Communications of the ACM 41.2 (1998), pp. 79–82.

[RHJ99] Dave Raggett, Arnaud Le Hors, and Ian Jacobs. HTML 4.01 Spec- ification. http : / / www . w3 . org / TR / html401/. [Online; accessed 04/20/2013].

[Rob12] Tom Robertshaw. October 2012 eCommerce Survey. http : / / tomrobertshaw.net/2012/11/october-2012-ecommerce-survey/. [On- line; accessed 01/10/2013].

[Rod09] Marko A. Rodriguez. “A Reflection on the Structure and Process of the Web of Data”. In: Bulletin of the American Society for Information Science and Technology 35.6 (2009), pp. 38–43.

[Rou87] Peter Rousseeuw. “Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis”. In: Journal of Computational and Applied Mathematics 20.1 (1987), pp. 53–65.

[RU12] A. Rajaraman and J.D. Ullman. Mining of Massive Datasets. Cam- bridge University Press, 2012. Bibliography 219

[Saf13] Nathan Safran. 310 Million Visits: Nearly Half of All Web Site Trac Comes From Natural Search. http://www.conductor.com/blog/2013/ 06/data-310-million-visits-nearly-half-of-all-web-site-trac-comes- from-natural-search/. [Online; accessed 10/04/2015].

[SBH06] Nigel Shadbolt, Tim Berners-Lee, and Wendy Hall. “The Semantic Web Revisited”. In: IEEE Intelligent Systems 21.3 (2006), pp. 96–101.

[Sch+05] Volker Schmitz et al. Spezifikation des BMECat 2005. http : / / www . schrack . at / fileadmin / f / at / downloadcenter / Tools_ - _Kundenschnittstellen/Anwendung_BMECat.pdf. [Online, accessed 09/15/2015].

[sch12] schema.org. Schema Blog: GoodRelations and Schema.org. http://blog. schema.org/2012/11/good-relations-and-schemaorg.html. [Online; accessed 01/10/2013].

[Sch13] Schema.org. Home - schema.org. http://schema.org/. [Online; accessed 01/10/2013].

[Scra] Scrapy. Scrapy. http://scrapy.org/. [Online; accessed 09/13/2015].

[Scrb] Scrapy. Scrapy Documentation - Scrapy at a Glance. http://doc.scrapy. org/en/latest/intro/overview.html. [Online; accessed 10/04/2013].

[SCV07] Leo Sauermann, Richard Cyganiak, and Max Voelkel. Cool URIs for the Semantic Web. http://www.w3.org/TR/cooluris/. [Online, accessed 09/15/2015].

[SGH12] Alex Stolz, Mouzhi Ge, and Martin Hepp. “GR4PHP: A Programming API for Consuming E-Commerce Data from the Semantic Web”. In: Proceedings of the First Workshop on Programming the Semantic Web. 2012, pages n.a.

[SGH13] Uwe Stoll, Mouzhi Ge, and Martin Hepp. “Understanding the Impact of E-Commerce Software on the Adoption of Structured Data on the Web”. In: Business Information Systems. 2013, pp. 100–112.

[SH13a] Alex Stolz and Martin Hepp. “Currency Conversion the Linked Data Way”. In: SALAD@ESWC. 2013, pp. 44–55. Bibliography 220

[SH13b] Alex Stolz and Martin Hepp. “From RDF to RSS and Atom: Con- tent Syndication with Linked Data”. In: Proceedings of the 24th ACM Conference on Hypertext and Social Media. 2013, pp. 236–241.

[SH14] Uwe Stoll and Martin Hepp. “Detection of E-Commerce Systems with Sparse Features and Supervised Classification”. In: 10th IEEE Inter- national Conference on e-Business Engineering (ICEBE 2013). 2014, pages n.a.

[Sha05] Y. Shafranovich. Common Format and MIME Type for Comma- Separated Values CSV Files. https://tools.ietf.org/html/rfc4180. [Online, accessed 09/15/2015].

[SHB06] Gerd Stumme, Andreas Hotho, and Bettina Berendt. “Semantic Web Mining: State of the Art and Future Directions”. In: Web Seman- tics: Science, Services and Agents on the World Wide Web 4.2 (2006), pp. 124–143.

[Sig05] Oreste Signore. Representing Knowledge in the Semantic Web. http: / / www . w3c . it / talks / 2005 / openCulture / slide7 - 0 . html. [Online; accessed 10/04/2013].

[Sin12] Amit Singhal. Introducing the Knowledge Graph: Things, not Strings. https://googleblog.blogspot.co.uk/2012/05/introducing-knowledge- graph-things-not.html. [Online; accessed 10/04/2015].

[Sod99] Stephen Soderland. “Learning Information Extraction Rules for Semi- Structured and Free Text”. In: Machine Learning 34.1-3 (1999), pp. 233–272.

[SRH13a] Alex Stolz, Bene Rodriguez-Castro, and Martin Hepp. RDF Translator: A RESTful Multi-Format Syntax Converter for the Semantic Web. http://www.stalsoft.com/publications/rdf-translator-TR.pdf. [Online, accessed 09/15/2015]. Bibliography 221

[SRH13b] Alex Stolz, Benedicto Rodriguez-Castro, and Martin Hepp. “Using BMEcat Catalogs as a Lever for Product Master Data on the Semantic Web”. In: The Semantic Web: Semantics and Big Data. 2013, pp. 623– 638.

[Sto10] Uwe Stoll. GR-Notify. http://gr-notify.appspot.com. [Online; accessed 03/20/2013].

[Sva06] Vojtech Svatek. “On the Design and Exploitation of Presentation On- tologies for Information Extraction”. In: Proceedings of the Workshop on Mastering the Gap, From Information Extraction to Semantic Rep- resentation, held in conjunction with the European Semantic Web Con- ference 2006. 2006, pages n.a.

[Swa02] A. Swartz. application/rdf+xml Media Type Registration. http://www. aaronsw.com/2002/draft- w3c- rdfcore- rdfxml- mediatype- 01.html. [Online; accessed 04/20/2013].

[Tan11] Ole Tange. “GNU Parallel - the Command-Line Power Tool”. In: login: The USENIX Magazine 36.1 (2011), pp. 42–47.

[Tea+11] R Developement Core Team et al. R: A Language Environment for Statistical Computing. http://www.gbif.org/resource/81287. [Online, accessed 09/15/2015].

[TH14] László Török and Martin Hepp. “Towards Portable Shopping Histo- ries: Using GoodRelations to Expose Ownership Information to E- Commerce Sites”. In: The Semantic Web: Trends and Challenges: 11th International Conference, ESWC 2014, Anissaras, Crete, Greece, May 25-29, 2014. 2014, pp. 691–705. doi: 10.1007/978-3-319-07443-6_46.

[Tol+03] Robert Tolksdorf et al. “Business to Consumer Markets on the Seman- tic Web”. In: OTM Workshops. 2003, pp. 816–828.

[Tre+08] Volker Tresp et al. “Towards Machine Learning on the Semantic Web”. In: Uncertainty Reasoning for the Semantic Web I. 2008, pp. 282–314. Bibliography 222

[UG96] Mike Uschold and Michael Gruninger. “Ontologies: Principles, Methods and Applications”. In: Knowledge Engineering Review 11.2 (1996), pp. 93–136.

[Var] Various. Creative Commons Attribution 3.0 Unported (CC BY 3.0). http://creativecommons.org/licenses/by/3.0/legalcode. [Online, accessed 09/15/2015].

[Vir] Virtuemart. Virtuemart. http://virtuemart.de/. [Online, accessed 04/20/2015].

[Wal12] Brian K. Walker. The Forrester Wave™: B2C Commerce Suites, Q3 2012. https://www.forrester.com/The+Forrester+Wave+B2C+ Commerce+Suites+Q3+2012/fulltext/-/E-RES80141. [Online, ac- cessed 09/15/2015].

[WD10] Daya C. Wimalasuriya and Dejing Dou. “Components for Information Extraction: Ontology-Based Information Extractors and Generic Plat- forms”. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management. 2010, pp. 9–18.

[web14] webdatacommons.org. Web Data Commons - RDFa, Microdata, and Microformats Data Sets - December 2014. https://www.forrester.com/ The+Forrester+Wave+B2C+Commerce+Suites+Q3+2012/fulltext/- /E-RES80141. [Online, accessed 09/15/2015].

[Wic13] Karen Wickre. Celebrating Twitter 7. https://blog.twitter.com/2013/ celebrating-twitter7. [Online; accessed 01/10/2013].

[Yeu+09] Ching-man Au Yeung et al. “Decentralization: The Future of Online Social Networking”. In: W3C Workshop on the Future of Social Net- working Position Papers. 2009, pages n.a.

[Zha04] Tong Zhang. “Solving Large Scale Linear Prediction Problems Us- ing Stochastic Gradient Descent Algorithms”. In: Proceedings of the Twenty-first International Conference on Machine Learning. 2004, p. 116. Bibliography 223

[ZL07] Yanhong Zhai and Bing Liu. “Extracting Web Data Using Instance- Based Learning”. In: World Wide Web 10.2 (2007), pp. 113–132.