Kurt Uwe Stoll Using Existing Structured Data As a Learning Set
Total Page:16
File Type:pdf, Size:1020Kb
Kurt Uwe Stoll Using Existing Structured Data as a Learning Set for Web Information Extraction in E-Commerce Doctoral Thesis Fakultät für Wirtschafts- und Organisationswissenschaften Using Existing Structured Data as a Learning Set for Web Information Extraction in E-Commerce Kurt Uwe Stoll Univ.-Prof. Dr. Hans A. Wüthrich Univ.-Prof. Dr. Martin Hepp Univ.-Prof. Dr. Claudius Steinhardt Univ.-Prof. Dr. Stephan Kaiser Univ.-Prof. Dr. Karl Morasch 12.7.2016 Dr. rerum politicarum (Dr. rer. pol.) 1. November 2016 Doctoral Thesis Using Existing Structured Data as a Learning Set for Web Information Extraction in E-Commerce Author: Supervisor: Kurt Uwe Stoll Prof. Dr. Martin Hepp A thesis submitted in partial fulfillment of the requirements for the degree of Dr. rer. pol. at the UNIVERSITÄT DER BUNDESWEHR MÜNCHEN November 1, 2016 “I checked it very thoroughly,” said the computer, “and that quite definitely is the answer. I think the problem, to be quite honest with you, is that you’ve never actually known what the question is.” “But it was the Great Question! The Ultimate Question of Life, the Universe and Everything,” howled Loonquawl. “Yes,” said Deep Thought with the air of one who suffers fools gladly, “but what actually is it?” A slow stupefied silence crept over the men as they stared at the computer and then at each other. “Well, you know, it’s just Everything ... Everything ...” offered Phouchg weakly. “Exactly!” said Deep Thought. “So once you know what the question actually is, you’ll know what the answer means.” Douglas Adams - The Hitchhiker’s Guide to the Galaxy Abstract Using Existing Structured Data as a Learning Set for Web Information Extraction in E-Commerce by Kurt Uwe Stoll In the last years, e-commerce has grown massively and evolved into a main driver of technological innovation on the Web. The Semantic Web is a vision to advance the technological foundation of the Web so that computers are empowered to better extract and process information from Web content [AH04, p. 1f.]. A core principle of the Semantic Web is to augment Web markup by structured data suited for machine processing, instead of markup just suitable for rendering the information for human consumption [AH04,p. 1f.]. The application of the Semantic Web to e-commerce shows significant potential in particular for the efficiency and precision of search, improving data quality, or raising market efficiency. Despite a significant increase in adoption, the percentage of Web sites that provide data markup for e-commerce information is still limited and will likely remain limited for many years to come. Predominantly, the data is generated with shop software extension modules, covering only a small fraction of the Web. At the same time, automatic methods for Web Information Extraction are still not able to reconstruct the full amount of structured data behind Web content. In order to address this issue, we propose a novel method for Web Information Extraction, targeted to the e-commerce domain. The approach exploits (1) the market dominance of a small amount of e-commerce systems, (2) the patterns those systems expose in Web page generation, and (3) the existing structured data in e-commerce. We evaluate our findings by splitting our dataset into a learning set and an evaluation set. Our results show that the approach is feasible for extracting structured data from e- commerce sites that do not include data markup solely on the basis of template similarity and existing markup as training data. The fundamental idea is to combine similarities in Web page templates, caused by the popularity of off-the-shelf shop software, with the use of data markup found in the subset of Web pages as training data for machine learning. Kurzzusammenfassung Existierende strukturierte Daten als Lernset für Webinformationsextraktion im Bereich E-Commerce von Kurt Uwe Stoll Der Wirtschaftsbereich E-Commerce ist in den letzten Jahren stark gewachsen und hat sich dabei zu einer Triebfeder technischer Innovation im Web etabliert. Das semantische Web ist eine Vision, die technologischen Grundlagen des Webs so zu verbessern, dass Computer leichter Informationen aus Webinhalten extrahieren und verarbeiten können [AH04, p. 1f.]. Hierbei ist das Kernprinzip, Webseitencode, welcher ursprünglich für die Darstellung für Menschen entworfen wurde, mit strukturierten Daten anzureichern, welche maschinenlesbar sind [AH04, p. 1f.]. Im Zusammenhang mit E-Commerce birgt die Anwendung von Semantic-Web-Technologien bedeutende Potentiale, insbesondere Effizienz und Suchgenauigkeit, Verbesserung von Datenqualität und Verbesserung von Markteffizienz. Trotz einer bedeutenden Zunahme in der Verwendung dieser Technologien ist der Anteil von Websites, die strukturierte Daten verwenden, nach wie vor begrenzt und wird dies aller Voraussicht nach in den nächsten Jahren bleiben. Die Daten werden vornehmlich durch Shop Extensions erzeugt. Gleichzeitig sind automatisierte Methoden aus dem Bereich Webinformationsextraktion noch nicht in der Lage, die Gesamtheit der in Webseiten enthaltenen Informationen als strukturierte Daten abzubilden. Um dieses Problem zu lösen, wird eine neue Methode für Webinformationsextraktion für E-Commerce vorgeschlagen. Sie nutzt die marktbeherrschende Stellung weniger E- Commerce-Systeme, die Muster, welche die Systeme bei der Webseitengenerierung erzeu- gen, und die bestehenden strukturierten Daten aus dem semantischen E-Commerce. Die Ergebnisse werden evaluiert, indem die zur Verfügung stehenden Daten in Train- ingsdaten und Testdaten aufgeteilt werden. Unsere Ergebnisse zeigen, dass der Ansatz lediglich durch die Verwendung von Ähnlichkeiten in Templates und existierendem Markup zusätzliche strukturierte Daten erzeugen kann. Die grundlegende Idee besteht in der Kombination von Ähnlichkeiten in Webseitentemplates, welche durch die Popularität von Standard Shopsoftware entsteht, mit der Verwendung von strukturiertem Markup als Trainingsdaten für Machine Learning. Acknowledgements First of all, I would like to sincerely thank my supervisor, Prof. Dr. Martin Hepp, for his guidance, support and encouragement. Without his supervision and trust in my ideas, this thesis would have never existed. Working with him was a highly inspiring experience. Additionally, I want to thank Prof. Dr. Claudius Steinhardt for taking over the role of co-supervisor. I want to thank my colleagues Dr. Mouzhi Ge, Andreas Radinger, Dr. Bene Rodriguez, Alex Stolz and Laszlo Török, for the inspiring discussions, and productive atmosphere at work. I owe progress in many critical points of this way to you. Many thanks also go to all my dear friends, without whom life would have never been so colorful. Most of all, I want to thank my wife Nadine. You are the best thing that has ever happened to me. Without your love, I would have never come so far. Especially, I want to thank my family. In the rest of my life, I can never pay back the love and care I owe to my mother. Finally, I want to thank Christopher David Ryan for the friendly provision of the title page graphic. Last but not least, I would like to thank the Universität der Bundeswehr München, who funded this research for a significant period and provided a highly creative atmosphere. v Contents Abstract iii Kurzzusammenfassung iv Acknowledgements v List of Figures xi List of Tables xiii Listings xv Abbreviations xvi 1 Introduction 1 1.1 Problem Statement and Hypothesis .................. 1 1.2 Relevance ................................ 3 1.2.1 Potential of the Semantic Web for E-Commerce ....... 3 1.2.2 Existing Semantic E-Commerce Data and Limitations .... 7 1.3 Contributions .............................. 9 1.4 Research Questions ........................... 13 1.5 Experimental Design .......................... 14 1.6 Organization of the Thesis ....................... 14 1.7 Previously Published Work ....................... 15 2 Structured Data: Fundamentals and Usage for E-Commerce 17 2.1 Semi-Automated Structured Data Generation on the Semantic Web 18 2.1.1 The Web ............................. 19 2.1.1.1 Economical Dimensions ............... 19 2.1.1.2 Social Dimensions .................. 21 2.1.1.3 Design Principles of the Web ............ 22 2.1.1.4 Fundamental Problems of the Web ......... 23 2.1.2 Semantic Web .......................... 26 2.1.2.1 Vision ......................... 26 2.1.2.2 Semantic Web Technology Stack .......... 28 2.1.2.3 Linked Data ...................... 42 vi Contents vii 2.1.2.4 Schema.org, Google Semantic Web Tools and Google Knowledge Graph .................. 43 2.1.3 Conclusion ............................ 45 2.2 Semantic E-Commerce ......................... 45 2.2.1 Technological Foundations of E-Commerce .......... 45 2.2.2 The GoodRelations Web Ontology for E-Commerce ..... 47 2.2.2.1 Goals and Design Principles ............. 47 2.2.2.2 Data Model ...................... 48 2.2.2.3 Features, Documentation, and Ecosystem ..... 50 2.2.2.4 Existing GoodRelations Data on the Web ..... 55 2.2.3 Existing Research in Semantic E-Commerce ......... 56 2.2.4 Real-World Usage of Structured E-Commerce Data ..... 57 2.2.5 Economical Implications of Semantic E-Commerce ...... 59 2.2.6 Conclusion ............................ 60 2.3 Automated Generation of Structured Data with Web Information Extraction ................................ 61 2.3.1 Research Strains in Web Information Extraction and Rela- tion to Semantic Web Research ................ 63 2.3.2 Classical Web Information Extraction Approaches ...... 64 2.3.3 Recent Approaches to Web Information