Analysing Web Data Through Automated Data Extraction

Web Observations: Analysing Web Data through automated Data Extraction Inauguraldissertation zur Erlangung der Würdeeines Doktors der Philosophie vorgelegt der Philosophisch-Naturwissenschaftlichen Fakultät der UniversitätBasel von Alexander Olivier Gröflin aus Basel, Schweiz Basel, 2020 Originaldokument gespeichert auf dem Dokumentenserver der UniversitätBasel: edoc.unibas.ch Dieses Werk untersteht dem urheberrechtlichen Schutz gemässCreative Commons \Namensnennung - Nicht kommerziell - Keine Bearbeitungen 4.0 International". Die vollständigeLizenzvereinbarung kann unter https://creativecommons.org/licenses/by-nc-nd/4.0/ eingesehen werden. Genehmigt von der Philosophisch-Naturwissenschaftlichen Fakultät auf Antrag von Prof. Dr. Helmar Burkhart, UniversitätBasel und Prof. Dr. Liz Bacon, UniversitätGreenwich Basel, den 11. Dezember 2018 Prof. Dr. Martin Spiess Dekan To my parents, Gertraud and Dr. med. Urs Beat Gröflin-Glück,my sister Lilian and my brothers Stefan and Fabian. Acknowledgments Writing a thesis is a lifetime task and due to the strong support of certain people I was able to successfully fulfil it. First and foremost, I want to express my acknowledgements to my supervisors, Prof. Helmar Burkhart and Prof. Liz Bacon. Their suggestions and inputs were crucial for making my thesis the best it can possibly be. Also, I am very grateful for the trust given by Prof. Helmar Burkart over the time I was allowed to be part of his now dissolved research group, \High Performance and Web computing (HPWC)". I wish to express my sincere gratitudes to Prof. Sabine Gless, who did not only provide financial support for the last year of my employment but also provided me with a priceless insight into the law and its growing influence on data and computer science. The confidence she has given me and the experience at the Faculty of Law at the University of Basel was crucial for turning my work into an interdisciplinary thesis. I would also like to thank my co-workers, friends, and fellow students Danilo Guerrera, Antonio Maffia, Dominic Bosch, Bas Kin, Robert Frank, Yvonne Wegmüller,Patricia Krattiger, Mario Weber, Christian Frei, JoëlleSimonet at the Department of Computer Science and Mathematics and Carl Jauslin, Laura Macula, Dario Stagno, Christine Möhrke-Sobolewski, Nadine Zurkinden, Lia Börlin,Armand Kurath, Quirin Meier, Claudine Abt at the Faculty of Law and Yun Seok Lee at the London School of Economics and Political Science. The constant critical discussions on the various aspects of conducting research for a thesis were very helpful and of great support to me. My special thanks go to Martin Guggisberg, who was particularly influential to the progress and development of the thesis. His constant support and invaluable guidance paved the way to this very thesis. Additionally, the above mentioned have turned from colleagues into dear friends who did not only support me on a working level but have built me up whenever I needed it. This is an experience very dear to me and I am glad I could share these last years with such positive, uplifting and supportive people. I am also very grateful to Pauline Pfirter and Bernhard Egger for thoroughly reading my thesis and polishing my writing and my structure to perfection. Additionally, I would like to express my thanks to the Freie Akademische Gesellschaft (FAG) and Prof. Beat Schöneberger who provided me with the financial support required to have this thesis finished. Furthermore, I would like to express my gratitude to the Swiss National Science Foundation for partially funding this project (SNF NRP 75 Big Data 167182 \Legal Challenges in Big Data. Allocating benefits. Averting risks."). Finally, I could not have finished this project without the immense support from my family and closest friends who would cheer me up and push me forward whenever necessary. Thanks to my mother, Gertraud Gröflin-Glück,who made me go to the very first class of the Computer Science High School, I could obtain a Swiss Federal Certificate of Computer Science which was only the starting point for my academic career. Later on, also thanks to my father, Dr. med. Urs Beat Gröflin, I could continue my education in Switzerland, Italy and the United Kingdom. Abstract In this thesis, a generic architecture for Web observations is introduced. Be- ginning with fundamental data aspects and technologies for building Web observations, requirements and architectural designs are outlined. Because Web observations are basic tools to collect information from any Web resource, legal perspectives are discussed in order to give an understanding of recent regula- tions, e.g. General Data Protection Regulation (GDPR). The general idea of Web observatories, its concepts, and experiments are presented to identify the best solution for Web data collections and based thereon, visualisation from any kind of Web resource. With the help of several Web observation scenarios, data sets were collected, analysed and eventually published in a machine-readable or visual form for users to be interpreted. The main research goal was to create a Web observation based on an architecture that is able to collect information from any given Web resource to make sense of a broad amount of yet untapped information sources. To find this generally applicable architectural structure, several research projects with different designs have been conducted. Eventu- ally, the container based building block architecture emerged from these initial designs as the most flexible architectural structure. Thanks to these considerations and architectural designs, a flexible and easily adaptable architecture was created that is able to collect data from all kinds of Web resources. Thanks to such broad Web data collections, users can get a more comprehensible understanding and insight of real-life problems, the efficiency and profitability of services as well as gaining valuable information on the changes of a Web resource. Keywords: Web Observatory, Web Observation, Web Data Collection, Web Architecture, Law & Data, Regulatory Contents Acknowledgments iv Abstract vi Contents vii Introduction1 I Fundamentals4 1 Data Aspects5 1.1 It is all about Data........................ 5 1.1.1 Definitions of Data.................... 6 1.1.2 Web Data ........................ 11 1.1.3 Data Sets ........................ 14 1.1.4 Open Data........................ 16 1.1.5 Big Data......................... 18 2 Technologies for Building Web Observatories 24 2.1 Observation of Web Data.................... 24 2.1.1 Web Mashups ...................... 25 2.1.2 Data Flow Systems ................... 27 2.1.3 Condition Action Systems................ 27 2.2 Web Observatory or Web Observations............. 28 2.2.1 Publish and Subscribe Pattern.............. 30 2.2.2 Push Principle...................... 30 2.2.3 Poll Principle....................... 31 2.2.4 Optimal Time Interval.................. 32 2.2.5 Operating-System-Level Virtualisation.......... 34 CONTENTS viii 2.3 Data Mining........................... 35 2.3.1 Mining Information from the Web............ 37 2.3.2 Extraction Methods ................... 38 2.3.3 Web Mining Algorithms................. 39 II Methodology 46 3 Research Plan 47 3.1 Research Contribution...................... 47 3.2 Research Problem ........................ 48 3.3 Final Research Questions .................... 49 III Law & Data 51 4 Legal and Political Perspectives 52 4.1 Introduction ........................... 53 4.2 Legal Issues ........................... 53 4.2.1 Switzerland........................ 54 4.2.2 European Union..................... 55 4.2.3 United States ...................... 61 4.2.4 United Kingdom..................... 65 4.3 Policy and Political Issues.................... 66 4.3.1 Open Data Access.................... 66 4.3.2 Predictive Policing.................... 68 4.4 Conclusion and Outlook..................... 69 4.4.1 Recommendations.................... 70 IV Creating a Web Observation 72 5 Considerations for Web Observations 73 5.1 Observation of States ...................... 73 5.2 Awareness of Data Collections.................. 76 5.3 Examination of Web Resources ................. 78 5.3.1 Conceptual and Technical Design............ 78 5.3.2 Definition of Input/Output ............... 79 5.3.3 Selection of Data Structure............... 80 6 Architectural Designs 81 CONTENTS ix 6.1 Event Condition Action (ECA) System ............. 82 6.1.1 Architecture ....................... 82 6.1.2 Conclusion........................ 85 6.2 Building Block Architecture................... 87 6.2.1 Building Blocks ..................... 87 6.2.2 Web Resource and Web Client Blocks.......... 93 6.2.3 Web Observation Block................. 93 6.2.4 Server Block....................... 94 6.2.5 Container System for Linking Building Blocks . 95 6.2.6 Conclusion........................ 96 V Results from Web Observations 98 7 Measurements 99 7.1 Purpose of Web Observations.................. 99 7.1.1 Latency-Driven Web...................100 7.2 Experiments ...........................101 7.2.1 Setup...........................101 7.2.2 Results..........................105 7.2.3 Conclusion........................108 8 Observation Scenarios 110 8.1 Catch a Car ...........................111 8.1.1 Web Observation Task..................111 8.1.2 Verifying the Results...................113 8.1.3 Interpretation of Data..................117 8.2 Public Transportation Data ...................122 8.2.1 Web Observation Task..................122 8.2.2 Verifying the Results...................123

Load more