Long-Term Location-Independent Research Data Dissemination Using Persistent Identiﬁers

Long-Term Location-Independent Research Data Dissemination Using Persistent Identifiers Dissertation zur Erlangung des Doktorgrades Doktor rerum naturalium“ ” der Mathematisch-Naturwissenschaftlichen Fakultaten¨ der Georg-August-Universitat¨ zu Gottingen¨ im PhD Programme in Computer Science (PCS) der Georg-August University School of Science (GAUSS) vorgelegt von Oliver Wannenwetsch (geb. Schmitt) aus Stuttgart Gottingen,¨ 2016 Betreuungsausschuss: Prof. Dr. Ramin Yahyapour Gesellschaft fur¨ wissenschaftliche Datenverarbeitung Gottingen¨ mbH (GWDG), Institut fur¨ Informatik Georg-August-Universitat¨ Gottingen¨ Prof. Dr. Jens Grabowski Institut fur¨ Informatik Georg-August-Universitat¨ Gottingen¨ Dr. Lena Wiese Institut fur¨ Informatik Georg-August-Universitat¨ Gottingen¨ Prufungskommission:¨ Referent: Prof. Dr. Ramin Yahyapour Gesellschaft fur¨ wissenschaftliche Datenverarbeitung Gottingen¨ mbH (GWDG), Institut fur¨ Informatik Georg-August-Universitat¨ Gottingen¨ Korreferenten: Prof. Dr. Jens Grabowski Institut fur¨ Informatik Georg-August-Universitat¨ Gottingen¨ Weitere Mitglieder Dr. Lena Wiese der Prufungskommission:¨ Institut fur¨ Informatik Georg-August-Universitat¨ Gottingen¨ Prof. Dr. Xiaoming Fu Institut fur¨ Informatik Georg-August-Universitat¨ Gottingen¨ Prof. Dr. Caroline Sporleder Institut fur¨ Informatik Georg-August-Universitat¨ Gottingen¨ Jun.-Prof. Dr. Marcus Baum Institut fur¨ Informatik Georg-August-Universitat¨ Gottingen¨ Tag der mundlichen¨ Prufung:¨ 11. Januar 2017 Abstract Research data occurs in all scientific experiments, computer simulations, observations or as a derivation from other datasets, literature or publications. As a subset of the general concept of digital data, it is classified through its distinct state and its origin. Enriched with descriptive metadata, research data serves as a foundation for discoveries and publishing results in various formats. For citing and linking specific research datasets and publications, unique and persistent identification is necessary. Today, this is realized by Persistent Identifier (PID) systems that provide stable identification for digital entities and an optional annotation by descriptive metadata. Moreover, PID systems abstract the current network location of data in order to anticipate changes in its network location, owed to alternating Uniform Resource Locators (URL) on the World Wide Web (WWW). Applying these concepts, PID systems have tagged billions of research datasets and publications over the past 20 years. On these foundations, the Handle PID system, known from the Digital Object Identifier (DOI) system, provides reliable access to digital publications and research data to the whole scientific community. While the architecture of the Handle system itself, which depends on fixed network locations, was designed with farsightedness, additional end-user services for PID resolution and management have introduced critical weak spots that can be discovered by comprehensively reviewing the current state-of-the-art. This thesis focuses on the adaption of location-independent network paradigms which have shown encouraging results when applied to several problems in the domain of decentralized network infrastructures in PID systems. Our first approaches aim at evolving the Handle system design into a self-adjusting system for all major infrastructure services that does not depend on fixed network locations. We tackle this by incorporating strategies and techniques from location-independent network paradigms originating from the current research branch of Named Data Networking (NDN). By this, major weak spots can be eliminated in the Handle PID system and it becomes robust against core infrastructure outages, sudden network topology changes, packet loss and heavy load situations. The second goal of the thesis is the integration of next generation data dissemination technologies based on location-independent network paradigms into the domain of persistent identifier systems. Therefore, we propose to employ the Handle system for citing research datasets which are disseminated by location-independent technologies based on BitTorrent and NDN. To tackle the trust challenges of dynamic data locations, we create a novel approach for trusted data dissemination in location-independent networks that ensures the authenticity of data as well as the attribution to data issuers. This is done by incorporating the foundations of the Handle PID system and a further format for exchanging complex access information in PIDs. Acknowledgements The work on this thesis has been conducted in the very interesting environment of the eScience group of the GWDG and the Institute for Computer Science at the Georg-August-Universitat¨ Gottingen.¨ First, I would like to express my great thanks and gratitude to my advisor Ramin Yahyapour for his supervision, his encouraging support, the inspiring discussions and the feedback he provided to me. I am also very thankful to Jens Grabowski and Lena Wiese for being my second and third supervisor and their interest in my scientific activities, Jens Grabowski additionally for being the second referee of this thesis. I also like to thank Tim Majchrzak, Sven Bingert, and Philipp Wieder for their advice and support during the thesis. Furthermore, I want to thank my colleagues of the eScience group and especially the data management section, who provided input and challenges for creating new research questions. This thesis would not have been possible without the financial support of the “Deutsche Forschungsgemeinschaft (DFG)” in the collaborative research center 963 “Astrophysical Flow Instabilities and Turbulence” and the support of the “Digital Humanities Research Collaboration” founded by the “Niedersachsisches¨ Vorab der Volkswagen Stiftung”. Additionally, I am very grateful for the numerous possibilities to travel to conferences in Germany, Europe and the United States. My special thanks go to my family and friends for their understanding and support. Finally, I want to thank my wife for her endless love, patience and sharp-witted advice. Contents 1 Introduction 1 1.1 Motivation . .2 1.2 Scope of Thesis . .4 1.3 Goals and Contributions . .5 1.4 Impact . .8 1.5 Structure of Thesis . .9 2 Foundations 11 2.1 Research Data Management . 11 2.2 Digital Data Repositories . 14 2.3 Persistent Identifiers . 16 2.4 Handle System . 18 2.5 Magnet Links . 24 2.6 Overlay Networks with BitTorrent . 25 2.6.1 General Principles . 26 2.6.2 Network Organization . 26 2.6.3 Data Organization . 29 2.7 Information Centric Networks with Named Data Networking . 31 2.7.1 Differentiation between CCN and NDN . 31 2.7.2 General Principles . 32 2.7.3 Naming Data . 33 2.7.4 Packet Types . 33 2.7.5 Node Design . 34 2.7.6 Routing . 37 2.7.7 Data Transport and Flow Control . 40 2.7.8 Content Validation and Content Protection . 41 2.8 Cryptography . 42 2.8.1 Symmetric Encryption . 42 2.8.2 Asymmetric Encryption . 43 2.8.3 Digital Signatures . 44 2.8.4 Symmetric Authentication . 45 3 Problem Statements 47 Contents 4 Related Work 53 4.1 Research Data Dissemination With Overlay Networks . 53 4.2 Research Data Dissemination With Named Data Networking . 55 4.3 Persistent Identifier in Named Data Networking . 57 4.4 Naming Schemes for Archive Data Access . 61 4.5 Running Legacy Network Applications in NDN . 61 4.5.1 Location-based Network Protocols over NDN . 62 4.5.2 Application Protocol Adaption for NDN . 63 4.5.3 Communication Application Interfaces Adaption . 65 4.5.4 Transparent Proxies . 66 4.6 Summary and Research Delta . 67 5 Location-Independent Persistent Identifiers 73 5.1 Persistent Identifier in Location-Independent Networks . 73 5.2 Improvements and Benefits . 74 5.3 Approach . 75 5.3.1 General Principles . 76 5.3.2 PID NDN Namespace Convergence . 76 5.3.3 Access Models . 78 5.3.4 Interoperability Model . 98 5.4 Implementation . 108 5.4.1 NDN-Enabled Handle Server . 109 5.4.2 Handle Library Modification for NDN Connectivity . 111 5.4.3 Native Handle Protocol Transport With the NDNInterface . 113 5.4.4 PID Publishing Subsystem . 119 5.5 Evaluation . 122 5.5.1 Simulator Environment . 122 5.5.2 Evaluation Input Data Preparation . 123 5.5.3 Native Handle Communication Using NDN PID Push . 127 5.5.4 PID Publishing using NDN PID Pull . 132 6 Location-Independent Data Access using Persistent Identifiers 137 6.1 Improvements and Benefits . 138 6.2 Distribution of PID Maintenance Efforts . 139 6.3 Approach . 140 6.3.1 Magnet URI Scheme Extension for NDN . 141 6.3.2 Magnet URI Scheme Extension for Trusted Data Access . 142 6.3.3 Embedding Magnet Links into Handle PID . 146 6.3.4 Data Access Service Chain . 146 6.3.5 Creation and Maintenance of PIDs . 148 6.3.6 Data Access from Location-Dependent Networks . 150 6.4 Implementation . 153 6.4.1 Server Side . 153 viii Contents 6.4.2 Client Side . 155 6.5 Evaluation . 156 6.5.1 PID Size Increase . 156 6.5.2 Data Access Duration . 164 7 Discussion 167 7.1 Answers to Research Questions Concerning Location-Independent Persistent Identifiers . 167 7.2 Limitations Of Location-Independent Persistent Identifiers . 169 7.3 Answers to Research Questions Concerning Location-Independent Data Access Using Persistent Identifiers . 171 7.4 Limitations Of Location-Independent Data Access Using Persistent Identifiers172 8 Conclusion 175 8.1 Summary . 175 8.2 Outlook . 177 Bibliography 179 List of Acronyms 195 List of Symbols 201 List of Definitions 203 List of Figures 205 List of Listings 209 List of Tables 211 Appendix 213 A.1 Handle Source Code Remarks . 213 A.1.1 Removal of URN Data Type Support . 213 A.2 Handle Source Code Patches and Additions . 215 A.2.1 Patches for NDN-enabled Native Handle Communication Using NDN PID Push . 215 A.2.2 Additions for NDN-enabled Native Handle Communication Using NDN PID Push . 233 A.2.3 Patches for PID Publishing Using NDN PID Pull . 244 A.2.4 Additions for PID Publishing Using NDN PID Pull (Server) . 245 A.2.5 Additions for PID Publishing Using NDN PID Pull (Client) . 253 A.3 Simulation Environment . 258 A.3.1 PID Resolution Request Classification By Handle Prefixes . 258 ix Contents A.3.2 Collecting Primary Handle Site Data .

Load more