Describing Data Patterns. a General Deconstruction of Metadata Standards
Total Page:16
File Type:pdf, Size:1020Kb
Describing Data Patterns A general deconstruction of metadata standards Dissertation in support of the degree of Doctor philosophiae (Dr. phil.) by Jakob Voß submitted at January 7th 2013 defended at May 31st 2013 at the Faculty of Philosophy I Humboldt-University Berlin Berlin School of Library and Information Science Reviewer: Prof. Dr. Stefan Gradman Prof. Dr. Felix Sasaki Prof. William Honig, PhD This document is licensed under the terms of the Creative Commons Attribution- ShareAlike license (CC-BY-SA). Feel free to reuse any parts of it as long as attribution is given to Jakob Voß and the result is licensed under CC-BY-SA as well. The full source code of this document, its variants and corrections are available at https://github.com/jakobib/phdthesis2013. Selected parts and additional content are made available at http://aboutdata.org A digital copy of this thesis (with same pagination but larger margins to fit A4 paper format) is archived at http://edoc.hu-berlin.de/. A printed version is published through CreateSpace and available by Amazon and selected distributors. ISBN-13: 978-1-4909-3186-9 ISBN-10: 1-4909-3186-4 Cover: the Arecibo message, sent into empty space in 1974 (image CC-BY-SA Arne Nordmann, http://commons.wikimedia.org/wiki/File:Arecibo_message.svg) CC-BY-SA by Widder (2010) Abstract Many methods, technologies, standards, and languages exist to structure and de- scribe data. The aim of this thesis is to find common features in these methods to determine how data is actually structured and described. Existing studies are limited to notions of data as recorded observations and facts, or they require given structures to build on, such as the concept of a record or the concept of a schema. These presumed concepts have been deconstructed in this thesis from a semiotic point of view. This was done by analysing data as signs, communicated in form of digital documents. The study was conducted by a phenomenological research method. Conceptual properties of data structuring and description were first col- lected and experienced critically. Examples of such properties include encodings, identifiers, formats, schemas, and models. The analysis resulted in six prototypes to categorize data methods by their primary purpose. The study further revealed five basic paradigms that deeply shape how data is structured and described in practice. The third result consists of a pattern language of data structuring. The patterns show problems and solutions which occur over and over again in data, independent from particular technologies. Twenty general patterns were identified and described, each with its benefits, consequences, pitfalls, and relations to other patterns. The results can help to better understand data and its actual forms, both for consumption and creation of data. Particular domains of application include data archaeology and data literacy. iv Zusammenfassung Diese Arbeit behandelt die Frage, wie Daten grundsatzlich¨ strukturiert und be- schrieben sind. Im Gegensatz zu vorhandenen Auseinandersetzungen mit Daten im Sinne von gespeicherten Beobachtungen oder Sachverhalten, werden Daten hierbei semiotisch als Zeichen aufgefasst. Diese Zeichen werden in Form von digitalen Dokumenten kommuniziert und sind mittels zahlreicher Standards, Formate, Spra- chen, Kodierungen, Schemata, Techniken etc. strukturiert und beschrieben. Diese Vielfalt von Mitteln wird erstmals in ihrer Gesamtheit mit Hilfe der phenomenologi- schen Forschungsmethode analysiert. Ziel ist es dabei, durch eine genaue Erfahrung und Beschreibung von Mitteln zur Strukturierung und Beschreibung von Daten zum allgemeinen Wesen der Datenstrukturierung und -beschreibung vorzudrin- gen. Die Ergebnisse dieser Arbeit bestehen aus drei Teilen. Erstens ergeben sich sechs Prototypen, die die beschriebenen Mittel nach ihrem Hauptanwendungszweck kategorisieren. Zweitens gibt es funf¨ Paradigmen, die das Verstandnis¨ und die An- wendung von Mitteln zur Strukturierung und Beschreibung von Daten grundlegend beeinflussen. Drittens legt diese Arbeit eine Mustersprache der Datenstrukturierung vor. In zwanzig Mustern werden typische Probleme und Losungen¨ dokumentiert, die bei der Strukturierung und Beschreibung von Daten unabhangig¨ von konkre- ten Techniken immer wieder auftreten. Die Ergebnisse dieser Arbeit konnen¨ dazu beitragen, das Verstandnis¨ von Daten — das heisst digitalen Dokumente und ihre Metadaten in allen ihren Formen — zu verbessern. Spezielle Anwendungsgebiete liegen unter Anderem in den Bereichen Datenarchaologie¨ und Daten-Literacy. v Contents Abstract ....................................... iv Zusammenfassung .................................v List of Examples .................................. vii List of Figures ................................... vii List of Tables .................................... vii 1. Introduction 1 1.1. Motivation ..................................1 1.2. Background .................................4 1.3. Method and scope ..............................6 1.4. Related work .................................7 2. Foundations 11 2.1. Mathematics ................................. 12 2.1.1. Logic ................................. 12 2.1.2. Set theory .............................. 15 2.1.3. Tuples and relations ........................ 17 2.1.4. Graph theory ............................ 18 2.2. Computer science .............................. 23 2.2.1. Formal languages and computation ................ 24 2.2.2. Data types .............................. 29 2.2.3. Data modeling ............................ 32 2.3. Library and information science ...................... 36 2.3.1. Background of the discipline ................... 36 2.3.2. Documents .............................. 37 2.3.3. Metadata ............................... 39 2.4. Philosophy .................................. 41 2.4.1. Philosophy of Information ..................... 41 2.4.2. Philosophy of Data ......................... 42 2.4.3. Philosophy of Technology and Design .............. 44 2.5. Semiotics ................................... 45 2.6. Patterns and pattern languages ...................... 50 3. Methods of data structuring 53 3.1. Character and number encodings ..................... 54 3.1.1. Unicode ............................... 54 3.1.2. Number encodings ......................... 58 vii Contents 3.2. Identifiers .................................. 59 3.2.1. Basic principles ........................... 59 3.2.2. Namespaces and qualifiers ..................... 62 3.2.3. Identifier Systems .......................... 63 3.2.4. Descriptive identifiers ....................... 67 3.2.5. Ordered identifiers ......................... 68 3.2.6. Hash codes .............................. 69 3.3. File systems ................................. 72 3.3.1. Origins and evolution ........................ 72 3.3.2. Components and properties .................... 73 3.3.3. Wrapping file systems ....................... 77 3.4. Databases ................................... 79 3.4.1. Record databases .......................... 79 3.4.2. Hierarchical databases ....................... 83 3.4.3. Network databases ......................... 84 3.4.4. Relational databases ........................ 85 3.4.5. Object database ........................... 89 3.4.6. NoSQL databases .......................... 90 3.5. Data structuring languages ......................... 92 3.5.1. Data binding languages ...................... 92 3.5.2. INI, CSV, and S-Expressions .................... 93 3.5.3. JSON ................................. 95 3.5.4. YAML ................................. 97 3.5.5. XML ................................. 98 3.5.6. RDF .................................. 103 3.6. Markup languages .............................. 112 3.6.1. General markup types ....................... 112 3.6.2. Text markup languages ....................... 114 3.7. Schema languages .............................. 119 3.7.1. Regular Expressions and Backus-Naur Form ........... 120 3.7.2. XML schema languages ....................... 122 3.7.3. RDF schema languages ....................... 131 3.7.4. SQL schemas ............................. 134 3.7.5. Other schema languages ...................... 135 3.8. Conceptual modeling languages ...................... 138 3.8.1. Entity-relationship model ..................... 138 3.8.2. Object Role Modeling ........................ 142 3.8.3. Unified Modeling language .................... 145 3.8.4. Domain specific modeling and metamodeling .......... 147 3.9. Conceptual diagrams ............................ 149 3.9.1. Diagram types ............................ 150 3.9.2. Diagram properties ......................... 155 3.10. Query languages and APIs ......................... 156 viii Contents 4. Findings 159 4.1. Categorization of methods ......................... 159 4.2. Paradigms in data structuring ....................... 160 4.2.1. Documents and objects ....................... 161 4.2.2. Standards and rules ......................... 164 4.2.3. Collections, types, and sameness ................. 170 4.2.4. Entities and connections ...................... 173 4.2.5. Levels of abstraction ........................ 176 5. Patterns in data structuring 181 5.1. Organization ................................. 181 5.2. Basic patterns ................................ 182 5.2.1. Label pattern ............................ 183 5.2.2. Atomicity pattern .......................... 184 5.2.3. Size pattern ............................. 186 5.2.4. Optionality pattern ......................... 187 5.2.5.