Extensions to Metadata Languages for Environmental Primary Data: the Forest Cloud
Total Page:16
File Type:pdf, Size:1020Kb
Extensions to Metadata Languages for Environmental Primary Data: the Forest Cloud By Luis Maria Ibanez de Garayo Adissertationsubmittedinpartialsatisfactionofthe requirements for the degree of Doctor of Philosophy in Environmental Science , Policy, and Management in the Graduate Division of the University of California, Berkeley Committee in Charge: Professor Greg Biging, Chair Professor Ray Larson Professor Dennis Baldocchi Fall 2011 para mis grandes amores for their support and comprehension Barbara ,Maite, eta Robertorentzat, nere ondoan egoteagatik eta emandako indar eta maitasun guztiagatik. Maite zaituztet Abstract During the last decade environmental scientists and managers have found in the Internet a new communication venue that has improved their productivity by allowing them to share data and knowledge more fluently. Since the invention of the eXtensible Markup Lan- guage (XML), XML has brought the attention of many researchers to improve the communications among people working with data-centric documents since XML was supposed to be the correct approach to standardize data. During this decade, many papers have been pub- lished under the pretenses of XML being the new and paradigmatic standard to share data even though no study has proved that the XML languages have been used by researchers or managers working with data-centric documents during this period of time. This thesis by researching all possible spaces proves that, on the contrary, that after more than a decade from its invention, XML is still not used by the vast scientific community that works with data-centric documents who are still using data archives with legacy formats. Therefore, if data stan- dardization is difficult to attain, and facilitating sharing data a goal to reach, the other clear venue to follow to achieve the goal is to use metadata information. However, metadata languages such as the Eco- logical Modeling Language (EML) and others have no intrinsic features to complete and directly describe the information conveyed in many important types of data-centric documents used by environmentalists. By carefully studying the nature of data-centric archives and the pro- cess of metadata creation, this thesis shows that any data archive can be easily described using an “a posteriori” approach where the lexical descriptors of the physical data from a data-centric file are developed by inspection of the file instead of by following the specifications of the format of the file. In addition, following the principles of the Linked Open Data project, the lexical tree is mapped into a simple logic model with semantic annotations from controlled vocabularies which can be easily serialized for data exchange or data syndicalization. With the metadata extensions researched in this thesis, metadata languages such as EML can be improved by increasing its expression power. Environ- mental scientists and researchers can us this to exchange data-centric documents, and multidisciplinary projects can easily syndicate data from different authors in different formats in a data-centric cloud. i Preface Ten years ago while learning about renewable forest resources and having known first hand and read about unscrupulous forest operations, I had a dream of a new forest regime where the information about forest operations could travel through the Internet free of any impedance with contrasted and true information. With this thesis, I have tried to pave the road for a better Internet. However this thesis could no have been materialized without the support of my wife who for eighteen long years has supported me and our family every single day of the trip. My thanks also go to all people that with their writings and tools have enlighten my ideas and work. Special thanks to my professors for their infinite patience and my apologies to them for my lack of communication skills. I thank Prof. Larson for teaching me the power of databases and his willingness to listen my ideas every time I stepped into his office, to Prof. Baldocchi for showing me his enthusiasm, the multidisciplinary nature of environmental sciences and the ubiquitous need of sharing primary datasets for the good of the people and their environment, to Prof. Tim Berners-Lee and Prof. Goodchild for having the time to read my spam-like email without knowing me and support my ideas and given me interesting feedback, and last but not least, to Prof. Biging for believing in me and taking me under his wing to the very very very last day. To all of you I thank again from the bottom of my heart. Gracias! ii Contents 1TheXMLisDead 1 1.1 Introduction . 1 1.2 Networking ............................ 2 1.2.1 The Paleozoic-network Era: 1880’s to 1900’s . 2 1.2.2 The Mesozoic-network Era: 1900’s to 1950’s . 3 1.2.3 The Tertiary-Network Period: 1950’s to 1980’s . 3 1.2.4 The Quaternary-Network Period: 1980’s to 1990’s . 3 1.2.5 The Contemporary Period: 1990’s to 2000s . 4 1.2.6 The Semantic Web Period: since 2000 . 5 1.2.6.1 TheWebPortal ................ 5 1.2.6.2 TheGRID ................... 7 1.2.6.3 TheCloud.................... 10 1.3 EnvironmentalCollaboration. 11 1.4 Environmental Data Exchange . 15 1.5 XMLSpaces............................ 19 1.5.1 Bibliography . 19 1.5.2 PublicWebSpace..................... 21 1.5.3 PrivateSpaces....................... 23 1.6 Conclusions ............................ 24 2LongLiveMetadata 25 2.1 Introduction . 25 2.2 EnvironmentalArchives . 26 2.3 EnvironmentalPrimaryData . 29 2.4 MetadataaboutData-centricDocuments. 31 2.4.1 ArchivalMetadata . 33 iii 2.4.2 Ecological Metadata Language (EML) . 33 2.4.3 Earth Science Markup Language (ESML) . 34 2.4.4 Data Format Description Language (DFDL) . 34 2.4.5 PADS . 34 2.4.6 Extensible Scientific Interchange Language (XSIL) . 35 2.4.7 Binary Format Description (BFD) . 35 2.5 MetadataComesafterData . 35 2.6 Proof of Concept for ’A Posteriori’ Metadata to Describe Pri- mary Data . 36 2.7 Hierarchical Engine for Metadata Processing: HEMP . 41 2.7.1 Hierarchical Scanner: Lexical Description . 41 2.7.1.1 Pointers . 42 2.7.1.2 Parallel Vs Sequential Scanning . 42 2.7.1.3 StreamConcatenation . 43 2.7.1.4 Lexical Description Examples of Common Data Files . 43 2.7.2 My Scientific Object Notation (MySON): Logical Struc- ture . 48 2.7.2.1 My Scientific Object Notation: MySON . 48 2.7.2.2 Serialization of MySON Objects. 51 2.7.3 Controlled Vocabularies: Semantic Mapping . 53 2.7.4 HEMP: Processing description . 56 2.7.4.1 Containers ................... 56 2.7.4.2 Lexical Functions . 57 2.7.4.3 Arithmetic Functions . 58 2.7.4.4 Logical Functions . 58 2.7.4.5 EngineDescription . 59 2.7.4.6 Example: Metadata File for a Compressed ShapefilewithTreeData. 60 2.8 Conclusions ............................ 69 iv 3 The Forest Cloud: Link between Local and Global Regimes 75 3.1 Introduction . 75 3.2 Local forest governance . 76 3.3 Global Forest Governance . 77 3.4 International Governance . 77 3.5 TheTragedyoftheForest . 78 3.6 Tree Identification . 79 3.6.1 Where Is It From? . 79 3.6.2 Which Specie Is It? . 80 3.6.3 Which One Is It? . 81 3.7 TheForestCloud ......................... 81 3.7.1 CloudConcept ...................... 82 3.7.1.1 Software on Demand . 82 3.7.1.2 Computational Power . 82 3.7.1.3 Illimitable Data Store . 82 3.7.1.4 Layer of Common Communications . 83 3.7.2 ForestCloudPhilosophy . 84 3.7.3 ForestCloudInfrastructure . 85 3.7.3.1 MetadataTemplates . 85 3.7.3.2 Data Store . 99 3.7.3.3 TheForestCloudServices . 104 3.8 TheFutureofTheForestCloud . .105 v List of Figures 1 Authors per Article From Different Disciplines during Last Decade............................... 14 2 AuthorsperArticleacrossDecadesinEcology . 15 3 Normalized View of the Number of Scientific Publications sorted byyearandKeywords ...................... 20 4 Number of XML Documents Retrieve by Google Search En- gine by Language . 22 5 EstimatedLifespanofXMLLanguages . 23 6 XMLvsLegacyFormats ..................... 24 7 Logic Equivalence between a Table and a List of Triplets . 30 8 Physical, Logical, and Semantical View of a Data Stream . 31 9 Injective Mapping between the Physical Stream and Logical Object . 37 10 Synoptic Tifffile Representation with non-Contiguous Data . 45 11 Synoptic Tifffile Representation with Contiguous Data . 46 12 RailroadDiagramofMySONObject . 50 13 Railroad Diagram for the Self-BAr Value object of MySON . 53 14 Semantic Layering . 54 15 Tree Location Map in Corner Heights at Ithaca, NY . 61 16 StreetMapofCornelHeightsatIthaca,NY . 62 17 Logical Description of a Compressed Shapefile with Tree Data 68 18 Tim-Berner-Lee’s Layered Graph of the Semantic Web . 72 19 Modified Layered Graph of the Semantic Web for Data-centric Documents . 73 20 Walls in the Cloud created by private competitive communi- cation protocols . 84 21 General Physical Description for Containers of an Excel Dataset 87 vi 22 Physical Description for Containers of an Excel Dataset with regular structure . 88 23 Physical Description for Containers of an Excel Dataset with amixedregularstructures. 89 24 The Forest Cloud partial view of trees from research project incentralCalifornia. 91 25 General Physical Description for Containers of a Shapefile withpointdata .......................... 92 26 General Physical Description for Containers of a DBase file . 92 27 The Forest Cloud partial view of trees from Cornel University 94 28 The Forest Cloud partial view of singular trees from Cadiz province, Spain . 97 29 The Forest Cloud partial view of urban trees from Philladephia 98 30 Data Store Entity-Relationship diagram . 104 vii 1TheXMLisDead 1.1 Introduction Since almost the beginning of telecommunications, environmental data such as weather has been encoded and transported by wired and wireless me- dia. During the last decade, access to Internet has modified research pat- terns among scientists and managers working with natural resources. Along with a fast improvement on data gathering and analysis, the epistemological grounds of empirical sciences have been moved in new directions.