Data Formats (I) – HDF5 and RDF

Data Formats (I) – HDF5 and RDF

COSC 6397 Big Data Analytics Data Formats (I) – HDF5 and RDF Edgar Gabriel Spring 2017 Scientific data libraries • Handle data on a higher level • Provide additional information typically not available in flat data files (Metadata) – Size and type of of data structure – Data format – Name – Units • Two widely used libraries available – NetCDF – HDF-5 1 HDF-5 • Hierarchical Data Format (HDF) developed since 1988 at NCSA (University of Illinois) – http://hdf.ncsa.uiuc.edu/HDF5/ • Has gone through a long history of changes, the recent version HDF-5 available since 1999 • HDF-5 supports – Very large files – Parallel I/O interface – Fortran, C, Java bindings HDF-5 dataset • Multi-dimensional array of basic data elements • A dataset consists of – Header + data • Header consists of – Name – Datatype : basic (e.g. HDF_NATIVE_FLOAT) or compound dataypes – Dataspace: defines size and shape of a multidimensional array. Dimensions can be fixed or unlimited. – Storage layout: defines how multidimensional arrays are stored in file. Can be contiguous or chunked. 2 Example of an HDF-5 file HDF5 “tempseries.h5” { GROUP “/” { GROUP “tempseries” { DATASET “height” { DATATYPE {“H5_STD_I32BE” } DATASPACE ( ARRAY (4) (4) } DATA { 0, 50, 100, 150 } ATTRIBUTES “units” { DATATYPE {“undefined string” } DATASPACE { ARRAY (0) (0) } DATA { unable to print } } } DATASET “temperature” { DATATYPE {“H5T_IEEE_F32BE” } DATASPACE{ ARRAY( 3,8,4 ) (H5S_UNLIMITED, 8, 4) } DATA {…} Storage layout: contiguous vs. chunked contiguous chunked 1 2 3 4 5 6 7 8 1 2 3 4 17 18 19 20 9 10 11 12 13 14 15 16 5 6 7 8 21 22 23 24 17 18 19 20 21 22 23 24 9 10 11 12 25 26 27 28 25 26 27 28 29 30 31 32 13 14 15 16 29 30 31 32 33 34 35 36 37 38 39 40 33 34 35 36 49 50 51 52 41 42 43 44 45 46 47 48 37 38 39 40 53 54 55 56 49 50 51 52 53 54 55 56 41 42 43 44 57 58 59 60 57 58 59 60 61 62 63 64 45 46 47 48 61 62 63 64 Advantages and disadvantages of chunking Accessing rows and columns require the same number of accesses Data can be extended into all dimensions Efficient storage of sparse arrays Can improve caching 3 HDF-5 API • HDF-5 naming convention – All API functions start with an H5 – The next character identifies category of functions • H5F: functions handling files • H5G: functions handling groups • H5D: functions handling datasets • H5S: functions handling dataspaces • H5A: functions handling attributes • A HDF-5 group is a collection of data sets – Comparable to a directory in a UNIX-like file system Writing a sequential HDF-5 file 1. Create the file h5file = H5Fcreate(…) 2. Create a group (opt.) group = H5Gcreate (h5file,…) 3. Define a dataspace tspace = H5Screate_simple(ndims, dims, maxdims ); ttype = H5T_IEEE_F32BE; 4. Define datatype tset = H5Dcreate (group, “testset”, 5. Create dataset ttype, tspace, …); tattr = H5Acreate (tset, “units”, H5T_C_S1, …) 6. Add attributes H5Awrite (tattr, H5T_C_S1, “meter”); H5Dwrite(tset,H5T_IEEE_F32BE,…,data); 7. Write data 8. Close all objects 4 Reading an HDF-5 file – structure of the file known 1. Open the file h5file = H5Fopen(…) group = H5Gopen(h5file,”tempseries”) 2. Open the group tset = H5Dopen(group,”temperature”); 3. Open dataset in the group tspace = H5Dget_space( tset ); 4. Look up dimensions H5Sget_simple_extent_dims (tspace, dims, …); H5Dread(tset,H5T_IEEE_F32BE, ttype, tspace, …, buffer); 5. Read data tattr = H5Aopen_name(tset, “units”); 6. Read attributes attrtype = H5Aget_type ( tattr ); H5Aread(tattr,attrtype,attr); 7. Read comments 8. Close all objects Compound Datatypes • Abstraction for user structures – Has a fixed size – Each member has its own name, datatype, reference, and byte offset h5type = H5Tcreate( H5T_class class, size_t size); H5Tinsert ( h5type, const char *name, off_t offset, hid_t field_id); 5 Hyperslab • A hyperslab is a portion of a dataset H5Sselect_hyperslab (hid_t space_id, H5S_seloper_t operator, const hssize_t *start, const hsize_t *stride, const hsize_t *count, const hsize_t *block); – Operator: H5S_SELECT_SET, H5S_SELECT_OR – Start: array determining the starting coordinates of the hyperslab – Stride: array indicating which elements along a dimension are to be selected – Count: array determining how many points to use in each dimension – Block: array determining the size of the element block by the datatype Example using hyperslabs /* Define hyperslab in the dataset. */ offset[0] = 1; count[0] = NX_SUB; offset[1] = 2; count[1] = NY_SUB; status = H5Sselect_hyperslab (dataspace, H5S_SELECT_SET, offset, NULL, count, NULL); /*Read data from hyperslab in file into hyperslab in memory */ status = H5Dread (dataset, H5T_NATIVE_INT, memspace, dataspace, H5P_DEFAULT, data_out); offset[0] count[0] offset[1] count[1] Examples taken from HDF-5 webpage 6 RDF - Resource Description Framework • A framework for describing resources – not a language • Model for data • Syntax to allow exchange and use of information stored in various locations • Facilitates reading and correct use of information by computers, not necessarily by people – Not intended to be directly parsed by an application • Recommended short read: http://www.andrew.cmu.edu/user/mm6/95-733/PowerPoint/rdftutorial.pdf http://what.csc.villanova.edu/~cassel/9010SemanticWeb/RDF.ppt Identification and description • RDF identifies resources with URIs – Often, though not always, the same as a URL – Anything that can have a URI is a RESOURCE • RDF describes resources with properties and property values – A property is a resource that has a name • Ex. Author, Book, Address, Client, Product – A property value is the value of the Property • Ex. “Joanna Santillo,” http://www.someplace.com/, etc. • A property value can be another resource, allowing nested descriptions. 7 Statements • Resource, Property, Property Value – Also referred to as subject, predicate, object of a statement • Predicates are not the same as English language verbs – Specify a relationship between the subject and the object Examples • Statement: "The author of http://www.w3schools.com/RDF is Jan Egil Refsnes". – Subject: http://www.w3schools.com/RDF – Predicate: author – Object: Jan Egil Refsnes • Statement: "The homepage of http://www.w3schools.com/RDF is http://www.w3schools.com” – Subject: http://www.w3schools.com/RDF – Predicate: homepage – Object: http://www.w3schools.com 8 Binary predicates • RDF offers only binary • From the example, predicates. • X = • Result represents the truth http://www.w3schools.com/RDF or falsehood of some • Y = Jan Egil Refsnes condition. • P = author • Think of them as P(x,y) where P is the relationship between the objects x and y. author http://www.w3schools.com/RDF Jan Egil Refsnes Storing RDF files • Serialization: saving RDF as a string of bytes • Multiple serializations defined – RDF/XML – Turtle – N3 (Notation 3) • Can also be stored in databases – Non-SQL, typically triplestores 9 Root element of RDF documents <rdf:RDF Source of xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" namespace for xmlns:cd="http://www.recshop.fake/cd#"> elements with rdf prefix <rdf:Description rdf:about="http://www.recshop.fake/cd/Empire Burlesque"> Source of namespace for elements <cd:artist>Bob Dylan</cd:artist> <cd:country>USA</cd:country> with cd prefix <cd:company>Columbia</cd:company> <cd:price>10.90</cd:price> <cd:year>1985</cd:year> </rdf:Description> <rdf:Description rdf:about="http://www.recshop.fake/cd/Hide your heart"> <cd:artist>Bonnie Tyler</cd:artist> <cd:country>UK</cd:country> Description element describes <cd:company>CBS Records</cd:company> the resource identified by the <cd:price>9.90</cd:price> rdf:about attribute. <cd:year>1988</cd:year> </rdf:Description> cd:country etc are properties of </rdf:RDF>. the resource. • The element containing all triplets must be an RDF element from the http://www.w3.org/1999/02/22-rdf-syntax-ns# namespace • The subject of each triplet is identified in the rdf:about attribute of an rdf:Description element • The example could have used separate rdf:Description elements for each triple, but it expresses two triples about the resource by putting two child elements inside the same rdf:Description element : a cd:artist,a cd:country etc. • The objects are expressed as plain text between start and end tags 10 Turtle serialization • Derived from N3 • Allows for shortcuts for prefixes • Shortcuts for describing multiple facts about the same subject separated by a semicolon • Final fact about an object terminated by a period Turtle serialization @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . @prefix d: <http://learningsparql.com/ns/data#> . @prefix dm: <http://learningsparql.com/ns/demo#> . d:item342 dm:shipped "2011-02-14"^^xsd:date . d:item342 dm:quantity "4"^^xsd:integer . d:item342 dm:invoiced "false"^^xsd:boolean . d:item342 dm:costPerItem "3.50"^^xsd:decimal . 11 RDF datatyping <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dm="http://learningsparql.com/ns/demo#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#"> <rdf:Description rdf:about="http://learningsparql.com/ns/demo#item342"> <dm:shipped rdf:datatype="http://www.w3.org/2001/XMLSchema#date">2011-02-14</dm:shipped> <dm:quantity rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">4</dm:quantity> <dm:invoiced rdf:datatype="http://www.w3.org/2001/XMLSchema#boolean">false</dm:invoiced> <dm:costPerItem rdf:datatype="http://www.w3.org/2001/XMLSchema#decimal">3.50</dm:costPerItem> </rdf:Description> </rdf:RDF> RDF Schema • Extension to RDF to allow definition of application- specific classes and properties – Does not define the classes,

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    14 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us