COSC 6397 Big Data Analytics
Data Formats (I) – HDF5 and RDF
Edgar Gabriel Spring 2017
Scientific data libraries
• Handle data on a higher level • Provide additional information typically not available in flat data files (Metadata) – Size and type of of data structure – Data format – Name – Units • Two widely used libraries available – NetCDF – HDF-5
1 HDF-5
• Hierarchical Data Format (HDF) developed since 1988 at NCSA (University of Illinois) – http://hdf.ncsa.uiuc.edu/HDF5/ • Has gone through a long history of changes, the recent version HDF-5 available since 1999 • HDF-5 supports – Very large files – Parallel I/O interface – Fortran, C, Java bindings
HDF-5 dataset
• Multi-dimensional array of basic data elements • A dataset consists of – Header + data • Header consists of – Name – Datatype : basic (e.g. HDF_NATIVE_FLOAT) or compound dataypes – Dataspace: defines size and shape of a multidimensional array. Dimensions can be fixed or unlimited. – Storage layout: defines how multidimensional arrays are stored in file. Can be contiguous or chunked.
2 Example of an HDF-5 file
HDF5 “tempseries.h5” { GROUP “/” { GROUP “tempseries” { DATASET “height” { DATATYPE {“H5_STD_I32BE” } DATASPACE ( ARRAY (4) (4) } DATA { 0, 50, 100, 150 } ATTRIBUTES “units” { DATATYPE {“undefined string” } DATASPACE { ARRAY (0) (0) } DATA { unable to print } } } DATASET “temperature” { DATATYPE {“H5T_IEEE_F32BE” } DATASPACE{ ARRAY( 3,8,4 ) (H5S_UNLIMITED, 8, 4) } DATA {…}
Storage layout: contiguous vs. chunked contiguous chunked 1 2 3 4 5 6 7 8 1 2 3 4 17 18 19 20 9 10 11 12 13 14 15 16 5 6 7 8 21 22 23 24 17 18 19 20 21 22 23 24 9 10 11 12 25 26 27 28 25 26 27 28 29 30 31 32 13 14 15 16 29 30 31 32 33 34 35 36 37 38 39 40 33 34 35 36 49 50 51 52 41 42 43 44 45 46 47 48 37 38 39 40 53 54 55 56 49 50 51 52 53 54 55 56 41 42 43 44 57 58 59 60 57 58 59 60 61 62 63 64 45 46 47 48 61 62 63 64
Advantages and disadvantages of chunking Accessing rows and columns require the same number of accesses Data can be extended into all dimensions Efficient storage of sparse arrays Can improve caching
3 HDF-5 API
• HDF-5 naming convention – All API functions start with an H5 – The next character identifies category of functions • H5F: functions handling files • H5G: functions handling groups • H5D: functions handling datasets • H5S: functions handling dataspaces • H5A: functions handling attributes
• A HDF-5 group is a collection of data sets – Comparable to a directory in a UNIX-like file system
Writing a sequential HDF-5 file
1. Create the file h5file = H5Fcreate(…) 2. Create a group (opt.) group = H5Gcreate (h5file,…) 3. Define a dataspace tspace = H5Screate_simple(ndims, dims, maxdims ); ttype = H5T_IEEE_F32BE; 4. Define datatype tset = H5Dcreate (group, “testset”, 5. Create dataset ttype, tspace, …); tattr = H5Acreate (tset, “units”, H5T_C_S1, …) 6. Add attributes H5Awrite (tattr, H5T_C_S1, “meter”); H5Dwrite(tset,H5T_IEEE_F32BE,…,data);
7. Write data 8. Close all objects
4 Reading an HDF-5 file – structure of the file known
1. Open the file h5file = H5Fopen(…) group = H5Gopen(h5file,”tempseries”) 2. Open the group tset = H5Dopen(group,”temperature”); 3. Open dataset in the group tspace = H5Dget_space( tset ); 4. Look up dimensions H5Sget_simple_extent_dims (tspace, dims, …); H5Dread(tset,H5T_IEEE_F32BE, ttype, tspace, …, buffer); 5. Read data tattr = H5Aopen_name(tset, “units”); 6. Read attributes attrtype = H5Aget_type ( tattr ); H5Aread(tattr,attrtype,attr); 7. Read comments 8. Close all objects
Compound Datatypes
• Abstraction for user structures – Has a fixed size – Each member has its own name, datatype, reference, and byte offset
h5type = H5Tcreate( H5T_class class, size_t size); H5Tinsert ( h5type, const char *name, off_t offset, hid_t field_id);
5 Hyperslab
• A hyperslab is a portion of a dataset H5Sselect_hyperslab (hid_t space_id, H5S_seloper_t operator, const hssize_t *start, const hsize_t *stride, const hsize_t *count, const hsize_t *block);
– Operator: H5S_SELECT_SET, H5S_SELECT_OR – Start: array determining the starting coordinates of the hyperslab – Stride: array indicating which elements along a dimension are to be selected – Count: array determining how many points to use in each dimension – Block: array determining the size of the element block by the datatype
Example using hyperslabs
/* Define hyperslab in the dataset. */ offset[0] = 1; count[0] = NX_SUB; offset[1] = 2; count[1] = NY_SUB; status = H5Sselect_hyperslab (dataspace, H5S_SELECT_SET, offset, NULL, count, NULL);
/*Read data from hyperslab in file into hyperslab in memory */ status = H5Dread (dataset, H5T_NATIVE_INT, memspace, dataspace, H5P_DEFAULT, data_out);
offset[0]
count[0]
offset[1] count[1] Examples taken from HDF-5 webpage
6 RDF - Resource Description Framework
• A framework for describing resources – not a language • Model for data • Syntax to allow exchange and use of information stored in various locations • Facilitates reading and correct use of information by computers, not necessarily by people – Not intended to be directly parsed by an application
• Recommended short read: http://www.andrew.cmu.edu/user/mm6/95-733/PowerPoint/rdftutorial.pdf
http://what.csc.villanova.edu/~cassel/9010SemanticWeb/RDF.ppt
Identification and description
• RDF identifies resources with URIs – Often, though not always, the same as a URL – Anything that can have a URI is a RESOURCE • RDF describes resources with properties and property values – A property is a resource that has a name • Ex. Author, Book, Address, Client, Product – A property value is the value of the Property • Ex. “Joanna Santillo,” http://www.someplace.com/, etc. • A property value can be another resource, allowing nested descriptions.
7 Statements
• Resource, Property, Property Value – Also referred to as subject, predicate, object of a statement • Predicates are not the same as English language verbs – Specify a relationship between the subject and the object
Examples
• Statement: "The author of http://www.w3schools.com/RDF is Jan Egil Refsnes". – Subject: http://www.w3schools.com/RDF – Predicate: author – Object: Jan Egil Refsnes
• Statement: "The homepage of http://www.w3schools.com/RDF is http://www.w3schools.com” – Subject: http://www.w3schools.com/RDF – Predicate: homepage – Object: http://www.w3schools.com
8 Binary predicates • RDF offers only binary • From the example, predicates. • X = • Result represents the truth http://www.w3schools.com/RDF or falsehood of some • Y = Jan Egil Refsnes condition. • P = author • Think of them as P(x,y) where P is the relationship between the objects x and y.
author http://www.w3schools.com/RDF Jan Egil Refsnes
Storing RDF files
• Serialization: saving RDF as a string of bytes • Multiple serializations defined – RDF/XML – Turtle – N3 (Notation 3) • Can also be stored in databases – Non-SQL, typically triplestores
9 Root element of RDF documents
• The element containing all triplets must be an RDF element from the http://www.w3.org/1999/02/22-rdf-syntax-ns# namespace • The subject of each triplet is identified in the rdf:about attribute of an rdf:Description element • The example could have used separate rdf:Description elements for each triple, but it expresses two triples about the resource by putting two child elements inside the same rdf:Description element : a cd:artist,a cd:country etc. • The objects are expressed as plain text between start and end tags
10 Turtle serialization
• Derived from N3 • Allows for shortcuts for prefixes • Shortcuts for describing multiple facts about the same subject separated by a semicolon • Final fact about an object terminated by a period
Turtle serialization
@prefix xsd:
d:item342 dm:shipped "2011-02-14"^^xsd:date . d:item342 dm:quantity "4"^^xsd:integer . d:item342 dm:invoiced "false"^^xsd:boolean . d:item342 dm:costPerItem "3.50"^^xsd:decimal .
11 RDF datatyping
RDF Schema
• Extension to RDF to allow definition of application- specific classes and properties – Does not define the classes, properties. – Provides a framework to describe them • Best practices: – Subject is meaningless in itself, main purpose is to be a unique identifier – RDF Schema (RDFS) namespace can be assigned to assign an rdfs:label value to a resources, which is supposed to be human readable
12 RDF Schema example
Dublin Core
• RDF is metadata -- data about data • Dublin core is a set of properties for describing documents • See www.dublincore.org for details • 15 basic elements: • Contributor, coverage, creator, format, date, description, identifier, language, publisher, relation, rights, source, subject, title, type
13 RDF Summary
• RDF allows to describe data in a machine readable form • RDF schema provides common definitions within a topic domain • RDF typically parsed from a query language such as SPARQL • As of 2012, over 31billion triplets described over the internet using RDF
14