COSC 6397 Big Data Analytics

Data Formats (I) – HDF5 and RDF

Edgar Gabriel Spring 2017

Scientific data libraries

• Handle data on a higher level • Provide additional information typically not available in flat data files (Metadata) – Size and type of of data structure – Data format – Name – Units • Two widely used libraries available – NetCDF – HDF-5

1 HDF-5

• Hierarchical Data Format (HDF) developed since 1988 at NCSA (University of Illinois) – http://hdf.ncsa.uiuc.edu/HDF5/ • Has gone through a long history of changes, the recent version HDF-5 available since 1999 • HDF-5 supports – Very large files – Parallel I/O interface – , , Java bindings

HDF-5 dataset

• Multi-dimensional array of basic data elements • A dataset consists of – Header + data • Header consists of – Name – Datatype : basic (e.g. HDF_NATIVE_FLOAT) or compound dataypes – Dataspace: defines size and shape of a multidimensional array. Dimensions can be fixed or unlimited. – Storage layout: defines how multidimensional arrays are stored in file. Can be contiguous or chunked.

2 Example of an HDF-5 file

HDF5 “tempseries.h5” { GROUP “/” { GROUP “tempseries” { DATASET “height” { DATATYPE {“H5_STD_I32BE” } DATASPACE ( ARRAY (4) (4) } DATA { 0, 50, 100, 150 } ATTRIBUTES “units” { DATATYPE {“undefined string” } DATASPACE { ARRAY (0) (0) } DATA { unable to print } } } DATASET “temperature” { DATATYPE {“H5T_IEEE_F32BE” } DATASPACE{ ARRAY( 3,8,4 ) (H5S_UNLIMITED, 8, 4) } DATA {…}

Storage layout: contiguous vs. chunked contiguous chunked 1 2 3 4 5 6 7 8 1 2 3 4 17 18 19 20 9 10 11 12 13 14 15 16 5 6 7 8 21 22 23 24 17 18 19 20 21 22 23 24 9 10 11 12 25 26 27 28 25 26 27 28 29 30 31 32 13 14 15 16 29 30 31 32 33 34 35 36 37 38 39 40 33 34 35 36 49 50 51 52 41 42 43 44 45 46 47 48 37 38 39 40 53 54 55 56 49 50 51 52 53 54 55 56 41 42 43 44 57 58 59 60 57 58 59 60 61 62 63 64 45 46 47 48 61 62 63 64

 Advantages and disadvantages of chunking  Accessing rows and columns require the same number of accesses  Data can be extended into all dimensions  Efficient storage of sparse arrays  Can improve caching

3 HDF-5 API

• HDF-5 naming convention – All API functions start with an H5 – The next character identifies category of functions • H5F: functions handling files • H5G: functions handling groups • H5D: functions handling datasets • H5S: functions handling dataspaces • H5A: functions handling attributes

• A HDF-5 group is a collection of data sets – Comparable to a directory in a UNIX-like file system

Writing a sequential HDF-5 file

1. Create the file h5file = H5Fcreate(…) 2. Create a group (opt.) group = H5Gcreate (h5file,…) 3. Define a dataspace tspace = H5Screate_simple(ndims, dims, maxdims ); ttype = H5T_IEEE_F32BE; 4. Define datatype tset = H5Dcreate (group, “testset”, 5. Create dataset ttype, tspace, …); tattr = H5Acreate (tset, “units”, H5T_C_S1, …) 6. Add attributes H5Awrite (tattr, H5T_C_S1, “meter”); H5Dwrite(tset,H5T_IEEE_F32BE,…,data);

7. Write data 8. Close all objects

4 Reading an HDF-5 file – structure of the file known

1. Open the file h5file = H5Fopen(…) group = H5Gopen(h5file,”tempseries”) 2. Open the group tset = H5Dopen(group,”temperature”); 3. Open dataset in the group tspace = H5Dget_space( tset ); 4. Look up dimensions H5Sget_simple_extent_dims (tspace, dims, …); H5Dread(tset,H5T_IEEE_F32BE, ttype, tspace, …, buffer); 5. Read data tattr = H5Aopen_name(tset, “units”); 6. Read attributes attrtype = H5Aget_type ( tattr ); H5Aread(tattr,attrtype,attr); 7. Read comments 8. Close all objects

Compound Datatypes

• Abstraction for user structures – Has a fixed size – Each member has its own name, datatype, reference, and byte offset

h5type = H5Tcreate( H5T_class class, size_t size); H5Tinsert ( h5type, const char *name, off_t offset, hid_t field_id);

5 Hyperslab

• A hyperslab is a portion of a dataset H5Sselect_hyperslab (hid_t space_id, H5S_seloper_t operator, const hssize_t *start, const hsize_t *stride, const hsize_t *count, const hsize_t *block);

– Operator: H5S_SELECT_SET, H5S_SELECT_OR – Start: array determining the starting coordinates of the hyperslab – Stride: array indicating which elements along a dimension are to be selected – Count: array determining how many points to use in each dimension – Block: array determining the size of the element block by the datatype

Example using hyperslabs

/* Define hyperslab in the dataset. */ offset[0] = 1; count[0] = NX_SUB; offset[1] = 2; count[1] = NY_SUB; status = H5Sselect_hyperslab (dataspace, H5S_SELECT_SET, offset, NULL, count, NULL);

/*Read data from hyperslab in file into hyperslab in memory */ status = H5Dread (dataset, H5T_NATIVE_INT, memspace, dataspace, H5P_DEFAULT, data_out);

offset[0]

count[0]

offset[1] count[1] Examples taken from HDF-5 webpage

6 RDF - Resource Description Framework

• A framework for describing resources – not a language • Model for data • Syntax to allow exchange and use of information stored in various locations • Facilitates reading and correct use of information by computers, not necessarily by people – Not intended to be directly parsed by an application

• Recommended short read: http://www.andrew.cmu.edu/user/mm6/95-733/PowerPoint/rdftutorial.pdf

http://what.csc.villanova.edu/~cassel/9010SemanticWeb/RDF.ppt

Identification and description

• RDF identifies resources with URIs – Often, though not always, the same as a URL – Anything that can have a URI is a RESOURCE • RDF describes resources with properties and property values – A property is a resource that has a name • Ex. Author, Book, Address, Client, Product – A property value is the value of the Property • Ex. “Joanna Santillo,” http://www.someplace.com/, etc. • A property value can be another resource, allowing nested descriptions.

7 Statements

• Resource, Property, Property Value – Also referred to as subject, predicate, object of a statement • Predicates are not the same as English language verbs – Specify a relationship between the subject and the object

Examples

• Statement: "The author of http://www.w3schools.com/RDF is Jan Egil Refsnes". – Subject: http://www.w3schools.com/RDF – Predicate: author – Object: Jan Egil Refsnes

• Statement: "The homepage of http://www.w3schools.com/RDF is http://www.w3schools.com” – Subject: http://www.w3schools.com/RDF – Predicate: homepage – Object: http://www.w3schools.com

8 Binary predicates • RDF offers only binary • From the example, predicates. • X = • Result represents the truth http://www.w3schools.com/RDF or falsehood of some • Y = Jan Egil Refsnes condition. • P = author • Think of them as P(x,y) where P is the relationship between the objects x and y.

author http://www.w3schools.com/RDF Jan Egil Refsnes

Storing RDF files

: saving RDF as a string of bytes • Multiple defined – RDF/XML – Turtle – N3 (Notation 3) • Can also be stored in databases – Non-SQL, typically triplestores

9 Root element of RDF documents

elements with rdf prefix Source of namespace for elements Bob Dylan USA with cd prefix Columbia 10.90 1985

Bonnie Tyler UK Description element describes CBS Records the resource identified by the 9.90 rdf:about attribute. 1988 cd:country etc are properties of . the resource.

• The element containing all triplets must be an RDF element from the http://www.w3.org/1999/02/22-rdf-syntax-ns# namespace • The subject of each triplet is identified in the rdf:about attribute of an rdf:Description element • The example could have used separate rdf:Description elements for each triple, but it expresses two triples about the resource by putting two child elements inside the same rdf:Description element : a cd:artist,a cd:country etc. • The objects are expressed as plain text between start and end tags

10 Turtle serialization

• Derived from N3 • Allows for shortcuts for prefixes • Shortcuts for describing multiple facts about the same subject separated by a semicolon • Final fact about an object terminated by a period

Turtle serialization

@prefix xsd: . @prefix : . @prefix dm: .

d:item342 dm:shipped "2011-02-14"^^xsd:date . d:item342 dm:quantity "4"^^xsd:integer . d:item342 dm:invoiced "false"^^xsd:boolean . d:item342 dm:costPerItem "3.50"^^xsd:decimal .

11 RDF datatyping

2011-02-14

4

false

3.50

RDF Schema

• Extension to RDF to allow definition of application- specific classes and properties – Does not define the classes, properties. – Provides a framework to describe them • Best practices: – Subject is meaningless in itself, main purpose is to be a unique identifier – RDF Schema (RDFS) namespace can be assigned to assign an rdfs:label value to a resources, which is supposed to be human readable

12 RDF Schema example

Dublin Core

• RDF is metadata -- data about data • Dublin core is a set of properties for describing documents • See www.dublincore.org for details • 15 basic elements: • Contributor, coverage, creator, format, date, description, identifier, language, publisher, relation, rights, source, subject, title, type

13 RDF Summary

• RDF allows to describe data in a machine readable form • RDF schema provides common definitions within a topic domain • RDF typically parsed from a query language such as SPARQL • As of 2012, over 31billion triplets described over the internet using RDF

14