BUILDING THE ONLINE DATA OBSERVATORY

ENABLING DYNAMIC INTEROPERABLE SCIENTIFIC BIG DATA ANALYTICS AT NKN

Luke Sheneman, Ph.D Technology and Data Services Manager Northwest Knowledge Network (NKN) Online Data Observatory

¨ Data: The whole of your accessible data is greater than the sum of its parts.

¨ Data Interoperability: Enable investigators to easily analyze large, heterogeneous datasets without struggling with file formats, unit conversion, manual subsetting, spatio-temporal scale harmonization, variable mapping

¨ Leverage: New science with old data

¨ Tools: Desktop, mobile, HPC, and web-enabled analytical and visualization tools dynamically connected to data

Data Observatory Components

Rich Metadata Data Resources Content Syntax Semantics Internal Remote Data ¥ Discovery ¥ Data Data ¥ Description structures ¥ Ontologies Catalog Metadata ¥ Variables ¥ Data types ¥ Linkages Harvesting ¥ Space-time ¥ Data Format ¥ Context coverage

Data Representation Tools Web Service APIs

Explicit Self- Tools access web Functions Examples Data Describing APIs - not files Model Formats ¥ Subsetting ¥ OPeNDAP ¥ Web tools ¥ Aggregation ¥ WaterOneFlow ¥ ¥ , Matlab, SAS ¥ Machine-to- GIS / OGC ¥ ODM/ODM2 ¥ HDF ¥ WMS ¥ Database ¥ GRIB ¥ GIS Machine ¥ Viz: VTK, IDL ¥ Efficient ¥ WCS Schemata ¥ NetCDF ¥ WFS A Big Data Example: Climate Science

Downscaled Climate Data + 100TB, replicated to INL + 4km Western US and CONUS Web Service API + Multivariate Adaptive Constructed Analogs (MACA) + OPeNDAP via THREDDS + Historical and Projected 1950-2099 + ncISO + 20 Models X Several Variables + OGC: WMS + WCS + Aggregation: Virtual datasets + Subsetting: Dynamic space-time

Metadata + Rich ISO 19115-2 Collection & Granule Metadata + Additional Embedded Metadata in NetCDF Files + Metadata Harvested and Exposed via NKN Catalog Tools + Metadata also Exposed through THREDDS + Instant connectivity for: ArcGIS, R, Python, MatLAB, more. + Interactive collaborative analysis iPython Notebooks Data Representation + Web interfaces: + Gridded NetCDF4 (HDF5) maca.northwestknowledge.net + Thousands of files (between 50MB and 5GB each) + Self-describing, machine-readable, embedded metadata NKN as Idaho’s Online Data Observatory

University Science Science DMZ DMZ 10Gbps

500TB 500TB Storage Storage

Metadata Web Web Metadata Catalog Services Services Catalog

Investigators INL HPC R, Python, MATLAB, GIS DataONE – Federation of Observatories

¨ NKN as Tier-4 DataONE Member Node

¨ DataONE Implemented as Web Service API

¨ Investigator Toolkit: Challenges and Opportunities

¨ Data Interoperability Research ¤ Observation and Scientific Data Models n e.g. ODM2 ¤ Increasingly Expressive Underlying File Formats n e.g. NetCDF/HDF and successors ¤ More Efficient Web Services ¤ Data and Model Integration n CSDMS, Statistical Issues, Spatio-Temporal Harmonization

¨ Real-time, on-demand HPC powering NKN web analytic tools and services Thank You

“The goal is to have a world in which all of the science literature is online, all of the science data is online, and they interoperate with each other.

Lots of new tools are needed to make this happen.”

-Jim Gray Microsoft Research