HDF5 Overview.Pptx
Total Page:16
File Type:pdf, Size:1020Kb
The HDF Group HDF5 Overview Elena Pourmal [email protected] The HDF Group 10/17/15 ICALEPCS 2015 1 www.hdfgroup.org Outline • The HDF Group company • Products and services • Overview of HDF5 • What is coming in HDF5 1.10.0 release? • Future directions 10/17/15 ICALEPCS 2015 2 www.hdfgroup.org THE HDF GROUP COMPANY 10/17/15 ICALEPCS 2015 3 www.hdfgroup.org Champaign, Illinois, USA 10/17/15 ICALEPCS 2015 4 www.hdfgroup.org The HDF Group www.hdfgroup.org • Not-for-profit company (since 2006), ex-NCSA at University of Illinois • Offices in 5 states • About 40 employees (more than 50% growth in the past 9 years) - Core software developers - Domain specialists - Documentation team - Technical support • Mission-driven 10/17/15 ICALEPCS 2015 5 www.hdfgroup.org The HDF Group Mission To ensure long-term accessibility of HDF data through sustainable development and support of HDF technologies. 10/17/15 ICALEPCS 2015 6 www.hdfgroup.org The HDF Group philosophy • Committed to Open Source • HDF software is free • BSD type of license • Community involvement • Testing • Patches • New features (e.g., CMake support) • Serving diverse user base • Remote sensing, HPC, non-destructive testing, medical records, scientific modeling, etc. 10/17/15 ICALEPCS 2015 7 www.hdfgroup.org Revenue by Source Light Sources 2014 3% 0% Earth science 4% Finance 28% General NASA, NOAA Naonal Labs HPC 62% Oil & gas 3% 0% Par<cle science 10/17/15 ICALEPCS 2015 8 www.hdfgroup.org Revenue by Project Type Revenues by type of proJect Training and other Consulng outreach 8% 0% R&D 22% Development 24% Premium support 1% Enterprise support 45% 10/17/15 ICALEPCS 2015 9 www.hdfgroup.org PRODUCTS AND SERVICES 10/17/15 ICALEPCS 2015 10 www.hdfgroup.org The HDF Group products • Main product: HDF Technology Suite - For managing high volume complex, heterogeneous data - Flagship: HDF5 data store - Flexible and efficient storage and I/O - Portable - Highly customizable - Misc. tools - Specialized software and tools (e.g., JPSS) 10/17/15 ICALEPCS 2015 11 www.hdfgroup.org Data challenges addressed by HDF5 HDF5 IN 5 MINUTES 10/17/15 ICALEPCS 2015 12 www.hdfgroup.org HDF5 Technology Platform • HDF5 Abstract Data Model • Defines the “building blocks” for data organization and specification • Files, Groups, Links, Datasets, Attributes, Datatypes, Dataspaces • HDF5 Software • Tools • Language Interfaces (C, Fortran, C++, Java) • HDF5 Library • HDF5 Binary File Format • Bit-level organization of HDF5 file • Defined by HDF5 File Format Specification • HDF5 Ecosystem • Tools and services (h5py, MATLAB, IDL, OPeNDAP, etc.) • Communities (Earth Sciences, medical imaging, modeling and visualization) • Community standards (NeXus, HDF-EOS5, h5part, CGNS) • Institutional support and endorsement (NASA, NOAA, DOE) 10/17/15 ICALEPCS 2015 13 www.hdfgroup.org Members of the HDF community 10/17/15 ICALEPCS 2015 14 www.hdfgroup.org Success stories • Petabytes of NASA remote sensing data in HDF4 and HDF5 file formats • New NASA/JPSS missions chose HDF5 format for data archiving Need to organize complex collections of data Long term data preservation lat | lon | temp ----|-----|----- 12 | 23 | 3.1 15 | 24 | 4.2 17 | 21 | 3.6 Efficient, scalable storage and access 10/17/15 ICALEPCS 2015 15 www.hdfgroup.org Success story: Trillion Particle Simulation • Physics plasma simulation at NERSC Cray XE6 • Simulation ran on 120,000 cores using 80% of computing resources 90% of available memory 50% of Lustre scratch system and writing 10 one-trillion particle dumps of 30-42 TBs in HDF5 files; sustained ~ 27 GB/sec; total 350 TBs in HDF5 10/17/15 ICALEPCS 2015 16 www.hdfgroup.org The HDF Group services • Helpdesk and mailing lists - [email protected] - [email protected] - Open to all users of HDF • HDF5 Documentation https://www.hdfgroup.org/HDF5/doc/index.html • HDF Examples (C, Fortran, C++, Java, Python, MATLAB) https://www.hdfgroup.org/HDF5/examples/ 10/17/15 ICALEPCS 2015 17 www.hdfgroup.org The HDF Group services • Standard support • Assistance in general areas of HDF usage • Premium support • Access to our consulting and training resources • Limited consulting hours are included • Enterprise support • Help with developing common strategies for managing HDF data within organization • Organization shares consulting/troubleshooting services • Training • Consulting, custom development and support 10/17/15 ICALEPCS 2015 18 www.hdfgroup.org New Upcoming Features HDF5 1.10.0 RELEASE 10/17/15 ICALEPCS 2015 19 www.hdfgroup.org Reusing free file space in a file PERSISTENT FILE FREE SPACE TRACKING 10/17/15 ICALEPCS 2015 20 www.hdfgroup.org Unused space in HDF5 file • HDF5 library currently only tracks free space while file is open • Space from deleted objects • Space from resized compressed chunks • Free space in the file is “lost” after file is closed • h5repack is used to remove “holes” in the file • New function H5Pset_file_space • Sets a property to track free space in the file that can be reused when file is reopened • Allows fine tuning space tracking 10/17/15 ICALEPCS 2015 21 www.hdfgroup.org Improving performance and saving space SCALABLE CHUNK INDEXING 10/17/15 ICALEPCS 2015 22 www.hdfgroup.org Optimizing chunking storage and performance • HDF5 has an ability to add more data to existing datasets (data arrays) • Special storage mechanism – chunked storage • B-trees are used to index chunks in the file • O(log n) lookup time • HDF5 takes advantage of the access pattern and properties of the datasets • O(1) lookup time • File space savings when storing HDF5 metadata 10/17/15 ICALEPCS 2015 23 www.hdfgroup.org Optimizing chunking storage and performance • B-tree implementation was reworked to use less space in the file • Used for datasets with more than one unlimited dimension • New indexing structures were introduced to achieve O(1) performance and storage savings in special cases 10/17/15 ICALEPCS 2015 24 www.hdfgroup.org Optimizing chunking storage and performance • Examples of O(1) lookup access: • Fixed-size chunked dataset with no compression filters • Algorithmic lookup • Fixed-size chunked dataset with compression filters • Array to index chunks • Fixed-size dataset stored in one chunk (i.e., we now allow compression for contiguous dataset) • No index • Dataset with one unlimited dimension • Extensible array to index chunks 10/17/15 ICALEPCS 2015 25 www.hdfgroup.org CONCURRENCY: SINGLE-WRITER/MULTIPLE- READER 10/17/15 ICALEPCS 2015 26 www.hdfgroup.org Concurrent Access to Data New data elements … Writer Reader …which can be read … are added by a reader… to a dataset HDF5 File in the file… with no IPC necessary. 10/17/15 ICALEPCS 2015 27 www.hdfgroup.org Managing data stored across HDF5 files VIRTUAL DATASET (VDS) 10/17/15 ICALEPCS 2015 28 www.hdfgroup.org VDS Use Case with NPP satellite data 4 granules in 9 GMODO-SVM07… files Visualization with IDV 10/17/15 ICALEPCS 2015 29 www.hdfgroup.org VDS Use Case with NPP satellite data One virtual dataset with 36 granules stored in one file Visualization with IDV 10/17/15 ICALEPCS 2015 30 www.hdfgroup.org VDS use case: Percival detector Series of images D C B A t3+4k t1+4k t4 t3 Virtual Dataset VDS has images A, B, C and D interleaved t2 t1 reader VDS.h5 writer writer writer writer 10/17/15Dataset A Dataset B Dataset C Dataset D A B C D a.h5 b.h5 c.h5 d.h5 31 www.hdfgroup.org VDS: Conceptual View 32 10/17/15 www.hdfgroup.org Performance boost when opening and closing HDF5 files METADATA CACHE IMAGE 10/17/15 ICALEPCS 2015 33 www.hdfgroup.org Problem: Metadata Cache Image ! HDF5 metadata is typically small and scattered throughout the file. ! Resulting many small I/Os a major problem for parallel file systems. ! Metadata cache minimizes this during normal operation, but must still populate cache on file open, and flush it on file close. ! Problem if files are opened and closed often. 10/17/15 ICALEPCS 2015 34 www.hdfgroup.org Solution: Metadata Cache Image ! Store the contents of the metadata cache in a single block at file close, and then populate the cache with the stored entries on file open. ! If access pattern is similar over close and reopen, should save a significant number of small I/O operations. ! This solution is implemented in the metadata cache image feature. 10/17/15 ICALEPCS 2015 35 www.hdfgroup.org Metadata Cache Image ! To enable, set cache image FAPL property on file create or open: H5AC_cache_image_config_t cache_image_config = {H5AC__CURR_CACHE_IMAGE_CONFIG_VERSION, TRUE, 0}; fapl_id = H5Pcreate(H5P_FILE_ACCESS); H5Pset_libver_bounds(fapl_id, H5F_LIBVER_LATEST, H5F_LIBVER_LATEST); H5Pset_mdc_image_config(fapl_id, &cache_image_config); ! Then create or open file as usual. 10/17/15 ICALEPCS 2015 36 www.hdfgroup.org Metadata Cache Image ! Metadata cache image is read and deleted automatically on file open. ! Must set cache image FAPL property again if a new cache image is desired on file close. ! Earlier versions of HDF5 that don't understand the cache image will refuse to open the file. ! One can use a light-weight utility to remove caching info making file compatible with 1.8 ! Prototype implementation showed order of magnitude speedup on parallel systems 10/17/15 ICALEPCS 2015 37 www.hdfgroup.org Performance imporvemnts DATA AGGREGATION AND PAGE BUFFERING 10/17/15 ICALEPCS 2015 38 www.hdfgroup.org Page buffering/ Data aggregation Aggregate and align metadata and small data, perform I/O in aligned pages 10/17/15 39 www.hdfgroup.org Data and Metadata Aggregators The new aggregators pack small raw data and metadata allocations into aligned