`` ADIOS/Tutorial

Dresden, Germany 10/11/16 – 10/13/16 S. A. Klasky ORNL, GT, UTK Too many contributors to mention but • Norbert Podhorszki • Viraj Bhat • Chuck Atkins • Qing Liu • Rob Ross • Axel Huebl • Matt Wolf • Garth Gibson • Tahsin Kurc • Karsten Schwan • Chen Jin • Joel Saltz • Manish Parashar • Roselyne Tchoua • Jai Dayal • Ciprian Docan • C. S. Chang • Qian Sun • Hasan Abbasi • Stephane Ethier • Mark Ainsworth • Fang Zheng • Michael Bussmann • William Tang • Fan Zhan • Wei Xue • Jeoren Tromp • Tong Jin • Nagiza Samatova • Weikuan Yu • Jong Choi • Bing Xie • Dave Pugmire • Jeremy Logan • George Ostrouchov • Dave Pugmire • Yuan Tian • James Kress • Arie Shoshani • Eric Suchyta • John Wu

[email protected] Agenda • Day 1 • Reading 60 minutes • Introduction 55 minutes • Exercises 15 minutes • ADIOS APIs 20 minutes • Day 3 • Break 15 minutes • Review 10 minutes • Writing-XML 60 minutes • SKEL 20 minutes • BPLS 15 minutes • Write-NO 45 minutes • Exercises 15 minutes • Python 30 minutes • Day 2 • Break 15 minutes • Review 10 minutes • Staging 60 minutes • Schema 20 minutes • Visit 60 minutes • Plotter 15 minutes • Transforms 45 minutes • Break 15 minutes [email protected] Where is ORNL?

[email protected] A Little About ORNL…

[email protected] DOE’s Office of Science Computation User Facilities

• DOE is leader in open High- Performance Computing • Provide the world’s most powerful computational tools for open science • Access is free to researchers who publish • Boost US competitiveness • Attract the best and brightest researchers NERSC ALCF OLCF Edison is 2.57 PF Mira is 10 PF Titan is 27 PF [email protected] What is the Leadership Computing Facility (LCF)?

• Collaborative DOE Office of Science user- • Highly competitive user allocation facility program at ORNL and ANL programs (INCITE, ALCC). • Mission: Provide the computational and data • Projects receive 10x to 100x more resource resources required to solve the most than at other generally available centers. challenging problems. • LCF centers partner with users to enable • 2-centers/2-architectures to address diverse science & engineering breakthroughs and growing computational needs of the (Liaisons, Catalysts). scientific community

[email protected] Three primary user programs for access to LCF Distribution of allocable hours

10% Director’s Discretionary

30% ALCC 60% INCITE ASCR Leadership Computing Challenge

[email protected] Our Science requires that we continue to advance our computational capability over the next decade on the roadmap to Exascale.

Since clock-rate scaling ended in 2003, Titan and beyond deliver hierarchical HPC performance has been achieved parallelism with very powerful nodes. MPI through increased parallelism. Jaguar plus thread level parallelism through scaled to 300,000 cores. OpenACC or OpenMP plus vectors

OLCF5: 5-10x Summit Summit: 5-10x Titan ~20 MW Titan: 27 PF Hybrid GPU/CPU 10 MW Jaguar: 2.3 PF Hybrid GPU/CPU Multi-core CPU 9 MW CORAL System 7 MW 2010 2013 2017 2022 [email protected] Notes to change for the talk • I need to add to this, and have more examples since I have an hour. • Scott will do this…..

[email protected] What isADIOS

• An extendable framework that allows developers to plug-in Interface to apps for descrip on of data (ADIOS, etc.) • Data Management Services I/O methods: Aggregate, Posix, MPI Feedback Buffering Schedule • Mul -resolu on Data Compression Data Indexing Services: Compression, Decompression methods methods (FastBit) methods • File Formats: HDF5, netcdf, … Plugins to the hybrid staging area Provenance Workflow Engine Run me engine Data movement • Stream Format: ADIOS-BP Analysis Plugins Visualiza on Plugins • Plug-ins: Analytic, Visualization Adios-bp IDX HDF5 pnetcdf “raw” data Image data Parallel and Distributed File System Viz. Client • Indexing: FastBit, ISABELLA-QA • Incorporates the “best” practices in the I/O middleware layer • Incorporates self describing data streams and files • Released twice a year, now 1.9, under the completely free BSD license • https://www.olcf.ornl.gov/center-projects/adios/, https://github.com/ornladios/ADIOS • Available at ALCF, OLCF, NERSC, CSCS, Tianhe-1,2, Pawsey SC, Ostrava • Applications are supported through OLCF INCITE program • Outreach via on-line manuals, and live tutorials [email protected] ADIOS applications 1. Accelerator: PIConGPU, Warp Impact on Industry : 2. Astronomy: SKA • NUMECA (FINE/Turbo) – 3. Astrophysics: Chimera Allowed time-varying interaction of turbomachinery-related 4. Combustion: S3D aerodynamic phenomena 5. CFD: FINE/Turbo, OpenFoam • TOTAL (RTM) – Allowed running 6. Fusion: XGC, GTC, GTC-P, M3D,M3D-C1, M3D-K, Pixie3D of higher fidelity seismic 7. Geoscience: SPECFEM3D_GLOBE, AWP-ODC, RTM simulations • FMGLOBAL (OpenFoam)– 8. Materials Science: QMCPack, LAMMPS Allowed running higher fidelity fire 9. Medical Imaging: Cancer pathology propagation simulations 10.Quantum Turbulence: QLG2Q 11.Relativity: Maya Over 1B LCF hours from ADIOS 12.Weather: GRAPES enabled Apps 2015 13.Visualization: Paraview, Visit, VTK, ITK, OpenCV, VTKm Over 1,500 citations LCF/NERSC Codes in red [email protected] Impact from running large scale applications at scale Typical Applications on LCF machines observed a 10X performance improvement in I/O

[email protected] Impact at the HPC User facilities • ALCF • OLCF • NERSC • Tiahne-1A • Tiahne-2 • Bluelight • Singapore • KAIST • Ostrava • Dresden • ERDC • CSCS • Blue Waters • EPFL • Barcelona Supercomputing Center [email protected] How to use ADIOS • ADIOS is provided as a library to users; use it like other I/O libraries, except • ADIOS has a declarative approach for I/O • User defines in application source code: “what” and “when” • Every process defines what data and when to output or read • ADIOS takes care of the “how” • Biggest hurdle for users: • Forget all of your manual tricks to gain I/O performance on your particular target system and target scale and just say what you want to write/read • Trust ADIOS to deliver the performance • Performance Portability: • Write once, perform well anywhere • It comes naturally with ADIOS • ADIOS has many different I/O methods (strategies) • Predictable performance • Staging, I/O throttling: Allow scientist to use different computational technologies to achieve good, predictable, performance [email protected] ADIOS project goals and current status • Utilize the best practices for ALL of the platforms DOE researchers utilize • File System, Topology, Memory_ Optimizations • Domain Specific Optimizations • PIC simulations, Monte Carlos simulations, Data Driven (Visualization, Medical Images, …) • Predictable Performance Optimizations • Hybrid Staging Techniques, I/O Throttling • Reduce I/O load • In situ indexing, queries, executable code, data refactoring • In situ infrastructure for code coupling and in situ/transit processing • Hybrid Staging, Burst Buffers, learning techniques for caching • Usability and Sustainability • Partnering with Kitware, along with user surveys to create better for our users

[email protected] ADIOS Research-Development-Production cycle

Applications

Research ORNL UniversitiesResearch & Development Production Development Production ORNL ORNL Kitware

Influencers Technology Collaborators HDF5

[email protected] R&D necessary to support petascale Apps • I/O for checkpoint restart files, for large writes per process • Small writes at high velocities • Reading for different IO patterns • Domain Specific methods • IO variability

[email protected] Key ideas for good performance of ADIOS for large writes

• Avoid latency (of small writes) ADIOS-BP stream/file format BurstBuffer • Buffer data for large bursts • Allows data from each node to be • Avoid global communication written independently with each other • ADIOS has that for metadata only, which with metadata can even be postponed for post-processing • Ability to create a separate metadata • Later: Topology-aware data movement file when “sub-files” are generated that takes advantage of topology • Allows variables to be individually • Find the closest I/O node to each compressed writer • Has a schema to introspect the • Minimize data movement across information, on each process racks/mid-planes (on Bluegene/Q) • Has workflows embedded into the data streams • Format is for “data-in-motion” and “data-at-rest” [email protected] Checkpoint Restart File Writes Solved for Chimera & GTC

J. F. Lofstead, S. Klasky, K. Schwan, N. Podhorszki, C. Jin, Flexible io and integration for scientific codes through the adaptable io system (adios) in Proceedings of the 6th international workshop on Challenges of large applications in distributed environments, ACM, pp. 15–24. J. Lofstead, F. Zheng, S. Klasky, K. Schwan, Input/output apis and data organization for high performance scientific computing in Petascale Data Storage Workshop, 2008. PDSW’08. 3rd, IEEE, pp. 1–6. J. Lofstead, F. Zheng, S. Klasky, K. Schwan, Adaptable, metadata rich IO methods for portable high performance IO in Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, IEEE, pp. 1–10. Z. Lin, Y. Xiao, I. Holod, W. Zhang, W. Deng, S. Klasky, J. Lofstead, C. Kamath, N. Wichmann. Advanced simulation of electron heat transport in fusion plasmas. Journal of Physics: Conference Series 2009, 180, 012059. C. Chang, S. Ku, P. Diamond, M. Adams, R. Barreto, Y. Chen, J. Cummings, E. D’Azevedo, G. Dif-Pradalier, S. Ethier, et al.. Whole-volume integrated gyrokinetic simulation of plasma turbulence in realistic diverted- tokamak geometry. Journal of Physics: Conference Series 2009, 180, 012057. [email protected] Impact to other LCF applications http://www.sciencedaily.com/releases/2016/01/160126130823.htm • Accelerators – PIConGPU • M. Bussmann, et al. - HZDR • Study laser-driven acceleration of ion beams and its use for therapy of cancer • Computational laboratory for real-time processing for optimizing parameters of the laser • Over 184 GB/s on 16K nodes on Titan • Seismic Imaging – RTM by Total Inc. • Pierre-Yves Aquilanti, TOTAL E&P PH5 in of a CRADA http://rice2016oghpc.rice.edu/program/ • TBs as inputs, outputs PBs of results along with intermediate data PH5 • Company conducted comparison tests among several I/O solutions. ADIOS is their choice for other codes: FWI, Kirchoff

[email protected] Small Writes From Many Cores at high velocities • Small writes

• Overheads are bigger than the cost of outputting http://www.hpcwire.com/2009/10/29/ small amount of bytes adios_ignites_combustion_simulations/ • Avoid accessing a file system target from many processes at once • Aggregate to a small number of actual writers: O(file system targets) • Avoid lock contention • by striping correctly & writing to subfiles • Reading many small data blocks • Data reorganization to merge data into bigger blocks • Reading with different access patterns than the data was written • Data reorganization to optimize overall read access Q. Liu, J. Logan, Y. Tian, H. Abbasi, N. Podhorszki, J. Y. Choi, S. Klasky, R. Tchoua, J. Lofstead, R. Oldfield, et al.. Hello ADIOS: the challenges and lessons of developing leadership class I/O frameworks. Concurrency and Computation: Practice and Experience 2014, 26, 1453–1473. [email protected] Yi Wang, Karl Meredith – FM Global Further impact: OpenFOAM CFD simulations S. Klasky, N. Podhorszki - ORNL Challenges Approach • Open C++ framework for all CFD simulations • Use ADIOS to write a single file per timestep • FM Global: fireFOAM application for fire • Use OpenFOAM functionObject approach for simulations defining extra functionality (used to extend • Output is stream-based, no central IO class OpenFOAM with new computations, I/O, • Hundreds of objects contribute to the output diagnostics, etc.) writing • User specifies the output objects in a • Many small files with even more tiny write configuration file (in a usual way for OpenFOAM) operations • Per processor and simulation variable Select Results • Add a scalable and fast IO solution to OpenFOAM “ADIOS was able to Research Products/Artifacts achieve a 12X performance improvement from the • Prototype implementation of ADIOS IO in original IO “ OpenFoam 2.2 source as extension https://www.olcf.ornl.gov/ • https://github.com/pnorbert/IOADIOSWrite 2016/01/05/fighting-fire- • Tests performed by FM Global on local cluster with-firefoam/ Stacking commodities on wood with NFS file system as well as at OLCF with pallets slows horizontal fire Lustre file system spread, versus absence of pallets [email protected] Reading Impact on LCF systems • Focus on data organization for “typical” scientific applications

• M. Polte, J. Lofstead, J. Bent, G. Gibson, S. A. Klasky, Q. Liu, M. Parashar, N. Podhorszki, K. Schwan, M. Wingate, et al., ... and eat it too: high read performance in write-optimized HPC I/O middleware file formats in Proceedings of the 4th Annual Workshop on Petascale Data Storage, ACM, pp. 21–25. • Y. Tian, S. Klasky, H. Abbasi, J. Lofstead, R. Grout, N. Podhorszki, Q. Liu, Y. Wang, W. Yu, EDO: improving read performance for scientific applications through elastic data organization in Cluster Computing (CLUSTER), 2011 IEEE International Conference on, IEEE, pp. 93–102. • Y. Tian, S. Klasky, W. Yu, H. Abbasi, B. Wang, N. Podhorszki, R. Grout, M. Wolf, A system-aware optimized data organization for efficient scientific analytics in Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing, ACM, pp. 125–126. First Place, ACM Student Research Competition Grand Finals 2012 • J. Y. Choi, H. Abbasi, D. Pugmire, N. Podhorszki, S. Klasky, C. Capdevila, M. Parashar, M. Wolf, J. Qiu, G. Fox, Mining hidden mixture context with ADIOS-P to improve predictive pre-fetcher accuracy in E-Science (e-Science), 2012 IEEE 8th International Conference on, IEEE. [email protected] Space Filling Curve Reordering for Concurrency

• Linear placement of data leads to hotspot on storage nodes 2D planes • Can’t leverage aggregated bandwidth, poor scalability • Distribute data chunks on storage targets along the Hilbert curve ordering • Does not change the data organization within each chunk • Achieving near-optimal concurrency for any access pattern Peak Planar Read Performance

- Good and balanced read performance First Place, ACM Student Research Competition Grand Finals 2012 - 37X speedup on Jaguar for S3D Hilbert Curve Ordering 7 6 2 3 10 14 15 11 12 13 14 15 8 12 13 9 0 1 5 4

8 9 10 11 Storage Storage Storage Storage 1 2 3 4 4 5 6 7 12 13 14 15 8 9 10 11 0 1 2 3 4 5 6 7 0 1 2 3 Hilbert Curve Reordering for a New ORG LC Linear Ordering 2D Array with 16 chunks Small (64KB/chunk) [email protected] I/O Variability on LCF systems • Static assumptions about I/O provisioning are sensitive to point contention on Titan • Aggregation with write-behind strategy • Stripe alignment: to avoid contention • Slowdown on a single node can stall I/O

• J. Lofstead, F. Zheng, Q. Liu, S. Klasky, R. Oldfield, T. Kordenbrock, K. Schwan, M. Wolf, Managing variability in the IO performance of petascale storage systems in Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE Computer Society, pp. 1–12. • Q. Liu, N. Podhorszki, J. Logan, S. Klasky, Runtime I/O ReRouting + Throttling on HPC Storage in 5th USENIX Workshop on Hot Topics in Storage and File Systems, USENIX, Berkeley, CA. https://www.usenix.org/conference/hotstorage13/workshop-program/presentation/Liu. [email protected] Staging to address I/O performance/variability issue • Simplistic approach to staging • Decouple application performance from storage performance (burst buffer) • Move data directly to remote memory in a “staging” area • Write data to disk from staging • Built on past work with threaded buffered I/O • Buffered asynchronous data movement with a single memory copy for networks which support RDMA • Application blocks for a very short time to copy data to outbound buffer • Data is moved asynchronously using server-directed remote reads • Exploits network hardware for fast data transfer to remote memory

[email protected] XGC I/O Variability • Staging of XGC output • Observed high I/O variations with files on XGC check-pointing • Research on Staging I/O with Rutgers and GA Tech: Decoupling app and I/O • Reduced I/O variations with staging methods as well as high I/O throughput • Creates new research opportunities with XGC I/O variations on checkpoint writing. Compared File I/O and Staging I/O on Titan. BurstBuffer (ADIOS + Staging + BurstBuffer) File • Impact: I/O performance is much more XGC ADIOS Staging predictable BurstBuffer [email protected] What do we need to address for the CORAL machines • Data Compression, Decompression • Data rerouting for I/O variability • In situ integration with hybrid data staging • Data models for “plug-ins” for analytics and visualization • Burst Buffers • Campaign Storage

[email protected] ISOBAR Lossless Compression • Constraints from the storage system (size, bandwidth) require us to look at data compression • ISOBAR (In-Situ Orthogonal Byte Aggregate Reduction) Compression is a preconditioner-based, high-throughput lossless compression technique for hard-to- compress scientific datasets ISOBAR compared to best standard lossless alternative (zlib or bzip2) Dataset % improvement Speedup Speedup on compression compression decompression S3D 32.6 31 63 GTS 10.2 8 5 XGC1 14.1 21 52 FLASH 17.2 36 14 • S. Lakshminarasimhan, N. Shah, S. Ethier, S. Klasky, R. Latham, R. Ross, N. F. Samatova, Compressing the incompressible with ISABELA: In-situ reduction of spatio-temporal data in Euro-Par 2011 Parallel Processing, Springer Berlin Heidelberg, 2011, pp. 366–379. • J. Jenkins, E. R. Schendel, S. Lakshminarasimhan, D. A. Boyuka, T. Rogers, S. Ethier, R. Ross, S. Klasky, N. F. Samatova, Byte-precision level of detail processing for variable precision analytics in High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for, IEEE, pp. 1–11. • D. A. Boyuka, S. Lakshminarasimham, X. Zou, Z. Gong, J. Jenkins, E. R. Schendel, N. Podhorszki, Q. Liu, S. Klasky, N. F. Samatova, Transparent in Situ Data Transformations in ADIOS in Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on, IEEE, pp. 256–266. [email protected] Transformation layer • Designed for data conversions, User Application compression, and transformation Variable A Variable B • zlib, bzip2, szip, ISOBAR, ALACRITY, FastBit, • that can transform local data on each Plugin Read processor Data Transform • Transparent for users ADIOS Transform Plugin • User code read/write the original Layer untransformed data Plugin Write • Applications I/O Transport Layer • Compressed output • Automatically indexed data • Local Data Reorganization Regular var. Transformed var. • Data Reduction BP file, staging area, etc. • Released in ADIOS 1.6 in 2013 with compression transformations

[email protected] ADIOS + Indexing/Queries • GOAL: reduce the time to read data by understanding what to read • Designed for querying data in multi-dimensional arrays • Unified handling of conditions on coordinates and values • Support parallel queries through explicit data partitioning • Key Challenges • Organize indexing data structures efficiently • Reading efficiently • Impact • S3D data run on Jaguar can be queried + read much faster than reading entire dataset • E.g. S3D data on a 1100x1080x1048 mesh [email protected] M. Bussmann, A. Huebl, R. Widera - HZDR S. Klasky, Q. Liu, N. Podhorszki, M. Wolf– ORNL G. Eisenhauer, K. Schwan – GT PIConGPU + ADIOS transforms M. Parashar – Rutgers N. Samatova - NCSU • PIConGPU has particle data that can be compressed well • They use the transform layer and existing compression transformation to decrease the size of output files • Performance penalty of compression however is too high for the application • Each CPU spends time to compress local data (~2.5GB uncompressed) • PIConGPU, a GPU application could use the full CPU for compression • PIConGPU developers implement a new parallel compression method in ADIOS for their own data using brotli from Google • Multi-threaded implementation to compress a block faster in parallel • Other techniques discussed in the Sirius presentation

[email protected] Staging works but we need techniques for the last mile Write Performance • Static assumptions about I/O provisioning don’t work • Techniques that achieved high performance I/O • Aggregation with write-behind strategy • Stripe alignment: to avoid contention • Possible solution

• Dynamically re-route I/O requests away Titan from congested storage devices

1 WRITE_IDLE SC – Sub-coordinator for group i i 2 RE-ROUTE REQ GC – Global coordinator

Pi – Processor i 3 RE-ROUE ACK SD – Storage device i i I/O Re-routing Framework 4 WRITE_MORE

5 WRITE_REROUTE

GC Re-routing Virtual 1 2 Messaging Layer 4 3

SC0 SC1 SC2 SC3

5

P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15

Interconnec( on)network)

SD SD SD SD 0$ 1$ 2$ 3$ Hopper File)0) File)1) File)2) File)3) [email protected] I/O Pipelines in Staging • Use the staging nodes and create a workflow in the staging nodes • Prepare the data for future analytics • Mitigate performance impact of I/O by using asynchronous data movement BP file sorted array BPBP writer writer SortSort BitmapBitmap Particle array IndexingIndexing Index file

HistogramHistogram PlotterPlotter

Total simulation runtime improved by 2.7% 2D2D Histogram Histogram PlotterPlotter Online processing of 260GB of data from 16K cores in 40s [email protected] DataSpaces • Goal: Seamlessly support in situ processing for coupled simulations using hybrid-staging • Support for applications usage modes such as IO offloading, in-situ and in- transit analytics, tight/loose coupling, fan-out workflows, etc. • Results: Performance improvement for fusion workflow using ADIOS + • Autonomic data placement and movement leveraging deep storage and DataSpaces memory hierarchies • Impact • Loose coupling: XGC1 + M3D +Viz • Tight Coupling: XGC1 + XGCa • Loose coupling: S3D + analytics • Tight coupling: LES+DNS tight coupling [email protected] ADIOS + Burst Buffers • Goal: Extend ADIOS to leverage Burst Buffers and effectively support I/O & In-Situ use cases • Staged Writes (for post-processing) • Checkpoint-Restart • In-Situ/In-Transit Analytics/Visualization • Accelerated Reads (Prefetching) • Out-of-core methods • Application/Workflow Coupling/Exchange • Current Status: Initial explorations and prototyping • Initial explorations with burst buffer technologies (Datawarp, IME) • Initial exploration with emerging platforms (e.g., Cori) • Research Challenges: Resource Allocation Priority, Access Control, Persistence, Coordination

[email protected] Performance Portability • ADIOS has methods optimized for different platforms • BGQ-GPFS, Cray XK-Lustre, IB clusters • Different methods can be changed for R/W optimizations • E.g. Quantum Physics QLG2Q code Performance Island navigation with ADIOS

• BGQ • Aggregate • Posix • Dataspaces [email protected] What do we need to do for the exascale machines • Hybrid data staging • I/O and skeleton workflow MiniApps • Non traditional (Ensemble) I/O for big data processing • Data models for “plug-ins” for analytics and visualization • Burst Buffers • Multi-tier I/O and storage management (discussed in my 2nd presentation)

[email protected] Data Management Tradeoffs at Exascale to hybrid staging • Balance of memory size and speed • Feedback for node designs with NVRAM, larger memory, on-chip NIC • Network throughput and latency impact on SDMA tasks • Placement of operations in concert with solver and network topology

Explore node design choices for data management

[email protected] In-situ and In-transit Partitioning of Workflows • Primary resources execute the main simulation and in situ computations • Secondary resources provide a staging area whose cores act as buckets for in transit computations

• 4896 cores total (4480 simulation/in situ; 256 in transit; 160 task scheduling/data movement) • Simulation size: 1600x1372x430 • All measurements are per simulation time step [email protected]

Data Staging considering Idle CPU Resources

(D.DPPC) (ADH_Cubic) (LJ) (EAM) (Chain) (Rhodopsin) (Class C) (Class D) (Class E) (Class C) (Class D) (Class E)

GTC GTS GROMACS GROMACS LAMMPS LAMMPS LAMMPS LAMMPS BT-MZ BT-MZ BT-MZ SP-MZ SP-MZ SP-MZ.E

3072 1536 3072 1536 3072 1536 3072 1536 3072 1536 3072 1536 3072 1536 3072 1536 3072 1536 3072 1536 3072 1536 3072 1536 3072 1536 3072

• MPI simulations with OpenMP show idle 1536

MainLopp TIme 0%

OpenMP

20%

MPI

40% Other Sequential

resources during data movement 60% 80% • Challenges 100% • Interference between simulation and analytics due to contention on 100 memory hierarchy 80 Counts Aggregated. Time • Select suitable idle periods to amortize scheduling costs 60 40 20

• Approach: Harvest Idle Resource for In-Situ Analytics (%) Percentage 0 0 2.4 4.8 7.2 9.6 12 >14.4 • Dynamically predict idle resource availability Length of Period (Milli-Seconds) • Reduce interference with execution throttling • Impact: GTS simulation with parallel coordinate visualization • Improve time to solution and resource efficiency • Indicate that co-scheduling analytics services on underutilized resources can improve in situ processing of data • Scale to up to 12288 cores on Hopper Cray XE6 [email protected] In Situ/Transit/LAN/WAN Data Staging for EOD data

• ICEE – Using EVPath package (GATech) – Support uniform network interface for TCP/IP and RDMA – Allows us to stage data over the LAN/WAN – Used for EOD data in the fusion community. • Dataspaces (with sockets) Sub-chunk –

Support TCP/IP and RDMA s

l

e

x i

p Areas of

• Select only areas of interest 0 interest 4 and send (e.g., blobs) 6 • Reduce payload on average Filtered out by about 5X

512 pixels [email protected] Creating living I/O MiniApps/skeletons to study these workflows

• Extended BP metadata to support automatic replay, by saving provenance • Allows new MiniApps to be created from any previous simulation • Incorporates performance information for post-hoc performance understanding

• Workflow Model provides input and output data for each component, uses a DAG to describe data flow • Code generators are built using templates [email protected] Data Driven/ensemble calculations R&D • Utilize ADIOS in image analysis calculations to reduce the impact of I/O and data movement, and reduce the time to knowledge • Problems: • I/O from ADIOS, … are meant for “traditional” simulations and are often synchronized at certain times, which causes I/O overheads for ensemble I/O • No data model for ADIOS which is common in different software ecosystems, requiring multiple data copies, translations, making the I/O impact larger • New Research to solve these problems • New research for I/O for data driven applications • New data organization strategies • New mechanisms for coordinating data and computations • Integration into ADIOS into • ITK, OpenCV, VTK, VTK-M, Catalyst [email protected] ADIOS and VTKm • VTKm and XViz are efforts to prepare for the increasing complexity of extreme-scale visualization • Addresses: Emerging Processors, In Situ Visualization, and Usability • Minimizes resources required for analysis and visualization • Processor/Accelerator awareness provides execution flexibility • Idea is to incorporate VTK-M into the ADIOS software eco-system

[email protected] Data Lifecycle management • Exascale data means larger volumes of data generated from the simulation DATA • Requires in situ data reduction to store as much knowledge is necessary with minimal excess data • Data = Knowledge + Code + Workflow 2 + Visualizations + O(data) Knowledge Code Workflow Visualization

• R&D for exascale systems include • Data morphing for changing the layout of data as it moves to different systems and transforms itself from data to knowledge • Learning techniques to pre-fetch, pre-calculate data from semantically described intentions • Better data reduction techniques (next talk) [email protected] Future ADIOS research for the ECP Key challenges to exascale we want to address • Burst Buffers • Hybrid Staging • Combine in-memory and inter-node data exchanges to optimize data transfer between applications • WAN Staging • Connect simulation data with experimental data • Enable near-real time decision making on remote simulations/experiments • Code coupling frameworks • Enable building exascale applications flexibly from multiple separate codes, including the in-situ data processing pipeline; integration to workflow systems • Data Refactoring • Reduce the amount of data while retain most of the information written to permanent storage • Optimize access to most frequently used data • Creation on living MiniApps & workflow Skeletons • Data Lifecycle • Data + Code + workflow + index + additional metadata [email protected] Ties to identified requirements of applications and other software components? • Kitware and ORNL + x are working together on the data software eco-system • Exascale Applications (CAAR apps for example) have I/O and in situ data processing requirements and we work with many of them • Post processing visualization/analysis drives I/O development • Technology (Burst Buffers, HBM) drive • Requirements: ADIOS + HDF5 can R/W from each other file format • Burst Buffer APIs need to be integrated into ADIOS so users don’t pay attention to this • Software eco-system • ADIOS can work with analytics/viz: e.g. pbdR, OpenCV, Python, Matlab, ITK, Visit, Paraview, … • ADIOS can work with applications: code coupling, workflow data mover • Campaign Storage, …. Integrating the storage hierarchy so apps go through ADIOS and don’t worry about these technologies. • How does ADIOS work with other programming models? Is this important now, in the future? [email protected] Making ADIOS work for the exascale • Need I/O frameworks nimble to support new methods by SDM researchers for next- generation Exascale applications • Better techniques for hybrid staging methods to allow easy methods to plug-in Knowledge Discovery (Analytics/ Visualization) tools and technologies • Better testing and development tools for easier integration of research ideas into a production platform, including better “living” documentation • Easy ways for users to incorporate ADIOS into their code (analysis, visualization post processing codes as well) so that they can standardize on this one way to deal with data, and it will work with data efficiently on the exascale platform as well as the smaller size machines

[email protected] XSSA: eXtreme Scale Service Architecture • Philosophy based on Service-Oriented Architecture • System management • Changing requirements • Evolving target platforms • Diverse, distributed teams • Applications built by assembling services • Universal view of functionality • Well defined API • Implementations can be easily modified and assembled • Manage complexity while maintaining performance, scalability • Scientific problems and codes • Underlying disruptive infrastructure • Coordination across codes and research teams • End-to-end workflows

[email protected] ADIOS 2

[email protected] 52 Main components • Community: new C++ version to allow easy integration • Sustainability: Kitware is our commercial partner to provide nightly testing, CDASH, CMAKE. • Technology: Take our Research artifacts and harden them for next generation technology – Phase change memory, HBM, NVRAM, campaign storage, object stores • Applications: New types of I/O patterns (e.g. multiple applications competing for resources, in situ write patterns, data driven output, code-coupling frameworks) • Interoperability: Write or Read uniformly with HDF5, pnetCDF, common schemas to allow for in situ plugins, • Resilience: Incorporation of community routines SCR, … as new plugins to Checkpoint different services in the framework

[email protected] ADIOS 2

[email protected] 54 Staging Petascale Staging

To Exascale Staging

• Use compute and deep-memory hierarchies to memory movement optimize overall workflow for power vs. • Abstraction allows staging performance tradeoffs – On-same core • Abstract complex/deep memory hierarchy access – On different cores • Placement of analysis and visualization tasks in a – On different nodes complex system – On different machines – • Impact of network data movement compared to Through the storage system [email protected] Evolution of data to Information

[email protected] ADIOS roadmap to Exascale 2017 staging • Create test harness • Living workflow • Create a clearer, more modular layering • Support for new programming models of application interfaces, data • WAN staging abstractions, and runtime components 2019 • Burst buffer support • EOD integration for validation workflows • New methods for CORAL optimizations • Ensemble workflow optimizations 2018 • Data Model support for software • Code coupling support with hybrid ecosystem Beyond 2019 • Container Support • Data reduction models and methods • Data Lifecycle • Data refactoring method • Workflow Support • Support for new storage hierarchy • Support for HBM [email protected] Agenda • Day 1 • Reading 60 minutes • Introduction 55 minutes • Exercises 15 minutes • ADIOS APIs 20 minutes • Day 3 • Break 15 minutes • Review 10 minutes • Writing-XML 60 minutes • SKEL 20 minutes • BPLS 15 minutes • Write-NO 45 minutes • Exercises 15 minutes • Python 30 minutes • Day 2 • Break 15 minutes • Review 10 minutes • Staging 60 minutes • Schema 20 minutes • Visit 60 minutes • Plotter 15 minutes • Transforms 45 minutes • Break 15 minutes [email protected] 59 Code examples GitHub: adiosvm https://github.com/pnorbert/adiosvm.git

On the VM ~/Tutorial

[email protected] ADIOS Approach • I/O calls are of declarative nature in ADIOS • which process writes what • add a local array into a global space (virtually) • adios_close() indicates that the user is done declaring all pieces that go into the particular dataset in that timestep • I/O strategy is separated from the user code • aggregation, number of subfiles, target filesystem hacks, and final file format not expressed at the code level • This allows users • to choose the best method available on a system • without modifying the source code • This allows developers • to create a new method that’s immediately available to applications • to push data to other applications, remote systems or cloud storage instead of a local filesystem

[email protected] 61 Writing setup/cleanup API • Initialize/cleanup • #include “adios.h” Fortran – adios_init (‘config.xml’, comm) use adios_write_mod • parse XML file on each process call adios_init (“config.xml”, comm, err) • setup transport methods call adios_finalize (rank, err) call adios_init_noxml (comm, err) • MPI_Comm • only one process reads the XML file • some staging methods can connect to staging server – adios_finalize (rank) • give each transport method opportunity to cleanup • particularly important for asynchronous methods to make sure they have completed before exiting • call just before MPI_Finalize() • adios_init_noxml (comm) • Use instead of adios_init() when there is no XML configuration file • Extra APIs allows for defining variables

[email protected] 62 API for writing 1/3 • Open for writing • adios_open (fh, “group name”, “file name”, mode, comm) • int64_t fh handle used for subsequent calls for write/close • “group name” matches an entry in the XML • identifies the set of variables and attributes that will be written • Mode is one of ‘w’ (write) or ‘a’ (append) • Communicator tells ADIOS what processes in the application will perform I/O on this file • Close • adios_close (fh) • handle from open Fortran integer*8 :: fd call adios_open (fd, group_name, filename, mode, comm, err) call adios_close (fd, err)

[email protected] 63 API for writing 2/3 • Write • adios_write (fh, “varname”, data) • fh is the handle from open • Name of variable in XML for this group • Data is the reference/pointer to the actual data • NOTE: with a XML configuration file, adios can build Fortran or C code that contains all of the write calls for all variables defined for a group • Must specify one call per variable written

Fortran call adios_write (fd, varname, data, err)

[email protected] 64 Important notes about the write calls • adios_write() • usually does not write to the final target (e.g. file) • most of the time it only buffers data locally • when the call returns, the application can re-use the variable’s memory • adios_close() • takes care of getting all data to the final target • usually the buffered data is written at this time

[email protected] 65 ADIOS non-XML APIs • Limitation of the XML approach • Must have all variables defined in advance • Approach of non-XML APIs • Similar operations to what happens internally in adios_init • Define variables and attributes before opening a file for writing • The writing steps are the same as XML APIs • open file • write variables • close file

[email protected] 66 Non-XML API functions • Initialization • init adios, allocate buffer, declare groups and select write methods for each group. adios_init_noxml (); adios_allocate_buffer (ADIOS_BUFFER_ALLOC_NOW, 10); • when and how much buffer to allocate (in MB) adios_declare_group (&group, "restart", "iter", adios_flag_yes); • group with name and optional timestep indicator (iter) and whether statistics should be generated and stored adios_select_method (group, "MPI", "", ""); • with optional parameter list, and base path string for output files

[email protected] 67 Non-XML API functions • Definition int64_t adios_define_var (group, “name”, “path”, type, “local_dims”, “global_dims”, “offsets”) • Similar to how we define a variable in the XML file • returns a handle to the specific definition • Dimensions/offsets can be defined with • scalars (as in the XML version) • id = adios_define_var (g, “xy”, “”, adios_double, “nx_local, ny_local”, “nx_global, ny_global”, “offs_x,offs_y”) • need to define and write several scalars along with the array • actual numbers • id = adios_define_var (g, “xy”, “”, adios_double, “20,20”, “100,100”, “0,40”)

[email protected] 68 Multiple blocks per process • AMR codes and load balancing strategies may want to write multiple pieces of a global variable from a single process • ADIOS allows one to do that but • one has to write the scalars defining the local dimensions and offsets for each block, and • group size should be calculated accordingly • This works with the XML API, too, but because of the group size issue, pre- generated write code cannot be used (should do the adios_write() calls manually) • Array definition with the actual sizes as numbers saves from writing a lot of scalars (and writing source code)

[email protected] Goals of the ADIOS Read API design • Works well with files (scalable I/O) • Staging I/O • Insulate the scalable application from the variability inherent in the file system • Enable the utilization of in-situ and in-transit analytics and visualization • Same API for reading data from files and from staging • Allow for read optimizations: • Multiple read operations can be scheduled before performing them • Allow for blocking and non-blocking reads • Use generic selections in the read statements instead of describing a bounding box • Option to let ADIOS deliver data in chunks, with memory allocated inside ADIOS not in user-space

[email protected] Selections ADIOS_SELECTION * adios_selection_boundingbox (int ndim, uint64_t * offsets, uint64_t * readsize) adios_selection_points (uint64_t ndim, uint64_t npoints, uint64_t *points) adios_selection_writeblock (int index) adios_selection_auto (char * hints)

nx ny

Writeblock

Auto

(0,0) [email protected] 71 Read API basics • Common API for reading files and streams (with staging) • In staging, one must process data step-by-step • Files allow for accessing all steps at once • Schedule/perform reads in bulk, instead of single reads • Allows for optimizing multiple reads together • Selections • bounding boxes, list of points, selected blocks and automatic • Chunking (optional) • receive and process pieces of the requested data concurrently • staging delivers data from many producers to a reader over a certain amount of time, which can be used to process the first chunks

[email protected] 72 Read API basics • Step • A dataset written within one adios_open/…/adios_close • Stream • A file containing of series of steps of the same dataset • Read API is designed to read data from one step at a time, then advance forward • alternative API allows for reading all steps at once from a file

[email protected] 73 Read API • Initialization/Finalization • One call per each read method used in an application • Staging methods usually connect to a staging server / other application at init, and disconnect at finalize. int adios_read_init_method ( enum ADIOS_READ_METHOD method, MPI_Comm comm, const char * parameters) int adios_read_finalize_method(enum ADIOS_READ_METHOD method) Fortran use adios_read_mod call adios_read_init_method (method, comm, parameters, err) call adios_finalize_method (method, err)

[email protected] 74 Read API

• For files: seeing all timesteps at once ADIOS_FILE * adios_read_open_file ( const char * fname, enum ADIOS_READ_METHOD method, MPI_Comm comm)

• Close int adios_read_close (ADIOS_FILE *fp) Fortran use adios_read_mod call adios_read_open_file (fp, fname, method, comm, err) call adios_read_close (fp, err)

[email protected] 75 Inquire about a variable (no extra I/O)

• ADIOS_VARINFO * adios_inq_var (ADIOS_GROUP *gp, const char * varname) • allocates memory for ADIOS_VARINFO, free resources later with adios_free_varinfo(). • ADIOS_VARINFO variables • int ndim Number of dimensions • uint64_t *dims Size of each dimension • int nsteps Number of steps of the variable in file. Streams: always 1 • void *value Value (for scalar variables only) • int nblocks Number of blocks that comprise this variable in a step • void *gmin, *gmax, gavg, gstd_dev Statistical values of the global arrays • int adios_inq_var_stat (ADIOS_FILE *fp, ADIOS_VARINFO * varinfo, int per_step_stat, int per_block_stat) • int adios_inq_var_blockinfo (ADIOS_FILE *fp, ADIOS_VARINFO * varinfo) • Get writing layout of an array Fortran variable (bounding boxes of each writer) call adios_get_scalar (fp, varname, data, err) call adios_inq_file (fp, vars_count, attrs_count, current_step, last_step, err) call adios_inq_varnames (fp, vnamelist, err) call adios_inq_var (fp, varname, vartype, nsteps, ndim, dims, err)

[email protected] 76 Read an attribute (no extra I/O) • int adios_get_attr (ADIOS_FILE * fp, const char * attrname, enum ADIOS_DATATYPES * type, int * size, void ** data) • Attributes are stored in metadata, read in during open operation. • It allocates memory for the content, so you need to free it later with free().

Fortran call adios_inq_attr (fp, attrname, attrtype, attrsize, err)

[email protected] 77 Select a bounding box • In our example we need to define the selection area of what to read from an array.

nx

ny Offset(y) Readsize(nx,ny) Offset(x) (0,0) • ADIOS_SELECTION * adios_selection_boundingbox ( int ndim, uint64_t * offsets, uint64_t * readsize)

Fortran call adios_selection_boundingbox (integer*8 sel, -- return value integer ndim, integer*8 offsets(:), integer*8 readsize(:))

[email protected] 78 Reading data • Read is a scheduled request, … int64_t adios_schedule_read ( const ADIOS_FILE * fp, const ADIOS_SELECTION * selection, const char * varname, int from_steps, int nsteps, void * data) • in streaming mode, only one step is available • …executed later with other requests together int adios_perform_reads (constFortranADIOS_FILE *fp, int blocking) call adios_schedule_read (fp, selection, varname, from_steps, nsteps, data, err) call adios_perform_reads ( fp, err)

[email protected] Design choices for reading API • One output step at a time • One step is seen at once after writer completes a whole output step • streaming is not byte streaming here • reader has access to all data in one output step • as long as the reader does not release the step, it can read it • potentially blocking the writer • Advancing in the stream means • get access to another output step of the writer, • while losing the access to the current step forever.

[email protected] 80 Recall read API • Step • A dataset written within one adios_open/…/adios_close • Stream • A file containing of series of steps of the same dataset • Open as a stream or as a file • for step-by-step reading (both staged data and files) ADIOS_FILE * adios_read_open ( const char * fname, enum ADIOS_READ_METHOD method, MPI_Comm comm, enum ADIOS_LOCKMODE lock_mode, float timeout_sec) • Close int adios_read_close (ADIOS_FILE *fp)

[email protected] 81 Locking options • ALL Locking modes • lock current and all future steps in staging • Current (or None) • ensures that reader can read all data • “next” := next available • reader’s priority, it can block the writer • All • CURRENT • “next” := current+1 • lock the current step only • future steps can disappear if writer pushes more newer steps and staging needs more space • writer’s priority • reader must handle skipped steps • NONE • no assumptions, anything can disappear between two read operations • be ready to process errors

[email protected] 82 Advancing a stream • One step is accessible in streams, advancing is only forward int adios_advance_step (ADIOS_FILE *fp, int last, float timeout_sec) • last: advance to “next” or to latest available • “next” or “next available” depends on the locking mode • locking = all: go to the next step, return error if that does not exist anymore • locking = current or none: give the next available step after the current one • timeout_sec: block for this long if no new steps are available • Release a step if not needed anymore • optimization to allow the staging method to deliver new steps if available int adios_release_step (ADIOS_FILE *fp)

[email protected] 83 Example of Read API: open a stream fp = adios_read_open ("myfile.bp", ADIOS_READ_METHOD_BP, comm, ADIOS_LOCKMODE_CURRENT, 60.0);

// error possibilities (check adios_errno) // err_file_not_found – stream not yet available // err_end_of_stream – stream has been gone before we tried to open // (fp == NULL) – some other error happened (print adios_errmsg())

// process steps here… ...

adios_read_close (fp);

[email protected] 84 Example of Read API: read a variable step-by-step

int count[] = {10,10,10}; int offs[] = {5,5,5}; P = (double*) malloc (sizeof(double) * count[0] * count[1] * count[2]); Q = (double*) malloc (sizeof(double) * count[0] * count[1] * count[2]); ADIOS_SELECTION *sel = adios_select_boundingbox (3, offs, count); while (fp != NULL) { adios_schedule_read (fp, sel, "P", 0, 1, P); adios_schedule_read (fp, sel, “Q", 0, 1, Q); adios_perform_reads (fp, 1, NULL); // 1: blocking read // P and Q contains the data at this point adios_release_step (fp); // staging method can release this step // ... process P and Q, then advance the step adios_advance_step (fp, 0, 60.0); // 60 sec blocking wait for the next available step } // free ADIOS resources adios_free_selection (sel);

[email protected] Queries

[email protected] 86 Scenarios for scientific variables • Stored as separate arrays in an ADIOS file double T 200*{10000, 9000} double P 200*{10000, 9000} double H 200*{10000, 9000}

• Stored as columns of a table in an ADIOS file • e.g. particle data is usually in this form (fusion, material science) • QMCPack walkers double traces {21571123, 4168} • Evaluate conditions on some columns and • 1. get values of another column • 2. get the whole matching rows

• See C examples in ADIOS source • examples/C/query/ [email protected] 87 Basic scenario • Have 3 variables (T,P,H), same global array size • Get the values of T where P > 80.0 and H <= 50.0 We need a single bounding box selection everywhere

ADIOS_SELECTION* box = adios_selection_boundingbox(...); ADIOS_QUERY* q1 = adios_query_create(f, box, “P”, ADIOS_GT, “80.0”); ADIOS_QUERY* q2 = adios_query_create(f, box, “V”, ADIOS_LTEQ, “50.0”); ADIOS_SELECTION *hits; ADIOS_QUERY* q = adios_query_combine( q1, ADIOS_QUERY_OP_AND, q2); adios_query_evaluate(q, box, 0, batchSize, &hits); ... adios_schedule_read (f, hits, "xy", 1, 1, data); adios_perform_reads (f, 1);

[email protected] 88 ADIOS Query API: Evaluation explained • ADIOS_SELECTION* outputBoundary • Query evaluation is relaxed • Evaluate the conditions on selections with the same shape • Typical corner case: evaluate the conditions on the same area of each variable

[email protected] 89 Python/Numpy Wrapper • Two different modules for serial and parallel version • Serial version: Use “import adios”, MPI version: Use “import adios_mpi” • Dependencies • Numpy: Data to write and read will be represented by using Numpy array • MPI4Py: Need only for MPI version, adios_mpi. Most MPI Comm related parameters are set with default values and can be omitted. • Consist of the following components • A set of write APIs. Both XML and No-XML APIs and Read APIs: read_init and read_finalize • Two read related classes, file and variable • Enumeration classes to represent constant parameters: DATATYPE, FLAGS, BUFFER_ALLOC_WHEN • A set of utility functions: • np2adiostype (NUMPY_TYPE): returns Adios DATATYPE • readvar (FILENAME, VARNAME): simple variable read function • bpls (FILENAME): show BP file contents • Examples: • See example codes (bpls.py and ncdf2bp.py) in wrapper/numpy/example/utils, Tests in wrapper/numpy/tests

[email protected] 90 Python/Numpy Read Class • File class • Contructor • file (PATH, [COMM], [is_stream], [lock_mode], [timeout_sec]) • Members • var: Python dict object contains {variable name: variable descriptor} pairs • Other variables for num variables, current steps, last steps, version, file sizes, etc. • Methods • close () : close file • printself () : print contents for debugging purpose • advance (LAST, TIMEOUT_SEC): advance steps for streaming • Variable class • Contructor • var (FILECLASS, VARNAME) • Members • Variable related parameters: name, varid, type, ndim, dims, nsteps • Methods • read ([OFFSET], [COUNT], [FROM_STEPS=0], [NSTEPS=1]): read as numpy array • close () : close and free variable class • printself () : print contents for debugging purpose

[email protected] 91 Typical BP Read by Python/Numpy Wrapper

import adios as ad ## import python modules import numpy as np

f = ad.file("heat.bp") ## Call file class constructor v = f.var['T'] ## Get variable val = v.read() ## Read as Numpy array … do computation … f.close() ## Close file

import adios_mpi as ad ## import MPI modules import numpy as np from mpi4py import MPI

comm = MPI.COMM_WORLD

f = ad.file("heat.bp", comm) ## Add comm info …

[email protected] 92 Python/Numpy Write API functions • Init init (PATH, [COMM]) • PATH: configuration file • COMM: MPI communicator (default=MPI.COMM_WORLD). Can be omitted for serial version • Open f = open (GROUPNAME, FILENAME, MODE, [COMM]) • Return file descriptor • Write write (FILEP, VARNAME, VAL) • FILEP: file descriptor. Return value of open() • VAL: numpy array type. Dimension and data type are automatically detected. • Close close (FILEP) • FILEP: file pointer. Return value of open()

[email protected] 93 Python/Numpy Write No-XML API functions • Init No-XML Init_noxml ( [COMM]) • COMM: MPI communicator (default=MPI.COMM_WORLD). Can be omitted for serial version • Allocate_buffer allocate_buffer (WHEN, BUFFER_SIZE) • WHEN: Defined as an enum class, BUFFER_ALLOC_WHEN. Use one of BUFFER_ALLOC_WHEN.[UNKNOWN, NOW, LATER] values. • Declare Group g = declare_group (GROUPNAME, [TIMEINDEX]) • Return group descriptor • Declare Var define_var(GROUPID, VARNAME, TYPE, [DIM], [GLOBALDIM], [OFFSET]) • GROUPID: group descriptor. Return value of declare_group() • TYPE: Defined as an enum class, DATATYPE • Select Method select_method(GROUPID, METHOD, [PARAMETERS], [BASEPATH]) • GROUPID: group descriptor. Return value of declare_group() • Example: wrapper/numpy/example/utils/ncdf2bp.py

[email protected] 94 Typical BP Write by Python/Numpy Wrapper import adios_mpi as ad ad.init("config_mpi.xml", comm) import numpy as np from mpi4py import MPI fd = ad.open("temperature", "adios_test_mpi.bp", "w", comm) comm = MPI.COMM_WORLD rank = comm.Get_rank() ad.write_int (fd, "NX", NX) size = comm.Get_size() ad.write_int (fd, "rank", rank) ad.write_int (fd, "size", size) NX = 10 ad.write (fd, "temperature", t) t = np.array (range(NX), ad.close (fd) dtype=np.float64) + rank*NX ad.finalize()

See ADIOS source wrappers/numpy/tests/test_adios_mpi.py or the sequential version: test_adios.py

[email protected] Agenda • Day 1 • Reading 60 minutes • Introduction 55 minutes • Exercises 15 minutes • ADIOS APIs 20 minutes • Day 3 • Break 15 minutes • Review 10 minutes • Writing-XML 60 minutes • SKEL 20 minutes • BPLS 15 minutes • Write-NO 45 minutes • Exercises 15 minutes • Python 30 minutes • Day 2 • Break 15 minutes • Review 10 minutes • Staging 60 minutes • Schema 20 minutes • Visit 30 minutes • Plotter 15 minutes • Transforms 45 minutes • Break 15 minutes [email protected] 96 Write Example • In this example we start with a 2D code which writes data of a 2D array, with a 2D domain decomposition, as shown in the figure. • Heat transfer example with heating the edges • We write out 5 time-steps, into a single file. • For simplicity, we work on only 12 cores, arranged in a 4 x 3 arrangement.

• Each processor works on 40x50 P8 P9 P10 P11 subsets (T and dT). y P4 P5 P6 P7 • The total size of the output array = 4*40 x 3*50 P0 P1 P2 P3

x

[email protected] 97 The ADIOS XML configuration file • Describe each IO grouping. • Maps a variable in the code, to a variable in a file. • Map an IO grouping to transport method(s). • Define buffering allowance • “XML-free” API are also provided

[email protected] 98 XML Overview • heat_transfer.xml describes the output variables in the code gndx ndx

gndy (0,0)

[email protected] 99 XML overview (global array) • We want to read in T from an arbitrary number of processors, so we need to write this as a global array. • Need 2 more variables, to define the offset in the global domain • • Need to define the T variable as a global array • Place this around the lines defining T in the XML file. •

[email protected] 100 XML Overview • Need to define the method, we will use MPI. • • Need to define the buffer • • Can use any size, but if the buffer > amount to write, the I/O to disk will be faster. • Need to define the host language (C or Fortran ordering of arrays). • • Set the XML version • • And end the configuration file •

[email protected] 101 The final XML file

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.

[email protected] 102 gpp.py • Converts the XML file into F90 (or C) code. • > gpp.py writer.xml • > cat gwrite_writer.fh call adios_write (adios_handle, "gndx", gndx, adios_err) call adios_write (adios_handle, "gndy", gndy, adios_err) call adios_write (adios_handle, "offx", offx, adios_err) call adios_write (adios_handle, "offy", offy, adios_err) call adios_write (adios_handle, "ndx", ndx, adios_err) call adios_write (adios_handle, "ndy", ndy, adios_err) call adios_write (adios_handle, "T", T(1:ndx,1:ndy,curr), adios_err)

[email protected] 103 Writing with ADIOS I/O (simplest form) call adios_init ("heat_transfer.xml", comm, adios_err) … call adios_open (adios_handle, "heat", trim(filename), "w", comm, adios_err) #include "gwrite_writer.fh" call adios_close (adios_handle, adios_err) … call adios_finalize (rank, adios_err)

Source file extension should be .F90 (instead of .f90) to enforce macro preprocessing at compile time [email protected] 104 Compile ADIOS codes • Makefile • use adios_config tool to get compile and link options ADIOS_DIR = /opt/adios/1.10.0 ADIOS_INC = $(shell ${ADIOS_DIR}/bin/adios_config -c -f) ADIOS_FLIB = $(shell ${ADIOS_DIR}/bin/adios_config -l -f) ADIOSREAD_FLIB := $(shell ${ADIOS_DIR}/bin/adios_config -l -f –r)

• Codes that write and read heat_transfer_adios: gwrite_heat.fh heat_vars.F90 heat_transfer.F90 io_adios_gpp.F90 ${FC} -g -c -o heat_vars.o ${ADIOS_INC} heat_vars.F90 ${FC} -g -c -o heat_transfer.o ${ADIOS_INC} heat_transfer.F90 ${FC} -g -c -o io_adios.o ${ADIOS_INC} io_adios.F90 ${FC} -g -o heat_transfer_adios heat_transfer.o io_adios_gpp.o ${ADIOS_FLIB}

gwrite_heat.fh: heat_transfer.xml ${ADIOS_DIR}/bin/gpp.py heat_transfer.xml

[email protected] 105 Compile and run the code VM $ cd ~/Tutorial/heat_transfer $ make adios1 $ mpirun -np 12 ./heat_transfer_adios1 heat 4 3 40 50 6 500 Process number : 4 x 3 Array size per process at first step: 40 x 50 Step 1: Step 2: Step 3: Step 4: Step 5: Step 6:

$ du -hs *.bp 2.4M heat.bp

[email protected] 106 ADIOS Componentization • ADIOS can allow many different I/O methods • POSIX • MPI • MPI_AGGREGATE: needs num_ost, and num_aggregators num_aggregators=4;num_ost=2 • PHDF5: • Limited functionality, but will be improved in future releases. • Must remove attributes and PATHS for this to work • NC4: same expectations as PHDF5 (NOT INSTALLED IN VM) • Rule of thumb: • Try Posix, then move to MPI, then MPI_AGGREGATE [email protected] 0 7 bpls

$ bpls -l heat.bp

integer gndx 6*scalar = 160 / 160 / 160 / 0 integer gndy 6*scalar = 150 / 150 / 150 / 0 integer /info/nproc 6*scalar = 12 / 12 / 12 / 0 integer /info/npx 6*scalar = 4 / 4 / 4 / 0 integer /info/npy 6*scalar = 3 / 3 / 3 / 0 integer offx 6*scalar = 0 / 120 / 60 / 44.7214 integer offy 6*scalar = 0 / 100 / 50 / 40.8248 integer ndx 6*scalar = 40 / 40 / 40 / 0 integer ndy 6*scalar = 50 / 50 / 50 / 0 integer step 6*scalar = 1 / 6 / 3.5 / 1.70783 integer iterations 6*scalar = 500 / 500 / 500 / 0 double T 6*{150, 160} = 0.000/999.47/442.536/318.995 double dT 6*{150, 160} = 0.000/0.720/0.135/0.115 y x • bpls is a C program • dimensions are reported in C order

[email protected] Agenda • Day 1 • Reading 60 minutes • Introduction 55 minutes • Exercises 15 minutes • ADIOS APIs 20 minutes • Day 3 • Break 15 minutes • Review 10 minutes • Writing-XML 60 minutes • SKEL 20 minutes • BPLS 15 minutes • Write-NO 45 minutes • Exercises 15 minutes • Python 30 minutes • Day 2 • Break 15 minutes • Review 10 minutes • Staging 60 minutes • Schema 20 minutes • Visit 30 minutes • Plotter 15 minutes • Transforms 45 minutes • Break 15 minutes [email protected] 0 9 bpls

$ bpls -l heat.bp

integer gndx 6*scalar = 160 / 160 / 160 / 0 integer gndy 6*scalar = 150 / 150 / 150 / 0 integer /info/nproc 6*scalar = 12 / 12 / 12 / 0 integer /info/npx 6*scalar = 4 / 4 / 4 / 0 integer /info/npy 6*scalar = 3 / 3 / 3 / 0 integer offx 6*scalar = 0 / 120 / 60 / 44.7214 integer offy 6*scalar = 0 / 100 / 50 / 40.8248 integer ndx 6*scalar = 40 / 40 / 40 / 0 integer ndy 6*scalar = 50 / 50 / 50 / 0 integer step 6*scalar = 1 / 6 / 3.5 / 1.70783 integer iterations 6*scalar = 500 / 500 / 500 / 0 double T 6*{150, 160} = 0.000/999.47/442.536/318.995 double dT 6*{150, 160} = 0.000/0.720/0.135/0.115 y x • bpls is a C program • dimensions are reported in C order

[email protected] 1 0 bpls (to show the mapping)

$ bpls -D heat.bp T double T 6*{150, 160} step 0: block 0: [ 0: 49, 0: 39] block 1: [ 0: 49, 40: 79] block 2: [ 0: 49, 80:119] block 3: [ 0: 49, 120:159] block 4: [ 50: 99, 0: 39] block 5: [ 50: 99, 40: 79] block 6: [ 50: 99, 80:119] block 7: [ 50: 99, 120:159] block 8: [100:149, 0: 39] block 9: [100:149, 40: 79] block 10: [100:149, 80:119] block 11: [100:149, 120:159] step 1: ...

[email protected] 1 1 bpls to dump: 2x2 read with bpls • Use bpls to read in a 2D slice of the first output step

$ bpls heat.bp -d T -s "0,49,39" -c "1,2,2" -n 2 double T 6*{150, 160} slice (0:0, 49:50, 39:40)

(0,49,39) 5.0916 4.15414 P8 P9 P10 P11 (0,50,39) 4.99562 4.05808 • Note: bpls handles time as an extra dimension y P4 P5 P6 P7 • -s starting offset • first offset is the timestep • -c size in each dimension P0 P1 P2 P3 • first value is how many steps x • -n how many values to print in one line [email protected] Agenda • Day 1 • Reading 60 minutes • Introduction 55 minutes • Exercises 15 minutes • ADIOS APIs 20 minutes • Day 3 • Break 15 minutes • Review 10 minutes • Writing-XML 60 minutes • SKEL 20 minutes • BPLS 15 minutes • Write-NO 45 minutes • Exercises 15 minutes • Python 30 minutes • Day 2 • Break 15 minutes • Review 10 minutes • Staging 60 minutes • Schema 20 minutes • Visit 60 minutes • Plotter 15 minutes • Transforms 45 minutes • Break 15 minutes [email protected] Visualization Schema • Why do we need a schema? • Bridge the gap between application and viz. • Keep it consistent • Current status: application/data format specific reader in visualization tools

App Viz. tool Display BP file App. specific coding

App Viz. tool Display BP file Schema Read API [email protected] heat transfer has simple uniform mesh • Uniform mesh: rectangular space with equidistant stepping • Described by a few scalar values • size in each dimension • stepping in each dimension • origin (point in N-dimensional space) • ADIOS XML

A uniform mesh ...

[email protected] Compile and run the code $ cd ~/Tutorial/heat_transfer $ diff heat_transfer.xml schema/heat_transfer.xml study the difference and add the schema definition to heat_transfer_xml $ mpirun -np 12 ./heat_transfer_adios2 heat 4 3 40 50 6 500 $ bpls -m heat.bp Mesh info: uniformmesh type: uniform dimensions: {150, 160} origins: {-100, 100} spacings: {2, 2} maximums: {198, 418} time varying: no ... [email protected] heat transfer and VisIt • Without schema and with schema (origin, spacing different)

[email protected] Further schema examples • For more examples, look • ~/Tutorial/schema • uniform, rectilinear, structured and unstructured (triangular) mesh • ADIOS source: • examples/C/schema • examples/Fortran/schema • an example where the mesh is stored separately from data • examples/Fortran/schema/tri2d_noxml_seperate.F90

[email protected] Supported meshes • uniform • rectilinear • structured • unstructured with cell types • 2D point, line, triangle, quad • 3D hex, tetrahedron, prism and pyramid

• Note: VisIt supports ADIOS with schema

[email protected] Schema: unstructured example

[email protected] Schema is embedded in dataset

$ bpls -lm tri2d.bp integer nproc scalar = 12 Mesh info: integer npoints scalar = 144 trimesh integer num_cells scalar = 240 type: unstructured integer nx_global scalar = 16 npoints: 144 integer ny_global scalar = 9 points: single-var: integer offs_x scalar = 0 "points" integer offs_y scalar = 0 ncsets: 1 integer nx_local scalar = 4 cell set 0: integer ny_local scalar = 3 cell type: 3 integer lp scalar = 12 ncells: 240 integer op scalar = 0 cells var: "cells" integer lc scalar = 24 nspaces: 2 integer oc scalar = 0 time varying: no double N {144} = 0 / 11 double C {240} = 0 / 11 double points {144, 2} = 0 / 25.6667 integer cells {240, 3} = 0 / 143

[email protected] Running the schema examples • Uniform • mpirun –np 12 ./uniform2d 4 3 • Rectilinear • mpirun -np 12 ./rectilinear2d 4 3 • Structured • mpirun -np 12 ./structured2d 4 3 • Unstructured • mpirun -np 12 ./tri2d 4 3

[email protected] VisIt + ADIOS Schema: unstructured triangular mesh

# let’s visualize this in VisIt $ visit -o tri2d.bp

In main window, in Plots, Click on Add button, Select Mesh menu Select the item trimesh

Then click on Draw button [email protected] VisIt + ADIOS Schema: unstructured mesh

In main window, in Plots, Click on Add button, Select Pseudocolor menu Select the variable C Then click on Draw button [email protected] Schema Read API

• List of meshes

ADIOS_FILE * adios_open(…) returns the list of meshes

• Get the mesh for a given variable int adios_inq_var_meshinfo (ADIOS_FILE *fp, ADIOS_VARINFO * varinfo) • Inquires about a mesh. This function allocates memory for the ADIOS_MESH struct and content ADIOS_MESH * adios_inq_mesh_byid (ADIOS_FILE *fp, int meshid) • Free memory used by an ADIOS_MESH struct void adios_free_meshinfo (ADIOS_MESH *meshinfo)

[email protected] Schema Read API Related Structures

adios_inq_mesh_byid ()

adios_open() adios_inq_var_meshinfo () [email protected] Agenda • Day 1 • Reading 60 minutes • Introduction 55 minutes • Exercises 15 minutes • ADIOS APIs 20 minutes • Day 3 • Break 15 minutes • Review 10 minutes • Writing-XML 60 minutes • SKEL 20 minutes • BPLS 15 minutes • Write-NO 45 minutes • Exercises 15 minutes • Python 30 minutes • Day 2 • Break 15 minutes • Review 10 minutes • Staging 60 minutes • Schema 20 minutes • Visit 60 minutes • Plotter 15 minutes • Transforms 45 minutes • Break 15 minutes [email protected] plotter: our tool for quick 1D/2D plots • reads ADIOS BP/NetCDF/HDF5 arrays • any slice from any multi-dimensional array • Use xmgrace to make X-Y plots • Use VTK for 2D graphics • We use this tool to generate lots of plots in workflows or manually to check output GTC S3D GEM

Chimera

XGC

[email protected] Plotter for 1D Options • -f file • -v variable (multiple allowed, regexp allowed), • -o output filename w/o extension (will have .png, .jpg or .txt) • -multiplot • plot multiple variables (with same size) on one plot otherwise it generates one image per variable • -multiplot-time • plot all timesteps of a variable in a single plot • It can iterate over a dimension as time, and plot data in another dimension [email protected] Plotter for 1D Subselection of data is similar to bpls • -start | -s "spec" • starting indexes for each dimension (default "0") • 'tN' in one dimension = plotter loops over that dimension and produces as many images as the count for that dimension is given. 'tN:S' : loop from N with a stepping of S as many times as specified in -count • -count | -c "spec" • counting for each dimension (default "-1") • -1 denotes 'until end' of array • Examples • -s "0,0" -c "1,99": read 100 elements (of the 2nd dimension) to plot. • -s "0,0" -c ”1,-1": read the whole 2nd dimension to plot. • -s "t5,0" -c ”10,-1": read all elements in the 2nd dimension to plot, 10 times for the first dimension indices 5..14 and produce 10 plots.

[email protected] Tutorial/02_noxml multi-block example (C)

$ bpls -l adios_global_no_xml.bp -D blocks • adios_global_no_xml.c double blocks {1200} = 1 / 1001 / • 1D array output: temperature step 0: block 0: [ 0: 99] = 1 / 1/ 1/ 0 • each process writes 3 blocks, 100 block 1: [ 100: 199] = 51 / 51/ 51/ 0 block 2: [ 200: 299] = 101 / 101/ 101/ 0 elements each block 3: [ 300: 399] = 301 / 301/ 301/ 0 block 4: [ 400: 499] = 351 / 351/ 351/ 0 • rank 1 writes its three blocks after block 5: [ 500: 599] = 401 / 401/ 401/ 0 rank 0’s three blocks, and so on block 6: [ 600: 699] = 601 / 601/ 601/ 0 block 7: [ 700: 799] = 651 / 651/ 651/ 0 • mpirun -np 4 block 8: [ 800: 899] = 701 / 701/ 701/ 0 block 9: [ 900: 999] = 901 / 901/ 901/ 0 ./adios_global_no_xml block 10: [1000:1099] = 951 / 951/ 951/ 0 block 11: [1100:1199] = 1001 / 1001/ 1001/ 0 $ bpls -l adios_global_no_xml.bp integer NX scalar = 100 integer Global_bounds scalar = 1200 integer Offsets scalar = 0 double temperature {1200} = 0 / 1199 / 599.5 / 346.41 double blocks {1200} = 1 / 1001 / 501 / 337.886 130 [email protected] We forgot to include xmgrace on the vm… • sudo apt install grace

[email protected] Plotter for 1D plotter -f adios_global_no_xml.bp -v temperature -v blocks -multiplot - title="Multiplot" -o temperature -imgsize 720 600

temperature.png

[email protected] Plotter for 1D -xvar optional name of variable to use for X axis values -xfile if xvar is in another file

-v ion__density

-xvar psi

[email protected] Plotter for 2D plotter2d • -f file • -v variable (multiple allowed, regexp allowed), • -o output filename w/o extension (will have .png, .jpg or .txt) • It can iterate over a dimension as time, and plot data in another dimensions like in 1D • -start and -count like in 1D • -xvar, -yvar • variables to use for X and Y axis values • or as mesh variables

[email protected] Plotter for 2D plotter2d S3D • --colormap • select a colormap instead of the default blue-red • RedBlue, BlueRed, Gray, XGC, XGCLog, HotDesaturated

• --contour[= #_of_contours] HotDesaturated • make a contour plot from the 2D array colormap GEM

Contour plot

[email protected] Plotter for 2D plotter2d meshes • by default a 2D array is plotted as it had a uniform mesh • --triangle-mesh • plot an unstructured triangle mesh • -v is the values array (1D array of N values) • -xvar is the vertices list (Nx2 or Nx3 array), • -yvar is the triangle list (Mx3 array) Triangle mesh • --rectilinear • plot an array in a rectilinear grid • -v is the values array (NxM array) • -xvar is the x coords (N array), Rectilinear mesh • -yvar is the y coords (M array) • --polar[=[x|y]] • use x and y as radius/angle for a polar plot Polar plot

[email protected] Tutorial/schema Rectilinear example • $ mpirun -np 12 ./rectilinear2d 4 3

$ bpls rectilinear2d.bp X Y data double X {260} double Y {387} double data {260, 387}

• $ plotter2d -f rectilinear2d.bp -v data -x X -y Y -rectilinear -o r • $ gpicview

[email protected] Plotter for 2D

plotter2d -f xgc.fieldp.0250.bp -v "/node_data\[15\]/values”

-triangle-mesh -xfile xgc.mesh.bp -xvar "/coordinates/values" -yvar ”/cell_set[0]/node_connect_list”

-min -150.0 -max 150.0 -colormap XGCLog

-o potential

[email protected] Building plotter • Dependencies: • https://github.com/pnorbert/adiosvm/tree/master/plotterpackages • xmgrace • libmotif-common libmotif-dev • vtk-5 • Mesa • libglu1-mesa-dev libxt-dev • Instructions in README.txt, section V. Build Plotter • https://github.com/pnorbert/adiosvm/blob/master/README.txt

[email protected] Agenda • Day 1 • Reading 60 minutes • Introduction 55 minutes • Exercises 15 minutes • ADIOS APIs 20 minutes • Day 3 • Break 15 minutes • Review 10 minutes • Writing-XML 60 minutes • SKEL 20 minutes • BPLS 15 minutes • Write-NO 45 minutes • Exercises 15 minutes • Python 30 minutes • Day 2 • Break 15 minutes • Review 10 minutes • Staging 60 minutes • Schema 20 minutes • Visit 60 minutes • Plotter 15 minutes • Transforms 45 minutes • Break 15 minutes [email protected] ADIOS Transforms

• ADIOS allows users to transparently apply User Application

transformations to data, using Variable A Variable B code that looks like its still using the original untransformed data Plugin Read • Data Can swap transformations in/out Transform ADIOS Transform at runtime (vs. compile time) Plugin Layer • Plugin based, enabling easy Plugin Write expansion I/O Transport Layer • Focus on compression today

Regular var. Transformed var.

BP file, staging area, etc. [email protected] Installing Transformations • Most of the transformations require external software • Use --with-whatever options, where include/ and lib/ directories for the software live

$ cd ~/Software/ADIOS && ./configure --help ... --with-zlib=DIR Location of ZLIB library --with-bzip2=DIR Location of BZIP2 library --with-szip=DIR Location of SZIP library --with-isobar=DIR Location of ISOBAR library ...

[email protected] Checking what is installed $ adios_config -m ... Available data transformation methods (in XML transform tags in elements): "none" : No data transform "identity" : Identity transform "zlib" : zlib compression "bzip2" : bzip2 compression "szip" : szip compression "isobar" : ISOBAR compression "alacrity" : ALACRITY indexing "zfp" : zfp compression ...

“none” and “identity” for placeholding/testing • None: does not run the transformation methods in ADIOS • Identity: Runs the transformation methods, but transformed data is same as original [email protected] Examples… • Edit the xml file to try all of the transforms for the following run • mpirun -np 12 ./heat_transfer_adios1 data/heat_t_none 4 3 1024 512 10 1 • none 961M • zlib:9 824M • bzip2:9 835M • szip:9 698M • isobar:9 716M • zfp:accuracy=.001 49M • zfp:accuracy=.0000001 152M • zfp:accuracy=.0000000000001 360M

[email protected] Selecting Transforms (no XML) • There is a set_transform method, which takes the the variable id and settings string as arguments • MPI_CommMore on nocomm XML= MPI_COMM_WORLDAPI tomorrow int64_t groupid, varid, fh; char outfile[64] = ”demo-example.bp” char groupname[64] = “demo” char name[64] = “test”; double data[100];

/* Compute data */

adios_init_noxml(comm); adios_declare_group(&groupid, groupname, “”, adios_flag_yes); adios_select_method(groupid, “MPI”, “”, “”); vid = adios_define_var(groupid, vname, “”, adios_double, “100”, “”, “”); set_transform(vid, “bzip2:5”); adios_open(fh, groupname, outfile, “w”, comm); adios_write(fh, vid, data); adios_close(fh); [email protected] Selecting Transforms (XML) • Only need to add transform=“method:setting” tag within $ cd ~/Tutorial/heat_transfer $ heat_transfer.xml ...

!– Characteristic examples: type="real*8" dimensions="ndx,ndy" transform="none"/> type="real*8" dimensions="ndx,ndy" transform=”bzip2:5"/> type="real*8" dimensions="ndx,ndy" transform="zfp:accuracy=0.001"/> --> ...

[email protected] Run the heat transfer example with with/without ZFP compression • Edit ~/Tutorial/heat_transfer/transfer.xml

• Run 1: “transform=none” • Run 2: “transform=bzip2:5” #edit xml file #edit xml file $ cd ~/Tutorial/heat_transfer $ cd ~/Tutorial/heat_transfer $ make adios1 $ mpirun -np 12 ./heat_transfer_adios1 \ $ mpirun -np 12 ./heat_transfer_adios1 \ heat-none 4 3 128 128 30 40 heat-bzip2 4 3 128 128 30 40 $ du -h heat-none.bp $ du -h heat-bzip2.bp 91M heat-none.bp 82M heat-bzip2.bp

[email protected] Simple visualization • Edit ~/Tutorial/heat_transfer/plot_heat.sh to plot T or dT $ ./plot_heat.sh heat-none.bp $ ./plot_heat.sh heat-none.bp $ mkdir images_none $ mkdir images_none $ mv T.00* images_none/ $ mv dT.00* images_none/ $ gpicview images_none/T.00* $ gpicview images_none/dT.00*

(Showing time step 0)

bzip2 version is identical • Check for yourself

[email protected] Now try ZFP compression $ mpirun -np 12 ./heat_transfer_adios1 \ heat-zfp 4 3 128 128 30 40 • Edit ~/Tutorial/heat_transfer/transfer.xml $ ./plot_heat.sh • Run 1: “transform=zfp:accuracy=0.1” $ gpicview dT.0000.png ZFP No longer identical because ZFP is lossy None

(dT time step 1)

[email protected] More differences in the data directory on my machine

[email protected] Compressors Overview • Included in the VM: • bzip2 • zlib Lossless All take an integer from 1-9 as an argument: • szip 1: Fastest, lowest compression • ISOBAR 9: Slowest, highest compression • ZFP 5: Default

• e.g. “szip:5”, “szip”, “isobar:2” • Coming soon: • Make sure to benchmark: reduced size comes • SZ with extra processing time • Add your own

[email protected] bzip2 • Compresses files using the Burrows-Wheeler block sorting text compression algorithm Wikipedia’s pseudocode function BWT (string s) and Huffman coding create a table, rows are all possible rotations of s sort rows alphabetically • In simpler language: rearrange return (last column of the table) bytes into an order that is easier to compress based on repetitions zlib • Abstraction of DEFLATE algorithm (used by gzip, ZIP)

• Usage of both like like gzip [email protected] szip • Part of the HDF software products • Implementation of the extended- Rice lossless compression algorithm • Highly used by the NASA Earth Observatory System (EOS) • Compression ratios up to ~3 (good for lossless) (Combination of environmental/climate data) Pen-Shu Y et al. "Implementation of CCSDS Lossless Data Compression in HDF," Earth Science Technology Conference–2002, 11–13 June 2002, Pasadena, California.

[email protected] ISOBAR (In-Situ Orthogonal Byte Aggregate Reduction Compression) • First preprocess: analyze the data and try to rearrange it to make more compressible (time and size) • Select best (lossless) compressor, from some set

Compression ratio improvement

E. R. Schendel et al., "ISOBAR Preconditioner for Effective and High-throughput Lossless Data Compression," 2012 IEEE 28th International Conference on Data Engineering, Washington, DC, 2012, pp. 138-149.

[email protected] ZFP: algorithm 2D data • Work in blocks of 4 elements in each dimension • A family of spatially-decorrelating transforms (e.g. discrete cosine transform), can be written according to the following parameterization: , • ZFP chooses because it can be implemented in a highly optimized fashion • Smoothness assumption: zig-zag ordering “sorts” coefficients • Arrange by bit plane is an adjustable parameter • Set how many bits or bit planes to keep • Fast! Almost correct summary: a fast, piece-wise FFT, with tunable precision for the Fourier coefficients

[email protected] ZFP: ADIOS settings l2-norm bounded in each colored block Three operation modes: • Accuracy (float) – absolute l2- norm bound in each 4d block • Precision (int) – number of bit planes to output, i.e. 2-p relative precision • Rate (float) – Bits/scalar to keep

[email protected] SZ: Curve Fitting++ Compression

• Data replaced with lowest- error cubic, linear, or constant fit over previous 3, 2, or 1 points in data array, if model is within adjustable l∞-norm error • Otherwise: save original value using fewer bits • Very fast decompression • Coming to ADIOS 1.11

[email protected] Adding New Transformations • Transformation layer expects a certain set of methods to be implemented for reading and writing the transformed data • Template files exist to start from, and easy to extrapolate from transformations already in places (…says a scientist new to ADIOS) • Basic idea: replace “_template_” with “_your-method-name_” and update the methods • See the transform plugins on Github • The ADIOS Developer’s manual is also useful to refer to

[email protected] Basic Write Paradigm (adios_transform_template_write.c) zfp

How you access the raw untransformed data

Shared buffer is usual case, write on close()

Here’s where you call your do_everything_write() method

[email protected] Basic read Paradigm (adios_transform_template_read.c)

This is what happens when one MPI processor’s block has been fully read. (Simplest implementations work in terms of these full blocks, but partial block management is possible too)

zfp

Here’s where you call your do_everything_read() method

[email protected] Agenda • Day 1 • Reading 60 minutes • Introduction 55 minutes • Exercises 15 minutes • ADIOS APIs 20 minutes • Day 3 • Break 15 minutes • Review 10 minutes • Writing-XML 60 minutes • SKEL 20 minutes • BPLS 15 minutes • Write-NO 45 minutes • Exercises 15 minutes • Python 30 minutes • Day 2 • Break 15 minutes • Review 10 minutes • Staging 60 minutes • Schema 20 minutes • Visit 60 minutes • Plotter 15 minutes • Transforms 45 minutes • Break 15 minutes [email protected] 162 Goals of this talk and hands-on • Understand: • Design of the Read API • Staging • How can ADIOS help with data intensive processing • How to program to read with ADIOS • How to use advanced data staging • Please Ask questions

[email protected] Vision: building scientific collaborative applications

[email protected] Workflow building • When writing codes to be used in a workflow, the order of action is • determine the placement first • determine the connection between two tasks • WAN/Cloud, LAN, HW-specific communication layer, shared memory, inline scheduling • write code that implements the communication specific to the actual placement • Goal here is to switch this order • implement task then do placement dynamically • Well, many workflow systems do this • but using files as common interface for data transfer

[email protected] Goals of the ADIOS Read API design • Works well with files (scalable I/O) • Staging I/O • Insulate the scalable application from the variability inherent in the file system • Enable the utilization of in-situ and in-transit analytics and visualization • Same API for reading data from files and from staging • Allow for read optimizations: • Multiple read operations can be scheduled before performing them • Allow for blocking and non-blocking reads • Use generic selections in the read statements instead of describing a bounding box • Option to let ADIOS deliver data in chunks, with memory allocated inside ADIOS not in user-space

[email protected] Selections ADIOS_SELECTION * adios_selection_boundingbox (int ndim, uint64_t * offsets, uint64_t * readsize) adios_selection_points (uint64_t ndim, uint64_t npoints, uint64_t *points) adios_selection_writeblock (int index) adios_selection_auto (char * hints)

nx ny

Writeblock

Auto

(0,0) [email protected] 167 Read API basics • Common API for reading files and streams (with staging) • In staging, one must process data step-by-step • Files allow for accessing all steps at once • Schedule/perform reads in bulk, instead of single reads • Allows for optimizing multiple reads together • Selections • bounding boxes, list of points, selected blocks and automatic • Chunking (optional) • receive and process pieces of the requested data concurrently • staging delivers data from many producers to a reader over a certain amount of time, which can be used to process the first chunks

[email protected] 168 Read API basics • Step • A dataset written within one adios_open/…/adios_close • Stream • A file containing of series of steps of the same dataset • Read API is designed to read data from one step at a time, then advance forward • alternative API allows for reading all steps at once from a file

[email protected] 169 Read API • Initialization/Finalization • One call per each read method used in an application • Staging methods usually connect to a staging server / other application at init, and disconnect at finalize. int adios_read_init_method ( enum ADIOS_READ_METHOD method, MPI_Comm comm, const char * parameters) int adios_read_finalize_method(enum ADIOS_READ_METHOD method) Fortran use adios_read_mod call adios_read_init_method (method, comm, parameters, err) call adios_finalize_method (method, err)

[email protected] 170 Read API

• For files: seeing all timesteps at once ADIOS_FILE * adios_read_open_file ( const char * fname, enum ADIOS_READ_METHOD method, MPI_Comm comm)

• Close int adios_read_close (ADIOS_FILE *fp) Fortran use adios_read_mod call adios_read_open_file (fp, fname, method, comm, err) call adios_read_close (fp, err)

[email protected] 171 Inquire about a variable (no extra I/O) • ADIOS_VARINFO * adios_inq_var (ADIOS_GROUP *gp, const char * varname) • allocates memory for ADIOS_VARINFO, free resources later with adios_free_varinfo(). • ADIOS_VARINFO variables • int ndim Number of dimensions • uint64_t *dims Size of each dimension • int nsteps Number of steps of the variable in file. Streams: always 1 • void *value Value (for scalar variables only) • int nblocks Number of blocks that comprise this variable in a step • void *gmin, *gmax, gavg, gstd_dev Statistical values of the global arrays • int adios_inq_var_stat (ADIOS_FILE *fp, ADIOS_VARINFO * varinfo, int per_step_stat, int per_block_stat) • int adios_inq_var_blockinfo (ADIOS_FILE *fp, ADIOS_VARINFO * varinfo) • Get writing layout of an arrayFortran variable (bounding boxes of each writer) call adios_get_scalar (fp, varname, data, err) call adios_inq_file (fp, vars_count, attrs_count, current_step, last_step, err) call adios_inq_varnames (fp, vnamelist, err) [email protected] adios_inq_var (fp, varname, vartype, nsteps, ndim, dims, err) 172 Read an attribute (no extra I/O) • int adios_get_attr (ADIOS_FILE * fp, const char * attrname, enum ADIOS_DATATYPES * type, int * size, void ** data) • Attributes are stored in metadata, read in during open operation. • It allocates memory for the content, so you need to free it later with free().

Fortran call adios_inq_attr (fp, attrname, attrtype, attrsize, err)

[email protected] 173 Select a bounding box • In our example we need to define the selection area of what to read from an array.

nx

ny Offset(y) Readsize(nx,ny) Offset(x) (0,0)

• ADIOS_SELECTIONFortran * adios_selection_boundingbox ( call adios_selection_boundingboxint ndim, uint64_t *(integer*8 offsets, sel, uint64_t -- return value * readsize) integer ndim, integer*8 offsets(:), integer*8 readsize(:))

[email protected] 174 Reading data • Read is a scheduled request, … int64_t adios_schedule_read ( const ADIOS_FILE * fp, const ADIOS_SELECTION * selection, const char * varname, int from_steps, int nsteps, void * data) • in streaming mode, only one step is available • …executed later with other requests together int adios_perform_reads (constFortranADIOS_FILE *fp, int blocking) call adios_schedule_read (fp, selection, varname, from_steps, nsteps, data, err) call adios_perform_reads ( fp, err)

[email protected] adios_advance_step( ADIOS_FILE *fp, int last, float timeout_sec); • Can only advance to the next timestep • We must assume that failures can occur • It depends on the locking method, if advancing to the next step advances to the next immediate step (ADIOS_LOCKMODE_ALL) or to the next available step (ADIOS_LOCKMODE_CURRENT) • If the reading method in use does not support locking all steps, advancing to the ‘next’ step may fail if that step is not available anymore, and then return an error • Advancing to step N informs the read method that all steps before N can be removed if space is needed. There is no way to go back to the previous steps.

[email protected] 176 ADIOS Demo: Read • Goals • Learn how to read in data from an arbitrary number of processors.

[email protected] 7 7 Compile and run the read code

$ cd ~/Tutorial/heat_transfer/read • We can read in data from $ make adios arbitrary number of $ mpirun -n 1 ./heat_read_adios $ ls fort.* processors fort.100 with a 1D domain $ less -S fort.100 decomposition rank=0 size=150x160 offsets=0:0 step=3 time row columns 0...159 0 1 2 ... Each line contains: timestep x (global) y(global) xy

$ mpirun -n 7 ./heat_read_adios $ ls fort.* fort.100 fort.101 fort.102 fort.103 fort.104 fort.105 fort.106 $ less -S fort.105 rank=5 size=150x22 offsets=0:110 step=3 time row columns 110...131 110 111 112 113 ...

[email protected] Dissecting the code 1 program reader 22 integer*8 :: fh ! File handle 2 use adios_read_mod 23 integer*8 :: sel ! ADIOS selection 3 use heat_print object 4 implicit none 24 ! Variable information 5 include 'mpif.h' 25 integer :: vartype, nsteps, ndim 7 character(len=256) :: filename, errmsg 26 integer*8, dimension(2) :: dims 8 integer :: timesteps ! number of times to read 27 ! Offsets and sizes data 28 integer :: gndx, gndy 9 integer :: nproc ! number of processors 29 integer*8, dimension(2) :: offset=0, readsize=1 11 real*8, dimension(:,:), allocatable :: T, dT 33 call MPI_Init (ierr) 13 ! MPI variables 34 call MPI_Comm_dup (MPI_COMM_WORLD, 14 integer :: group_comm group_comm, ierr) 15 integer :: rank 35 call MPI_Comm_rank (MPI_COMM_WORLD, 16 integer :: ierr rank, ierr) 18 integer :: ts=0 ! actual timestep 36 call MPI_Comm_size (group_comm, nproc , ierr) 19 integer :: i,j [email protected] ! ADIOS related variables Dissecting the code -2 37 call adios_read_init_method (ADIOS_READ_METHOD_BP, group_comm, "verbose= 3", ierr) 39 write(filename,'("../heat.bp")') 41 call adios_read_open_file (fh, filename, ADIOS_READ_METHOD_BP, group_com m, ierr) 43 ! adios_get_scalar() gets the value from metadata in memory 44 call adios_get_scalar(fh, "gndx", gndx, ierr) 45 call adios_get_scalar(fh, "gndy", gndy, ierr) 46 readsize(1) = gndx 47 readsize(2) = gndy / nproc 49 ! We can also inquire the dimensions, type and number of steps 50 ! of a variable directly 51 call adios_inq_var (fh, "T", vartype, nsteps, ndim, dims, ierr) 52 ts = nsteps-1 ! Let's read the last timestep 54 offset(1) = 0 [email protected] 55 offset(2) = rank * readsize(2) 57 if (rank == nproc-1) then ! last process should read all the rest of columns 58 readsize(2) = gndy - readsize(2)*(nproc-1) 59 endif 61 allocate( T(readsize(1), readsize(2)) ) 63 ! Create a 2D selection for the subset 64 call adios_selection_boundingbox (sel, 2, offset, readsize) 67 ! Arrays are read by scheduling one or more of them 68 ! and performing the reads at once 69 call adios_schedule_read(fh, sel, "T", ts, 1, T, ierr) 70 call adios_perform_reads (fh, ierr) 72 call print_array (T, offset, rank, ts) 74 call adios_read_close (fh, ierr) 76 call adios_selection_delete (sel) 77 call adios_read_finalize_method (ADIOS_READ_METHOD_BP, ierr) 78 call MPI_Finalize (ierr) 79 end program reader [email protected] Dissecting the step code 1 program reader 23 integer*8 :: sel ! ADIOS selection 2 use adios_read_mod object 3 use heat_print 24 ! Variable information 4 implicit none 25 integer :: vartype, nsteps, ndim 5 include 'mpif.h' 26 integer*8, dimension(2) :: dims 7 character(len=256) :: filename, errmsg 27 ! Offsets and sizes 8 integer :: timesteps ! number of times to read 28 integer :: gndx, gndy data 29 integer*8, dimension(2) :: offset=0, readsize=1 9 integer :: nproc ! number of processors 34 call MPI_Init (ierr) 11 real*8, dimension(:,:), allocatable :: T, dT 35 call MPI_Comm_dup (MPI_COMM_WORLD, 14 integer :: group_comm group_comm, ierr) 15 integer :: rank 36 call MPI_Comm_rank (MPI_COMM_WORLD, 16 integer :: ierr rank, ierr) 18 integer :: ts=0 ! actual timestep 37 call MPI_Comm_size (group_comm, nproc , ierr) 19 integer :: i,j 21 ! ADIOS related variables 22 integer*8 :: fh ! File handle [email protected] Dissecting the step code -2 38 call adios_read_init_method 56 call adios_get_scalar(fh, "gndx", gndx, ierr) (ADIOS_READ_METHOD_BP, group_comm, "verbose= 57 call adios_get_scalar(fh, "gndy", gndy, ierr) 3", ierr) 58 readsize(1) = gndx 40 write(filename,'("../heat.bp")') 59 readsize(2) = gndy / nproc 42 call adios_read_open (fh, filename, 61 offset(1) = 0 ADIOS_READ_METHOD_BP, group_comm, ADIOS_LOCKMODE_CURRENT, 180.0, ierr) 62 offset(2) = rank * readsize(2) 44 if (ierr .ne. 0) then 64 if (rank == nproc-1) then ! last process should read all the re st of columns 45 print '(" Failed to open stream: ",a)', filename 65 readsize(2) = gndy - readsize(2)*(nproc-1) 46 print '(" open stream ierr=: ",i0)', ierr 66 endif 47 call exit(1) 68 allocate( T(readsize(1), readsize(2)) ) 48 endif 70 ! Create a 2D selection for the subset 50 ts = 0; 71 call adios_selection_boundingbox (sel, 2, 51 do while (ierr==0) offset, readsize) 53 print '(" Process step: ",i0)', ts 72 endif 54 if (ts==0) then [email protected] 74 ! Arrays are read by scheduling one 88 if (ierr==err_end_of_stream .and. or more of them and performing the reads rank==0) then at once In streaming mode, always step 0 is 89 print *, " Stream has terminated. the one available for read Quit reader" 77 call adios_schedule_read (fh, sel, 90 elseif (ierr==err_step_notready "T", 0, 1, T, ierr) .and. rank==0) then 78 call adios_perform_reads (fh, ierr) 91 print *, " Next step has not 80 ! Release resources to this step arrived for a while. Assume termination" 81 call adios_release_step (fh, ierr) 92 endif 83 call print_array (T, offset, rank, ts) 93 ts = ts+1 85 ! Advance to the next available step 94 enddo 86 call MPI_Barrier (group_comm, 95 call adios_read_close (fh, ierr) ierr) 87 call adios_advance_step (fh, 0, 5.0, ierr) [email protected] Agenda • Day 1 • Reading 60 minutes • Introduction 55 minutes • Exercises 15 minutes • ADIOS APIs 20 minutes • Day 3 • Break 15 minutes • Review 10 minutes • Writing-XML 60 minutes • SKEL 20 minutes • BPLS 15 minutes • Write-NO 45 minutes • Exercises 15 minutes • Python 30 minutes • Day 2 • Break 15 minutes • Review 10 minutes • Staging 60 minutes • Schema 20 minutes • Visit 60 minutes • Plotter 15 minutes • Transforms 45 minutes • Break 15 minutes [email protected] A Common Question • I used ADIOS with my code “foobar”, but my I/O performance sucked. Can you make the I/O performance work better? • Our reply • Where did you run the code? • What does foobar write out? • Which method did you use? • What was your file system? • … • They give us some answers, but this repeats itself for a long time!

• How do we make this process work much better?

[email protected] Skel 1: Replaying the I/O part of an application

Source files .bp data file

Skel Makefile .yaml Submit scripts • Skeletal applications perform the same I/O operations as an application, but eliminate computation and communication • Created from an ADIOS output file • Easy to create, run, and update, compared to instrumented apps or I/O kernels 186 [email protected] 187 Using SKEL

$ cd ~/Tutorial/heat_transfer Run a single timestep $ mpirun -np 12 ./heat_transfer_adios1 heat_1 3 4 50 40 1 500 $ mkdir skel $ skeldump heat_1.bp >& skel/heat.yaml

$ cd skel $ skel replay heat -y heat.yaml $ mpirun -np 12 ./heat_skel_group1 $ bpls -lva out_group1_heat_1

[email protected] Skel 2: Automatic generation of I/O Skeletal Benchmarks

Source files app.xml

Skel Makefile

params.xml Submit scripts

• Skeletal applications perform the same I/O operations as an application, but eliminate computation and communication • Created from ADIOS XML file and a handful of additional parameters, does not require a full running application • Easy to create, run, and update, compared to instrumented apps or I/O kernels

188 [email protected] Skel: Users and Impact • Extensible collection of relevant I/O benchmarks • Focus less on measuring and more on improving I/O performance • Create a comprehensive and relevant set of I/O benchmarks for new or existing hardware platforms • Demonstrated to work well with these applications (so far): • S3D (Combustion) • GTS (Fusion) • CHIMERA (Astrophysics) • GRAPES (Weather) • GEOS-5 (Climate) 1 app, 2 machines, 3 methods 189 [email protected] The Exascale world is changing • More routines are being run in situ • SKEL can be extended to take into account many of the ADIOS services, to reproduce the workflow • SKEL is becoming a Domain Specific Markup Language (DSML) for creating the in situ workflows, and allow us to test the workflow without running the codes

[email protected] Interpreting Adios timing information • When writing with multiple timesteps, some Adios methods record timing information in the output file • In the heat transfer example, each file contains timing information for the previous timestep $ cd ~/Tutorial/heat_transfer number of writing processes $ bpls heat.bp -d "/__adios__/timers” double /__adios__/timers_1 {12, 8} number of available timers ( 0,0) 0.0995553 0.00518584 0 3.60012e-05 0.0143201 1.90735e-06 ( 0,6) 0 0.105091 0.0501909 0 0 0 ( 1,4) 0.0034349 0 0 0.0502429 0.0498557 0 ( 2,2) 0 0 0.00271893 9.53674e-07 0 0.0499089 ... • The number and types of timers depend on which method was used for writing

[email protected] Interpreting Adios timing information • Use bpls to see the names of timers in a bp file • Need to dump the values from the labels array • Use –S to interpret the byte arrays as strings

$ bpls heat.bp -Sd "/__adios__/timer_labels_1” byte /__adios__/timer_labels_1 {8, 17} (0, 0) "Communication “ (1, 0) "I/O “ (2, 0) "Local metadata “ (3, 0) "Global metadata “ (4, 0) "adios_open() “ (5, 0) "adios_write() “ (6, 0) "adios_overflow()“ (7, 0) "adios_close() "

[email protected] Interpreting Adios timing information • You can use the slicing feature of bpls to see measurements for a particular timer # View the ADIOS_TIMER_COMM measurements $ bpls heat.bp -d "/__adios__/timers_1” -s "0,0" -c "-1,1" double /__adios__/timers_1 {12, 8} ( 0,0) 0.0995553 0.0501909 0.0498557 0.0980701 0.0699418 0.0915856 ( 6,0) 0.107233 0.0559721 0.0439298 0.0867393 0.076993 0.0778482...

• This shows the time spent by each process on communication

[email protected] Agenda • Day 1 • Reading 60 minutes • Introduction 55 minutes • Exercises 15 minutes • ADIOS APIs 20 minutes • Day 3 • Break 15 minutes • Review 10 minutes • Writing-XML 60 minutes • SKEL 20 minutes • BPLS 15 minutes • Write-NOXML 45 minutes • Exercises 15 minutes • Python 30 minutes • Day 2 • Break 15 minutes • Review 10 minutes • Staging 60 minutes • Schema 20 minutes • Visit 60 minutes • Plotter 15 minutes • Transforms 45 minutes • Break 15 minutes [email protected] 195 ADIOS non-XML APIs • Limitation of the XML approach • Must have all variables defined in advance • Approach of non-XML APIs • Similar operations to what happens internally in adios_init • Define variables and attributes before opening a file for writing • The writing steps are the same as XML APIs • open file • write variables • close file

[email protected] 196 Non-XML API functions • Initialization • init adios, allocate buffer, declare groups and select write methods for each group. adios_init_noxml (); adios_allocate_buffer (ADIOS_BUFFER_ALLOC_NOW, 10); • when and how much buffer to allocate (in MB) adios_declare_group (&group, "restart", "iter", adios_flag_yes); • group with name and optional timestep indicator (iter) and whether statistics should be generated and stored adios_select_method (group, "MPI", "", ""); • with optional parameter list, and base path string for output files

[email protected] 197 Non-XML API functions • Definition int64_t adios_define_var (group, “name”, “path”, type, “local_dims”, “global_dims”, “offsets”) • Similar to how we define a variable in the XML file • returns a handle to the specific definition • Dimensions/offsets can be defined with • scalars (as in the XML version) • id = adios_define_var (g, “xy”, “”, adios_double, “nx_local, ny_local”, “nx_global, ny_global”, “offs_x,offs_y”) • need to define and write several scalars along with the array • actual numbers • id = adios_define_var (g, “xy”, “”, adios_double, “20,20”, “100,100”, “0,40”)

[email protected] 198 Multiple blocks per process • AMR codes and load balancing strategies may want to write multiple pieces of a global variable from a single process • ADIOS allows one to do that but • one has to write the scalars defining the local dimensions and offsets for each block, and • group size should be calculated accordingly • This works with the XML API, too, but because of the group size issue, pre- generated write code cannot be used (should do the adios_write() calls manually) • Array definition with the actual sizes as numbers saves from writing a lot of scalars (and writing source code)

[email protected] 199 ADIOS Demo: Non-XML Write API • Goals • Use Non-XML API of ADIOS • How to write multiple blocks of a variable from a process

[email protected] 200 Compile and run the code

$ cd ~/Tutorial/heat_transfer $ make noxml $ mpirun -np 12 ./heat_transfer_noxml heat 4 3 40 50 6 500 $ du -hs *.bp 2.3M heat.bp $ bpls -l heat.bp integer gndx 6*scalar = 150 / 150 / 150 / 0 integer gndy 6*scalar = 160 / 160 / 160 / 0 double T 6*{160, 150} = 0.0002/999.47/442.536/318.995 double dT 6*{160, 150} = 6.27403e-06/0.720 /0.135/0.115

• study the source io_adios_noxml.F90

[email protected] 201 Multi-block example (Fortran) • This is similar to 01_write_read tutorial example, except that each process writes 2 blocks (2nd block shifted in the X offset). • 12 cores, arranged in a 4 x 3 arrangement. • 24 data blocks • Each processor allocates an 4x3 array (xy) P11 P11 • we write the same array to P6 P6 two places in the output y • Total size of the array P0 P0 • 2*4*4 x 3*3 • Use the non-XML API x • define and write xy array without scalars [email protected] 202 No-XML example $ cd ../02_noxml/ $ make $ mpirun -np 12 ./writer_noxml ts= 0 ts= 1 $ ls -l *.bp -rw-rw-r-- 1 adios adios 11196 Jul 7 14:10 writer00.bp -rw-rw-r-- 1 adios adios 11451 Jul 7 14:10 writer01.bp $ plot_writer.sh $ eog . &

[email protected] Agenda • Day 1 • Reading 40 minutes • Introduction 55 minutes • Python 30 minutes • ADIOS APIs 20 minutes • Exercises 15 minutes • Break 15 minutes • Day 3 • Writing-XML 60 minutes • Review 10 minutes • BPLS 15 minutes • SKEL 20 minutes • Exercises 15 minutes • Write-NO 45 minutes • Day 2 • Python 30 minutes • Review 10 minutes • Break 15 minutes • Schema 20 minutes • Staging 60 minutes • Plotter 15 minutes • Visit 60 minutes • Transforms 45 minutes • Break 15 minutes [email protected] Python/Numpy Wrapper • Two different modules for serial and parallel version • Serial version: Use “import adios” • MPI version: Use “import adios_mpi” • Dependencies • Numpy: Data to write and read will be represented by using Numpy array • MPI4Py: Need only for MPI version, adios_mpi. Most MPI Comm related parameters are set with default values and can be omitted. • Consist of the following components • A set of write APIs. Both XML and No-XML APIs • Read APIs: read_init and read_finalize • Two read related classes, file and variable • Enumeration classes to represent constant parameters: DATATYPE, FLAGS, BUFFER_ALLOC_WHEN • A set of utility functions: • np2adiostype (NUMPY_TYPE): returns Adios DATATYPE • readvar (FILENAME, VARNAME): simple variable read function • bpls (FILENAME): show BP file contents • Examples: • See example codes (bpls.py and ncdf2bp.py) in wrapper/numpy/example/utils • Test codes in wrapper/numpy/tests [email protected] Python/Numpy Read Class • File class • Constructor • file (PATH, [COMM], [is_stream], [lock_mode], [timeout_sec]) • Members • var: Python dict object contains {variable name: variable descriptor} pairs • Other variables for num variables, current steps, last steps, version, file sizes, etc. • Methods • close () : close file • printself () : print contents for debugging purpose • advance (LAST, TIMEOUT_SEC): advance steps for streaming • Variable class • Constructor • var (FILECLASS, VARNAME) • Members • Variable related parameters: name, varid, type, ndim, dims, nsteps • Methods • read ([OFFSET], [COUNT], [FROM_STEPS=0], [NSTEPS=1]): read as numpy array • close () : close and free variable class • printself () : print contents for debugging purpose

205 [email protected] Pseudocode BP Read by Python/Numpy Wrapper

import adios as ad ## import python modules import numpy as np

f = ad.file("heat.bp") ## Call file class constructor v = f.var['T'] ## Get variable val = v.read() ## Read as Numpy array … do computation … f.close() ## Close file

import adios_mpi as ad ## import MPI modules import numpy as np from mpi4py import MPI

comm = MPI.COMM_WORLD

f = ad.file("heat.bp", comm) ## Add comm info … [email protected] Heat Transfer example python reading

$ cd ~/Tutorial/heat_transfer/python $ python bpls.py ../heat_zfp.bp File info: of variables: 13 time steps: 0 - 499 file size: 85940906 bp version: 3

long integer /info/nproc 500*() long integer /info/npx 500*() long integer /info/npy 500*() double precision T 500*(384L, 512L) double precision dT 500*(384L, 512L) long integer gndx 500*() long integer gndy 500*() long integer iterations 500*() long integer ndx 500*() long integer ndy 500*() long integer offx 500*() long integer offy 500*() long integer step 500*()

207 [email protected] bpls.py • #!/usr/bin/env python • """ • f = ad.file(fname) • Example: • print "File info:" • $ python ./bpls.py bpfile • print " %-18s %d" % ("of variables:", f.nvars) • """ • print " %-18s %d - %d" % ("time steps:", • import adios as ad f.current_step, f.last_step) • import numpy as np • print " %-18s %d" % ("file size:", f.file_size) • import getopt, sys • print " %-18s %d" % ("bp version:", f.version) • def usage(): • print "" • print "USAGE: %s filename" % sys.argv[0] • • def main(): • for k in sorted(f.var.keys()): • fname = "" • v = f.var[k] • if len(sys.argv) < 2: • print " %-17s %-12s %d*%s" % • usage() (np.typename(np.sctype2char(v.dtype)), v • • sys.exit(1) .name, v.nsteps, v.dims) • • else: if __name__ == "__main__": • • fname = sys.argv[1] main()

[email protected] Heat Transfer example python reading

$ cd ~/Tutorial/heat_transfer/python $ python heat_read.py ../heat_zfp.bp dT >>> Read full data >>> name: dT >>> shape: (500, 384, 512) >>> values: [[[ 0.00633766 0.01261227 0.01876219 ..., 0.01876219 0.01261227 0.00633766] [ 0.01261227 0.02509891 0.03733712 ..., 0.03733712 0.02509891 0.01261227] [ 0.01876219 0.03733712 0.05554169 ..., 0.05554169 0.03733712 0.01876219] ... >>> Read step by step >>> step: 0 >>> name: dT >>> shape: (384L, 512L) >>> values: ...

209 [email protected] heat_read.py • #!/usr/bin/env python • v = ad.readvar(fname, vname) • """ • print ">>> name:", vname • Example: • print ">>> shape:", v.shape • $ python ./heat_transfer.py bpfile varname [varname [...]] • print ">>> values:" • """ • print v • import adios as ad • """ • import numpy as np • Read step by step • import sys • """ • def usage(): • print "" • print "USAGE: %s bpfile varname" % sys.argv[0] • print ">>> Read step by step" • fname = "" • f = ad.file(fname) • if len(sys.argv) < 3: • for i in range(f.current_step, f.last_step): • usage() • print ">>> step:", i • sys.exit(1) • for vname in varnames: • else: • v = f.var[vname] • fname = sys.argv[1] • val = v.read(from_steps=i) • varnames = sys.argv[2:] • print ">>> name:", vname • """ • print ">>> shape:", v.dims • Read all • print ">>> values:" • """ • print val • print ">>> Read full data" • f.close() • for vname in varnames: [email protected] example: 2x2 read with bpls (in first timestep) • Use bpls to read in a 2D slice $ bpls heat_zfp.bp -d T -s "0,49,39" -c "1,2,2" -n 2 double T 500*{384, 512} slice (0:0, 49:50, 39:40)

(0,49,39) 5.0916 4.15414 P8 P9 P10 P11 (0,50,39) 4.99562 4.05808

y P4 P5 P6 P7 • How do we the same in Python?

• Note: bpls handles P0 P1 P2 P3 time as an extra dimension, • python handles time separately x

211 [email protected] Heat Transfer example python reading $ cd ~/Tutorial/heat_transfer/python $ python >>> import adios as ad >>> import numpy as np >>> f=ad.file("../heat.bp") >>> f.printself() >>> v=f.var['T'] >>> v.printself() >>> T=v.read(offset=(49,39),count=(2,2),from_steps=0,nsteps=1) >>> T array([[ 5.09159701, 4.15414135], [ 4.99562066, 4.05807836]])

or just >>> T=v.read((49,39), (2,2), 0, 1) >>> quit()

212 [email protected] Python/Numpy Write API functions • Init init (PATH, [COMM]) • PATH: configuration file • COMM: MPI communicator (default=MPI.COMM_WORLD). Can be omitted for serial version • Open f = open (GROUPNAME, FILENAME, MODE, [COMM]) • Return file descriptor • Write write (FILEP, VARNAME, VAL) • FILEP: file descriptor. Return value of open() • VAL: numpy array type. Dimension and data type are automatically detected. • Close close (FILEP) • FILEP: file pointer. Return value of open()

213 [email protected] Python/Numpy Write No-XML API functions • Init No-XML Init_noxml ( [COMM]) • COMM: MPI communicator (default=MPI.COMM_WORLD). Can be omitted for serial version • Allocate_buffer allocate_buffer (WHEN, BUFFER_SIZE) • WHEN: Defined as an enum class, BUFFER_ALLOC_WHEN. Use one of BUFFER_ALLOC_WHEN.[UNKNOWN, NOW, LATER] values. • Declare Group g = declare_group (GROUPNAME, [TIMEINDEX]) • Return group descriptor • Declare Var define_var(GROUPID, VARNAME, TYPE, [DIM], [GLOBALDIM], [OFFSET]) • GROUPID: group descriptor. Return value of declare_group() • TYPE: Defined as an enum class, DATATYPE • Select Method select_method(GROUPID, METHOD, [PARAMETERS], [BASEPATH]) • GROUPID: group descriptor. Return value of declare_group() • Example: wrapper/numpy/example/utils/ncdf2bp.py

214 [email protected] Typical BP Write by Python/Numpy Wrapper import adios_mpi as ad ad.init("config_mpi.xml", comm) import numpy as np fd = ad.open("temperature", from mpi4py import MPI "adios_test_mpi.bp", "w", comm)

comm = MPI.COMM_WORLD ad.write_int (fd, "NX", NX) rank = comm.Get_rank() ad.write_int (fd, "rank", rank) size = comm.Get_size() ad.write_int (fd, "size", size) ad.write (fd, "temperature", t) ad.close (fd) NX = 10 t = np.array (range(NX), ad.finalize() dtype=np.float64) + rank*NX

See ADIOS source wrappers/numpy/tests/test_adios_mpi.py or the sequential version: test_adios.py

215 [email protected] Interactive Analysis with iPython Notebook

• Run $ cd ~/Tutorial/heat_transfer/python $ ipython notebook • Then, create an interactive notebook

Click to create a new notebook

[email protected] iPython Notebook • Basic operations • Notebook consists of multiple cells • Each cell can be a code section or markdown for text • Execute code cell: Shift + Enter • Will display (*) while executing • Start over: • Kernel > Restart • Numpy basic • Access metadata: x.shape, x.dtype • Slicing • Range: x[:10] or x[0:10] • Full range: x[:] or x[…] [email protected] • Load necessary modules and set plot settings import adios as ad ## Adios module import numpy as np ## Numpy module import matplotlib.pyplot as plt ## Plotting import pprint ## Pretty print %matplotlib inline

• Check files: ls -alh ../old_runs/heat_*.bp -rw-rw-r-- 1 adios adios 1.5G Sep 28 13:13 ../heat_none.bp -rw-rw-r-- 1 adios adios 82M Sep 28 12:54 ../heat_zfp.bp

[email protected] Adios File • Open adios file f = ad.file('../old_runs/heat_none.bp') • Check contents (meta data: path, vars, attrs, steps, size, etc) f AdiosFile (path='../heat_none.bp', nvars=13, vars=['gndy', 'gndx', 'ndx', '/info/npy', '/info/npx', 'ndy', 'offy', 'step', 'T', 'iterations', 'offx', 'dT', '/info/nproc'], nattrs=5, attrs=['T/description', '/info/nproc/description', 'dT/description', '/info/npx/description', '/info/npy/description'], current_step=0, last_step=499, file_size=1581566394) • Variables and attributes are organized as "dict" f.vars f.attrs

[email protected] Adios Variable • Check variable metadata (type, dims, steps, etc) v = f['T'] ## Get a variable v AdiosVar (varid=11, name='T', dtype=dtype('float64'), ndim=2, dims=(384L, 512L), nsteps=500, attrs=['description']) • Read values T = v[...] ## Read all and save as a numpy array T ## Print the contents (numpy object) print T.shape, T.dtype ## Print shape and type • Subset reading v[0,...] # Read only 1st time step v[:10,...] # Read only the first 10 time steps v[...,0,0] # Read only all values at x=0, y=0 [email protected] Python Plotting • Plot 1D array (vector) plt.plot(T[:,100,100]) • Plot 2D array (matrix) plt.plot(T[0,:,:]) ## Read 2D matrix at timestep=0 plt.colorbar()

[email protected] Heat transfer example with Python • Read 'T' variable from heat_none.bp and heat_zfp.bp T = ad.file('../data/heat_none.bp')['T'][...] Tz = ad.file('../data/heat_zfp.bp')['T'][...]

• Plot abs difference between T and zfp-compressed T plt.imshow(abs(T[10,...] - Tz[10,...])); plt.colorbar();

[email protected] Compare Derivatives • First derivative: dT = (T[2:500,...] - T[0:498,...])/2 dTz = (Tz[2:500,...] - Tz[0:498,...])/2 ## Plot plt.imshow(dT[10,...]); plt.colorbar(); ## Plot abs difference plt.imshow(dT[10,…]-dTz[10,...]); plt.colorbar(); • Second derivative: ddT = (T[2:500,...] + T[0:498,...] - 2*T[1:499,...]) ddTz = (Tz[2:500,...] + Tz[0:498,...] -2*Tz[1:499,...]) ## Plot plt.imshow(ddT[10,...]); plt.colorbar(); plt.imshow(ddT[10,...]); plt.colorbar(); ## Plot abs difference plt.imshow(ddT[10,...]-ddTz[10,...]); plt.colorbar(); [email protected] Agenda • Day 1 • Reading 60 minutes • Introduction 55 minutes • Exercises 15 minutes • ADIOS APIs 20 minutes • Day 3 • Break 15 minutes • Review 10 minutes • Writing-XML 60 minutes • SKEL 20 minutes • BPLS 15 minutes • Write-NO 45 minutes • Exercises 15 minutes • Python 30 minutes • Day 2 • Break 15 minutes • Review 10 minutes • Staging 60 minutes • Schema 20 minutes • Visit 60 minutes • Plotter 15 minutes • Transforms 45 minutes • Break 15 minutes [email protected] Introduction to Staging • Simplistic approach to staging • Decouple application performance from storage performance (burst buffer) • Move data directly to remote memory in a “staging” area • Write data to disk from staging • Built on past work with threaded buffered I/O • Buffered asynchronous data movement with a single memory copy for networks which support RDMA • Application blocks for a very short time to copy data to outbound buffer • Data is moved asynchronously using server-directed remote reads

• Create a “DataSpace” • Exploits network hardware for fast data transfer to remote memory Released with ADIOS 1.4.0 [email protected] Vision: building scientific collaborative applications

[email protected] Evolution of Staging to support on line analytics

• Use of staging for common data management tasks Application

• Staging applications initially monolithic programs Staging Staging Process • Multiple staging applications can be combined for Process 2 a pipeline approach 1

Storage Storage

• Partition large applications into multiple Staging Area services • Each plugin uses ADIOS API to read/write Plugin Plugin Plugin Application A B C • Plugins can communicate memory-to-memory or through files Plugin Plugin • Workflow can have many branches E D • Resource allocations managed by ADIOS [email protected] 228 Currently available staging methods

ADIOS + DataSpaces DataSpaces ADIOS + DIMES Producer Consumer data step ADIOS + FLEXPATH record.bprecord.bp + asynchronous communication DATASPACES + easy, commonly-used APIs + fast and scalable data movement + not affected by parallel IO performance DataSpaces meta data record.bprecord.bp

Producer Consumer RDMA pull DIMES

FLEXPATH

Producer Consumer RDMA pull

[email protected] Design choices for reading API • One output step at a time • One step is seen at once after writer completes a whole output step • streaming is not byte streaming here • reader has access to all data in one output step • as long as the reader does not release the step, it can read it • potentially blocking the writer • Advancing in the stream means • get access to another output step of the writer, • while losing the access to the current step forever.

[email protected] 230 Recall read API • Step • A dataset written within one adios_open/…/adios_close • Stream • A file containing of series of steps of the same dataset • Open as a stream or as a file • for step-by-step reading (both staged data and files) ADIOS_FILE * adios_read_open ( const char * fname, enum ADIOS_READ_METHOD method, MPI_Comm comm, enum ADIOS_LOCKMODE lock_mode, float timeout_sec) • Close int adios_read_close (ADIOS_FILE *fp)

[email protected] 231 Locking options

• ALL Locking modes • lock current and all future steps in staging • Current (or None) • “next” := next available • ensures that reader can read all data • All • reader’s priority, it can block the writer • “next” := current+1 • CURRENT • lock the current step only • future steps can disappear if writer pushes more newer steps and staging needs more space • writer’s priority • reader must handle skipped steps • NONE • no assumptions, anything can disappear between two read operations • be ready to process errors

[email protected] 232 Advancing a stream • One step is accessible in streams, advancing is only forward int adios_advance_step (ADIOS_FILE *fp, int last, float timeout_sec) • last: advance to “next” or to latest available • “next” or “next available” depends on the locking mode • locking = all: go to the next step, return error if that does not exist anymore • locking = current or none: give the next available step after the current one • timeout_sec: block for this long if no new steps are available • Release a step if not needed anymore • optimization to allow the staging method to deliver new steps if available int adios_release_step (ADIOS_FILE *fp)

[email protected] 233 Example of Read API: open a stream fp = adios_read_open ("myfile.bp", ADIOS_READ_METHOD_BP, comm, ADIOS_LOCKMODE_CURRENT, 60.0);

// error possibilities (check adios_errno) // err_file_not_found – stream not yet available // err_end_of_stream – stream has been gone before we tried to open // (fp == NULL) – some other error happened (print adios_errmsg())

// process steps here… ...

adios_read_close (fp);

[email protected] 234 Example of Read API: read a variable step-by-step

int count[] = {10,10,10}; int offs[] = {5,5,5}; P = (double*) malloc (sizeof(double) * count[0] * count[1] * count[2]); Q = (double*) malloc (sizeof(double) * count[0] * count[1] * count[2]); ADIOS_SELECTION *sel = adios_select_boundingbox (3, offs, count); while (fp != NULL) { adios_schedule_read (fp, sel, "P", 0, 1, P); adios_schedule_read (fp, sel, “Q", 0, 1, Q); adios_perform_reads (fp, 1, NULL); // 1: blocking read // P and Q contains the data at this point adios_release_step (fp); // staging method can release this step // ... process P and Q, then advance the step adios_advance_step (fp, 0, 60.0); // 60 sec blocking wait for the next available step } // free ADIOS resources adios_free_selection (sel);

[email protected] 235 heat transfer example with staging • Staged reading code • ~/Tutorial/heat_transfer/stage_write/stage_write.c • same as ADIOS repository: examples/staging/stage_write • This code • reads an ADIOS dataset using an ADIOS read method, step-by-step • writes out each step using an ADIOS write method • Use cases • Staged write • asynchronous I/O using extra compute nodes, a.k.a burst buffer • Reorganize data from N process output to M process output • many subfiles to less, bigger subfiles, or one big file • Convert to other formats (e.g. GRIB2) • Assumptions • The list of variables and the global dimensions of the arrays DO NOT CHANGE during steps • Otherwise see ADIOS examples/staging/stage_write_varying

[email protected] 236 heat transfer example with staging $ cd ~/Tutorial/heat_transfer edit heat_transfer.xml (vi, gedit) set method to FLEXPATH QUEUE_SIZE=4;verbose=3 $ cd stage_write $ make $ cd .. $ mpirun -np 4 ./heat_transfer_adios2 heat 2 2 300 300 10 600 In another terminal $ cd ~/Tutorial/heat_transfer $ mpirun -np 2 stage_write/stage_write heat.bp staged.bp FLEXPATH "" MPI "" 2 Input stream = heat.bp Output stream = staged.bp Read method = FLEXPATH (id=5) Read method parameters = "max_chunk_size=100; app_id =32767; verbose= 3;poll_interval = 100;" Write method = MPI Write method parameters = "" Waiting to open stream heat.bp... $ bpls -l staged.bp

[email protected] 237 N to M reorganization with stage_write • heat transfer + stage_write running together • Write out 6 time-steps. • Write from 12 cores, arranged in a 4 x 3 arrangement. • Read from 3 cores, arranged as 1x3

P8 P9 P10 P11

y P4 P5 P6 P7

P0 P1 P2 P3

x [email protected] 238 N to M reorganization with stage_write $ cd ~/Tutorial/heat_transfer edit heat_transfer.xml (vi, gedit) set method to MPI

$ mpirun -np 12 ./heat_transfer_adios1 heat 4 3 40 50 6 500 $ bpls -D heat.bp T double T 6*{150, 160} step 0: block 0: [ 0: 49, 0: 39] block 1: [ 0: 49, 40: 79] ... block 11: [100:149, 120:159]

$ mpirun -np 3 stage_write/stage_write heat.bp h_3.bp BP "" MPI "" 3 $ bpls -D h_3.bp T double T 6*{150, 160} step 0: block 0: [ 0:149, 0: 52] block 1: [ 0:149, 53:105] block 2: [ 0:149, 106:159] [email protected] Agenda • Day 1 • Reading 40 minutes • Introduction 55 minutes • Python 30 minutes • ADIOS APIs 20 minutes • Exercises 15 minutes • Break 15 minutes • Day 3 • Writing-XML 60 minutes • Review 10 minutes • BPLS 15 minutes • SKEL 20 minutes • Exercises 15 minutes • Write-NO 45 minutes • Day 2 • Python 30 minutes • Review 10 minutes • Queries 30 minutes • Schema 20 minutes • Break 15 minutes • Plotter 15 minutes • Staging 60 minutes • Transforms 45 minutes • Visit 60 minutes • Break 15 minutes [email protected] 240 Goal • Speed up reading time by finding the values of interest in the dataset

1.0 <= x < 2.0

[email protected] 241 Scenarios for scientific variables • Stored as separate arrays in an ADIOS file double T 200*{10000, 9000} double P 200*{10000, 9000} double H 200*{10000, 9000}

• Stored as columns of a table in an ADIOS file • e.g. particle data is usually in this form (fusion, material science) • QMCPack walkers double traces {21571123, 4168} • Evaluate conditions on some columns and • 1. get values of another column • 2. get the whole matching rows

• See C examples in ADIOS source • examples/C/query/ [email protected] 242 Basic scenario • Have 3 variables (T,P,H), same global array size • Get the values of T where P > 80.0 and H <= 50.0 We need a single bounding box selection everywhere

ADIOS_SELECTION* box = adios_selection_boundingbox(...); ADIOS_QUERY* q1 = adios_query_create(f, box, “P”, ADIOS_GT, “80.0”); ADIOS_QUERY* q2 = adios_query_create(f, box, “V”, ADIOS_LTEQ, “50.0”); ADIOS_SELECTION *hits; ADIOS_QUERY* q = adios_query_combine( q1, ADIOS_QUERY_OP_AND, q2); adios_query_evaluate(q, box, 0, batchSize, &hits); ... adios_schedule_read (f, hits, "xy", 1, 1, data); adios_perform_reads (f, 1);

[email protected] int do_queries(int step) // We can call this with unknown too, { just testing the default behavior here q1 = adios_query_create (f, boxsel, if (query_method != v1name, ADIOS_GTEQ, minstr); ADIOS_QUERY_METHOD_UNKNOWN) q2 = adios_query_create (f, boxsel, adios_query_set_method (q, v1name, ADIOS_LTEQ, maxstr); query_method); q = adios_query_combine (q1, int64_t batchSize = ADIOS_QUERY_OP_AND, q2); adios_query_estimate(q, 0); if (q == NULL) { print ("rank %d: set upper limit to print ("rank %d: ERROR: Query batch size. Number of total elements in creation failed: %s\n", rank, array = %lld\n", rank, batchSize); adios_errmsg()); query_result = adios_query_evaluate(q, boxsel, 0, return 1; batchSize); }

[email protected] int read_vars(int step) { // initialize v1/v2 array // Read the data into v1 but only those portions that satisfied the query int timestep = 0; adios_query_read_boundingbox (f, q, v1name, timestep, query_result->nselections, query_result->selections, boxsel, v1); adios_query_read_boundingbox (f, q, v2name, timestep, query_result->nselections, query_result->selections, boxsel, v2); // read the original v1 completely and do manual evaluation v1m = (double *) malloc (ldims[0]*ldims[1]*sizeof(double)); adios_schedule_read (f, boxsel, v1name, 0, 1, v1m); adios_perform_reads (f, 1); …

[email protected] 245 ADIOS Query API: Evaluation explained • ADIOS_SELECTION* outputBoundary • Query evaluation is relaxed • Evaluate the conditions on selections with the same shape • Typical corner case: evaluate the conditions on the same area of each variable

[email protected] Query on heat transfer example $ cd ~/Tutorial/heat_transfer edit heat_transfer.xml (vi, gedit) and set method to MPI $ mpirun -np 12 ./heat_transfer_adios1 heat 4 3 40 50 6 500 $ adios_index_fastbit heat.bp ==> index file is at: heat.idx - can also try this without creating the fastbit indexing… $ ls -l heat.* -rw-rw-r-- 1 adios adios 2436048 Dec 26 19:03 heat.bp -rw-rw-r-- 1 adios adios 3415173 Dec 26 19:16 heat.idx

$ cd query $ make $ mpirun -np 1 ./test_range ../heat.bp q.bp 150 250 NAN fastbit 1 1 $ bpls -l q.bp double T 6*{150, 160} = 150.012 / 249.949 / 197.987 / 28.5502 double dT 6*{150, 160} = 0.10170 / 0.66569 / 0.20220 / 0.12307 $ ./plot_query.sh q.bp $ eog .

[email protected] Try working with fastbit and the default of minmax

[email protected] mpirun -np 1 ./test_range ../heat.bp q.bp 2 9 NAN fastbit 1

[email protected] 249 Query on heat transfer example

T.0004

100 <= T <= 200

dT.0004

[email protected] 250 Parallelism in query • BTW, test_range is a parallel code • compare previous sequential run with a parallel one

$ mpirun -np 3 ./test_range ../heat.bp q3.bp 150 250 NAN fastbit 3 1

• Each process uses a separate bounding box on which the query is evaluated

[email protected] Agenda • Day 1 • Reading 60 minutes • Introduction 55 minutes • Exercises 15 minutes • ADIOS APIs 20 minutes • Day 3 • Break 15 minutes • Review 10 minutes • Writing-XML 60 minutes • SKEL 20 minutes • BPLS 15 minutes • Write-NO 45 minutes • Exercises 15 minutes • Python 30 minutes • Day 2 • Break 15 minutes • Review 10 minutes • Staging 60 minutes • Schema 20 minutes • Visit 60 minutes • Plotter 15 minutes • Transforms 45 minutes • Break 15 minutes [email protected] Tutorial Overview

• Tutorial Information: http://visitusers.org/index.php?title=VisIt_Tutorial • Tutorial Prep: http://visitusers.org/index.php?title=Tutorial_Preparation • Example Datasets: http://visitusers.org/index.php?title=Tutorial_Data

[email protected] 253 VisIt Setup • VisIt binary: • /opt/visit/bin/visit • Tutorial data: • ADIOS files from tutorial: • /home/adios/Tutorial/heat_transfer/heat.bp • /home/adios/Tutorial/visit/data • For convenience, you might want to copy these to your home directory • Python scripts: • /home/adios/Tutorial/visit/scripts

[email protected] Project Introduction

• Production end-user tool supporting scientific and engineering applications. • Provides an infrastructure for parallel post-processing that scales from desktops to massive HPC clusters.

• Source released under a BSD Density Isovolume of a style license. 3K^3 (27 billion cell) dataset [email protected] Project Introduction VisIt supports a wide range of use cases.

? =

Data Exploration Comparative Analysis

Quantitative Analysis

Visual Presentation Graphics Visual DebuggingDebugging [email protected] Project Introduction

Streamlines Vector / Tensor Glyphs Pseudocolor Rendering

Volume Rendering Molecular Visualization Parallel Coordinates [email protected] Project Introduction

Full Dataset 3072 sub-grids (27 billion total cells) (each 192x129x256 cells)

[email protected] Project Introduction

• The VisIt project started in 2000 to support LLNL’s large scale ASC physics codes. • The project grew beyond LLNL and ASC with research and development from DOE SciDAC and other efforts. • VisIt is now supported by multiple organizations: . LLNL, LBNL, ORNL, UC Davis, Univ of Utah, Intelligent Light, … • Over 75 person years of effort, 1.5+ million lines of code.

Project Started LLNL ASC users 2005 R&D 100 VACET Funded Transition to VisIt 2.0 Release transitioned to Public SW repo VisIt

2000 2003 2005 2006 2008 2010 2012 - 2017 [email protected] Project Introduction VisIt scales well on current HPC platforms.

Machine Architecture Problem Size # of Cores Graph X86_64 20,0013 (8 T cells) 12K Dawn BG/P 15,8713 (4 T cells) 64K

Franklin Cray XT4 12,5963 (2 T cells) 32K

JaguarPF Cray XT5 12,5963 (2 T cells) 32K

Juno X86_64 10,0003 (1 T cells) 16K

Franklin Cray XT4 10,0003 (1 T cells) 16K

Ranger Sun 10,0003 (1 T cells) 16K

Purple IBM P5 8,0003 (0.5 T cells) 8K

Scaling Studies of Isosurface Extraction and Volume Rendering (2009)

VisIt is also used daily by domain scientists. [email protected] Project Introduction

• Mesh Types: – Point, Curve, 2D/3D Rectilinear, Curvilinear, Unstructured – Domain Decomposed, AMR – Time Varying • Fields: – Scalar, Vector, Tensor, Material volume fractions, Species

[email protected] Project Introduction

• Regular releases (~ 6 / year) – Executables for all major platforms – End-to-end build process script ``build_visit’’ • Customer Support and Training – visitusers.org, wiki for users and developers – Email lists: visit-users, visit-developers – Beginner and advanced tutorials – VisIt class with detailed exercises • Documentation – “Getting data into VisIt” manual – Python interface manual Slides from the VisIt class – Users reference manual [email protected] VisIt employs a parallelized client-server architecture. Local Components Parallel Cluster

VisIt Data Engine Plugin Data

VisIt Data Data

Engine Plugin

MPI

network connection VisIt Data Data Engine Plugin

VisIt (Files or Simulation) Viewer Data Flow Network

Filter VisIt VisIt Python Java Filter GUI CLI Clients Clients Filter [email protected] Project Introduction

Task 1 Task 2

Compositing

Task 3 Task 4 • Rendering Modes: Final Composited Image – Local (hardware) – Remote (software or hardware) • Beyond surfaces: – VisIt also provides scalable volume rendering. [email protected] VisIt’s infrastructure provides a flexible platform for custom workflows. • C++ Plugin Architecture • Custom File formats, Plots, Operators • Interface for custom GUIs in Python, C++ and Java • Python Interfaces • Python scripting and batch processing • Data analysis via Python Expressions and Queries. • Libsim library

• Enables coupling of simulation codes to VisIt for in situ VisIt runtime visualization.

Adaptor code

libsim Data / Control / Data • Research into ADIOS Staging Simulation • Enables coupling with simulation codes and data staging [email protected] Project Introduction VisIt is used as a platform to deploy visualization research.

• Research Collaborations: 2006 – 2011

2012 – 2017 Scaling research: Algorithms research: Scaling to 10Ks of cores and How to efficiently calculate trillions of cells. particle paths in parallel.

. Research Focus: • Next Generation Architectures • Parallel Algorithms • In-Situ Processing Algorithms research: Methods research: Reconstructing material How to incorporate statistics interfaces for visualization into visualization. [email protected] Project Introduction

• Everything works at scale • Robust, usable tool • Features that span the “power of visualization”: – Data Exploration – Confirmation – Communication • Features for different kinds of users: – Visualization Experts – Code Developers – Code Consumers

[email protected] Plotting Techniques Terminology

• Meshes: discretization of physical space • Contains “zones” / “cells” / “elements” • Contains “nodes” / “points” / “vertices” • VisIt speak: zone & node • Fields: variables stored on a mesh • Scalar: 1 value per zone/node • Example: pressure, density, temperature • Vector: 3 values per zone/node (direction) • Example: velocity • Note: 2 values for 2D, 3 values for 3D • More fields discussed later…

[email protected] Plotting Techniques Pseudocolor

• Maps scalar fields (e.g., density, pressure, temperature) to colors.

• \

[email protected] Plotting Techniques Contour / Isosurface

[email protected] Plotting Techniques Volume rendering

Film/image

Emitter VisIt can combine volume rendering and opaque geometry [email protected] Plotting Techniques Streamlines

[email protected] Data representation for mesh-based HPC simulations Meshes

Mesh Types

• All data in VisIt lives on a Curve Rectilinear mesh • Discretizes space into points and cells • (1D, 2D, 3D) + time Curvilinear • Mesh dimension need not match Unstructured spatial dimension (e.g. 2D surface in 3D space) • Provides a place for data to be located • Defines how data is Points Molecular interpolated

[email protected] Data representation for mesh-based HPC simulations Variables

• Scalars, Vectors, Tensors Cell Data • Associated with points or cells of a mesh • Points: linear interpolation • Cells: piecewise constant • Can have different dimensionality than the mesh (e.g. 3D vector data on a 2D mesh) Point Data

Vector Data Tensor Data

[email protected] Data representation for mesh-based HPC simulations Materials

• Describes disjoint spatial regions at a sub-grid level • Volume/area fractions • VisIt will do high-quality sub- grid material interface reconstruction

[email protected] Data representation for mesh-based HPC simulations Species • Similar to materials, describes sub-grid variable composition

• Example: Material “Air” is made of species “N2” ,“O2”, “Ar”, “CO2”, etc. • Used for mass fractions • Generally used to weight other scalars (e.g. partial pressure)

[email protected] Data representation for mesh-based HPC simulations Parallel Meshes

• Provides aggregation for meshes • A mesh may be composed of large numbers of mesh “blocks” • Allows data parallelism

[email protected] Data representation for mesh-based HPC simulations AMR meshes

• Mesh blocks can be associated with patches and levels • Allows for aggregation of meshes into AMR hierarchy levels

[email protected] Data representation for mesh-based HPC simulations

[email protected] Data representation for mesh-based HPC simulations

[email protected] VisIt’s Core Abstractions

[email protected] Example VisIt Pipelines VisIt’s core abstractions

. Databases: How datasets are read

. Plots: How you render data

. Operators: How you manipulate data

. Expressions: Mechanism for generating derived quantities

. Queries: How to access quantitative information

[email protected] Example VisIt Pipelines Examples of VisIt Pipelines

• Databases: how you read data • Plots: how you render data Open a database, • Operators: how you Database which reads from a file transform/manipul (example: open file1.hdf5) ate data • Expressions: how Make a plot of a variable in the you create new Plot database fields (example: Volume plot) • Queries: how you pull out quantitative information

[email protected] Example VisIt Pipelines Examples of VisIt Pipelines

• Databases: how you read data • Plots: how you Open a database, render data Database which reads from a file • Operators: how you (example: open file1.hdf5) transform/manipul ate data Apply an operator to transform the Operator data • Expressions: how (example: Slice operator) you create new fields Plot a variable in the database • Queries: how you Plot (example: Pseudocolor plot) pull out quantitative information

[email protected] 285

[email protected] 286

Navigate to ADIOS Data

[email protected] 287

[email protected] 288

[email protected] 289

 Time Slider

[email protected] 290

Change Plot Attributes: • Double click plot

[email protected] 291

Change Plot Attributes: • Double click plot • Modify color table • Modify min/max

[email protected] 292

 Change Variable

[email protected] Gradient Expression

gradient(T)[0]

[email protected] dx comparison: heat_none vs. heat_zfp

[email protected] 295 Comparison Expressions

[email protected] 296 2

3

conn_cmfe(, 1 )

4

[email protected] 297

 This variable can now be used

[email protected] 298 Creating Expressions 2

3 T – zfp_T

1

4 [email protected] 299

[email protected] [email protected] Services

• Execution Environments • On simulation nodes • Shared resources with simulation • Dedicated resources • On dedicated nodes (staging) • Post hoc services • Data Models • Codes are different • Limit data copy and conversion • Support data reductions • Integration with SIRIUS

[email protected] VTKm

EAVL DAX PISTON

[email protected]

Performance Portability

Architecture CPU GPU Phi ????

Algorithm Isosurface Threshold Streamline Render

[email protected] Performance Portability

Architecture CPU GPU Phi ????

VTK-m

Algorithm Isosurface Threshold Streamline Render

[email protected] VTKm Framework

[email protected] Limited Data Models in VTK

Point Arrangement

Cells Coordinates Explicit Logical Implicit

Strided Structured Grid Structured Separated Rectilinear Grid Image Data

Strided Unstructured Grid Unstructured Separated

[email protected] Arbitrary Data Models in VTKm

Point Arrangement

Cells Coordinates Explicit Logical Implicit

Strided Structured Separated

Strided Unstructured Separated

[email protected] Other Data Model Gaps Addressed in VTK-m

Low/high dimensional data (7D Multiple cell groups in one Multiple coordinate systems GenASiS) mesh (lat/lon + XY)

Non-physical data (graphs, Novel and hybrid mesh types Mixed topology (atoms+bonds) sensor, etc) (quadtree grid from MADNESS) [email protected] Data Models and Efficiency

Threshold operation

35 < density < 45

[email protected] Example: Memory and Algorithmic Efficiency

Threshold regular grid: 35 < density < 45

Traditional Data Model VTK-m Data Model Fully unstructured grid Hybrid implicit/explicit grid •Explicit points •Implicit points •Explicit cells •Explicit cells [email protected] Memory and Algorithmic Efficiency

VTKm: ~7X memory VTKm: ~5X performance

VTKm VTKm

[email protected] VTK-m Services: Ray Casting

[email protected] VTK-m Services: Direct Volume Rendering

• 432x432x432 astrophysics data • 1024x1024 image

[email protected] VTK-m Services: Isosurface

[email protected] VTK-m Services: Surface Simplification

• “Lucy” dataset from Stanford • 1024x1024 image

[email protected]