Dr. Robert W. Wisniewski Chief Software Architect Extreme Scale Computing, Intel November 15, 2017 Session Agenda and Objective

• OpenHPC • Value • Goals • Background • Update • Intel’s component work planned for submission to OpenHPC Value from Community

A Shared Repository • Stable HPC System Software that: Fuels a vibrant and efficient HPC software ecosystem Removes duplication of effort throughout community Simplify the complexity of installation, configuration, and ongoing maintenance of a custom software stack Takes advantage of hardware innovation and drives revolutionary technologies Eases traditional HPC application development and testing at scale Development environment for new workloads (ML, analytics, big data, cloud) OpenHPC - Mission and Vision Mission: to provide a reference collection of open-source HPC software components and best practices, lowering barriers to deployment, advancement, and use of modern HPC methods and tools.

Vision: OpenHPC components and best practices will enable and accelerate innovation and discoveries by broadening access to state-of-the-art, open-source HPC methods and tools in a consistent environment, supported by a collaborative, worldwide community of HPC users, developers, researchers, administrators, and vendors.

Recent article by Adrian Reber, OpenHPC TSC and Red Hat:

https://opensource.com/article/17/11/openhpc 4 OpenHPC: a brief History...

ISC’16 v1.1.1 release, Linux Foundation ISC’15 announces technical ISC’17 BoF on the leadership, v1.3.1 merits/interest in a founding members, release, community effort Nov 2015 and governance Nov 2016 BoF Nov 2017

June 2015 June 2016 June 2017 SC’15 SC’16 SC’17 Initial v1.0 release, v1.2 release, V1.3.3 release, gather interested BoF BoF parties to work with Linux Foundation

5 Current Project Members

Mixture of academics, labs, and industry

WWW.OpenHPC.Community Member participation interest? Please contact Jeff ErnstFriedman 6 OpenHPC Stack Overview OpenHPC v1.3.3 - Current S/W components

Functional Areas Components Base OS CentOS 7.4, SLES12 SP3

Architecture aarch64, x86_64

Conman, Ganglia, Lmod, LosF, Nagios, pdsh, pdsh-mod-slurm, prun, EasyBuild, ClusterShell, mrsh, Administrative Tools Genders, Shine, Spack, test-suite

Provisioning Warewulf, xCAT

Resource Mgmt. SLURM, Munge, PBS Professional, PMIx

Runtimes OpenMP, OCR, Singularity

I/O Services Lustre client, BeeGFS client

Numerical/Scientific Boost, GSL, FFTW, Hypre, Metis, Mumps, OpenBLAS, PETSc, PLASMA, Scalapack, Scotch, SLEPc, Libraries SuperLU, SuperLU_Dist, Trilinos

I/O Libraries HDF5 (pHDF5), NetCDF/pNetCDF (including C++ and Fortran interfaces), Adios

Compiler Families GNU (gcc, g++, gfortran), Clang/LLVM

MPI Families MVAPICH2, OpenMPI, MPICH

Development Tools Autotools, cmake, hwloc, mpi4py, R, SciPy/NumPy, Valgrind

Performance Tools PAPI, IMB, mpiP, pdtoolkit TAU, Scalasca, ScoreP, SIONLib Basic Cluster Install Example

• Starting install guide/recipe targeted for flat hierarchy • Leverages image-based provisioner: Warewulf or xCAT • PXE boot (stateful or stateless) • Optionally connect external Lustre* or BeeGFS parallel file system • Need hardware-specific information to support (remote) bare-metal provisioning Target System Design

Large systems have a considerable number of Service Nodes (SNs) – SMS – System Management Server – Row/Rack Controllers – I/O Nodes (IONs) – Specialized servers for Fabric Manager (FM), Workload Manager/Resource Manager (RM), Database (DB) For flexibility, the control system will target a “pool” of service nodes – i.e., the control system has a mechanism to prefer service nodes for efficiency, affinity, or for necessary characteristics when assigning a particular function – But, the control system has flexibility so it can spread work over many nodes for performance and resiliency

For max application performance, Compute Nodes (CNs) are avoided for execution of control system software – Reduce noise; reduce footprint overhead in CNs – Can leverage CNs when noise is ok or expected – job control operations (start, kill, etc) – activity between jobs This approach encourages Scalable Unit packaging concepts

10 Focus on Core System Software Components

• mOS – Scalable operating systems

• Unified Control System – Unified, Productive (single pane of glass), Reliable

• DAOS – Distribute Asynchronous Object Store

• MPI – Scalable, high performance, topology optimized

• GEOPM – Global Extensible Open Power Manager

• PMIx – Process management with “Instant On”

11 mOS High-Level Architecture

• LWK performance for HPC applications • Nimble to adapt to new technology • Linux compatibility • Better contained containers

12 • Provide a comprehensive system view Unified Control System

• Advance the state-of-the art Inventory system • Build upon existing work RAS manifestcdus pdus events racks rectifiers chassis • Support the system lifecycle alerts drawers blades boards bmcs Operations switches cpus Data Access Interface jobs provisioning memory cores Provider Provider Provider Provider Provider Provider cables devices service Security Inventory Service Alerts RAS Master online tier ops Environmental (VoltDB) temperature DAI Framework Configuration Operator Env Data RAS voltage current Interface (Web & REST) (SCON, data tier access, daemon mgmt, Inventory Alerts nodes coolant Provider Operator Operator Interface resilient workitems, etc) Service OS images flow

Workload Fabric Low-Level software Provisioner Monitor SCON nearline tier Manager Manager Control (PostgreSQL) components Provider Provider Provider Provider Provider Provider

Unified Control System

offline tier system data (ELK/Splunk) SCON SLURM Warewulf OPA FM Actsys Sensys Service

Underlying Components Guided by DAI providers, but generally operate autonomously

13 Distributed Asynchronous Object Storage

•Scale-out object store designed from the •Software-defined storage platform ground up for nextgen storage & fabric – Flexible storage provisioning technologies – Predictable performance/capacity – High throughput/IOPS @arbitrary alignment/size – Ease of management – Byte addressable for better application •Seamless Integration with Lustre scalability – Single unified namespace – no read-modify-write or false block sharing – OS bypass with lightweight client/server Traditional HPC Big Data & AI – Small memory footprint & low CPU usage Apps NetCDF Apps •Advanced storage API Ext

Ext SCR POSIX MPI- Dataspac Spark Apache HDF5 FTI (No)SQL – New scalable storage model suitable for I/O IO es RDD Arrow VeloC both structured & unstructured data DAOS – Non-blocking data & metadata operations Open Source Apache 2.0 License – Metadata & data query/indexing NVRAM NVMeNVMe Argobots Mercury HDD NVRAM NVMe Thread model Function shipping MPICH-OFI

• Open-source implementation based on MPICH • Uses the new CH4 infrastructure • Co-designed with MPICH community • Targets existing and new fabrics via next-gen Open Fabrics Interface (OFI) • Ethernet/sockets, Intel® Omni-Path, Cray Aries*, IBM BG/Q*, InfiniBand* • Improving Performance • Topology-aware collectives and process mapping • Optimizations for newer networks • Thread performance enhancements • Support for automated configuration • Added support for latest libfabric 1.5 features

15 Power-Aware . Runtime for in-band power management and optimization RM / Scheduler . On-the-fly monitoring of hardware counters and application profiling . Feedback-guided optimization of hardware control knob settings GEOPM Root . Open source software (flexible BSD three clause license) . Extensible and portable through plugin architecture GEOPM GEOPM . Enables portability beyond x86 architectures (truly open) Aggregator Aggregator . Enables rapid prototyping of new power management strategies . Accommodates the reality that different sites have different constraints and preferences for performance vs. energy savings

GEOPM Leaf GEOPM Leaf GEOPM Leaf . Designed for holistic optimization across a whole job GEOPM Leaf . Job-wide global optimization of HW control knob settings MPI Ranks MPI Ranks MPI Ranks MPI Ranks . Application-awareness for max speedup or energy savings 0 to i-1 i to j-1 j to k-1 k to n-1 . Scalable via distributed tree-hierarchical design, algorithms Processor Processor Processor Processor

MPI Comms Overlay SHM Shared Mem Region Project url: http://geopm.github.io/geopm MSR Msr-safe (or Other Drivers GEOPM GEOPM Controller Contact: [email protected] for Non-x86 Platforms) 16 What is PMIx?

2015 2016 2017 RM RM RM SLURM SLURM SLURM ALPS JSM JSM others

PMI-1 PMI-2 PMIx v1.2 PMIx v2.x years go by… MPICH wireup support Exascale systems OMPI dynamic spawn OMPI on horizon Spectrum keyval publish/lookup Spectrum Launch times long OSHMEM OSHMEM New paradigms SOS PGAS Exascale launch Exascale launch PGAS others in < 30s in < 10s others Orchestration Three Distinct Entities

• PMIx Standard . Defined set of APIs, attribute strings . Nothing about implementation • PMIx Reference Library . A full-featured implementation of the Standard . Intended to ease adoption • PMIx Reference Server . Full-featured “shim” to a non-PMIx RM Backup

• Backup Hierarchical Overlay for OpenHPC software OpenHPC Development Infrastructure

• The ‘usual’ software engineering stuff: • GitHub (SCM and issue tracking/planning) • Continuous Integration (CI) Testing (Jenkins) • Documentation (Latex)

• Capable build/packaging system • At present: we target a common delivery/access mechanism that adopts Linux sysadmin familiarity • Require Flexible System to manage builds • A system using Open Build Service (OBS) supported by back-end git OpenHPC Build System: OBS

• Manage Build Process • Drive Builds for multiple repositories • Repeatable builds • Generate binary and src rpms • Publish corresponding package repositories • Client/server architecture supports distributed build slaves and multiple architectures OpenHPC Integration/Testing/Validation

• Install recipes • Cross-package interaction • Development environment • Mimic use cases common in HPC deployments • Upgrade mechanism OpenHPC Integration/Test/Validation

• Standalone integration test infrastructure • Families of tests that could be used during: • Initial install process (can we build a system?) • Post-install process (does it work?) • Developing tests that touch all of the major components (can we compile against 3rd party libraries, will they execute under resource manager, etc.?) • Expectation is that each new component included will need corresponding integration test collateral • Integration tests are included in the GitHub repo • Global testing harness includes a number of embedded subcomponents