Oak Ridge National Laboratory Computing and Computational Sciences Directorate

Balancing Performance and Portability with Containers in HPC: An OpenSHMEM Example

Thomas Naughton, Lawrence Sorrillo, Adam Simpson and Neena Imam Oak Ridge National Laboratory

August 8, 2017

OpenSHMEM 2017 Workshop, Annapolis, MD, USA

ORNL is managed by UT-Battelle for the US Department of Energy Introduction

• Growing interest in methods to support user driven software customizations in an HPC environment – Leverage isolation mechanisms & container tools

• Reasons – Improve productivity of users – Improve reproducibility of computational experiments

Computational Research & UNCLASSIFIED // FOR OFFICIAL USE ONLY Development Programs Containers

• What’s a Container? – “knobs” to tailor classic fork/exec model – Method for encapsulating an execution environment – Leverages (OS) isolation mechanisms • Resource namespaces & control groups (cgroups)

Example: Process gets “isolated” view of running processes (PIDs), and a control group that restricts it to 50% of CPU • Container Runtime – Coordinate OS mechanisms – Provide streamlined (simplified) access for user – Initialization of “container” & interfaces to attach/interact

Computational Research & UNCLASSIFIED // FOR OFFICIAL USE ONLY Development Programs Container Motivations

• Compute mobility – Build application with software stack and carry to compute – Customize execution environment to end-user preferences

• Reproducibility – Run experiments with the same software & data – Archive experimental setup for reuse (by self or others)

• Packaging & Distribution – Configure applications with possibly complex software dependencies using packages that are “best” for user – Benefits User and Developer productivity

Computational Research & UNCLASSIFIED // FOR OFFICIAL USE ONLY Development Programs Challenges

• Integration into HPC ecosystem – Tools & Container Runtimes for HPC

• Accessing high performance resources – Take advantage of advanced hardware capabilities – Typically in form of device drivers / user-level interfaces • Example: HPC NIC and associated communication libraries

• Methodology and best practices – Methods for balancing portability & performance – Establish “best practices” for building/assembling/using

Computational Research & UNCLASSIFIED // FOR OFFICIAL USE ONLY Development Programs Docker • Container technology comprised of several parts – Engine: User interface, Storage Drivers, Network overlays, Container Runtime – Registry: Image server (public & private) – Swarm: Machine and Control (daemon) overlay – Compose: Orchestration interface/specification https://www.docker.com • Tools & specifications – Same specification format for all layers (consistency) – Rich set of tools for image management/specification • e.g., DockerHub, Dockerfile, Compose, etc.

Computational Research & UNCLASSIFIED // FOR OFFICIAL USE ONLY Development Programs Docker Container Runtime

• runc & containerd – Low-level runtime for container creation/management – Donated as reference implementation for OCI (2015) • Open Container Initiative – runtime & image specs. • Formerly libcontainer in Docker – Used with containerd as of Docker-1.12

containerd runc

container Container Image rootfs container.json Image-Layer-x Image-Layer-y Image-Layer-z Container Instance

Computational Research & UNCLASSIFIED // FOR OFFICIAL USE ONLY Development Programs Container Runtime

• Container runtime for “compute mobility” – Code developed “from scratch” by Greg Kurtzer – Started in early 2016 (http://singularity.lbl.gov)

• Created with HPC site/use awareness – Expected practices for HPC resources (e.g., network, launch) – Expected behavior for user permissions

• Adding features to gain parity with Docker – Importing Docker images – Creating a Singularity-Hub

Computational Research & UNCLASSIFIED // FOR OFFICIAL USE ONLY Development Programs Docker Singularity • Software stack • Software stack – Several components – Small code base • dockerd, swarm, registry, etc. – Follows more traditional HPC – Rich set of tools & capabilities patterns • User permissions • Existing HPC launchers • Networking • Networking – Supports full network isolation – No network isolation – Supports pass-through “host”

• Storage • Storage – Full storage abstraction – No storage abstraction • Example: devicemapper, AUFS

Computational Research & UNCLASSIFIED // FOR OFFICIAL USE ONLY Development Programs Docker Singularity • Images • Images – Advanced read/write – Single image “file” (or ‘rootfs’ dir) capabilities (layers) • No COW or container layers • Copy-on-Write (COW) for “container layer” • Image creation • Image creation – Specification file (Dockerfile) – Specification file – Supports Docker image import • Image sharing • Image sharing – DockerHub is central location – SingularityHub emerging as for sharing images basis to build/share images

Computational Research & UNCLASSIFIED // FOR OFFICIAL USE ONLY Development Programs Example: MPI Application

• MPI – Docker – Run SSH daemon in containers – Span nodes using Docker networking (“virtual networking”) – Fully abstracted from host – MPI entirely within container

• MPI – Singularity – Run SSH daemon on host – Span nodes using host network (no network isolation) – Support network at host/MPI-RTE layer – MPI split between host/container

Computational Research & UNCLASSIFIED // FOR OFFICIAL USE ONLY Development Programs Evaluation

• Objective: Evaluate viability of using same image on “developer” system and then on “production” system – Use an existing OpenSHMEM benchmark

• Container Image – Ubuntu 17.04 • Recent release & not available on production system – Select Open MPI’s implementation of OpenSHMEM • Directly available from Ubuntu – Select Graph500 as demonstration application • Using OpenSHMEM port

Computational Research & UNCLASSIFIED // FOR OFFICIAL USE ONLY Development Programs Adapting Images

• Two general approaches a) Customize image with requisite software b) Dynamically load requisite software

• Pro/Con – Over customization may break portability – Loading at runtime may not be viable in all cases

Computational Research & UNCLASSIFIED // FOR OFFICIAL USE ONLY Development Programs Image Construction Procedure

• Create Docker image for Graph500 – Useful for initial testing on development machine – Docker not available on production systems • Create Singularity image for Graph500 – Directly import Docker image, or – Bootstrap image from Singularity definition file • Customize for production – Later we found we also had to add a few directories for bind mounts on production machine – Add few changes to environment variable via the “/environment” file in Singularity-2.2.1 – Add ‘munge’ library for authentication

Computational Research & UNCLASSIFIED // FOR OFFICIAL USE ONLY Development Programs Cray Titan Image Customizations

• (Required) Directories for runtime bind mounts – /opt/cray – /var/spool/alps – /var/opt/cray – /lustre/atlas – /lustre/atlas1 – /lustre/atlas2 • Other customizations for our tests – apt-get install libmunge2 munge – mkdir -p /ccs/home/$MY_TITAN_USERNAME – Edits to “/environment” file (next slide)

Computational Research & UNCLASSIFIED // FOR OFFICIAL USE ONLY Development Programs Singularity /environment

# Appended to /environment file in container image

# On Cray, extend LD_LIBRARY_PATH with host CRAY Libs if test -n "$CRAY_LD_LIBRARY_PATH"; then export PATH=$PATH:/usr/local/cuda/bin # Add Cray specific library paths export LD_LIBRARY_PATH=${CRAY_LD_LIBRARY_PATH}:/opt/cray/sysutils/1.0- 1.0502.60492.1.1.gem/lib64:/opt/cray/wlm_detect/1.0- 1.0502.64649.2.2.gem/lib64:/usr/local/lib:/lib64/usr/lib/x86_64-- gnu:/usr/local/cuda/lib:/usr/local/cuda/lib64:${LD_LIBRARY_PATH} fi

# On Cray, Also add the host OMPI Libs if test -n "$CRAY_OMPI_LD_LIBRARY_PATH"; then # Add OpenMPI/2.0.2 libraries export LD_LIBRARY_PATH=${CRAY_OMPI_LD_LIBRARY_PATH}:${LD_LIBRARY_PATH} # Apparently these are needed on Titan OMPI_MCA_mpi_leave_pinned=0 OMPI_MCA_mpi_leave_pinned_pipeline=0 fi

Computational Research & UNCLASSIFIED // FOR OFFICIAL USE ONLY Development Programs Setup

• Software • Testbed Hardware – Graph500-OSHMEM – 4 Node, 64 core testbed • D’Azevedo 2015 [1] – 10GigE – Singularity 2.2.1 • Production Hardware – Ubuntu 17.04 – Titan @ OLCF • OpenMPI v2.0.2 • 16 cores per node with OpenSHMEM enabled • Used 2-64 nodes in tests (Using oshmem: spml/yoda) – Gemini network • GCC 5.4.3 – Cray Linux Environment (CLE) 5.2 (Linux 3.0.x)

[1] E. D'Azevedo and N. Imam, Graph 500 in OpenSHMEM, OpenSHMEM Workshop 2015, http://www.csm.ornl.gov/workshops/openshmem2015/documents/talk5_paper_graph500.pdf

Computational Research & UNCLASSIFIED // FOR OFFICIAL USE ONLY Development Programs Test Cases

1. “Native” – a non-container case • Establish baseline for Graph500 without container stuff

2. “Singularity” – container setup to use Host network • Leverage host communication libraries (Cray Gemini) • Inject host comm. libs into container via LD_LIBRARY_PATH

3. “Singularity-NoHost” – #2 minus host libs • Run standard container (self-contained comm. libs) • Not likely to have super high performance comm. libs as the containers are built on commodity machines – i.e., no Cray Gemini headers/libs in container

Computational Research & UNCLASSIFIED // FOR OFFICIAL USE ONLY Development Programs Example Invocations

Native (no-container)

EXE=$PROJ_DIR/naughton/graph500/mpi/graph500_shmem_one_sided

oshrun -np 128 --map-by ppr:2:node --bind-to core \ $EXE 20 16

Singularity (container)

IMAGE_FILE=$PROJ_DIR/naughton/images-singularity/graph500-oshmem.img EXE=/benchmarks/src/graph500-oshmem/mpi/graph500_shmem_one_sided

oshrun -np 128 --map-by ppr:2:node --bind-to core \ singularity exec $IMAGE_FILE $EXE 20 16

Computational Research & UNCLASSIFIED // FOR OFFICIAL USE ONLY Development Programs Development Cluster – Graph500 BFS

UB4,2Hosts,? Graph500,BFS,mean_time (scale:,20,,edges:,16,,skip?valiate)

40 35 31.7 29.8 30 25 20 13.5 11.9 13.4 Time,(sec) 15 10.8 9.0 8.6 10 5.6 4.9 5 0 1,(1ppr) 2,(1ppr) 4,(2ppr) 8,(4ppr) 16,(8ppr) Num,Processes (ProcessPerHost)

ub4?native ub4?singularity

• Roughly same performance – Native (no-container) & Singularity – 2 hosts @ scale=20

Computational Research & UNCLASSIFIED // FOR OFFICIAL USE ONLY Development Programs Titan – Graph500 BFS

Titan,A Graph500,BFS,mean_time Titan,A Graph500,BFS,mean_time (scale:16,,edges:16,,skipAvalidate,,uGNI) (scale:20,,edges:16,,skipAvalidate,,uGNI) 16 14.5 14.0 30 14 23.2 23.9 25 12 10.2 10.0 17.6 20 17.3 10 8.1 7.8 15 8 9.0 8.7 6 4.9 4.7 10 Time,(sec) 3.1 3.0 Time,(sec) 4 5 2 0 0 64,(16hosts,,4ppr) 64,(64hosts,,1ppr) 128,(64hosts,,2ppr) 1,(2hosts,,1ppr) 2,(2hosts,,1ppr) 4,(2hosts,,2ppr) 8,(2hosts,,4ppr) 16,(2hosts,,8ppr) Num,Processes Num,Processes (NumHosts,,ProcessesPerHost) (NumHosts,,ProcessesPerHost)

titanAnative titanAsingularity titanAnative titanAsingularity

• Roughly same performance – Native (no-container) & Singularity – Left: 2 hosts @ scale=16 with uGNI BTL – Right: 16 hosts & 64 hosts @ scale=20 with uGNI BTL

Computational Research & UNCLASSIFIED // FOR OFFICIAL USE ONLY Development Programs Titan – Graph Construction & Generation

Titan)? Graph500)construction_time Titan*? Graph500*graph_generation (scale:20,)edges:16,)skip?validate,)uGNI) (scale:20,*edges:16,*skip?validate,*uGNI)

4 30 25.8 25.5 25.4 3.0 3.0 22.5 22.5 22.5 2.9 2.9 25 3 2.6 2.6 20 2 15 10 Time)(sec) 1 Time*(sec) 5 0 0 64)(64hosts,)1ppr) 128)(64hosts,)2ppr) 64*(64hosts,*1ppr) 128*(64hosts,*2ppr) Num)Processes Num*Processes (NumHosts,)ProcessPerHost) (NumHosts,*ProcessPerHost)

titan?native titan?singularity titan?singularityNoHost titan?native titan?singularity titan?singularityNoHost

• Graph construction & generation times roughly same – Native (no-container), Singularity, and Singularity-noHost – 64 hosts @ scale=20 with uGNI BTL

Computational Research & UNCLASSIFIED // FOR OFFICIAL USE ONLY Development Programs Titan – Graph500 BFS mean_time

Titan,A Graph500,BFS,mean_time (scale:20,,edges:,16,,skipAvalidate,,uGNI)

30 23.2 23.9 25 20 15 9.0 8.7 10 Time,(sec) 5 1.1 1.1 0 64,(64hosts,,1ppr) 128,(64hosts,,2ppr) Num,Processes (NumHosts,,ProcessesPerHost)

titanAnative titanAsingularity titanAsingularityNoHost

• Much worse performance when using host library (uGNI) – Native (no-container), Singularity (with uGNI), Singularity-noHost – 64 hosts @ scale=20 with uGNI BTL

Computational Research & UNCLASSIFIED // FOR OFFICIAL USE ONLY Development Programs Titan – Native comparison

Titan*@ Graph500**BFS*mean_time (scale:20,*edges:16,*skip@validate) 23.2 25 20 15 9.0 10 6.9

Time*(sec) 4.0 5 0 64*(64hosts,*1ppr) 128*(64hosts,*2ppr) Num*Processes (NumHosts,*ProcessPerHost)

titan@native crayshmem@native

• OMPI-oshmem uGNI (spml/yoda) worse than CraySHMEM – All native (no containers) – Suggests something wrong with our OMPI-oshmem uGNI setup ???

Computational Research & UNCLASSIFIED // FOR OFFICIAL USE ONLY Development Programs Titan – Graph500 BFS mean_time

Titan)> Graph500)BFS)mean_time (scale:20,)edges:16,)skip>validate,)TCP/"no(uGNI") 1.4 1.1 1.2 1.2 1.1 1.0 1 0.9 0.9 0.8 0.6 Time)(sec) 0.4 0.2 0 64)(64hosts,)1ppr) 128)(64hosts,)2ppr) Num.)Processes (NumHosts,) ProcessPerHost)

titan>native>NOugni titan>singularity>NOugni titan>singularityNoHost>NOugni

• Roughly same when disable uGNI BTL – Native (no uGNI), Singularity (no uGNI), Singularity-noHost – 64 hosts @ scale=20 without uGNI BTL

Computational Research & UNCLASSIFIED // FOR OFFICIAL USE ONLY Development Programs Observations (1) • Graph500 – Native & Singularity had roughly same performance • Both uGNI & TCP BTLs – Singularity-NoHost had better performance in our tests due to problems with OMPI-2.0.2’s OSHMEM with uGNI

• Open MPI’s (v2.0.2) OpenSHMEM – Good: Maybe the only OSHMEM included in a Linux distro – Good: General TCP BTL was stable for testing and showed decent performance in our Graph500 tests at scale=20 on 64 nodes – Bad: Cray Gemini (uGNI) BTL is not stable with OSHMEM interface (using MCA spml/yoda)

Computational Research & UNCLASSIFIED // FOR OFFICIAL USE ONLY Development Programs Observations (2) • Singularity in production – Performance was consistent between native & singularity – Required to customize image with dirs for bind mounts – Inconvenient to push full image for all edits to the image • Example: change ‘/environment’ file require full re-upload • Note: Singularity-2.3 may have improved this, e.g., not need ‘sudo’ for “copy” and can set env via ‘SINGULARITYENV_xxx’. – User-defined bind mounts disabled on older CLE kernel – Not able to use Cray SHMEM for container case • Note: Can not create image on system, and defeats purpose of portable container (not run on devel cluster)

Computational Research & UNCLASSIFIED // FOR OFFICIAL USE ONLY Development Programs Future Work

• Run with revised OpenSHMEM configuration – Use different component in Open MPI’s SPML framework • Disable Yoda (spml/yoda) • Enable UCX (spml/ucx) – Determine root cause of unexpected performance with uGNI

• Scale-up tests on production system – Perform larger node/core count MPI and OSHMEM tests on Titan

Computational Research & UNCLASSIFIED // FOR OFFICIAL USE ONLY Development Programs Summary

• Containers in HPC – Motivations • Productivity of users • Reproducibility of experiments – Overview of two key container runtimes • Singularity & Docker • Evaluation on Titan – Graph500 benchmark to investigate performance of OpenSHMEM application with Singularity based containers on production machine – Identified basic set of image edits for use on Titan – Consistent performance between Native and Singularity • Note: Identified unexpected slowness (unexplained) when using uGNI BTL

Computational Research & UNCLASSIFIED // FOR OFFICIAL USE ONLY Development Programs Acknowledgements

This work was supported by the United States Department of Defense (DoD) and used resources of the Computational Research and Development Programs at Oak Ridge National Laboratory.

Computational Research & UNCLASSIFIED // FOR OFFICIAL USE ONLY Development Programs Questions?

Computational Research & Development Programs