ECP Continuous Integration Startup Tutorial

2020 ECP Annual Meeting

Houston, Texas February 4, 2020 Goals

What do we hope you walk away with?

1. An understanding of current and planned GitLab CI capabilities

2. Facility points of contact and resources to get started today

3. Introduction to GitLab CI concepts

2 Agenda

● Introduction to HPC CI and GitLab ● Tutorial: Hello Environment [hands-on] ● GitLab Runner Enhancements ● Break ● HDF5 Group - CI Experience and Feedback ● Current Project Status and Ongoing Efforts ● Tutorial: Heat Equation [hands-on] ● Federation and Zero Trust Networking ● Secure Mirroring ● Recap and Path Forward ● Tutorial: Expanded Heat Equation [hands-on]

3 Introduction to HPC CI and GitLab ECP Project: WBS 2.4.4 Deployment at Facilities - Software Integration

The Software Integration effort was established to bridge the ECP ST software development effort with the Exascale hardware and software environments deployed at the Facilities.

● Continuous Integration (CI) - Provide the ability to continuously test AD/ST software on facility hardware resources with software environments established at the Facility. Key for software development teams targeting systems being deployed agile feedback loop is key for development

5 Our Team

Site GitLab capabilities | CI runners on HPC resources | Account authentication | Test resources / support

● Ryan Adamson ● Noah Ginsburg ● David Nicholaeff ● Anthony Agelastos ● Don Maghrak ● Neil O’Neill ● Paul Bryant ● Mario Melara ● Rebel Powell ● Tiffany Connors ● Thomas Mendoza ● Ryan Prout ● Darel Finkbeiner ● Mark Miller ● Mahesh Rajan ● Jim Galarowicz ● Dave Montoya ● Kyle Shaver ● Todd Gamblin ● Jeff Neel ● Lance Vowell ● Chris West

6 Continuous Integration (CI) Refresher

• CI is the process of building / testing code for your project, best done before code is integrated into the repository or released for general usage. – https://about.gitlab.com/product/continuous-integration/ • As developers we most likely expect: – Customizable triggers (e.g. specific changes, schedules, targets, …) – Easy to set up and capable of using existing tests – Feedback (dashboard) identifying pass/fail cases – Access to required test resources – Resources than can handle code build requirements • Many may already use tools such as Travis, Jenkins, or even cron jobs

Source: https://about.gitlab.com/product/continuous-integration/

7 CI Solution for Large-Scale HPC Facilities

Need: HPC centers need new security Response: Using GitLab as a basis, extend features within Continuous Integration security, permissions, and auditing features systems to serve thousands of users with while improving intra-Lab and inter-Lab unique hardware and security requirements. accessibility to CI pipelines.

8 Automated testing is an enormous productivity multiplier Projects cannot scale to accept many contributions without automation

External contributors Code Review Code Free cloud testing and CI services

Automated testing on Accepted unique HPC resources Contributions

• Cloud testing is a boon to open source projects – Travis, CircleCI, GitLab, AppVeyor, Azure, etc. provide free cycles!

• So-called Continuous Integration (CI) has been hard to implement at HPC centers due to security restrictions – Strict internal rules, lax security in most off-the-shelf products Code Repository

• Lack of testing on key platforms will lead to less robust HPC software.

9 GitLab

• GitLab is a single platform for the entire DevOps lifecycle – Plan, create, verify, and release • Core functionality (free) – Version control, collaboration, CI/CD, and documentation tools • Open source and capable of self management – MIT License Source: https://docs.gitlab.com/ee/ • Additional tiers of functionality are only available via license – https://about.gitlab.com/pricing/self-manag ed/feature-comparison/

10 Gitlab - Features and Roadmap

GitLab is a continually developing project, with new features being added. However, it is important to note that not everything is available in the “core” (free) tier.

11 GitLab CI/CD

• GitLab offers robust support for CI/CD (Continues Delivery) • Primarily controlled via a ‘.gitlab-ci.yml’ file defined in your project repository – Defines the jobs that will be run, their structure, and what will trigger their potential execution – https://docs.gitlab.com/ee/ci/yaml/ • Terms: – Jobs: Core elements of CI, where commands are defined for execution – Stages: Defined order for job execution within a pipeline – Pipelines: Collection of jobs • Pipelines can be triggered through a number of mechanisms, including: commits, manually, api, scheduling, merge requests, upstream projects, and more...

12 GitLab Runner

• “GitLab Runner is the open source project that is used to run your CI/CD jobs and send the results back to GitLab” • Using GitLab CI/CD requires management of a runner:

– Application supports the execution and reporting of CI jobs on target systems – Installed on target test resources and registered to a GitLab instance – Responsible for the execution of scripts whose creation is dictated by both the user’s CI file as well as administrative configuration

13 Our Vision

Delivering a secure, easy-to-use CI solution to support Software

Product testing against E6 HPC environments.

14 15 Tutorial: Hello Environment

[hands-on] Tutorial Resources

• Tutorial resources available for the hands-on sections • Please review and sign the provided agreement – If you don’t have a login or have any questions please let us know • Remember resources may be limited in comparison to the number of attendees, refrain to running jobs that would take a disproportionate amount of compute cycles.

17 Hello Environment - Results

18 Hello Environment - Results

19 Hello Environment - Results

20 Hello Environment - Results

21 Hello Environment - Results

22 Hello Environment - Results

23 Hello Environment - Results

24 GitLab CI Enhancements Security Challenges and Use Cases

• All GitLab CI jobs run as the single gitlab-runner user • Achieving any level of isolation between CI jobs requires:

– 1) Targeting specific runners such as Docker or VMs – 2) Having each user/team manage their own runner – Neither of these are ideal • Use cases we aim to support

– Runner data isolation – Account specific resources – Sensitive data sets – Administrator management

26 Enhanced GitLab Functionality

• Developing and supporting enhancements to GitLab applications to ensure targeted HPC testing resources can be available to code teams – Integrating facility feedback to improve technical aspects and support logistical challenges. • Runner capable of setuid, enforcing local permissions – Jobs are executed under a users local account • Batch executor added to interface with Cobalt, LSF, and Slurm – A runner side configuration that dictates how a CI job will be executed. • Federated cross site continuous integration – Previously auth endpoints only used at login, now they have been expanded and integrated into the CI process as well. • Existing GitLab functionality will continue to be supported – For instance continued upstream development efforts and improved to GitLab’s core CI/CD capabilities can be used with HPC focused enhancements

27 Batch Executor

• Execute the user defined script (.gitlab-ci.yml) at the job level by submitting it the underlying scheduler – Comparable function to “Shell executor” and leverages setuid functionality – 1 Job = 1 Submission • Able to provide parameters for the job similar to a command line interface – SCHEDULER_PARAMETERS • Simply leverage the underlying Variable used to pass job submission argument scheduler Follows: https://docs.gitlab.com/ee/ci/variables/#priority-of-variables

28 Batch Executor with Setuid Enabled

.gitlab-ci.yml Job output (GitLab web UI)

29 Batch Executor - Behavior

Cobalt LSF Slurm

Script Location where qrun Location where brun Location where srun ok? Element script will run? ok? script will run? ok? script will run?

before_script launch/mom node yes launch node yes batch node yes

script launch/mom node yes launch node yes batch node yes

after_script login node no login node no login node no

“Scripts specified in before_script are concatenated with any scripts specified in the main script, and executed together in a single shell.” - GitLab documentation

30 What is the GitLab Runner Doing?

• The GitLab runner polls the server to identify available CI jobs – Will only begin executing a job once it has been scheduled by the server • Job local to runner result in generated Bash scripts based upon the contents of the .gitlab-ci.yml – Works similar to upstream GitLab (https://docs.gitlab.com/runner/shells/index.html) – Scripts are executed by piping them into a non-interactive Bash login shell. • Every step in a CI job is executed by a valid local user account – This is a key change with setuid enabled runner • Each user is provided with a clean login shell – Ensures runner environment does not compromise subsequent jobs – Users experience an environment similar to what they would see on a login node – Depending on your environment this may affect the results of your CI

31 Setuid and GitLab Server Requirements

• Setuid feature enables GitLab to run bare-metal CI jobs securely on large, multi-tenant systems – All CI job actions are executed under the local user’s account with appropriate privileges observed • Running a setuid enabled runner requires: – Root privileges - The runners require elevated privileges – Trust - GitLab server and the hosts of any setuid enabled runners trust each other – HTTPS - Ensure that communication is secure • Trust between server and runner – We are requiring that the username on both systems align – This can be done for instance with GitLab’s support for LDAP – https://docs.gitlab.com/ee/administration/auth/ldap.html • Though setuid is an important element, it’s not the only solution as you’ll see with Federation

32 Federation

• Goal of allowing cross site CI in a secure manner while still empowering site admins with the tools to ensure policy is enforced. – Site Identify Providers are used and also linked to specific runners – Valid site credentials must be present before a job could be run at any specific site. • Enhanced identity model that establishes a connection between a site runner, auth provider, and the user’s session. • Gitlab Omniauth as a first class citizen: – “...Users can choose to sign in using any of the configured mechanisms.” – I.e. LDAP and standard Gitlab login can still function, but many providers can be added • Ongoing communication directly with GitLab to upstream enhancements. • More details on zero-trust and the federated model will be detailed in a later section

33 34 Spack pipeline leveraging GitLab CI

https://spack.readthedocs.io/en/latest/pipelines.html#pipelines 35 Spack pipeline leveraging GitLab CI

https://spack.readthedocs.io/en/latest/pipelines.html#pipelines 36 Break

3:30 - 4:00 ECP GitLab CI “friendly user” Experiences and Feedback

The HDF Group Larry Knox, Scot Breitenfeld, Allen Byrne, Elena Pourmal

ECP Annual Meeting February 3 – 7, 2020; Houston, TX

exascaleproject.org The HDF Group ECP Goals and Partners Enhance HDF5 to handle extreme scalability Develop features to take advantage of available and coming system features like memory hierarchy and object store Provide support for HDF5 ECP applications Help with I/O performance tuning Recommend newly developed features Resolve issues in HDF5 software Maintain HDF5 software Regression testing and deployment of HDF5 CI testing on exascale systems Integration of the features developed for ECP, bug fixes Partners: ExaIO, UnifyFS, DataLib, ADIOS2, …

39 The HDF Group Continuous Integration (CI) Testing Goals Improve current processes for routine Cl testing of HDF5 library in software configurations expanding testing of new platforms, compilers, MPls, and file systems. Improve reporting of metrics and identification of defects in public releases of HDF5 library code and in code destined for public releases. Coordinate Cl testing efforts with ECP and other similar efforts to minimize duplication of effort Identify changes in test plans and design that would improve testing processes and effectiveness.

40 CI/CD: Current status (2019-now)

● Run regular daily regression tests on several DOE Office-of-science & NNSA-labs production system ( e.g: cori, Serrano, Eclipse, Quartz, Ray, Lassen, Chama, … ) ● Use a single bash script nnsa.sh – uses CMake, Ctest and posts results on Cdash tabulated for each system at: https://cdash.hdfgroup.org/index.php?project=HDF5 ● Learned about using HDF5 with GitLab ● Got access to OSTI Gitlab in October 2019 ● Set up locally at THG ● Status: ● Runs on NMC p9 ● Runs on NERSC Cori ● Setting up on ANL Theta

41 The HDF Group feedback: challenges and questions

NMC p9 Tests run on NFS Slow Tests time out No parallel FS Local file system and PFS are needed Cori Need to make a choice of hardware for successful run Time for the “knl” test to run the 28 parallel tests is 10m 37s against 4m 29 sec for the “haswell” test Unknown turnaround time on haswell

42 The HDF Group feedback: challenges and questions Debugging failures to run commands in yml file is difficult No access to build files No access to system environment Version of the yml file was added just to discover system information (e.g., available modules) The tool to report system information would be very helpful (e.g., module information, FS information) We need to answer the following questions: We have a pull mirror of our HDF5 repository in the OSTI GitLab server. If we create a trigger there, can that server launch tests on runners at NERSC, ANL and other labs? Currently HDF5 has a .gitlab-ci.yml file in a branch of the HDF5 repository. In GitLab at NERSC which cannot do pull mirroring, .gitlab-ci.yml is the only file in the repository, and it clones the HDF5 source directly from The HDF Group bitbucket repository. It is expected that the clone can also be done from the OSTI mirror. In this setup every other GitLab server could have its own .gitlab-ci.yml file. Is this a good setup? The question of triggering also applies here if pulling from OSTI. 43 The HDF Group CI/CD goals for 2020

● Run with a single repository at OSTI, full build, serial and parallel runs at systems at NMC, NERSC, ANL and ORNL using different configurations (compilers and MPI I/O libraries) using Spack. ● Run daily (if there is code change) ● Run with different set of compilers ● Make results public on CDash (or TBD)

● Add HDF5 regression “performance tests” MACSio and IOR ( ECP proxy miniapps). ● When HDF5 snapshots are available (weekly) ● At scale (TBD) ● Make results public on CDash (or TBD)

● Test with a target set of ECP applications ● When HDF5 snapshots are available (monthly) ● Make results public on CDash (or TBD)

44 Thank you!

?

45 Current Project Status and Ongoing Efforts Documentation

• Need to address ongoing requests for organized documentation • https://ecp-ci.gitlab.io – Using GitLab Pages – Ongoing effort to provide value up to date information • Goal is not to replace upstream documentation but highlight HPC focused enhancements and best-practices. • Site specific information organized using Confluence – https://confluence.exascaleproject.org/display/HI SD/Getting+Started+with+CI

47 CI Integration and AD/ST Project Teams

• Effort includes the on-boarding of ECP AD/ST and E6 software projects onto the ECP/E6 CI infrastructure. – Derives capability from the CI Optimization effort and the CI Test Resources effort. It is focused on supporting software teams and helping them port and utilize the ECP CI infrastructure. • Goals: – Helping individual teams on-board and teams – Developing documentation and training – Support development of tools to support CI teams – Identify limitations of CI infrastructure / resources and areas of potential growth • Ready to support more projects at sites in a preview environment – We want to help teams to get running locally at the sites – Assist in transition to the OSTI model down the road, if desired

48 Supported Deployment Models

• Enhancements to GitLab mean we are capable of supporting new workflows that aim to provide substantial value to development teams. • Taking into account all currently supported enhancements we can offer solutions that work not only internal to sites but also across instance – In our target federated instance, we will leverage the OSTI hosted central GitLab • Ideal testing tiers defined to help plan level of support – We are not aiming to replace existing capabilities by teams, especially those with strong community efforts, only to offer another potential avenue for testing. • Though a central GitLab instance may not be available to projects openly yet we want to encourage teams to use resources that are available internally at sites.

49 Deployment Models

Federated - Centralized (OSTI) Sever Site - Local Site Server

• Single, centrally managed Gitlab Server instance provided and • Separate and isolated Gitlab Server provided and administered by the OSTI team. administered by the site’s Operations team. • Federated runners at sites • Local systems/resources capacity • Multi-site run capabilities

50 Deployment Models - Contd.

FederatedDeployment with Site Models - Contd. Local - Hybrid

• Most likely to be the typical deployment model for DOE facilities • Federated capabilities • Control over Site deployment

51 CI Readiness

Site readiness for OSTI hosted CI with federated runners

Local CI MOU Auth Endpoint Federated(OSTI) CI

ORNL Available In-process In-process: 2/2020

NERSC Available In-process Complete In-process

ANL Available In-process In-process: 2/2020

LANL Available In-process In-process

LLNL Available In-process In-process

SNL Available In-process In-process

NMC In-process In-process: 1/2020 Complete

52 What’s available and Where?

Facility CI Resources Local Preview Environments Internal Open science Gitlab EE OLCF ECP Runners - HPC cluster Ascent (summit like)

Internal GitLab CE server VM ALCF ECP Runners – HPC systems Theta(users), Iota(test)

Internal GitLab server VM EE NERSC ECP Runners - HPC systems Edison/Cori

Internal Gitlab server (accessible to LLNL HPC users) LLNL ECP Runners – Quartz, butte (P9 cluster)

LANL Internal GitLab CCS

OSTI External Gitlab EE for Federation Internal Gitlab CE Other ECP Runners - x86, ARM, P9 systems https://confluence.exascaleproject.org/display/HISD/CI+At+the+Facilities

53 Allocations for Continuous Integration

● Build any CI needs into your allocation quest ○ We want to avoid managing multiple allocations ○ Be prepared to defend your CI needs during the allocation request phase

● Typically, teams have scoped 5% to 25% of production allocation for CI ○ AD teams may need less than ST ○ Please pay attention to the cycle / value proposition on these scientific instruments

● Resources may or may not be allocated (Cori and Theta are allocatable, Ascent is not)

54 Tutorial: Heat Equation

[hands-on] Heat Equation - Results

56 Heat Equation - Results

57 Heat Equation - Results

58 Heat Equation - Results

59 Federation and Zero Trust Networking Federation

• https://gitlab.com/gitlab-org/gitlab/issues/33665 – Working directly with GitLab to upstream functionality – Community involvement encouraged

• https://youtube.com/watch?v=5mOnp26Ingo – San Francisco, 2020

61 Challenges of automation and security

• CI jobs require running arbitrary code • CI jobs are typically run in Docker containers in multi-cloud environments, but… • HPC requires bare metal performance • HPC requires secure enclaves for CI jobs • All pipelines must run as some user • I.e., zero trust • Urgency of automation self-evident for multi-cloud environments

62 ECP CI Roles and Security Model

63 Technical challenges to federation

• How do we refine user access control for CI? – Make OmniAuth a first-class citizen in GitLab – Leverage credentials in the auth hash schema – Authorization code flow – Digital Identity Guidelines (NIST SP 800-63-3) • How do we directly leverage HPC resources? – setUID and batch custom executors • How do we handle the three body administration problem? – Server, runner, identity provider administration – Federation audit events

64 OmniAuth and the auth hash schema

• “OmniAuth is a library that standardizes multi-provider authentication for web applications” • GitLab OmniAuth as a first-class citizen: – “...Users can choose to sign in using any of the configured mechanisms.” – I.e. LDAP and standard GitLab login can still function, but many providers can be added

65 66 Digital Identity Guidelines

• NIST 800-63-3 – https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-63-3.pdf • Establishing a user is who they claim to be – Used to be a single Level of Assurance (LOA) • Identity Assurance Level (IAL) – IAL1: No requirement to link the applicant to a specific real-life identity – IAL2: Verify applicant is associated with real-world identity – IAL3: Verified by an authorized and trained representative of the CSP • Authenticator Assurance Level (AAL) – AAL1: Either single-factor or multi-factor authentication – AAL2: Two distinct authentication factors, through secure protocol – AAL3: Possession of a key through a cryptographic protocol

67 Federation Audit Events

• Taps in directly to GitLab’s existing Audit Events infrastructure, via • Federation Events sidecar service • https://docs.gitlab.com/ee/adminis tration/audit_events.html

68 69 FY20 Development - Path Forward

• Server – Upstream federation enhancements: https://gitlab.com/gitlab-org/gitlab/issues/33665 – Current federation model address facility concerns, next step is to expand efforts to improve user (AD/ST project team) experience – Continue management of forked GitLab until key changes are upstreamed and sites migrated • Runner – Address open support issues for current runner deployment – Implementation of customer executor – Identify and support upstreaming efforts wherever possible • Concurrent Efforts – Generating and maintaining documentation for users and admins – Code scanning at the OSTI level – Commit-as-assurance to provide trusted signed commits

70 71 72 73 74 75 76 77 Recap and Path Forward CI Readiness

Site readiness for OSTI hosted CI with federated runners

Local CI MOU Auth Endpoint Federated(OSTI) CI

ORNL Available In-process In-process: 2/2020

NERSC Available In-process Complete In-process

ANL Available In-process In-process: 2/2020

LANL Available In-process In-process

LLNL Available In-process In-process

SNL Available In-process In-process

NMC In-process In-process: 1/2020 Complete

79 Enhanced Runners

• Setuid capabilities to support security requirements 27 Co – GitLab user triggering the job is associated with a local account Cobalt – Execute CI pipelines under user’s account as opposed to 58.933 privileged or system account. • Batch executor introduced to interact with HPC hardware – Execute the user defined script (.gitlab-ci.yml) at the job level by submitting it as a job script to the underlying batch system source: – Leverages SetUID functionality www.slurm.schedmd.com – 1 CI Job == 1 Submission • Federation, allowing cross site CI in a secure manner while still

empowering site admins source: community.ibm.com – Site Identify Providers are used and also linked to specific runners – Valid site credentials must be present before a job could be run at any specific site.

80 Continued Development and Support

• Improvements to available documentation over the coming quarter – https://ecp-ci.gitlab.io • Development and upstreaming efforts now spear headed by the ECP CI team – Ongoing server and runner support – Migration to better support custom executor • Future efforts include; Commit-as-assurance and code scanning • Support for CI integration of AD/ST teams – Helping individual teams on-board and teams – Developing documentation and training – Support development of tools to support CI teams – Identify limitations of CI infrastructure / resources and areas of potential growth

81 Points of Contact

• Collaboration across facilities and groups to offer assistance with:

– Site internal GitLab capabilities – CI runners on HPC resources – Test resources / support – Development efforts • Please get in touch!

82 Tutorial: Expanded Heat Equation

[hands-on] Expanded Heat Equation - Results

84 Expanded Heat Equation - Results

85 Expanded Heat Equation - Results

86 Expanded Heat Equation - Results

87 Expanded Heat Equation - Results

88 Expanded Heat Equation - Results

89 Expanded Heat Equation - Results

90 Expanded Heat Equation - Results

91 Discussion

92 Appendix Setuid

• Setuid feature enables GitLab to run bare-metal CI jobs securely on large, multi-tenant systems – All CI job actions are executed under the local user’s account with appropriate privileges observed • Process – 1. At start of job gather information on GitLab user – 2. Validate user’s account exists & can run on system – 3. Create local directories, enforcing proper ownership / rights – 4. Fork runner process so child process Job output (GitLab web UI) belongs to user – 5. Execute CI 94 Batch Executor

• Fundamelty the batch executors seeks to do the following: – Request allocation to execute build script – Monitor ongoing job state – Pipe output back to the GitLab server – If required, cancel job • Execute the user defined script (.gitlab-ci.yml) at the job level by submitting it as a job script(s) to the available batch system – Comparable function to “Shell executor” and leverages setuid functionality – 1 Job = 1 Submission • Goal is to ensure all upstream GitLab CI functionality can also be leveraged by the Batch Executor

95 27 Batch Executor - Cobalt Co

Cobalt 58.933

• All Jobs are submitted to the queue manage with qsub – qsub: https://trac.mcs.anl.gov/projects/cobalt/wiki/qsub.1.html • Both the output (--output) as well as error (--error) logs will be monitored • Duration of the build script will have access to the request allocation but ran on mom node – Remember qrun must be used to execute on compute resources

96 Batch Executor - LSF

source: community.ibm.com

• Jobs a submited to LSF using bsub – BSUB: https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.2/lsf_command_ref/bsub.1.html • All jobs are submitted as interactive (bsub -I) – Must use brun in your script to execute on compute resources • Simply piping stdout/stderr from the shell

97 Batch Executor - Slurm

source: www.slurm.schedmd.com

• Slurm integration revolves around submitting the job to sbatch – sbatch: https://slurm.schedmd.com/sbatch.html • The runner will monitor the stdout/stderr via log file (--output) • Entire build script is executed on the computer resource – Major differentiation between other supported schedulers

98 Batch Executor - Behavior

Cobalt LSF Slurm

Script Location where qrun Location where brun Location where srun ok? Element script will run? ok? script will run? ok? script will run?

before_script launch/mom node yes launch node yes batch node yes

script launch/mom node yes launch node yes batch node yes

after_script login node no login node no login node no

99 Batch Executor - Behavior

Cobalt LSF Slurm

Script Location where qrun Location where brun Location where srun ok? Element script will run? ok? script will run? ok? script will run?

before_script launch/mom node yes launch node yes batch node yes

script launch/mom node yes launch node yes batch node yes

after_script login node no login node no login node no

“Scripts specified in before_script are concatenated with any scripts specified in the main script, and executed together in a single shell.” - GitLab documentation

100