ECP Continuous Integration Startup Tutorial
2020 ECP Annual Meeting
Houston, Texas February 4, 2020 Goals
What do we hope you walk away with?
1. An understanding of current and planned GitLab CI capabilities
2. Facility points of contact and resources to get started today
3. Introduction to GitLab CI concepts
2 Agenda
● Introduction to HPC CI and GitLab ● Tutorial: Hello Environment [hands-on] ● GitLab Runner Enhancements ● Break ● HDF5 Group - CI Experience and Feedback ● Current Project Status and Ongoing Efforts ● Tutorial: Heat Equation [hands-on] ● Federation and Zero Trust Networking ● Secure Mirroring ● Recap and Path Forward ● Tutorial: Expanded Heat Equation [hands-on]
3 Introduction to HPC CI and GitLab ECP Project: WBS 2.4.4 Software Deployment at Facilities - Software Integration
The Software Integration effort was established to bridge the ECP ST software development effort with the Exascale hardware and software environments deployed at the Facilities.
● Continuous Integration (CI) - Provide the ability to continuously test AD/ST software on facility hardware resources with software environments established at the Facility. Key for software development teams targeting systems being deployed agile feedback loop is key for development
5 Our Team
Site GitLab capabilities | CI runners on HPC resources | Account authentication | Test resources / support
● Ryan Adamson ● Noah Ginsburg ● David Nicholaeff ● Anthony Agelastos ● Don Maghrak ● Neil O’Neill ● Paul Bryant ● Mario Melara ● Rebel Powell ● Tiffany Connors ● Thomas Mendoza ● Ryan Prout ● Darel Finkbeiner ● Mark Miller ● Mahesh Rajan ● Jim Galarowicz ● Dave Montoya ● Kyle Shaver ● Todd Gamblin ● Jeff Neel ● Lance Vowell ● Chris West
6 Continuous Integration (CI) Refresher
• CI is the process of building / testing code for your project, best done before code is integrated into the repository or released for general usage. – https://about.gitlab.com/product/continuous-integration/ • As developers we most likely expect: – Customizable triggers (e.g. specific changes, schedules, targets, …) – Easy to set up and capable of using existing tests – Feedback (dashboard) identifying pass/fail cases – Access to required test resources – Resources than can handle code build requirements • Many may already use tools such as Travis, Jenkins, or even cron jobs
Source: https://about.gitlab.com/product/continuous-integration/
7 CI Solution for Large-Scale HPC Facilities
Need: HPC centers need new security Response: Using GitLab as a basis, extend features within Continuous Integration security, permissions, and auditing features systems to serve thousands of users with while improving intra-Lab and inter-Lab unique hardware and security requirements. accessibility to CI pipelines.
8 Automated testing is an enormous productivity multiplier Projects cannot scale to accept many contributions without automation
External contributors Code Review Code Free cloud testing and CI services
Automated testing on Accepted unique HPC resources Contributions
• Cloud testing is a boon to open source projects – Travis, CircleCI, GitLab, AppVeyor, Azure, etc. provide free cycles!
• So-called Continuous Integration (CI) has been hard to implement at HPC centers due to security restrictions – Strict internal rules, lax security in most off-the-shelf products Code Repository
• Lack of testing on key platforms will lead to less robust HPC software.
9 GitLab
• GitLab is a single platform for the entire DevOps lifecycle – Plan, create, verify, and release • Core functionality (free) – Version control, collaboration, CI/CD, and documentation tools • Open source and capable of self management – MIT License Source: https://docs.gitlab.com/ee/ • Additional tiers of functionality are only available via license – https://about.gitlab.com/pricing/self-manag ed/feature-comparison/
10 Gitlab - Features and Roadmap
GitLab is a continually developing project, with new features being added. However, it is important to note that not everything is available in the “core” (free) tier.
11 GitLab CI/CD
• GitLab offers robust support for CI/CD (Continues Delivery) • Primarily controlled via a ‘.gitlab-ci.yml’ file defined in your project repository – Defines the jobs that will be run, their structure, and what will trigger their potential execution – https://docs.gitlab.com/ee/ci/yaml/ • Terms: – Jobs: Core elements of CI, where commands are defined for execution – Stages: Defined order for job execution within a pipeline – Pipelines: Collection of jobs • Pipelines can be triggered through a number of mechanisms, including: commits, manually, api, scheduling, merge requests, upstream projects, and more...
12 GitLab Runner
• “GitLab Runner is the open source project that is used to run your CI/CD jobs and send the results back to GitLab” • Using GitLab CI/CD requires management of a runner:
– Application supports the execution and reporting of CI jobs on target systems – Installed on target test resources and registered to a GitLab instance – Responsible for the execution of scripts whose creation is dictated by both the user’s CI file as well as administrative configuration
13 Our Vision
Delivering a secure, easy-to-use CI solution to support Software
Product testing against E6 HPC environments.
14 15 Tutorial: Hello Environment
[hands-on] Tutorial Resources
• Tutorial resources available for the hands-on sections • Please review and sign the provided agreement – If you don’t have a login or have any questions please let us know • Remember resources may be limited in comparison to the number of attendees, refrain to running jobs that would take a disproportionate amount of compute cycles.
17 Hello Environment - Results
18 Hello Environment - Results
19 Hello Environment - Results
20 Hello Environment - Results
21 Hello Environment - Results
22 Hello Environment - Results
23 Hello Environment - Results
24 GitLab CI Enhancements Security Challenges and Use Cases
• All GitLab CI jobs run as the single gitlab-runner user • Achieving any level of isolation between CI jobs requires:
– 1) Targeting specific runners such as Docker or VMs – 2) Having each user/team manage their own runner – Neither of these are ideal • Use cases we aim to support
– Runner data isolation – Account specific resources – Sensitive data sets – Administrator management
26 Enhanced GitLab Functionality
• Developing and supporting enhancements to GitLab applications to ensure targeted HPC testing resources can be available to code teams – Integrating facility feedback to improve technical aspects and support logistical challenges. • Runner capable of setuid, enforcing local permissions – Jobs are executed under a users local account • Batch executor added to interface with Cobalt, LSF, and Slurm – A runner side configuration that dictates how a CI job will be executed. • Federated cross site continuous integration – Previously auth endpoints only used at login, now they have been expanded and integrated into the CI process as well. • Existing GitLab functionality will continue to be supported – For instance continued upstream development efforts and improved to GitLab’s core CI/CD capabilities can be used with HPC focused enhancements
27 Batch Executor
• Execute the user defined script (.gitlab-ci.yml) at the job level by submitting it the underlying scheduler – Comparable function to “Shell executor” and leverages setuid functionality – 1 Job = 1 Submission • Able to provide parameters for the job similar to a command line interface – SCHEDULER_PARAMETERS • Simply leverage the underlying Variable used to pass job submission argument scheduler Follows: https://docs.gitlab.com/ee/ci/variables/#priority-of-variables
28 Batch Executor with Setuid Enabled
.gitlab-ci.yml Job output (GitLab web UI)
29 Batch Executor - Behavior
Cobalt LSF Slurm
Script Location where qrun Location where brun Location where srun ok? Element script will run? ok? script will run? ok? script will run?
before_script launch/mom node yes launch node yes batch node yes
script launch/mom node yes launch node yes batch node yes
after_script login node no login node no login node no
“Scripts specified in before_script are concatenated with any scripts specified in the main script, and executed together in a single shell.” - GitLab documentation
30 What is the GitLab Runner Doing?
• The GitLab runner polls the server to identify available CI jobs – Will only begin executing a job once it has been scheduled by the server • Job local to runner result in generated Bash scripts based upon the contents of the .gitlab-ci.yml – Works similar to upstream GitLab (https://docs.gitlab.com/runner/shells/index.html) – Scripts are executed by piping them into a non-interactive Bash login shell. • Every step in a CI job is executed by a valid local user account – This is a key change with setuid enabled runner • Each user is provided with a clean login shell – Ensures runner environment does not compromise subsequent jobs – Users experience an environment similar to what they would see on a login node – Depending on your environment this may affect the results of your CI
31 Setuid and GitLab Server Requirements
• Setuid feature enables GitLab to run bare-metal CI jobs securely on large, multi-tenant systems – All CI job actions are executed under the local user’s account with appropriate privileges observed • Running a setuid enabled runner requires: – Root privileges - The runners require elevated privileges – Trust - GitLab server and the hosts of any setuid enabled runners trust each other – HTTPS - Ensure that communication is secure • Trust between server and runner – We are requiring that the username on both systems align – This can be done for instance with GitLab’s support for LDAP – https://docs.gitlab.com/ee/administration/auth/ldap.html • Though setuid is an important element, it’s not the only solution as you’ll see with Federation
32 Federation
• Goal of allowing cross site CI in a secure manner while still empowering site admins with the tools to ensure policy is enforced. – Site Identify Providers are used and also linked to specific runners – Valid site credentials must be present before a job could be run at any specific site. • Enhanced identity model that establishes a connection between a site runner, auth provider, and the user’s session. • Gitlab Omniauth as a first class citizen: – “...Users can choose to sign in using any of the configured mechanisms.” – I.e. LDAP and standard Gitlab login can still function, but many providers can be added • Ongoing communication directly with GitLab to upstream enhancements. • More details on zero-trust and the federated model will be detailed in a later section
33 34 Spack pipeline leveraging GitLab CI
https://spack.readthedocs.io/en/latest/pipelines.html#pipelines 35 Spack pipeline leveraging GitLab CI
https://spack.readthedocs.io/en/latest/pipelines.html#pipelines 36 Break
3:30 - 4:00 ECP GitLab CI “friendly user” Experiences and Feedback
The HDF Group Larry Knox, Scot Breitenfeld, Allen Byrne, Elena Pourmal
ECP Annual Meeting February 3 – 7, 2020; Houston, TX
exascaleproject.org The HDF Group ECP Goals and Partners Enhance HDF5 to handle extreme scalability Develop features to take advantage of available and coming system features like memory hierarchy and object store Provide support for HDF5 ECP applications Help with I/O performance tuning Recommend newly developed features Resolve issues in HDF5 software Maintain HDF5 software Regression testing and deployment of HDF5 CI testing on exascale systems Integration of the features developed for ECP, bug fixes Partners: ExaIO, UnifyFS, DataLib, ADIOS2, …
39 The HDF Group Continuous Integration (CI) Testing Goals Improve current processes for routine Cl testing of HDF5 library in software configurations expanding testing of new platforms, compilers, MPls, and file systems. Improve reporting of metrics and identification of defects in public releases of HDF5 library code and in code destined for public releases. Coordinate Cl testing efforts with ECP and other similar efforts to minimize duplication of effort Identify changes in test plans and design that would improve testing processes and effectiveness.
40 CI/CD: Current status (2019-now)
● Run regular daily regression tests on several DOE Office-of-science & NNSA-labs production system ( e.g: cori, Serrano, Eclipse, Quartz, Ray, Lassen, Chama, … ) ● Use a single bash script nnsa.sh – uses CMake, Ctest and posts results on Cdash tabulated for each system at: https://cdash.hdfgroup.org/index.php?project=HDF5 ● Learned about using HDF5 with GitLab ● Got access to OSTI Gitlab in October 2019 ● Set up locally at THG ● Status: ● Runs on NMC p9 ● Runs on NERSC Cori ● Setting up on ANL Theta
41 The HDF Group feedback: challenges and questions
NMC p9 Tests run on NFS Slow Tests time out No parallel FS Local file system and PFS are needed Cori Need to make a choice of hardware for successful run Time for the “knl” test to run the 28 parallel tests is 10m 37s against 4m 29 sec for the “haswell” test Unknown turnaround time on haswell
42 The HDF Group feedback: challenges and questions Debugging failures to run commands in yml file is difficult No access to build files No access to system environment Version of the yml file was added just to discover system information (e.g., available modules) The tool to report system information would be very helpful (e.g., module information, FS information) We need to answer the following questions: We have a pull mirror of our HDF5 bitbucket repository in the OSTI GitLab server. If we create a trigger there, can that server launch tests on runners at NERSC, ANL and other labs? Currently HDF5 has a .gitlab-ci.yml file in a branch of the HDF5 repository. In GitLab at NERSC which cannot do pull mirroring, .gitlab-ci.yml is the only file in the repository, and it clones the HDF5 source directly from The HDF Group bitbucket repository. It is expected that the clone can also be done from the OSTI mirror. In this setup every other GitLab server could have its own .gitlab-ci.yml file. Is this a good setup? The question of triggering also applies here if pulling from OSTI. 43 The HDF Group CI/CD goals for 2020
● Run with a single repository at OSTI, full build, serial and parallel runs at systems at NMC, NERSC, ANL and ORNL using different configurations (compilers and MPI I/O libraries) using Spack. ● Run daily (if there is code change) ● Run with different set of compilers ● Make results public on CDash (or TBD)
● Add HDF5 regression “performance tests” MACSio and IOR ( ECP proxy miniapps). ● When HDF5 snapshots are available (weekly) ● At scale (TBD) ● Make results public on CDash (or TBD)
● Test with a target set of ECP applications ● When HDF5 snapshots are available (monthly) ● Make results public on CDash (or TBD)
44 Thank you!
?
45 Current Project Status and Ongoing Efforts Documentation
• Need to address ongoing requests for organized documentation • https://ecp-ci.gitlab.io – Using GitLab Pages – Ongoing effort to provide value up to date information • Goal is not to replace upstream documentation but highlight HPC focused enhancements and best-practices. • Site specific information organized using Confluence – https://confluence.exascaleproject.org/display/HI SD/Getting+Started+with+CI
47 CI Integration and AD/ST Project Teams
• Effort includes the on-boarding of ECP AD/ST and E6 software projects onto the ECP/E6 CI infrastructure. – Derives capability from the CI Optimization effort and the CI Test Resources effort. It is focused on supporting software teams and helping them port and utilize the ECP CI infrastructure. • Goals: – Helping individual teams on-board and teams – Developing documentation and training – Support development of tools to support CI teams – Identify limitations of CI infrastructure / resources and areas of potential growth • Ready to support more projects at sites in a preview environment – We want to help teams to get running locally at the sites – Assist in transition to the OSTI model down the road, if desired
48 Supported Deployment Models
• Enhancements to GitLab mean we are capable of supporting new workflows that aim to provide substantial value to development teams. • Taking into account all currently supported enhancements we can offer solutions that work not only internal to sites but also across instance – In our target federated instance, we will leverage the OSTI hosted central GitLab • Ideal testing tiers defined to help plan level of support – We are not aiming to replace existing capabilities by teams, especially those with strong community efforts, only to offer another potential avenue for testing. • Though a central GitLab instance may not be available to projects openly yet we want to encourage teams to use resources that are available internally at sites.
49 Deployment Models
Federated - Centralized (OSTI) Sever Site - Local Site Server
• Single, centrally managed Gitlab Server instance provided and • Separate and isolated Gitlab Server provided and administered by the OSTI team. administered by the site’s Operations team. • Federated runners at sites • Local systems/resources capacity • Multi-site run capabilities
50 Deployment Models - Contd.
FederatedDeployment with Site Models - Contd. Local - Hybrid
• Most likely to be the typical deployment model for DOE facilities • Federated capabilities • Control over Site deployment
51 CI Readiness
Site readiness for OSTI hosted CI with federated runners
Local CI MOU Auth Endpoint Federated(OSTI) CI
ORNL Available In-process In-process: 2/2020
NERSC Available In-process Complete In-process
ANL Available In-process In-process: 2/2020
LANL Available In-process In-process
LLNL Available In-process In-process
SNL Available In-process In-process
NMC In-process In-process: 1/2020 Complete
52 What’s available and Where?
Facility CI Resources Local Preview Environments Internal Open science Gitlab EE OLCF ECP Runners - HPC cluster Ascent (summit like)
Internal GitLab CE server VM ALCF ECP Runners – HPC systems Theta(users), Iota(test)
Internal GitLab server VM EE NERSC ECP Runners - HPC systems Edison/Cori
Internal Gitlab server (accessible to LLNL HPC users) LLNL ECP Runners – Quartz, butte (P9 cluster)
LANL Internal GitLab CCS
OSTI External Gitlab EE for Federation Internal Gitlab CE Other ECP Runners - x86, ARM, P9 systems https://confluence.exascaleproject.org/display/HISD/CI+At+the+Facilities
53 Allocations for Continuous Integration
● Build any CI needs into your allocation quest ○ We want to avoid managing multiple allocations ○ Be prepared to defend your CI needs during the allocation request phase
● Typically, teams have scoped 5% to 25% of production allocation for CI ○ AD teams may need less than ST ○ Please pay attention to the cycle / value proposition on these scientific instruments
● Resources may or may not be allocated (Cori and Theta are allocatable, Ascent is not)
54 Tutorial: Heat Equation
[hands-on] Heat Equation - Results
56 Heat Equation - Results
57 Heat Equation - Results
58 Heat Equation - Results
59 Federation and Zero Trust Networking Federation
• https://gitlab.com/gitlab-org/gitlab/issues/33665 – Working directly with GitLab to upstream functionality – Community involvement encouraged
• https://youtube.com/watch?v=5mOnp26Ingo – San Francisco, 2020
61 Challenges of automation and security
• CI jobs require running arbitrary code • CI jobs are typically run in Docker containers in multi-cloud environments, but… • HPC requires bare metal performance • HPC requires secure enclaves for CI jobs • All pipelines must run as some user • I.e., zero trust • Urgency of automation self-evident for multi-cloud environments
62 ECP CI Roles and Security Model
63 Technical challenges to federation
• How do we refine user access control for CI? – Make OmniAuth a first-class citizen in GitLab – Leverage credentials in the auth hash schema – Authorization code flow – Digital Identity Guidelines (NIST SP 800-63-3) • How do we directly leverage HPC resources? – setUID and batch custom executors • How do we handle the three body administration problem? – Server, runner, identity provider administration – Federation audit events
64 OmniAuth and the auth hash schema
• “OmniAuth is a library that standardizes multi-provider authentication for web applications” • GitLab OmniAuth as a first-class citizen: – “...Users can choose to sign in using any of the configured mechanisms.” – I.e. LDAP and standard GitLab login can still function, but many providers can be added
65 66 Digital Identity Guidelines
• NIST 800-63-3 – https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-63-3.pdf • Establishing a user is who they claim to be – Used to be a single Level of Assurance (LOA) • Identity Assurance Level (IAL) – IAL1: No requirement to link the applicant to a specific real-life identity – IAL2: Verify applicant is associated with real-world identity – IAL3: Verified by an authorized and trained representative of the CSP • Authenticator Assurance Level (AAL) – AAL1: Either single-factor or multi-factor authentication – AAL2: Two distinct authentication factors, through secure protocol – AAL3: Possession of a key through a cryptographic protocol
67 Federation Audit Events
• Taps in directly to GitLab’s existing Audit Events infrastructure, via • Federation Events sidecar service • https://docs.gitlab.com/ee/adminis tration/audit_events.html
68 69 FY20 Development - Path Forward
• Server – Upstream federation enhancements: https://gitlab.com/gitlab-org/gitlab/issues/33665 – Current federation model address facility concerns, next step is to expand efforts to improve user (AD/ST project team) experience – Continue management of forked GitLab until key changes are upstreamed and sites migrated • Runner – Address open support issues for current runner deployment – Implementation of customer executor – Identify and support upstreaming efforts wherever possible • Concurrent Efforts – Generating and maintaining documentation for users and admins – Code scanning at the OSTI level – Commit-as-assurance to provide trusted signed commits
70 71 72 73 74 75 76 77 Recap and Path Forward CI Readiness
Site readiness for OSTI hosted CI with federated runners
Local CI MOU Auth Endpoint Federated(OSTI) CI
ORNL Available In-process In-process: 2/2020
NERSC Available In-process Complete In-process
ANL Available In-process In-process: 2/2020
LANL Available In-process In-process
LLNL Available In-process In-process
SNL Available In-process In-process
NMC In-process In-process: 1/2020 Complete
79 Enhanced Runners
• Setuid capabilities to support security requirements 27 Co – GitLab user triggering the job is associated with a local account Cobalt – Execute CI pipelines under user’s account as opposed to 58.933 privileged or system account. • Batch executor introduced to interact with HPC hardware – Execute the user defined script (.gitlab-ci.yml) at the job level by submitting it as a job script to the underlying batch system source: – Leverages SetUID functionality www.slurm.schedmd.com – 1 CI Job == 1 Submission • Federation, allowing cross site CI in a secure manner while still
empowering site admins source: community.ibm.com – Site Identify Providers are used and also linked to specific runners – Valid site credentials must be present before a job could be run at any specific site.
80 Continued Development and Support
• Improvements to available documentation over the coming quarter – https://ecp-ci.gitlab.io • Development and upstreaming efforts now spear headed by the ECP CI team – Ongoing server and runner support – Migration to better support custom executor • Future efforts include; Commit-as-assurance and code scanning • Support for CI integration of AD/ST teams – Helping individual teams on-board and teams – Developing documentation and training – Support development of tools to support CI teams – Identify limitations of CI infrastructure / resources and areas of potential growth
81 Points of Contact
• Collaboration across facilities and groups to offer assistance with:
– Site internal GitLab capabilities – CI runners on HPC resources – Test resources / support – Development efforts • Please get in touch!
82 Tutorial: Expanded Heat Equation
[hands-on] Expanded Heat Equation - Results
84 Expanded Heat Equation - Results
85 Expanded Heat Equation - Results
86 Expanded Heat Equation - Results
87 Expanded Heat Equation - Results
88 Expanded Heat Equation - Results
89 Expanded Heat Equation - Results
90 Expanded Heat Equation - Results
91 Discussion
92 Appendix Setuid
• Setuid feature enables GitLab to run bare-metal CI jobs securely on large, multi-tenant systems – All CI job actions are executed under the local user’s account with appropriate privileges observed • Process – 1. At start of job gather information on GitLab user – 2. Validate user’s account exists & can run on system – 3. Create local directories, enforcing proper ownership / rights – 4. Fork runner process so child process Job output (GitLab web UI) belongs to user – 5. Execute CI 94 Batch Executor
• Fundamelty the batch executors seeks to do the following: – Request allocation to execute build script – Monitor ongoing job state – Pipe output back to the GitLab server – If required, cancel job • Execute the user defined script (.gitlab-ci.yml) at the job level by submitting it as a job script(s) to the available batch system – Comparable function to “Shell executor” and leverages setuid functionality – 1 Job = 1 Submission • Goal is to ensure all upstream GitLab CI functionality can also be leveraged by the Batch Executor
95 27 Batch Executor - Cobalt Co
Cobalt 58.933
• All Jobs are submitted to the queue manage with qsub – qsub: https://trac.mcs.anl.gov/projects/cobalt/wiki/qsub.1.html • Both the output (--output) as well as error (--error) logs will be monitored • Duration of the build script will have access to the request allocation but ran on mom node – Remember qrun must be used to execute on compute resources
96 Batch Executor - LSF
source: community.ibm.com
• Jobs a submited to LSF using bsub – BSUB: https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.2/lsf_command_ref/bsub.1.html • All jobs are submitted as interactive (bsub -I) – Must use brun in your script to execute on compute resources • Simply piping stdout/stderr from the shell
97 Batch Executor - Slurm
source: www.slurm.schedmd.com
• Slurm integration revolves around submitting the job to sbatch – sbatch: https://slurm.schedmd.com/sbatch.html • The runner will monitor the stdout/stderr via log file (--output) • Entire build script is executed on the computer resource – Major differentiation between other supported schedulers
98 Batch Executor - Behavior
Cobalt LSF Slurm
Script Location where qrun Location where brun Location where srun ok? Element script will run? ok? script will run? ok? script will run?
before_script launch/mom node yes launch node yes batch node yes
script launch/mom node yes launch node yes batch node yes
after_script login node no login node no login node no
99 Batch Executor - Behavior
Cobalt LSF Slurm
Script Location where qrun Location where brun Location where srun ok? Element script will run? ok? script will run? ok? script will run?
before_script launch/mom node yes launch node yes batch node yes
script launch/mom node yes launch node yes batch node yes
after_script login node no login node no login node no
“Scripts specified in before_script are concatenated with any scripts specified in the main script, and executed together in a single shell.” - GitLab documentation
100