Merlin Documentation Release 1.7.7 MLSI Jan 07, 2021 Contents 1 Merlin Overview 3 1.1 Tutorial..................................................4 1.2 Getting Started.............................................. 56 1.3 FAQ.................................................... 57 1.4 Command line.............................................. 65 1.5 Workflows................................................ 70 1.6 Workflow Specification......................................... 71 1.7 Configuration............................................... 76 1.8 Variables................................................. 82 1.9 Celery.................................................. 87 1.10 Virtual environments........................................... 89 1.11 Spack................................................... 90 1.12 Contributing............................................... 92 1.13 Docker.................................................. 93 i ii Merlin Documentation, Release 1.7.7 Merlin is a tool for running machine learning based workflows. The goal of Merlin is to make it easy to build, run, and process the kinds of large scale HPC workflows needed for cognitive simulation. Contents 1 Merlin Documentation, Release 1.7.7 2 Contents CHAPTER 1 Merlin Overview Merlin is a distributed task queuing system, designed to allow complex HPC workflows to scale to large numbers of simulations (we’ve done 100 Million on the Sierra Supercomputer). Why would you want to run that many simulations? To become your own Big Data generator. Data sets of this size can be large enough to train deep neural networks that can mimic your HPC application, to be used for such things as design optimization, uncertainty quantification and statistical experimental inference. Merlin’s been used to study inertial confinement fusion, extreme ultraviolet light generation, structural mechanics and atomic physics, to name a few. How does it work? In essence, Merlin coordinates complex workflows through a persistent external queue server that lives outside of your HPC systems, but that can talk to nodes on your cluster(s). As jobs spin up across your ecosystem, workers on those allocations pull work from a central server, which coordinates the task dependencies for your workflow. Since this coordination is done via direct connections to the workers (i.e. not through a file system), your workflow can scale to very large numbers of workers, which means a very large number of simulations with very little overhead. Furthermore, since the workers pull their instructions from the central server, you can do a lot of other neat things, like having multiple batch allocations contribute to the same work (think surge computing), or specialize workers to different machines (think CPU workers for your application and GPU workers that train your neural network). Another neat feature is that these workers can add more work back to central server, which enables a variety of dynamic workflows, such as may be necessary for the intelligent sampling of design spaces or reinforcement learning tasks. Merlin does all of this by leveraging some key HPC and cloud computing technologies, building off open source components. It uses maestro to provide an interface for describing workflows, as well as for defining workflow task dependencies. It translates those dependencies into concrete tasks via celery, which can be configured for a variety of backend technologies (rabbitmq and redis are currently supported). Although not a hard dependency, we encourage the use of flux for interfacing with HPC batch systems, since it can scale to a very large number of jobs. The integrated system looks a little something like this: 3 Merlin Documentation, Release 1.7.7 For more details, check out the rest of the documentation. Need help? [email protected] 1.1 Tutorial Estimated time • 3 hours Grab your laptop and coffee, and dive into this 7-module tutorial to become a Merlin expert. This hands-on tutorial introduces Merlin through some example workflows. In it, you will install Merlin on your local machine, stand up a virtual server and run both a simple workflow and a quasi-real-life physicsy simulation that couples a physics application with visualization and machine learning. You’ll also learn how to use some advanced features and help make Merlin better. Finally we offer some tips and tricks for porting and scaling up your application. 1.1.1 0. Before you start It will be helpful to have these steps already completed before you start the tutorial modules: • Make sure you have python 3.6 or newer. • Make sure you have GNU make tools and compilers. • Install docker. • Download OpenFOAM image with: 4 Chapter 1. Merlin Overview Merlin Documentation, Release 1.7.7 docker pull cfdengine/openfoam • Download redis image with: docker pull redis 1.1.2 Introduction This module introduces you to Merlin, some of the technology behind it, and how it works. Prerequisites • Curiosity Estimated time • 20 minutes You will learn • What Merlin is and why you might consider it • Why it was built and what are some target use cases • How it is designed and what the underlying tech is Table of Contents: • What is Merlin? • Why Merlin? What’s the need? • How can Merlin run so many simulations? • So what exactly does Merlin do? • How is it designed? • What is in this Tutorial? What is Merlin? Summary Merlin is a toolkit designed to enable HPC-focused simulation workflows with distributed cloud compute technologies. This helps simulation workflows push to immense scale. (Like 100 million.) At its core, Merlin translates a text-based, command-line focused workflow description into a set of discrete tasks. These tasks live on a centralized broker (e.g. a separate server) that persists outside of your HPC batch allocation. 1.1. Tutorial 5 Merlin Documentation, Release 1.7.7 Autonomous workers in different allocations (even on different machines) can then connect to this server, pull off and execute these tasks asynchronously. Why Merlin? What’s the need? That sounds complicated. Why would you care to do this? The short answer: machine learning The longer answer: machine learning and data science are becoming an integral part of scientific inquiry. The problem is that machine learning models are data hungry: it takes lots and lots of simulations to train machine learning models on their outputs. Unfortunately HPC systems were designed to execute a few large hero simulations, not many smaller simulations. Naively pushing standard HPC workflow tools to hundreds of thousands and millions of simulations can lead to some serious problems. Workflows, applications and machines are becoming more complex, but subject matter experts need to devote time and attention to their applications and often require fine command-line level control. Furthermore, they rarely have the time to devote to learning workflow systems. With the expansion of data-driven computing, the HPC scientist needs to be able to run more simulations through complex multi-component workflows. Merlin targets HPC workflows that require many simulations. These include: Table 1: Merlin Targeted Use Cases Emulator building Running enough simulations to build an emulator (or “surrogate model”) of an ex- pensive computer code, such as needed for uncertainty quantification Iterative sampling Executing some simulations and then choosing new ones to run based on the results obtained thus far Active learning Iteratively sampling coupled with emulator building to efficiently train a machine learning model Design optimization Using a computer code to optimize a model design, perhaps robustly or under uncer- tainty Reinforcement learning Building a machine learning model by subsequently exposing it to lots of trials, giving it a reward/penalty for the outcomes of those trials Hierarchical simulation Running low-fidelity simulations to inform which higher fidelity simulations to exe- cute Heterogeneous workflows Workflows that require different steps to execute on different hardware and/or systems Many scientific and engineering problems require running lots of simulations. But accomplishing these tasks effec- tively in an unstable bleeding edge HPC environment can be dicey. The tricks that work for 100 simulations won’t work for 10 thousand, let alone 100 million. We made Merlin to make high-frequency extreme scale computing easy. How can Merlin run so many simulations? The good news is that distributed cloud compute technology has really pushed the frontier of scalability. Merlin helps bring this tech to traditional scientific HPC. Traditionally, HPC workflow systems tie workflow steps to HPC resources and coordinate the execution of tasks and management of resources one of two ways: 6 Chapter 1. Merlin Overview Merlin Documentation, Release 1.7.7 Table 2: Traditional HPC Workflow Philosophies External Coordination • Separate batch jobs for each task • External daemon tracks dependencies and jobs • Progress monitored with periodic polling (of files or batch system) Internal Coordination • Multiple tasks bundled into larger batch jobs • Internal daemon tracks dependencies and re- sources • Progress monitored via polling (of filesystem or message passing) External coordination ties together independent batch jobs each executing workflow sub-tasks with an external mon- itor. This monitor could be a daemon or human that monitors either the batch or file system via periodic polling and orchestrates task launch dependencies. External coordination can tailor the resources to the task, but cannot easily run lots of concurrent simulations (since
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages98 Page
-
File Size-