DRMAA2 – An Open Standard for Job Submission and Cluster Monitoring

DANIEL GRUBER DGRUBER@.COM INTRODUCTION DRMAA in a Nutshell Why a Standard for Cluster Scheduler Access?

Generic and Simple Interface No Vendor Stable Well Lock-in Interface Documented

Clearly defined Simplify Protect Semantic Investments

Portable Code Future Simple Save Migration Community Command Line versus Standardized API

• CLI offers most flexibility but is ght to: • DRM vendor • DRM version • CLI cons • CLI is slow: • Creates process per request • Establishes communicaon channel, authencaon, shutdown • No syntax checking / problems with outdated scripts • CLI output is hard to parse (error code / different formats / requires parser) • API: • Well defined (simple) funcons and output • Efficient: Usually same connecon used during runme DESIGN OF DRMAA2 ...a lile bit of History

• 2006 first implementaon of DRMAA API (DRMAA1) in Sun Grid Engine and Condor • Implementaons available (not only for): • PBS, Torque, LoadLeveler, Moab, Apple Xgrid, ... • 2009 Working on DRMAA2 started (ISC) based on public survey, customer feedback, experiences, ... • 2012 Inial version of DRMAA2 finalized (IDL): GFD 194 • 2012 language binding finalized: GFD 198 DRMAA versus DRMAA2

DRMAA Version 1 DRMAA Version 2 Simple API (~40 funcons) Rich API (~100 funcons)

Job submission / job workflow support Job submission / job workflow support

One job session (volale) per applicaon Mulple, concurrent, persistent job sessions per applicaon (only nave specificaon for job Extensible objects submission) - Advance reservaons - Cluster monitoring (machines, queues, non-DRMAA jobs) - Noon of queues, slots, machines, job classes... Several language bindings ANSI C API standardized, Go available – others in progress Widely adopted New interface Basic Structure of DRMAA2 Design Goals

• Minimum set of funcons which are supported by all major cluster scheduler: • Funconality which is not available everywhere or has different semanc is oponal • Example: Deadline me • Semanc of queues not defined, but queues are available, etc. • Relaonship to other OGF standards: OCCI-DRMAA2 mapping, SAGA, GLUE 2.0 • Definion in abstract interface definion language (IDL) • Scope of funcons / grouping of funcons • Return values / error condions • Clear semanc of funcons Job Session – Working with Jobs

• Create named job session ß persistent • Destroy named job session ß does not affect jobs • Open exisng job session / close job session ß connecon setup • Get all jobs of session (filter as argument) • Job state: • Job object: Get State • Job info object: Get State • Waing for job state: • WaitAnyStarted • WaitAnyTerminated Job Session – Working with Jobs

• Job submission like in DRMAA1: • Allocate job template • Run jobs / run bulk jobs using job template • With a job template you can define (incomplete): • Job: remoteCommand, args, and slots • Submission opons: rerunnable, submitAsHold, workingDirectory, jobEnvironment, jobName, queueName, startTime, accounngId • Host selecon: candidateMachines, minPhysMemory, machineOS, machineArch • Email opons: email, emailOnStarted, emailOnTerminated • Limits: resource Limits • Extensible!

Job Session – Working with Jobs

• Informaon about jobs: • Job object • Jobinfo object • Job informaon object as filter:

• Values of job info struct used as filter: • Job ID • Exit status • Termination signal • Job state • Job owner Methods defined on jobs: • drmaa2_j_get_id() • Slots • drmaa2_j_get_jt() • Queue • drmaa2_j_suspend() / resume() / hold() / release() / terminate()

• Resource usage, … • drmaa2_j_get_state() • drmaa2_j_get_info() • drmaa2_j_wait_started()

• drmaa2_j_wait_terminated()

Job Session – Working with Jobs

• Example: JobInfo as filter

Monitoring Session

• Open / close (no create / destroy) • Get all jobs (if allowed by DRM security sengs) with filter • No job manipulaon • Job informaon / job state • Get all machines • Machine object contains sockets, cores, hw. threads, load, memory • Extensible! • Get all queues • Queue object contains name • Extensible! Error Handling

• Defined as excepons in IDL • In C: Funcons return error code or NULL in case of an error. Errors are stored in thread local storage to avoid issues in mul-threaded applicaons. • int drmaa2_lasterror(void) • drmaa2_string drmaa2_lasterror_text(void)

Dealing with Enhancements - Oponal

• Optional functionality: Using DrmaaCapability interface (drmaa2_supports()): Dealing with Enhancements - Extensions

• Check and use data structure enhancements with the DrmaaReflecve interface

• Set instance value

• Get instance value LANGUAGE BINDINGS C Language Binding

• OGF GFD-198 • Short (4 pages + C header) ß all semanc in IDL • Adds high-level data structures: • Lists and Diconaries • Errata finalized at ISC: • Issue tracking: hp://redmine.ogf.org • Naming inconsistency (jtemplate vs jt vs job_template) • Dict keys are strings… • Unset values are part of enum • Finalize funcon Go Language Binding

• Go (#golang): Easy, compiled, fast, garbage collector, corounes, closures, … • Uses cgo to access DRMAA2 C binding • Not yet a finalized standard – feedback welcome: • hps://redmine.ogf.org/projects/drmaav2-go-binding/ repository/revisions/master/raw/drmaa2-go.pdf • Open source implementaon (Apache license): • hps://github.com/dgruber/drmaa2 • Example applicaons: • Simple mul-clustering tool (implements minimalisc web service API): hp://github.com/dgruber/ubercluster OUTLOOK Give it a try ...

• Example implementaon (wrapping OS calls) of DRMAA2: • hps://github.com/troeger/drmaav2-mock • DRMAA2 included in 48 core limited free downloadable version (www.univa.com) • Contains all man pages and some examples • For compability and feature checks • Vagrant installaon recipe available: hps://github.com/dgruber/vagrantGridEngine DRMAA Working Group – Your Input is Required

• Hard work is done (syntax and semanc) • Now we need your DRMAA2 implementaon! • or create other language bindings based on C bindings!

• OGF: hp://www.ogf.org • Working group: hp://www.drmaa.org • Join (low traffic) mailing list: hps://www.ogf.org/mailman/lisnfo/drmaa-wg

Thank you very much for your aenon!

Next event: Meet us at ISC in Frankfurt

Feel free to contact me here at HEPiX or at [email protected]