DRMAA2 – An Open Standard for Job Submission and Cluster Monitoring
DANIEL GRUBER DGRUBER@UNIVA.COM INTRODUCTION DRMAA in a Nutshell Why a Standard for Cluster Scheduler Access?
Generic and Simple Interface No Vendor Stable Well Lock-in Interface Documented
Clearly defined Simplify Protect Semantic Investments
Portable Code Future Simple Save Migration Community Command Line versus Standardized API
• CLI offers most flexibility but is ght to: • DRM vendor • DRM version • CLI cons • CLI is slow: • Creates process per request • Establishes communica on channel, authen ca on, shutdown • No syntax checking / problems with outdated scripts • CLI output is hard to parse (error code / different formats / requires parser) • API: • Well defined (simple) func ons and output • Efficient: Usually same connec on used during run me DESIGN OF DRMAA2 ...a li le bit of History
• 2006 first implementa on of DRMAA API (DRMAA1) in Sun Grid Engine and Condor • Implementa ons available (not only for): • PBS, Torque, LoadLeveler, Moab, Apple Xgrid, ... • 2009 Working on DRMAA2 started (ISC) based on public survey, customer feedback, experiences, ... • 2012 Ini al version of DRMAA2 finalized (IDL): GFD 194 • 2012 C language binding finalized: GFD 198 DRMAA versus DRMAA2
DRMAA Version 1 DRMAA Version 2 Simple API (~40 func ons) Rich API (~100 func ons)
Job submission / job workflow support Job submission / job workflow support
One job session (vola le) per applica on Mul ple, concurrent, persistent job sessions per applica on (only na ve specifica on for job Extensible objects submission) - Advance reserva ons - Cluster monitoring (machines, queues, non-DRMAA jobs) - No on of queues, slots, machines, job classes... Several language bindings ANSI C API standardized, Go available – others in progress Widely adopted New interface Basic Structure of DRMAA2 Design Goals
• Minimum set of func ons which are supported by all major cluster scheduler: • Func onality which is not available everywhere or has different seman c is op onal • Example: Deadline me • Seman c of queues not defined, but queues are available, etc. • Rela onship to other OGF standards: OCCI-DRMAA2 mapping, SAGA, GLUE 2.0 • Defini on in abstract interface defini on language (IDL) • Scope of func ons / grouping of func ons • Return values / error condi ons • Clear seman c of func ons Job Session – Working with Jobs
• Create named job session ß persistent • Destroy named job session ß does not affect jobs • Open exis ng job session / close job session ß connec on setup • Get all jobs of session (filter as argument) • Job state: • Job object: Get State • Job info object: Get State • Wai ng for job state: • WaitAnyStarted • WaitAnyTerminated Job Session – Working with Jobs
• Job submission like in DRMAA1: • Allocate job template • Run jobs / run bulk jobs using job template • With a job template you can define (incomplete): • Job: remoteCommand, args, and slots • Submission op ons: rerunnable, submitAsHold, workingDirectory, jobEnvironment, jobName, queueName, startTime, accoun ngId • Host selec on: candidateMachines, minPhysMemory, machineOS, machineArch • Email op ons: email, emailOnStarted, emailOnTerminated • Limits: resource Limits • Extensible!
Job Session – Working with Jobs
• Informa on about jobs: • Job object • Jobinfo object • Job informa on object as filter:
• Values of job info struct used as filter: • Job ID • Exit status • Termination signal • Job state • Job owner Methods defined on jobs: • drmaa2_j_get_id() • Slots • drmaa2_j_get_jt() • Queue • drmaa2_j_suspend() / resume() / hold() / release() / terminate()
• Resource usage, … • drmaa2_j_get_state() • drmaa2_j_get_info() • drmaa2_j_wait_started()
• drmaa2_j_wait_terminated()
Job Session – Working with Jobs
• Example: JobInfo as filter
Monitoring Session
• Open / close (no create / destroy) • Get all jobs (if allowed by DRM security se ngs) with filter • No job manipula on • Job informa on / job state • Get all machines • Machine object contains sockets, cores, hw. threads, load, memory • Extensible! • Get all queues • Queue object contains name • Extensible! Error Handling
• Defined as excep ons in IDL • In C: Func ons return error code or NULL in case of an error. Errors are stored in thread local storage to avoid issues in mul -threaded applica ons. • int drmaa2_lasterror(void) • drmaa2_string drmaa2_lasterror_text(void)
Dealing with Enhancements - Op onal
• Optional functionality: Using DrmaaCapability interface (drmaa2_supports()): Dealing with Enhancements - Extensions
• Check and use data structure enhancements with the DrmaaReflec ve interface
• Set instance value
• Get instance value LANGUAGE BINDINGS C Language Binding
• OGF GFD-198 • Short (4 pages + C header) ß all seman c in IDL • Adds high-level data structures: • Lists and Dic onaries • Errata finalized at ISC: • Issue tracking: h p://redmine.ogf.org • Naming inconsistency (jtemplate vs jt vs job_template) • Dict keys are strings… • Unset values are part of enum • Finalize func on Go Language Binding
• Go (#golang): Easy, compiled, fast, garbage collector, corou nes, closures, … • Uses cgo to access DRMAA2 C binding • Not yet a finalized standard – feedback welcome: • h ps://redmine.ogf.org/projects/drmaav2-go-binding/ repository/revisions/master/raw/drmaa2-go.pdf • Open source implementa on (Apache license): • h ps://github.com/dgruber/drmaa2 • Example applica ons: • Simple mul -clustering tool (implements minimalis c web service API): h p://github.com/dgruber/ubercluster OUTLOOK Give it a try ...
• Example implementa on (wrapping OS calls) of DRMAA2: • h ps://github.com/troeger/drmaav2-mock • DRMAA2 included in Univa Grid Engine 48 core limited free downloadable version (www.univa.com) • Contains all man pages and some examples • For compa bility and feature checks • Vagrant installa on recipe available: h ps://github.com/dgruber/vagrantGridEngine DRMAA Working Group – Your Input is Required
• Hard work is done (syntax and seman c) • Now we need your DRMAA2 implementa on! • or create other language bindings based on C bindings!
• OGF: h p://www.ogf.org • Working group: h p://www.drmaa.org • Join (low traffic) mailing list: h ps://www.ogf.org/mailman/lis nfo/drmaa-wg
Thank you very much for your a en on!
Next event: Meet us at ISC in Frankfurt
Feel free to contact me here at HEPiX or at [email protected]