Logic Networks on the Grid: Handling 15 Million Jobs

Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics Lab 07-06-10 Delft University of Technology Challenge the future Overview • Explanation of the application • Challenges for the grid • Custom grid solution design & implementation • More challenges (aka problems) • Adding a desktop cluster • Errors and statistics • Discussion 2 But first... Does anybody not know what these are: • Life Science Grid • Grid middleware • ToPoS • ORM (Object Relational Mapper) 3 The application: overview Input data: ~100 mouse tumors 4 Grid pipeline • Prepare inputs: prepare data for future grid runs • Multiple parameter settings are tested, output of these tests contains the 'real' data • Choose best parameter settings, it should still be feasible to do at least 100 permutations • Do permutations, 10 permutations per run 5 Run properties • Number of jobs per run is fairly large: 24 * 6228 = 149472 • Run time is, due to the optimization algorithm, unpredictable: jobs can take anywhere between 2 seconds and 14 hours • Outputs are small, both for the real runs and for the permutations 6 Middleware problems • Scheduling: – This amount of jobs cannot be scheduled using the normal (glite) middleware – Overhead of scheduling could out-weight the run time • Bookkeeping – No method of tracking this amount of jobs • Output handling: – No grid resource can store large amounts of small files (dCache is not an option) – Other solutions (such as ToPoS) are slow when retrieving output 7 Scheduling jobs with ToPoS • ToPoS takes care of the first two categories of problems but presents some new challenges: – ToPoS does not scale beyond 10.000 jobs per pool – No client software which facilitates spreading the tokens over multiple pools 8 Python ToPoS clients To deal with the limitations of ToPoS two clients were implemented: • Grid client: – uses the most basic Python httplib module – can fetch, lock and delete tokens – has a generator to transparently handle tokens in multiple pools • Local client: – uses the more advanced UrlLib2 module – can create and delete pools, spread tokens over multiple pools, delete all locks in a pool, gather ToPoS statistics, etc. 9 Dealing with the outputs Outputs: • are small and well defined • why not just flush them to a database? Proposed solution: • Python as language • SQLAlchemy as ORM • XML-RPC as communication channel • MySQL (for now) as database 10 Client design send results start python Bash script Python script Matlab (MCR) call Matlab ● Set environment variables ● Loop: ● Perform algorithm ● Fetch input data ● Fetch token from ToPoS ● Make binaries executable ● Call Matlab (MCR) ● Load modules ● Parse output & send to ● Start python script result server 11 Application design DB ORM App XML-RPC Clients Design overview Listen loop: listen for incoming calls Fetch token XML-RPC server Client Do work Upload output main DB DB code Flush loop: flush Thread once every minute model 12 Implementation & the weakest link • Implemented in Python • Hosted on a P4 in a broom closet in our department • On power failure: everything collapses (but that's not very likely, right?) 13 Getting ready to run: data replication • Getting the data from one (remote) site is expensive • Use data replication across all sites to minimize external traffic and divide the load over multiple SRMs • Data replication can be done easily with the V-Browser • Manual approach: Register file: lcg-cr -l lfn:///grid/lsgrid/jridder/MGtest/MG_Perm2_5_Datapack.zip MG_Perm2_5_Datapack.zip Replicate file (in this example to nikhef): lcg-rep --vo lsgrid -d tbn18.nikhef.nl srm://gb-se- tud.ewi.tudelft.nl/dpm/ewi.tudelft.nl/home/lsgrid/generated/2010-05-26/file006bff9b-49ef-46bd- 80cd-5b8110171557 On a WN, retrieve a local copy: DATAPACK=lfn:/grid/lsgrid/jridder/MGtest/MG_Perm2_5_Datapack.zip echo $VO_LSGRID_DEFAULT_SE TDATA=`lcg-lr --vo lsgrid $DATAPACK | grep $VO_LSGRID_DEFAULT_SE` lcg-cp --verbose $TDATA $DATAPACK 14 Adding a desktop cluster • Practical (student) pcs are not doing anything at night • Use these computers to increase computation power • Compute at night & in weekends • Our scenario (using ToPoS and an external output server) is ideal for testing such a cluster • Use condor to manage the work 15 Desktop cluster locations • Two locations – Drebbelweg: 250 practical pcs – Mekelweg: 50 – 100 pcs distributed throughout the building • Different locations means different vlans: use two condor queues 16 Problems during run • Many jobs seemed to quit prematurely while most of them ran fine • Errors could be traced back to Deimos and Nikhef • The middleware doesn't really provide statistics to the end- user • Output files cannot always be retrieved 17 Gathering statistics • Add run information (e.g. start & end times) to the job-output • Add an additional XML-RPC method to capture error information • Uploading error info is easy: – Use return status of external program – Use Pythons internal error handling capabilities – All error messages (of the entire job) are located in one text file 18 Job Running times (1) 19 Job running times (2) One permutation run (10 permutations) takes: • 415140369 seconds • 115316.77 hours • 4804.87 days • 13.16 years • Now, repeat 9 times (yes, that's a century) 20 Work done per site 21 Nikhef and Deimos mortality 22 Gathering error info • Gathering error information on the grid is prone to error • Again, work around the middleware: – Implement additional XML-RPC call to gather error information 23 Error & Fix • Jobs failed due to one error: “Could not access the MCR component cache" • Fix: export MCR_CACHE_ROOT=$( mktemp -d ) basically tells the MCR to store all temporary information in a new tmpdir • Will be included in the next POC environment 24 Mortality after fix 25 Discussion • We can schedule millions of jobs and capture their outputs on the grid, it just takes a custom solution • Other fields (such as pattern recognition) can benefit from this solution • Is their similar work being done? • If not, can we design and implement a generic solution which does the same? 26 Thanks Jeroen de Ridder Jeroen Engelberts Jan Just Keijzer Roeland van Ochten Pieter van Beek Marcel Reinders Evert Lammerts 27 Life Science Grid Site CPUs SARA 2000 NIKHEF 5000 Philips 1500 RUG 160 Erasmus 32 Keygene 32 TU Delft 32 RUG 32 AMS 32 NKI 16 AMC 16 LUMC 16 WUR 16 UU 16 kun 16 Total 8900 28 Grid Middleware • The glue (or spaghetti) that unifies job management across clusters Middleware Different sites with gina: condor keygene: pbs TU Delft: lsf ... RUG: SGE different job scheduling applications Heterogeneous compute resources 29 ToPoS: Token Pool Server Add work Submit jobs 1. Get token 2. Do work: Translate token Call function Upload output 3. Delete token Fetch work ToPoS: a pilot job framework. Why is ToPoS needed: • Problems with grid middleware Pilot job: one job / thread which keeps running – Inability to deal with large until all the work has been done. amounts of jobs – Failing jobs A 'token' represents one unit of work. – Job accounting Tokens can be locked to prevent other jobs – Etc. from doing the same work twice. 30 ORM: Object-Relational Mapper • Mapper for persistent storage of objects into a database • Saves you from having to write any DB code yourself • Examples: – Python: SQL Alchemy, Storm – Java: Hybernate, Cayenne – Ruby: ActiveRecord 31 Why not Molgenis? • Familiar with Python, which already has all the tools to make this • Design – XML-ify – generate – rol-out to cumbersome 32.

Logic Networks on the Grid: Handling 15 Million Jobs

Declarative Languages for Big Streaming Data a Database Perspective

Kodai: a Software Architecture and Implementation for Segmentation

DARPA and Data: a Portfolio Overview

STORM: Refinement Types for Secure Web Applications

FOSS Philosophy 6 the FOSS Development Method 7

Multimedia Systems

After the Storm After the Storm the Jobs and Skills That Will Drive the Post-Pandemic Recovery

Ascetic Documentation Release 0

Django Book Documentation Выпуск 0.1

Dmitry Omelechko

Sql Update Statement with Select

Zope.Component Documentation Release 5.0.1