<<

Activities of the COST D37 GridChem Computational Workflow Group

EGEE'07 Conference Budapest 01.10.2007

Thomas Steinke Zuse Institute Berlin (ZIB) [email protected] Partners in the CCWF Working Group

København •

ƒ Thomas Steinke, Tim Clark (DE) Cambridge • • Berlin ƒ Hans-Peter Lüthi, Martin Brändle (CH) • London ƒ Peter Murray-Rust, Henry Rzepa (UK) ƒ Antonio Márquez (ES) • Erlangen ƒ Kurt Mikkelsen (DK) Zürich • • Manno - CSCS (Manno, CH) - ZIB (Berlin, DE)

• Sevilla

2 “Traditional” Workflow in Comppyutational Chemistry

WkflWorkflows h ave a l ong t tditiithCCdradition in the CC domai n.

start kldb(knowledge base (DB search) automated/manually edited molecular structures molecular simulations method / program A method / program B … properties primary vi suali zati on / qualit y cont rol analysis / archival / DB storage new insights?

3 Databases: Computational protocol (T. Clark, 1998)

ƒ Complete protocol runs automatically with less than 0.5% failure rate. ƒ Cleanup ƒ 2D → 3D conversion ƒ VAMP optimization ƒ Calculate properties

ƒ ~3,000 compounds per processor day (3 GHz Xeon)

Enhanced 3D-Databases: A Fully Electrostatic Database of AM1-Optimized StructuresB. Beck, A. Horn, J. E. Carpenter, and T. Clark, J.Chem. Inf. Comput.Sci. 1998, 38, 1214-1217.

source: Tim Clark, Uni Erlangen 4 Distributed Computing Environment in the 90 ’s

QM packages

5 Distributed Computing Environment in the 90 ’s

Example: UniChem distributed environment for quantum-chemical simulations Cray Research Inc. 1991-(2004) 6 CCWF Chemical Illustrator Applications

‰ Molecular design of functionalised enzynes Hans-Peter Lüthi, Martin Brändle, Zürich Peter Murray-Rust, Cambridge; Henry Rzepa, London

‰ Quantum chemical based QSAR/QSPR Tim Clark, Erlangen; Jon Essex , Southampton

‰ High-order dynamic and static electrostatic molecular properties Kurt Mikkelsen, Copenhagen

‰ Computational heterogeneous Antonio M. Márquez Cruz, Javier Fdez. Sanz, Sevilla

7 Molecular Design Workflow (Enzyne Design)

Steps: QC I nput QC ‰ Generation and Application Archiving of data QQpC Output Parser

‰ EtExtrac tion XPath XML XPath queries Query DB

XSLT Input Statistical ‰ Statistical Analysis Analysis Output

source: Hans-Peter Lüthi, ETH Zürich

8 Quantum Chemical Based QSAR and QSPR

‰ generatttte structures, conformations and protonation 2D-Database states ‰ semiempirical MO geometry optimization and electron density 2D → 3D ‰ generate isodensity surfaces, spherical-harmonic fits and local Conformations, properties Tautomers ‰ apply models QSPR VAMP Materials Design Virtual Screening

Multiscale Modeling ParaSurf ADME/Tox.

Proppyperty Optimization Pharmacokinetics

source: Tim Clark, Uni Erlangen Molecular Info 9 Properties: Free Energies of Hydration

4

2 ) -1 0 mol

-2 O) (kcal O) (kcal 2

(H -4 vv sol G

Δ -6 N = 362

ulated -8 cc MUE = 0. 85 kcal mol-1

Cal -1 -10 RMSD = 1.09 kcal mol r2 = 0.88 -12 q2 = 0.83

-14 -14 -12 -10 -8 -6 -4 -2 0 2 4 Δ -1 Experimental Gsolv(H2O) (kcal mol ) source: Tim Clark, Uni Erlangen 10 Comppguting the NCI database ( P. Murray -Rust,,) ’05)

MOPAC PM5

Workflow built with Taverna

source: Peter Murray-Rust et al., Uni Cambridge / Unilever Institute 11 Times to run jobs

120,000

80,000 / s ee

tim 40,000

0 0.E+00 5.E+08 1.E+09 (n basis functions)4

source: Peter Murray-Rust et al., Uni Cambridge / Unilever Institute 12 Unsuitable Data Program Crashes

PthlPathologi cal IfInform Behaviour Developer

Protocol System Crashes Log Files

Statistics Science Errors Parse

Analysis

Other Science Disseminate Results source: Peter Murray-Rust et al., Uni Cambridge / Unilever Institute 13 Conclusions from NCI “Experiment” (2005)

‰ Protocols can be automated

‰ MhiMachines can hi hihlihtghlight unusual lbhi behaviour, geomet tiries and distribution of results for humans to consider

‰ Computational programs can provide high quality “experimental” molecular properties

source: Peter Murray-Rust et al., Uni Cambridge / Unilever Institute 14 Motivation

‰ The orchestration of complex workflow scenarios is on today’s agenda.  complex scientific solution paths  linking in-house and (commercial) legacy codes

Æ Transformation of scientific ventures into a scientifically validated protocol  allowing a highly (semi-) automated data generation (pre- processing) and data processing steps.

15 Goals of the CCWF Working Group

‰ implementation of workflow environments for QC by adapting standard (Grid) technologies

‰ fostering standard techniques (interfaces) for handling quantum chemical data in a flexible and extensible format to ensure application program interoperability and support of an efficient access to chemical information based on a CC ontology.

‰ implementation of illustrator scenarios to demonstrate the applicability ofhf our approach

16 Generic Workflow

1. Aut oma tic generati on + valid ati on of i nput d at a

2. Submission, monitoring, and gathering of output data of siltijbimulation jobs

3. Integration of results (primary data) into project database

4. Data mining and visualization techniques to reduce complexity

5. Knowledge generation by applying methods of statistical analysis and pattern recognition.

6. On-line publication and archiving of valuable scientific data.

17 Challenges

‰ Diversity: Molecular properties derived from state functions obtained with electronic -structure methods.  ab-initio, semi-empirical, DFT, approximate potentials ‹ Gaussian,,,,,,p, COLUMBUS, Dalton, Turbomole, MOPAC, Vamp, CPMD…

‰ Data formats: How to implement seamless data export/import?  ~80 relevant formats known in CC: XYZ, MDL, SDF, PDB, … Æ OpenBABEL

18 Challenges (cont.)

‰ Scaling, Robustness , Load Balancing: I can handle O(10) jobs by hand but… what a bout campa igns of O(1000) of j ob s?  workflow system  computational resources Æ distributed computing  persistence, automated failure recovery, …  long simulation times, sometimes unpredictable

‰ Acceptance:  easy of use, GUI + CLI

19 What I Want…

‰ easy-of-use:  workflow orchestration  usage  installation / maintenance

‰ sharing of workflow descriptions with my colleagues  standard languages

‰ support in a heterog eneous environment  laptop – server – cluster – supercomputer – grid

20 Which Workflow System?

… to be spoilt for choice?

21 Some Assessment Criteria

‰ workflows in distributed systems ‰ robustness, stability ‰ Grid environment ‰ supported batch systems: PBS (, LSF) ‰ open source ‰ support for managing large files

‰ recovery / backup ‰ restart/stop/debugging ‰ user/installation base

‰ quality of the documentation ‰ status & exception handling ‰ customizability ‰ legacy codes and Web services ‰ PKI / security ‰ project development activity

‰ required installation effort ‰ GUI ‰ Web interface ‰ WF language

22 TRIANA Experiences (2005/06)

9 workflow orchestration - proprietary workflow 9 integration of web services description language in 9 semantic check of WSDL files TRIANA (BPEL is announced) 9 support for self-written Triana modules - GUI robustness for very 9 negligibl e cont rol l ogi c complex workflow definitions overhead 9 pre-reqqguisite for migration to Grid environments

23 GWES Experiences (MediGRID, since 2006)

9 integration of web services - workflow orchestration and legacy codes (WF GUI builder in 9 monitoring + debugging preparation) support - proprietary workflow 9 Grid environments description language 9 under active development (A. Hoheisel et al./FhG FIRST)

24 25 OMII Server: Attracting Features

Workflows  language: BPEL (Active BPEL)  WF editor (Eclipse)  WbSWeb Servi ces cust omi zati on

Jobs  submission & monitoring via WS  job manager API Data ‹ persistent (job recovery), in- memory (via Hibernate)  GridSAM file staging support  Distributed Resource Management  within job (JSDL): file stage in/out (DRM)  Apache Virtual File System library (vfs) ‹ Condor-G, Globus Gram ‹ FTP, local files, http, http, ssftp ‹ SSH-exec ‹ z i p, j a r, t a r, b z i p 2 , g z i p ‹ your own plug-ins, e.g. PBS ‹ ram - data in memory  GridFTP 26 OMII/Active BPEL Experiences (3 months)

9 workflow orchestration - deployment requires manual (Eclipse plugin) workarounds 9 standardized WF language - learning barrier (BPEL) 9 monitoring support - BPEL editor not fully mature 9 Grid environments (validation of BPEL workflows) 9 security f ea tures: https + signed messages (X.509 cert.) 9 active development (UK eScience)

27 Summary

‰ there are a couple of workflow system available  design/development of workflow system still an on-going research  not yet decided for our working group ‰ barriers: easy to use vs. robustness ‰ middleware stack: more complicated Grid environments vs. script-based approaches on clusters ‰ standards vs. proprietary but powerful/sufficient WF languages  BPEL hhhhhas a high chance to survive

28 Acknowledgement

Core members of D37 CCWF working group ‰ Hans-Peter Lüthi, ETH Zurich ‰ Tim Clark,,g CCC Uni Erlangen ‰ J. A. Townsend, P. Murray-Rust, S. M. Tyrrell, Y. Zhang, Uni Cambridge/Unilever Inst.

‰ developer of workflow systems mentioned in this talk

29 QUESTIONS?

30