Activities of the COST D37 GridChem Computational Chemistry Workflow Group
EGEE'07 Conference Budapest 01.10.2007
Thomas Steinke Zuse Institute Berlin (ZIB)
København •
Thomas Steinke, Tim Clark (DE) Cambridge • • Berlin Hans-Peter Lüthi, Martin Brändle (CH) • London Peter Murray-Rust, Henry Rzepa (UK) Antonio Márquez (ES) • Erlangen Kurt Mikkelsen (DK) Zürich • • Manno - CSCS (Manno, CH) - ZIB (Berlin, DE)
• Sevilla
2 “Traditional” Workflow in Comppyutational Chemistry
WkflWorkflows h ave a l ong t tditiithCCdradition in the CC domai n.
start kldb(knowledge base (DB search) automated/manually edited molecular structures molecular simulations method / program A method / program B … properties primary vi suali zati on / qualit y cont rol analysis / archival / DB storage new insights?
3 Databases: Computational protocol (T. Clark, 1998)
Complete protocol runs automatically with less than 0.5% failure rate. Cleanup 2D → 3D conversion VAMP optimization Calculate properties
~3,000 compounds per processor day (3 GHz Xeon)
Enhanced 3D-Databases: A Fully Electrostatic Database of AM1-Optimized StructuresB. Beck, A. Horn, J. E. Carpenter, and T. Clark, J.Chem. Inf. Comput.Sci. 1998, 38, 1214-1217.
source: Tim Clark, Uni Erlangen 4 Distributed Computing Environment in the 90 ’s
QM packages
5 Distributed Computing Environment in the 90 ’s
Example: UniChem distributed environment for quantum-chemical simulations Cray Research Inc. 1991-(2004) 6 CCWF Chemical Illustrator Applications
Molecular design of functionalised enzynes Hans-Peter Lüthi, Martin Brändle, Zürich Peter Murray-Rust, Cambridge; Henry Rzepa, London
Quantum chemical based QSAR/QSPR Tim Clark, Erlangen; Jon Essex , Southampton
High-order dynamic and static electrostatic molecular properties Kurt Mikkelsen, Copenhagen
Computational heterogeneous catalysis Antonio M. Márquez Cruz, Javier Fdez. Sanz, Sevilla
7 Molecular Design Workflow (Enzyne Design)
Steps: QC I nput QC Generation and Application Archiving of data QQpC Output Parser
EtExtrac tion XPath XML XPath queries Query DB
XSLT Input Statistical Statistical Analysis Analysis Output
source: Hans-Peter Lüthi, ETH Zürich
8 Quantum Chemical Based QSAR and QSPR
generatttte structures, conformations and protonation 2D-Database states semiempirical MO geometry optimization and electron density 2D → 3D generate isodensity surfaces, spherical-harmonic fits and local Conformations, properties Tautomers apply models QSPR VAMP Materials Design Virtual Screening
Multiscale Modeling ParaSurf ADME/Tox.
Proppyperty Optimization Pharmacokinetics
source: Tim Clark, Uni Erlangen Molecular Info 9 Properties: Free Energies of Hydration
4
2 ) -1 0 mol
-2 O) (kcal O) (kcal 2
(H -4 vv sol G
Δ -6 N = 362
ulated -8 cc MUE = 0. 85 kcal mol-1
Cal -1 -10 RMSD = 1.09 kcal mol r2 = 0.88 -12 q2 = 0.83
-14 -14 -12 -10 -8 -6 -4 -2 0 2 4 Δ -1 Experimental Gsolv(H2O) (kcal mol ) source: Tim Clark, Uni Erlangen 10 Comppguting the NCI database ( P. Murray -Rust,,) ’05)
MOPAC PM5
Workflow built with Taverna
source: Peter Murray-Rust et al., Uni Cambridge / Unilever Institute 11 Times to run jobs
120,000
80,000 / s ee
tim 40,000
0 0.E+00 5.E+08 1.E+09 (n basis functions)4
source: Peter Murray-Rust et al., Uni Cambridge / Unilever Institute 12 Unsuitable Data Program Crashes
PthlPathologi cal IfInform Behaviour Developer
Protocol System Crashes Log Files
Statistics Science Errors Parse
Analysis
Other Science Disseminate Results source: Peter Murray-Rust et al., Uni Cambridge / Unilever Institute 13 Conclusions from NCI “Experiment” (2005)
Protocols can be automated
MhiMachines can hi hihlihtghlight unusual lbhi behaviour, geomet tiries and distribution of results for humans to consider
Computational programs can provide high quality “experimental” molecular properties
source: Peter Murray-Rust et al., Uni Cambridge / Unilever Institute 14 Motivation
The orchestration of complex workflow scenarios is on today’s agenda. complex scientific solution paths linking in-house and (commercial) legacy codes
Æ Transformation of scientific ventures into a scientifically validated protocol allowing a highly (semi-) automated data generation (pre- processing) and data processing steps.
15 Goals of the CCWF Working Group
implementation of workflow environments for QC by adapting standard (Grid) technologies
fostering standard techniques (interfaces) for handling quantum chemical data in a flexible and extensible format to ensure application program interoperability and support of an efficient access to chemical information based on a CC ontology.
implementation of computational chemistry illustrator scenarios to demonstrate the applicability ofhf our approach
16 Generic Workflow
1. Aut oma tic generati on + valid ati on of i nput d at a
2. Submission, monitoring, and gathering of output data of siltijbimulation jobs
3. Integration of results (primary data) into project database
4. Data mining and visualization techniques to reduce complexity
5. Knowledge generation by applying methods of statistical analysis and pattern recognition.
6. On-line publication and archiving of valuable scientific data.
17 Challenges
Diversity: Molecular properties derived from state functions obtained with electronic -structure methods. ab-initio, semi-empirical, DFT, approximate potentials Gaussian,,,,,,p, COLUMBUS, Dalton, Turbomole, MOPAC, Vamp, CPMD…
Data formats: How to implement seamless data export/import? ~80 relevant formats known in CC: XYZ, MDL, SDF, PDB, … Æ OpenBABEL
18 Challenges (cont.)
Scaling, Robustness , Load Balancing: I can handle O(10) jobs by hand but… what a bout campa igns of O(1000) of j ob s? workflow system computational resources Æ distributed computing persistence, automated failure recovery, … long simulation times, sometimes unpredictable
Acceptance: easy of use, GUI + CLI
19 What I Want…
easy-of-use: workflow orchestration usage installation / maintenance
sharing of workflow descriptions with my colleagues standard languages
support in a heterog eneous environment laptop – server – cluster – supercomputer – grid
20 Which Workflow System?
… to be spoilt for choice?
21 Some Assessment Criteria
workflows in distributed systems robustness, stability Grid environment supported batch systems: PBS (, LSF) open source support for managing large files
recovery / backup restart/stop/debugging user/installation base
quality of the documentation status & exception handling customizability legacy codes and Web services PKI / security project development activity
required installation effort GUI Web interface WF language
22 TRIANA Experiences (2005/06)
9 workflow orchestration - proprietary workflow 9 integration of web services description language in 9 semantic check of WSDL files TRIANA (BPEL is announced) 9 support for self-written Triana modules - GUI robustness for very 9 negligibl e cont rol l ogi c complex workflow definitions overhead 9 pre-reqqguisite for migration to Grid environments
23 GWES Experiences (MediGRID, since 2006)
9 integration of web services - workflow orchestration and legacy codes (WF GUI builder in 9 monitoring + debugging preparation) support - proprietary workflow 9 Grid environments description language 9 under active development (A. Hoheisel et al./FhG FIRST)
24 25 OMII Server: Attracting Features
Workflows language: BPEL (Active BPEL) WF editor (Eclipse) WbSWeb Servi ces cust omi zati on
Jobs submission & monitoring via WS job manager API Data persistent (job recovery), in- memory (via Hibernate) GridSAM file staging support Distributed Resource Management within job (JSDL): file stage in/out (DRM) Apache Virtual File System library (vfs) Condor-G, Globus Gram FTP, local files, http, http, ssftp SSH-exec z i p, j a r, t a r, b z i p 2 , g z i p your own plug-ins, e.g. PBS ram - data in memory GridFTP 26 OMII/Active BPEL Experiences (3 months)
9 workflow orchestration - deployment requires manual (Eclipse plugin) workarounds 9 standardized WF language - learning barrier (BPEL) 9 monitoring support - BPEL editor not fully mature 9 Grid environments (validation of BPEL workflows) 9 security f ea tures: https + signed messages (X.509 cert.) 9 active development (UK eScience)
27 Summary
there are a couple of workflow system available design/development of workflow system still an on-going research not yet decided for our working group barriers: easy to use vs. robustness middleware stack: more complicated Grid environments vs. script-based approaches on clusters standards vs. proprietary but powerful/sufficient WF languages BPEL hhhhhas a high chance to survive
28 Acknowledgement
Core members of D37 CCWF working group Hans-Peter Lüthi, ETH Zurich Tim Clark,,g CCC Uni Erlangen J. A. Townsend, P. Murray-Rust, S. M. Tyrrell, Y. Zhang, Uni Cambridge/Unilever Inst.
developer of workflow systems mentioned in this talk
29 QUESTIONS?
30