Activities of the COST D37 Gridchem Computational Chemistry Workflow Group
Total Page:16
File Type:pdf, Size:1020Kb
Activities of the COST D37 GridChem Computational Chemistry Workflow Group EGEE'07 Conference Budapest 01.10.2007 Thomas Steinke Zuse Institute Berlin (ZIB) <www.zib.de> [email protected] Partners in the CCWF Working Group København • Thomas Steinke, Tim Clark (DE) Cambridge • • Berlin Hans-Peter Lüthi, Martin Brändle (CH) • London Peter Murray-Rust, Henry Rzepa (UK) Antonio Márquez (ES) • Erlangen Kurt Mikkelsen (DK) Zürich • • Manno - CSCS (Manno, CH) - ZIB (Berlin, DE) • Sevilla 2 “Traditional” Workflow in Comppyutational Chemistry WkflWorkflows h ave a l ong t tditiithCCdradition in the CC domai n. start kldb(knowledge base (DB search) automated/manually edited molecular structures molecular simulations method / program A method / program B … properties primary vi suali zati on / qualit y cont rol analysis / archival / DB storage new insights? 3 Databases: Computational protocol (T. Clark, 1998) Complete protocol runs automatically with less than 0.5% failure rate. Cleanup 2D → 3D conversion VAMP optimization Calculate properties ~3,000 compounds per processor day (3 GHz Xeon) Enhanced 3D-Databases: A Fully Electrostatic Database of AM1-Optimized Structures B. Beck, A. Horn, J. E. Carpenter, and T. Clark, J.Chem. Inf. Comput.Sci. 1998, 38, 1214-1217. source: Tim Clark, Uni Erlangen 4 Distributed Computing Environment in the 90 ’s QM packages 5 Distributed Computing Environment in the 90 ’s Example: UniChem distributed environment for quantum-chemical simulations Cray Research Inc. 1991-(2004) 6 CCWF Chemical Illustrator Applications Molecular design of functionalised enzynes Hans-Peter Lüthi, Martin Brändle, Zürich Peter Murray-Rust, Cambridge; Henry Rzepa, London Quantum chemical based QSAR/QSPR Tim Clark, Erlangen; Jon Essex , Southampton High-order dynamic and static electrostatic molecular properties Kurt Mikkelsen, Copenhagen Computational heterogeneous catalysis Antonio M. Márquez Cruz, Javier Fdez. Sanz, Sevilla 7 Molecular Design Workflow (Enzyne Design) Steps: QC Inpu t QC Generation and Application Archiving of data QQpC Output Parser EtExtrac tion XPath XML XPath queries Query DB XSLT Input Statistical Statistical Analysis Analysis Output source: Hans-Peter Lüthi, ETH Zürich 8 Quantum Chemical Based QSAR and QSPR generatttte structures, conformations and protonation 2D-Database states semiempirical MO geometry optimization and electron density 2D → 3D generate isodensity surfaces, spherical-harmonic fits and local Conformations, properties Tautomers apply models QSPR VAMP Materials Design Virtual Screening Multiscale Modeling ParaSurf ADME/Tox. Proppyperty Optimization Pharmacokinetics source: Tim Clark, Uni Erlangen Molecular Info 9 Properties: Free Energies of Hydration 4 2 ) -1 0 mol -2 O) (kcal O) (kcal 2 (H -4 vv sol G Δ -6 N = 362 ulated -8 cc MUE = 0. 85 kcal mol-1 Cal -1 -10 RMSD = 1.09 kcal mol r2 = 0.88 -12 q2 = 0.83 -14 -14 -12 -10 -8 -6 -4 -2 0 2 4 Δ -1 Experimental Gsolv(H2O) (kcal mol ) source: Tim Clark, Uni Erlangen 10 Comppguting the NCI database ( P. Murray -Rust,,) ’05) MOPAC PM5 Workflow built with Taverna source: Peter Murray-Rust et al., Uni Cambridge / Unilever Institute 11 Times to run jobs 120,000 80,000 / s ee tim 40,000 0 0.E+00 5.E+08 1.E+09 (n basis functions)4 source: Peter Murray-Rust et al., Uni Cambridge / Unilever Institute 12 Unsuitable Data Program Crashes PthlPathologi cal IfInform Behaviour Developer Protocol System Crashes Log Files Statistics Science Errors Parse Analysis Other Science Disseminate Results source: Peter Murray-Rust et al., Uni Cambridge / Unilever Institute 13 Conclusions from NCI “Experiment” (2005) Protocols can be automated MhiMachines can hi hihlihtghlight unusual lbhi behaviour, geomet tiries and distribution of results for humans to consider Computational programs can provide high quality “experimental” molecular properties source: Peter Murray-Rust et al., Uni Cambridge / Unilever Institute 14 Motivation The orchestration of complex workflow scenarios is on today’s agenda. complex scientific solution paths linking in-house and (commercial) legacy codes Æ Transformation of scientific ventures into a scientifically validated protocol allowing a highly (semi-) automated data generation (pre- processing) and data processing steps. 15 Goals of the CCWF Working Group implementation of workflow environments for QC by adapting standard (Grid) technologies fostering standard techniques (interfaces) for handling quantum chemical data in a flexible and extensible format to ensure application program interoperability and support of an efficient access to chemical information based on a CC ontology. implementation of computational chemistry illustrator scenarios to demonstrate the applicability ofhf our approach 16 Generic Workflow 1. Aut oma tic generati on + valid ati on of i nput d at a 2. Submission, monitoring, and gathering of output data of siltijbimulation jobs 3. Integration of results (primary data) into project database 4. Data mining and visualization techniques to reduce complexity 5. Knowledge generation by applying methods of statistical analysis and pattern recognition. 6. On-line publication and archiving of valuable scientific data. 17 Challenges Diversity: Molecular properties derived from state functions obtained with electronic -structure methods. ab-initio, semi-empirical, DFT, approximate potentials Gaussian,,,,,,p, COLUMBUS, Dalton, Turbomole, MOPAC, Vamp, CPMD… Data formats: How to implement seamless data export/import? ~80 relevant formats known in CC: XYZ, MDL, SDF, PDB, … Æ OpenBABEL 18 Challenges (cont.) Scaling, Robustness , Load Balancing: I can handle O(10) jobs by hand but… what a bout campa igns of O(1000) of j ob s? workflow system computational resources Æ distributed computing persistence, automated failure recovery, … long simulation times, sometimes unpredictable Acceptance: easy of use, GUI + CLI 19 What I Want… easy-of-use: workflow orchestration usage installation / maintenance sharing of workflow descriptions with my colleagues standard languages support in a heterog eneous environment laptop – server – cluster – supercomputer – grid 20 Which Workflow System? … to be spoilt for choice? 21 Some Assessment Criteria workflows in distributed systems robustness, stability Grid environment supported batch systems: PBS (, LSF) open source support for managing large files recovery / backup restart/stop/debugging user/installation base quality of the documentation status & exception handling customizability legacy codes and Web services PKI / security project development activity required installation effort GUI Web interface WF language 22 TRIANA Experiences (2005/06) 9 workflow orchestration - proprietary workflow 9 integration of web services description language in 9 semantic check of WSDL files TRIANA (BPEL is announced) 9 support for self-written Triana modules - GUI robustness for very 9 negligibl e cont rol l ogi c complex workflow definitions overhead 9 pre-reqqguisite for migration to Grid environments 23 GWES Experiences (MediGRID, since 2006) 9 integration of web services - workflow orchestration and legacy codes (WF GUI builder in 9 monitoring + debugging preparation) support - proprietary workflow 9 Grid environments description language 9 under active development (A. Hoheisel et al./FhG FIRST) 24 25 OMII Server: Attracting Features Workflows language: BPEL (Active BPEL) WF editor (Eclipse) WbSWeb Serv ices cust omi zati on Jobs submission & monitoring via WS job manager API Data persistent (job recovery), in- memory (via Hibernate) GridSAM file staging support Distributed Resource Management within job (JSDL): file stage in/out (DRM) Apache Virtual File System library (vfs) Condor-G, Globus Gram FTP, local files, http, http, ssftp SSH-exec z i p, j a r, t a r, b z i p 2 , g z i p your own plug-ins, e.g. PBS ram - data in memory GridFTP 26 OMII/Active BPEL Experiences (3 months) 9 workflow orchestration - deployment requires manual (Eclipse plugin) workarounds 9 standardized WF language - learning barrier (BPEL) 9 monitoring support - BPEL editor not fully mature 9 Grid environments (validation of BPEL workflows) 9 security fea tures: https + signed messages (X.509 cert.) 9 active development (UK eScience) 27 Summary there are a couple of workflow system available design/development of workflow system still an on-going research not yet decided for our working group barriers: easy to use vs. robustness middleware stack: more complicated Grid environments vs. script-based approaches on clusters standards vs. proprietary but powerful/sufficient WF languages BPEL hhhhhas a high chance to survive 28 Acknowledgement Core members of D37 CCWF working group Hans-Peter Lüthi, ETH Zurich Tim Clark,,g CCC Uni Erlangen J. A. Townsend, P. Murray-Rust, S. M. Tyrrell, Y. Zhang, Uni Cambridge/Unilever Inst. developer of workflow systems mentioned in this talk 29 QUESTIONS? 30.