Automated Optimization Methods for Scientific Workflows in E-Science Infrastructures
Total Page:16
File Type:pdf, Size:1020Kb
Scientific workflows have emerged as a key technology that assists scientists with the design, management, execution, sharing and reuse of in silico experiments. Workflow management systems simplify the management of scientific workflows by providing graphical interfaces for their development, monitoring and analysis. Nowadays, e-Science combines such workflow man- agement systems with large-scale data and computing resources into complex research infra- structures. For instance, e-Science allows the conveyance of best practice research in collab- orations by providing workflow repositories, which facilitate the sharing and reuse of scientific workflows. However, scientists are still faced with different limitations while reusing workflows. One of the most common challenges they meet is the need to select appropriate applications and their individual execution parameters. If scientists do not want to rely on default or experi- ence-based parameters, the best-effort option is to test different workflow set-ups using either trial and error approaches or parameter sweeps. Both methods may be inefficient or time con- suming respectively, especially when tuning a large number of parameters. Therefore, scientists require an effective and efficient mechanism that automatically tests different workflow set-ups in an intelligent way and will help them to improve their scientific results. This thesis addresses the limitation described above by defining and implementing an approach for the optimization of scientific workflows. In the course of this work, scientists’ needs are investigated and requirements are formulated resulting in an appropriate optimization concept. This concept is prototypically implemented by extending a workflow management system with an Workflows of Scientific Optimization Automated optimization framework. This implementation and therewith the general approach of workflow optimization is experimentally verified by four use cases in the life science domain. Finally, a new collaboration-based approach is introduced that harnesses optimization provenance to make optimization faster and more robust in the future. This publication was written at the Jülich Supercomputing Centre (JSC) which is an integral part of the Institute for Advanced Simulation (IAS). The IAS combines the Jülich simulation sciences and the supercomputer facility in one organizational unit. It includes those parts of the scientific institutes at Forschungszentrum Jülich which use simulation on supercomputers as their main Sonja Holl research methodology. IAS Series IAS Series Automated Optimization Methods for Scientific Workflows in e-Science Infrastructures Sonja Holl IAS Series Volume 24 ISBN 978-3-89336-949-2 24 Member of the Helmholtz Association Member of the Automated Optimization Methods for Scientific Workflows in e-Science Infrastructures Sonja Holl Member of the Helmholtz Association Member of the Schriften des Forschungszentrums Jülich IAS Series Volume 24 Forschungszentrum Jülich GmbH Institute for Advanced Simulation (IAS) Jülich Supercomputing Centre (JSC) Automated Optimization Methods for Scientific Workflows in e-Science Infrastructures Sonja Holl Schriften des Forschungszentrums Jülich IAS Series Volume 24 ISSN 1868-8489 ISBN 978-3-89336-949-2 Bibliographic information published by the Deutsche Nationalbibliothek. The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available in the Internet at http://dnb.d-nb.de. Publisher and Forschungszentrum Jülich GmbH Distributor: Zentralbibliothek 52425 Jülich Phone +49 (0) 24 61 61-53 68 · Fax +49 (0) 24 61 61-61 03 e-mail: [email protected] Internet: http://www.fz-juelich.de/zb Cover Design: Jülich Supercomputing Centre, Forschungszentrum Jülich GmbH Printer: Grafische Medien, Forschungszentrum Jülich GmbH Copyright: Forschungszentrum Jülich 2014 Schriften des Forschungszentrums Jülich IAS Series Volume 24 D 5 (Diss., Bonn, Univ., 2014) ISSN 1868-8489 ISBN 978-3-89336-949-2 Persistent Identifier: urn:nbn:de:0001-2014022000 Resolving URL: http://www.persistent-identifier.de/?link=610 Neither this book nor any part of it may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, microfilming, and recording, or by any information storage and retrieval system, without permission in writing from the publisher. Abstract Scientific workflows have emerged as a key technology that assists scientists with the design, management, execution, sharing and reuse of in silico experiments. Workflow man- agement systems simplify the management of scientific workflows by providing graphical interfaces for their development, monitoring and analysis. Nowadays, e-Science combines such workflow management systems with large-scale data and computing resources into complex research infrastructures. For instance, e-Science allows the conveyance of best practice research in collaborations by providing workflow repositories, which facilitate the sharing and reuse of scientific workflows. However, scientists are still faced with different limitations while reusing workflows. One of the most common challenges they meet is the need to select appropriate applications and their individual execution parameters. If scien- tists do not want to rely on default or experience-based parameters, the best-effort option is to test different workflow set-ups using either trial and error approaches or parameter sweeps. Both methods may be inefficient or time consuming respectively, especially when tuning a large number of parameters. Therefore, scientists require an effective and efficient mechanism that automatically tests different workflow set-ups in an intelligent way and will help them to improve their scientific results. This thesis addresses the limitation described above by defining and implementing an approach for the optimization of scientific workflows. In the course of this work, scien- tists’ needs are investigated and requirements are formulated resulting in an appropriate optimization concept. In a following step, this concept is prototypically implemented by extending a workflow management system with an optimization framework, including general mechanisms required to conduct workflow optimization. As optimization is an ongoing research topic, different algorithms are provided by pluggable extensions (plugins) that can be loosely coupled with the framework, resulting in a generic and quickly extend- able system. In this thesis, an exemplary plugin is introduced which applies a Genetic Algorithm for parameter optimization. In order to accelerate and therefore make workflow optimization feasible at all, e-Science infrastructures are utilized for the parallel execution of scientific workflows. This is empowered by additional extensions enabling the execution of applications and workflows on distributed computing resources. The actual implementation and therewith the general approach of workflow optimization is experimentally verified by four use cases in the life science domain. All workflows were significantly improved, which demonstrates the advantage of the proposed workflow optimization. Finally, a new collaboration-based approach is introduced that harnesses optimization provenance to make optimization faster and more robust in the future. Acknowledgements I would like to express my gratitude to all the people who supported me in any way during my PhD project. First and foremost, I would like to thank Prof. Martin Hofmann-Apitius for his advice and guidance though such a versatile research project. Visiting his research department was a wonderful working experience, especially collaborating with such a friendly and open minded group. I would like to thank Prof. Thomas Lippert and Daniel Mallmann for the opportunity to conduct my PhD research within the Federated Systems and Data group at the Jülich Supercomputing Center in the Forschungszentrum Jülich. It was a great workplace offering both an excellent technical infrastructure and social environment. I am also very thankful to my supervisor Olav Zimmermann for his continuous support, fruitful discussions and numerous comments helping me sharpen my doctoral studies. Moreover, I thank Prof. Heiko Schoof to act as a very excited co-referee as well as further referees, Prof. Rainer Manthey and Prof. Diana Imhof. I would like to acknowledge all my colleagues at the Jülich Supercomputing Center for providing technical background during the implementation, enormous efforts to create a stable and reliable infrastructure, proof-reading parts of this thesis, and offering a warm and cheerful office atmosphere in the coldest office in JSC. Additional thanks go to all the Studium Universale members offering friendly meetings as a welcome alternation to the PhD work. A special thanks goes to Prof. Carole Goble for hosting me at the University of Manchester and her team, in person Dr. Khalid Belhajjame, Alan Williams, Stian Soiland- Reyes, Dr. Alexandra Nenadic and Dr. Katy Wolstencroft for providing excellent support and assistance during the implementation phase on extensions to the Taverna Workflow Management System and Research Objects. Many thanks are due to my collaboration partners namely Prof. Magnus Palmblad, Dr. Yassene Mohammed, Daniel Garijo, Dr. Matthias Obst, Renato De Giovanni, Shweta Bagewadi, and Dr. Philipp Senger for their fruitful discussions and great assistance with biological, medical or workflow provenance issues. Finally, I am deeply grateful to