Development of an Integrated Database Architecture for a Runtime Environment for Simulation Workflows
Total Page:16
File Type:pdf, Size:1020Kb
Institute of Architecture of Application Systems University of Stuttgart Universitätsstraße 38 D–70569 Stuttgart Diploma Thesis No. 2984 Development of an Integrated Database Architecture for a Runtime Environment for Simulation Workflows Christoph Marian Müller Course of Study: Software Engineering Examiner: Prof. Dr.-Ing. Dimka Karastoyanova Supervisors: Dipl.-Inf. Peter Reimann, Dipl.-Math. Michael Reiter Commenced: July 29, 2009 Completed: February 5, 2010 CR-Classification: H.3.4, H.4.1, I.6.7 Contents 1 Introduction 1 1.1 Motivation and Task of this Thesis...........................2 1.2 Conventions for this Document............................3 1.3 Legal Statements....................................4 1.4 Structure of this Document..............................4 2 Background 7 2.1 Basics of Web Services.................................7 2.2 Basics of Workflows................................... 10 2.3 Scientific Workflows and Simulations......................... 16 2.4 Database Issues..................................... 35 2.5 Tools and Frameworks................................. 38 2.6 Database Workflow Integration............................ 44 3 Requirements Analysis 47 3.1 Requirements by the Task of the Thesis........................ 48 3.2 Requirements due to the Simulation Workflow Environment............ 50 3.3 Requirements due to Potential Database Applications in Simulations........ 51 3.4 Workflow Scenarios and Use Cases.......................... 54 3.5 Summary of the Application Fields.......................... 69 4 Simulation Workflow Database Architecture 71 4.1 Logical Architecture.................................. 71 4.2 Components of the Architecture............................ 72 4.3 Components of the Prototype Architecture...................... 76 4.4 Portability to Other Workflow Environments..................... 78 5 Database Evaluation 81 5.1 Evaluation Schema................................... 81 5.2 Evaluation Infrastructure................................ 91 5.3 Database presentation................................. 97 5.4 Database Evaluation and Results........................... 106 5.5 Selection........................................ 128 6 Implementation and Prototypes 131 6.1 Used Software and Frameworks............................ 131 6.2 Installation and Configuration of the Environment.................. 132 iii Contents 6.3 IDARES Web Services.................................. 136 6.4 Extension of the Web Service Interface........................ 142 6.5 Development notes................................... 144 6.6 Tests and Prototypes.................................. 145 6.7 Runtime Environment and its Virtualization..................... 149 7 Summary and Future Work 153 A Appendix 157 A.1 Abbreviations...................................... 157 A.2 List of Requirements.................................. 160 A.3 Performance Queries.................................. 162 A.4 IDARES Web Services.................................. 174 A.5 Extensions to the Web Service Interface........................ 197 A.6 Performance Results.................................. 201 List of Figures 205 List of Tables 206 Bibliography 207 iv 1 Introduction Today, computer simulations are one of the cornerstones of scientific research. Computer simulations are utilized for various problems in physics, climate modeling, genomics, and earthquake prediction ([TDGS06], [GKC+09], [Har05], etc.). On the one hand, the methodology of simulations is applied to investigate theoretical hypotheses (cf. [Roh90]). On the other hand, simulations are utilized as a complement to experiments, or even as a substitute, e. g., car crash simulations. Besides the steady increase of computing capabilities, scientific research faces a huge amount of data that has to be managed ([DG06], [GKC+09]). Furthermore, the scientific community has always tend to build problem specific solutions. Therefore, the reuse of simulation models and data in other projects, or even disciplines, is an issue. One way to tackle these challenges are scientific workflows and – especially considered in this thesis – simulation workflows. The workflow technology originates from the business process world, offers modular loosely-coupled components, and facilitates service integration ([LR00], [BG06], [MSK+95], [WCL+05]). Moreover, modern workflow systems provide monitoring and administration functions for workflows as well as fault and compensation handling. Thereby, workflow management systems simplify the life-cylcle management of workflows essentially ([GDE+07]). The scientific community has perceived the potential of workflow technology, since it allows the automation and repeatability of long running scientific processes. The workflow language BPEL (Business Process Execution Language) has been investigated and suggested to be applied for scientific purposes ([AMA06], [TDGS06]). In addition to that, multiple projects have begun to develop more or less generic environments for scientific workflows, such as Kepler1, Taverna2, and Microsoft Trident3. 1https://kepler-project.org/ 2http://www.taverna.org.uk/ 3http://research.microsoft.com/en-us/collaboration/tools/trident.aspx 1 1 Introduction 1.1 Motivation and Task of this Thesis In the context of simulation workflows, data management is an important issue. Such workflows do not only have to cope with huge amounts of data, they also have to integrate heterogeneous data sources ([PHP+06]). [HG07] state that simulations spent a lot of time in pre-processing and post-processing of complex models. [Feh10] agrees to this and confirms that a lot of work on modeling and performing simulations is spent in data conversion tasks. Database technology has been applied in some projects to address data conversion problems and ease data management for simulation models (e. g., [HG07]). Moreover, databases facilitate rich query languages through standardized interfaces and provide transactional access for concurrent simulation activities. The Cluster of Excellence “Simulation Technology” (SimTech)4 at the University of Stuttgart conducts research for computer simulations to tackle common issues in this domain. Especially the Project Network 8 of SimTech investigates how workflow technology and integrated data management can support scientists in the development of simulation workflows. An example for the research of the project network is a solution how monolithic traditional simulation software can be integrated into a runtime environment for simulation workflows. This thesis is situated in the SimTech context and is concerned with data management issues in control flow-oriented simulation workflows. The thesis investigates how database technology can be utilized to simplify typical data provisioning tasks in simulation workflows. Suggestions are collected and developed how a control flow-oriented workflow environment would benefit of database systems that are tightly integrated into the environment and accessible by work- flow tasks. Accompanied by additional services, these database systems form an integrated database architecture. The thesis further suggests application fields that highlight how the integrated database architecture simplifies tasks related to data management or improves their performance. The ideas developed in this thesis are applied to a runtime environment for simulation workflows based on BPEL as the workflow language and Apache ODE as the workflow engine. The integrated database architecture is conceptualized in general and developed for its application in this environment. The database architecture consists of multiple database systems to support the following requirements: • Storage of the process information of Apache ODE. • Storage of the runtime information of Apache ODE, such as events and auditing information. • Import of external data into the database architecture in – relational and – XML based formats. 4http://www.simtech.uni-stuttgart.de/ 2 1.2 Conventions for this Document • Storage of relational and XML based data in – external storage based databases and in – main memory based databases. The main memory based databases are supported by the database architecture in order to provide a fast local storage for data-intensive simulation tasks and hence to improve their performance. [Rei08] already considered a local cache for performance improvements in BPEL workflows. This thesis prepares a possible realization of this approach. In order to provide necessary database systems for the integrated database architecture, candidates are collected, analyzed, and selected for their applicability during an evaluation. The integrated database architecture respects the design of the student project SIMPL5, which, amongst others, provides a framework to enrich the workflow language BPEL with data manage- ment activities. Since the SIMPL project is yet to be finished, the database architecture cannot rely on their implementation. Thus, the thesis provides an own approach to access databases from BPEL workflows. In order to demonstrate the practical application of the integrated database architecture, this thesis extends example workflows provided by [Rut09]. The workflows utilize a developed Web Service framework, which abstracts the access to traditional simulation software. The extended example workflows show how to orchestrate the steps of traditional simulation software and how their input and output data can be provisioned. This data provisioning is important because it facilitates the seamless integration of such simulation software into simulation