Development of an Integrated Database Architecture for a Runtime Environment for Simulation Workflows

Total Page:16

File Type:pdf, Size:1020Kb

Development of an Integrated Database Architecture for a Runtime Environment for Simulation Workflows Institute of Architecture of Application Systems University of Stuttgart Universitätsstraße 38 D–70569 Stuttgart Diploma Thesis No. 2984 Development of an Integrated Database Architecture for a Runtime Environment for Simulation Workflows Christoph Marian Müller Course of Study: Software Engineering Examiner: Prof. Dr.-Ing. Dimka Karastoyanova Supervisors: Dipl.-Inf. Peter Reimann, Dipl.-Math. Michael Reiter Commenced: July 29, 2009 Completed: February 5, 2010 CR-Classification: H.3.4, H.4.1, I.6.7 Contents 1 Introduction 1 1.1 Motivation and Task of this Thesis...........................2 1.2 Conventions for this Document............................3 1.3 Legal Statements....................................4 1.4 Structure of this Document..............................4 2 Background 7 2.1 Basics of Web Services.................................7 2.2 Basics of Workflows................................... 10 2.3 Scientific Workflows and Simulations......................... 16 2.4 Database Issues..................................... 35 2.5 Tools and Frameworks................................. 38 2.6 Database Workflow Integration............................ 44 3 Requirements Analysis 47 3.1 Requirements by the Task of the Thesis........................ 48 3.2 Requirements due to the Simulation Workflow Environment............ 50 3.3 Requirements due to Potential Database Applications in Simulations........ 51 3.4 Workflow Scenarios and Use Cases.......................... 54 3.5 Summary of the Application Fields.......................... 69 4 Simulation Workflow Database Architecture 71 4.1 Logical Architecture.................................. 71 4.2 Components of the Architecture............................ 72 4.3 Components of the Prototype Architecture...................... 76 4.4 Portability to Other Workflow Environments..................... 78 5 Database Evaluation 81 5.1 Evaluation Schema................................... 81 5.2 Evaluation Infrastructure................................ 91 5.3 Database presentation................................. 97 5.4 Database Evaluation and Results........................... 106 5.5 Selection........................................ 128 6 Implementation and Prototypes 131 6.1 Used Software and Frameworks............................ 131 6.2 Installation and Configuration of the Environment.................. 132 iii Contents 6.3 IDARES Web Services.................................. 136 6.4 Extension of the Web Service Interface........................ 142 6.5 Development notes................................... 144 6.6 Tests and Prototypes.................................. 145 6.7 Runtime Environment and its Virtualization..................... 149 7 Summary and Future Work 153 A Appendix 157 A.1 Abbreviations...................................... 157 A.2 List of Requirements.................................. 160 A.3 Performance Queries.................................. 162 A.4 IDARES Web Services.................................. 174 A.5 Extensions to the Web Service Interface........................ 197 A.6 Performance Results.................................. 201 List of Figures 205 List of Tables 206 Bibliography 207 iv 1 Introduction Today, computer simulations are one of the cornerstones of scientific research. Computer simulations are utilized for various problems in physics, climate modeling, genomics, and earthquake prediction ([TDGS06], [GKC+09], [Har05], etc.). On the one hand, the methodology of simulations is applied to investigate theoretical hypotheses (cf. [Roh90]). On the other hand, simulations are utilized as a complement to experiments, or even as a substitute, e. g., car crash simulations. Besides the steady increase of computing capabilities, scientific research faces a huge amount of data that has to be managed ([DG06], [GKC+09]). Furthermore, the scientific community has always tend to build problem specific solutions. Therefore, the reuse of simulation models and data in other projects, or even disciplines, is an issue. One way to tackle these challenges are scientific workflows and – especially considered in this thesis – simulation workflows. The workflow technology originates from the business process world, offers modular loosely-coupled components, and facilitates service integration ([LR00], [BG06], [MSK+95], [WCL+05]). Moreover, modern workflow systems provide monitoring and administration functions for workflows as well as fault and compensation handling. Thereby, workflow management systems simplify the life-cylcle management of workflows essentially ([GDE+07]). The scientific community has perceived the potential of workflow technology, since it allows the automation and repeatability of long running scientific processes. The workflow language BPEL (Business Process Execution Language) has been investigated and suggested to be applied for scientific purposes ([AMA06], [TDGS06]). In addition to that, multiple projects have begun to develop more or less generic environments for scientific workflows, such as Kepler1, Taverna2, and Microsoft Trident3. 1https://kepler-project.org/ 2http://www.taverna.org.uk/ 3http://research.microsoft.com/en-us/collaboration/tools/trident.aspx 1 1 Introduction 1.1 Motivation and Task of this Thesis In the context of simulation workflows, data management is an important issue. Such workflows do not only have to cope with huge amounts of data, they also have to integrate heterogeneous data sources ([PHP+06]). [HG07] state that simulations spent a lot of time in pre-processing and post-processing of complex models. [Feh10] agrees to this and confirms that a lot of work on modeling and performing simulations is spent in data conversion tasks. Database technology has been applied in some projects to address data conversion problems and ease data management for simulation models (e. g., [HG07]). Moreover, databases facilitate rich query languages through standardized interfaces and provide transactional access for concurrent simulation activities. The Cluster of Excellence “Simulation Technology” (SimTech)4 at the University of Stuttgart conducts research for computer simulations to tackle common issues in this domain. Especially the Project Network 8 of SimTech investigates how workflow technology and integrated data management can support scientists in the development of simulation workflows. An example for the research of the project network is a solution how monolithic traditional simulation software can be integrated into a runtime environment for simulation workflows. This thesis is situated in the SimTech context and is concerned with data management issues in control flow-oriented simulation workflows. The thesis investigates how database technology can be utilized to simplify typical data provisioning tasks in simulation workflows. Suggestions are collected and developed how a control flow-oriented workflow environment would benefit of database systems that are tightly integrated into the environment and accessible by work- flow tasks. Accompanied by additional services, these database systems form an integrated database architecture. The thesis further suggests application fields that highlight how the integrated database architecture simplifies tasks related to data management or improves their performance. The ideas developed in this thesis are applied to a runtime environment for simulation workflows based on BPEL as the workflow language and Apache ODE as the workflow engine. The integrated database architecture is conceptualized in general and developed for its application in this environment. The database architecture consists of multiple database systems to support the following requirements: • Storage of the process information of Apache ODE. • Storage of the runtime information of Apache ODE, such as events and auditing information. • Import of external data into the database architecture in – relational and – XML based formats. 4http://www.simtech.uni-stuttgart.de/ 2 1.2 Conventions for this Document • Storage of relational and XML based data in – external storage based databases and in – main memory based databases. The main memory based databases are supported by the database architecture in order to provide a fast local storage for data-intensive simulation tasks and hence to improve their performance. [Rei08] already considered a local cache for performance improvements in BPEL workflows. This thesis prepares a possible realization of this approach. In order to provide necessary database systems for the integrated database architecture, candidates are collected, analyzed, and selected for their applicability during an evaluation. The integrated database architecture respects the design of the student project SIMPL5, which, amongst others, provides a framework to enrich the workflow language BPEL with data manage- ment activities. Since the SIMPL project is yet to be finished, the database architecture cannot rely on their implementation. Thus, the thesis provides an own approach to access databases from BPEL workflows. In order to demonstrate the practical application of the integrated database architecture, this thesis extends example workflows provided by [Rut09]. The workflows utilize a developed Web Service framework, which abstracts the access to traditional simulation software. The extended example workflows show how to orchestrate the steps of traditional simulation software and how their input and output data can be provisioned. This data provisioning is important because it facilitates the seamless integration of such simulation software into simulation
Recommended publications
  • Fun Factor: Coding with Xquery a Conversation with Jason Hunter by Ivan Pedruzzi, Senior Product Architect for Stylus Studio
    Fun Factor: Coding With XQuery A Conversation with Jason Hunter by Ivan Pedruzzi, Senior Product Architect for Stylus Studio Jason Hunter is the author of Java Servlet Programming and co-author of Java Enterprise Best Practices (both O'Reilly Media), an original contributor to Apache Tomcat, and a member of the expert groups responsible for Servlet, JSP, JAXP, and XQJ (XQuery API for Java) development. Jason is an Apache Member, and as Apache's representative to the Java Community Process Executive Committee he established a landmark agreement for open source Java. He co-created the open source JDOM library to enable optimized Java and XML integration. More recently, Jason's work has focused on XQuery technologies. In addition to helping on XQJ, he co-created BumbleBee, an XQuery test harness, and started XQuery.com, a popular XQuery development portal. Jason presently works as a Senior Engineer with Mark Logic, maker of Content Interaction Server, an XQuery-enabled database for content. Jason is interviewed by Ivan Pedruzzi, Senior Product Architect for Stylus Studio. Stylus Studio is the leading XML IDE for XML data integration, featuring advanced support for XQuery development, including XQuery editing, mapping, debugging and performance profiling. Ivan Pedruzzi: Hi, Jason. Thanks for taking the time to XQuery behind a J2EE server (I have a JSP tag library talk with The Stylus Scoop today. Most of our readers for this) but I’ve found it easier to simply put XQuery are familiar with your past work on Java Servlets; but directly on the web. What you do is put .xqy files on the could you tell us what was behind your more recent server that are XQuery scripts to produce XHTML interest and work in XQuery technologies? pages.
    [Show full text]
  • The State of the Feather
    The State of the Feather An Overview and Year In Review of The Apache Software Foundation The Overview • Not a replacement for “Behind the Scenes...” • To appreciate where we are - • Need to understand how we got here 2 In the beginning... • There was The Apache Group • But we needed a more formal and legal entity • Thus was born: The Apache Software Foundation (April/June 1999) • A non-profit, 501(c)3 Corporation • Governed by members - member based entity 3 “Hierarchies” Development Administrative PMC Members Committers Board Contributors Officers Patchers/Buggers Members Users 4 At the start • There were only 21 members • And 2 “projects”: httpd and Concom • All servers and services were donated 5 Today... • We have 227 members... • ~54 TLPs • ~25 Incubator podlings • Tons of committers (literally) 6 The only constant... • Has been Change (and Growth!) • Over the years, the ASF has adjusted to handle the increasing “administrative” aspects of the foundation • While remaining true to our goals and our beginnings 7 Handling growth • ASF dedicated to providing the infrastructure resources needed • Volunteers supplemented by contracted out SysAdmin • Paperwork handling supplemented by contracted out SecAssist • Accounting services as needed • Using pro-bono legal services 8 Staying true • Policy still firmly in the hands of the ASF • Use outsourced help where needed – Help volunteers, not replace them – Only for administrative efforts • Infrastructure itself is a service provided by the ASF • Board/Infra/etc exists so projects and people
    [Show full text]
  • Preview HSQLDB Tutorial (PDF Version)
    About the Tutorial HyperSQL Database is a modern relational database manager that conforms closely to the SQL:2011 standard and JDBC 4 specifications. It supports all core features and RDBMS. HSQLDB is used for the development, testing, and deployment of database applications. In this tutorial, we will look closely at HSQLDB, which is one of the best open-source, multi-model, next generation NoSQL product. Audience This tutorial is designed for Software Professionals who are willing to learn HSQL Database in simple and easy steps. It will give you a great understanding on HSQLDB concepts. Prerequisites Before you start practicing the various types of examples given in this tutorial, we assume you are already aware of the concepts of database, especially RDBMS. Disclaimer & Copyright Copyright 2016 by Tutorials Point (I) Pvt. Ltd. All the content and graphics published in this e-book are the property of Tutorials Point (I) Pvt. Ltd. The user of this e-book is prohibited to reuse, retain, copy, distribute or republish any contents or a part of contents of this e-book in any manner without written consent of the publisher. We strive to update the contents of our website and tutorials as timely and as precisely as possible, however, the contents may contain inaccuracies or errors. Tutorials Point (I) Pvt. Ltd. provides no guarantee regarding the accuracy, timeliness or completeness of our website or its contents including this tutorial. If you discover any errors on our website or in this tutorial, please notify us at [email protected]. i Table of Contents About the Tutorial ...................................................................................................................................
    [Show full text]
  • Getting Started with Derby Version 10.14
    Getting Started with Derby Version 10.14 Derby Document build: April 6, 2018, 6:13:12 PM (PDT) Version 10.14 Getting Started with Derby Contents Copyright................................................................................................................................3 License................................................................................................................................... 4 Introduction to Derby........................................................................................................... 8 Deployment options...................................................................................................8 System requirements.................................................................................................8 Product documentation for Derby........................................................................... 9 Installing and configuring Derby.......................................................................................10 Installing Derby........................................................................................................ 10 Setting up your environment..................................................................................10 Choosing a method to run the Derby tools and startup utilities...........................11 Setting the environment variables.......................................................................12 Syntax for the derbyrun.jar file............................................................................13
    [Show full text]
  • Operational Database Offload
    Operational Database Offload Partner Brief Facing increased data growth and cost pressures, scale‐out technology has become very popular as more businesses become frustrated with their costly “Our partnership with Hortonworks is able to scale‐up RDBMSs. With Hadoop emerging as the de facto scale‐out file system, a deliver to our clients 5‐10x faster performance Hadoop RDBMS is a natural choice to replace traditional relational databases and over 75% reduction in TCO over traditional scale‐up databases. With Splice like Oracle and IBM DB2, which struggle with cost or scaling issues. Machine’s SQL‐based transactional processing Designed to meet the needs of real‐time, data‐driven businesses, Splice engine, our clients are able to migrate their legacy database applications without Machine is the only Hadoop RDBMS. Splice Machine offers an ANSI‐SQL application rewrites” database with support for ACID transactions on the distributed computing Monte Zweben infrastructure of Hadoop. Like Oracle and MySQL, it is an operational database Chief Executive Office that can handle operational (OLTP) or analytical (OLAP) workloads, while scaling Splice Machine out cost‐effectively from terabytes to petabytes on inexpensive commodity servers. Splice Machine, a technology partner with Hortonworks, chose HBase and Hadoop as its scale‐out architecture because of their proven auto‐sharding, replication, and failover technology. This partnership now allows businesses the best of all worlds: a standard SQL database, the proven scale‐out of Hadoop, and the ability to leverage current staff, operations, and applications without specialized hardware or significant application modifications. What Business Challenges are Solved? Leverage Existing SQL Tools Cost Effective Scaling Real‐Time Updates Leveraging the proven SQL processing of Splice Machine leverages the proven Splice Machine provides full ACID Apache Derby, Splice Machine is a true ANSI auto‐sharding of HBase to scale with transactions across rows and tables by using SQL database on Hadoop.
    [Show full text]
  • Oracle® Big Data Connectors User's Guide
    Oracle® Big Data Connectors User's Guide Release 4 (4.12) E93614-03 July 2018 Oracle Big Data Connectors User's Guide, Release 4 (4.12) E93614-03 Copyright © 2011, 2018, Oracle and/or its affiliates. All rights reserved. Primary Author: Frederick Kush This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is prohibited. The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing. If this is software or related documentation that is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government, then the following notice is applicable: U.S. GOVERNMENT END USERS: Oracle programs, including any operating system, integrated software, any programs installed on the hardware, and/or documentation, delivered to U.S. Government end users are "commercial computer software" pursuant to the applicable Federal Acquisition Regulation and agency- specific supplemental regulations. As such, use, duplication, disclosure, modification, and adaptation of the programs, including any operating system, integrated software, any programs installed on the hardware, and/or documentation, shall be subject to license terms and license restrictions applicable to the programs.
    [Show full text]
  • Tuning Derby Version 10.14
    Tuning Derby Version 10.14 Derby Document build: April 6, 2018, 6:14:42 PM (PDT) Version 10.14 Tuning Derby Contents Copyright................................................................................................................................4 License................................................................................................................................... 5 About this guide....................................................................................................................9 Purpose of this guide................................................................................................9 Audience..................................................................................................................... 9 How this guide is organized.....................................................................................9 Performance tips and tricks.............................................................................................. 10 Use prepared statements with substitution parameters......................................10 Create indexes, and make sure they are being used...........................................10 Ensure that table statistics are accurate.............................................................. 10 Increase the size of the data page cache............................................................. 11 Tune the size of database pages...........................................................................11 Performance trade-offs of large pages..............................................................
    [Show full text]
  • Resume of Dr. Michael J. Bisconti
    Table of Contents This file contains, in order: Time Savers Experience Matrix Resume _________________________ 1 Time Savers There are a number of things we can do to save everyone’s time. In addition to resume information there are a number of common questions that employers and recruiters have. Here is an FAQ that addresses these questions. (We may expand this FAQ over time.) Frequently Asked Questions 1099 Multiple Interviewers Severance Pay Contract End Date Multiple Interviews Technical Exam Contract Job Need/Skill Assessment Interview Temporary Vs. Permanent Contract Rate Payment Due Dates U.S. Citizenship Drug Testing Permanent Job W2 Face-to-face Interview Phone Interview Word Resume Job Hunt Progress Salary Are you a U.S. citizen? Yes. Do you have a Word resume? Yes, and I also have an Adobe PDF resume. Do you prefer temporary (contract) or permanent employment? Neither, since, in the end, they are equivalent. Will you take a drug test? 13 drug tests taken and passed. Do you work 1099? Yes, but I give W2 payers preference. Do you work W2? Yes, and I work 1099 as well but I give W2 payers preference. How is your job search going? See 1.2 Job Hunt Progress. What contract rate do you expect? $65 to $85/hr. W2 and see the 2.5 Quick Rates Guide. What salary do you expect? 120k to 130k/yr. and see the 2.5 Quick Rates Guide. When do you expect to be paid? Weekly or biweekly and weekly payers will be given preference. Will you do a face-to-face interview? Yes, but I prefer a Skype or equivalent interview because gas is so expensive and time is so valuable.
    [Show full text]
  • Base Handbook Copyright
    Version 4.0 Base Handbook Copyright This document is Copyright © 2013 by its contributors as listed below. You may distribute it and/or modify it under the terms of either the GNU General Public License (http://www.gnu.org/licenses/gpl.html), version 3 or later, or the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), version 3.0 or later. All trademarks within this guide belong to their legitimate owners. Contributors Jochen Schiffers Robert Großkopf Jost Lange Hazel Russman Martin Fox Andrew Pitonyak Dan Lewis Jean Hollis Weber Acknowledgments This book is based on an original German document, which was translated by Hazel Russman and Martin Fox. Feedback Please direct any comments or suggestions about this document to: [email protected] Publication date and software version Published 3 July 2013. Based on LibreOffice 4.0. Documentation for LibreOffice is available at http://www.libreoffice.org/get-help/documentation Contents Copyright..................................................................................................................................... 2 Contributors.............................................................................................................................2 Feedback................................................................................................................................ 2 Acknowledgments................................................................................................................... 2 Publication
    [Show full text]
  • Unravel Data Systems Version 4.5
    UNRAVEL DATA SYSTEMS VERSION 4.5 Component name Component version name License names jQuery 1.8.2 MIT License Apache Tomcat 5.5.23 Apache License 2.0 Tachyon Project POM 0.8.2 Apache License 2.0 Apache Directory LDAP API Model 1.0.0-M20 Apache License 2.0 apache/incubator-heron 0.16.5.1 Apache License 2.0 Maven Plugin API 3.0.4 Apache License 2.0 ApacheDS Authentication Interceptor 2.0.0-M15 Apache License 2.0 Apache Directory LDAP API Extras ACI 1.0.0-M20 Apache License 2.0 Apache HttpComponents Core 4.3.3 Apache License 2.0 Spark Project Tags 2.0.0-preview Apache License 2.0 Curator Testing 3.3.0 Apache License 2.0 Apache HttpComponents Core 4.4.5 Apache License 2.0 Apache Commons Daemon 1.0.15 Apache License 2.0 classworlds 2.4 Apache License 2.0 abego TreeLayout Core 1.0.1 BSD 3-clause "New" or "Revised" License jackson-core 2.8.6 Apache License 2.0 Lucene Join 6.6.1 Apache License 2.0 Apache Commons CLI 1.3-cloudera-pre-r1439998 Apache License 2.0 hive-apache 0.5 Apache License 2.0 scala-parser-combinators 1.0.4 BSD 3-clause "New" or "Revised" License com.springsource.javax.xml.bind 2.1.7 Common Development and Distribution License 1.0 SnakeYAML 1.15 Apache License 2.0 JUnit 4.12 Common Public License 1.0 ApacheDS Protocol Kerberos 2.0.0-M12 Apache License 2.0 Apache Groovy 2.4.6 Apache License 2.0 JGraphT - Core 1.2.0 (GNU Lesser General Public License v2.1 or later AND Eclipse Public License 1.0) chill-java 0.5.0 Apache License 2.0 Apache Commons Logging 1.2 Apache License 2.0 OpenCensus 0.12.3 Apache License 2.0 ApacheDS Protocol
    [Show full text]
  • LIST of NOSQL DATABASES [Currently 150]
    Your Ultimate Guide to the Non - Relational Universe! [the best selected nosql link Archive in the web] ...never miss a conceptual article again... News Feed covering all changes here! NoSQL DEFINITION: Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable. The original intention has been modern web-scale databases. The movement began early 2009 and is growing rapidly. Often more characteristics apply such as: schema-free, easy replication support, simple API, eventually consistent / BASE (not ACID), a huge amount of data and more. So the misleading term "nosql" (the community now translates it mostly with "not only sql") should be seen as an alias to something like the definition above. [based on 7 sources, 14 constructive feedback emails (thanks!) and 1 disliking comment . Agree / Disagree? Tell me so! By the way: this is a strong definition and it is out there here since 2009!] LIST OF NOSQL DATABASES [currently 150] Core NoSQL Systems: [Mostly originated out of a Web 2.0 need] Wide Column Store / Column Families Hadoop / HBase API: Java / any writer, Protocol: any write call, Query Method: MapReduce Java / any exec, Replication: HDFS Replication, Written in: Java, Concurrency: ?, Misc: Links: 3 Books [1, 2, 3] Cassandra massively scalable, partitioned row store, masterless architecture, linear scale performance, no single points of failure, read/write support across multiple data centers & cloud availability zones. API / Query Method: CQL and Thrift, replication: peer-to-peer, written in: Java, Concurrency: tunable consistency, Misc: built-in data compression, MapReduce support, primary/secondary indexes, security features.
    [Show full text]
  • Return of Organization Exempt from Income
    OMB No. 1545-0047 Return of Organization Exempt From Income Tax Form 990 Under section 501(c), 527, or 4947(a)(1) of the Internal Revenue Code (except black lung benefit trust or private foundation) Open to Public Department of the Treasury Internal Revenue Service The organization may have to use a copy of this return to satisfy state reporting requirements. Inspection A For the 2011 calendar year, or tax year beginning 5/1/2011 , and ending 4/30/2012 B Check if applicable: C Name of organization The Apache Software Foundation D Employer identification number Address change Doing Business As 47-0825376 Name change Number and street (or P.O. box if mail is not delivered to street address) Room/suite E Telephone number Initial return 1901 Munsey Drive (909) 374-9776 Terminated City or town, state or country, and ZIP + 4 Amended return Forest Hill MD 21050-2747 G Gross receipts $ 554,439 Application pending F Name and address of principal officer: H(a) Is this a group return for affiliates? Yes X No Jim Jagielski 1901 Munsey Drive, Forest Hill, MD 21050-2747 H(b) Are all affiliates included? Yes No I Tax-exempt status: X 501(c)(3) 501(c) ( ) (insert no.) 4947(a)(1) or 527 If "No," attach a list. (see instructions) J Website: http://www.apache.org/ H(c) Group exemption number K Form of organization: X Corporation Trust Association Other L Year of formation: 1999 M State of legal domicile: MD Part I Summary 1 Briefly describe the organization's mission or most significant activities: to provide open source software to the public that we sponsor free of charge 2 Check this box if the organization discontinued its operations or disposed of more than 25% of its net assets.
    [Show full text]