Coercion Approach to the Shimming Problem in Scientific Workflows

Coercion Approach to the Shimming Problem in Scientific Workflows Andrey Kashlev, Shiyong Lu Artem Chebotko Department of Computer Science Department of Computer Science Wayne State University University of Texas – Pan American Abstract—When designing scientific workflows, users often write shimming code [10-11, 18]. We believe these face the so-called shimming problem when connecting two requirements are difficult and make workflow design related but incompatible components. The problem is counterproductive for non-technical users. addressed by inserting a special kind of adaptors, called shims, Second, current approaches produce cluttered workflows that perform appropriate data transformations to resolve data with many visible shims that distract users from main type inconsistencies. However, existing shimming techniques workflow components that perform useful work. provide limited automation and burden users with having to Furthermore, recent workflow studies [3, 20] show that the define ontological mappings, generate data transformations, percentage of shim components in workflows registered in and even manually write shimming code. In addition, these myExperiment portal (www.myexperiment.org) has grown approaches insert many visible shims that clutter workflow from 30% in 2009 [20] to 38% in 2012 [3]. These numbers design and distract user’s attention from functional components of the workflow. To address these issues, we 1) indicate that such explicit shimming tends to make reduce the shimming problem to a runtime coercion problem workflows even messier overtime, which further diminishes in the theory of type systems, 2) propose a scientific workflow the usefulness of these techniques. model and define the notion of well-typed workflows, 3) Third, many shimming techniques only apply under a develop three algorithms to typecheck workflows by first particular set of circumstances that are hard to guarantee or translating them into equivalent lambda expressions, 4) design even predict. Some approaches (e.g., [9-12]) apply only two functions that together insert “invisible shims”, or runtime when all the right shims are supplied by web service coercions into workflows, thereby solving the shimming providers and are properly annotated beforehand, and/or problem for any well-typed workflow, 5) implement our when required shims can be generated by automated agents automated shimming technique, including all the proposed (e.g., XQuery–based shims [12]), which cannot be algorithms, lambda calculus, type system, and translation guaranteed for any practical class of workflows. Such functions in our VIEW system and present a case study to uncertainty makes these techniques unreliable in the eyes of validate the proposed approach. end users (domain scientists) who need assurance that their workflows will run. Keywords-shim; shimming problem; scientific workflows; Finally, these efforts focus on shims for scientific data of a particular type, such as XML [10-12] or relational schemas I. INTRODUCTION [13], and cannot be generalized to handle all structured data Scientific workflows are becoming increasingly types, let alone primitive types such as String or Double. important to integrate, structure, and orchestrate a variety of To address these issues, we heterogeneous services and applications into complex 1. reduce the shimming problem to a runtime coercion computational processes to enable and facilitate scientific problem in the theory of type systems, discovery. Oftentimes, composing autonomous third-party 2. propose a scientific workflow model and define the services and applications into workflows requires using notion of well-typed workflows, intermediate components, called shims, to mediate syntactic 3. develop three algorithms to typecheck workflows by and semantic incompatibilities between different translating them into equivalent lambda expressions, heterogeneous components. 4. design two functions that together insert “invisible shims” (coercions) into workflows, thereby solving the Consider, a workflow Wa in Fig. 1 comprised of two web services – Not and Increment. Because the output of Not is a shimming problem for any well-typed workflow, boolean value (true or false) while Increment is designed to 5. implement our automated shimming technique and process integer arguments, to execute the workflow we need present a case study to validate the proposed approach. to find and insert a shim that will resolve this To our best knowledge, this work is the first one to incompatibility. Determining where the shim is needed, reduce the shimming problem to the coercion problem and to finding appropriate shim and inserting it is known as the propose a fully automated solution with no human shimming problem, whose significance has been widely involvement. Moreover, our technique does not insert shims recognized by the scientific workflow community [3-8]. in the workflow design, but instead performs implicit Existing approaches to the shimming problem have the shimming by dynamically injecting coercions during following limitations. workflow execution. While this paper focuses on primitive First, existing techniques are not automated and burden types, such as Int and Float, our general approach equally users by requiring them to generate transformation scripts, applies to structured data types as we explain in Section 6. define mappings to and from domain ontologies, and even Fig 1. Examples of scientific workflows (Wa, Wb, …, Wg). DCin {(ipi, (Wj, ipk)) | ipi IP, Wj W*, ipk Wj.IP} II. SCIENTIFIC WORKFLOW MODEL That is, each pair in DCin represents a data channel Scientific workflows consist of one or more connecting input port ipi IP to an input port ipk of some computational components connected to each other and component Wj W*. possibly to some input data products. Each of these 7. DCout : W* W*.OP → OP is an inverse-functional components can be viewed as a black box with well defined one-to-many mapping. DCout is a set of ordered pairs: input and output ports. Every component is itself another DCout {((Wi, opj), opk) | Wi W*, opj Wi.OP, opk workflow, either primitive or composite. Primitive workflows OP}. That is, each pair in DCout represents a data channel are bound to executable components, such as web services, connecting output port opj of some component Wi W* to scripts, or high performance computing (HPC) services and an output port op OP. can be viewed as atomic entities. Composite workflows k 8. DCmid : W* W*.OP → W* W*.IP is an inverse- consist of multiple building blocks connected to one another functional one-to-many mapping. DC is a set of ordered via data channels. Each of these building blocks can be mid pairs: DC {((W , op ), (W , ip )) | W , W W*, op either a workflow or a data product. In the following we mid i j k n i k j formalize the scientific workflow model used in this paper. Wi.OP, ipn Wk.IP}. That is, each pair in DCmid represents a Definition 2.1 (Port). A port is a pair (id, type) data channel connecting an output port opj of some consisting of a unique identifier and a data type associated component Wi W* with an input port ipn of some other component Wk W*. with this port. We denote input and output ports as ipi:Ti and 9. DC : DP → W* W*.IP is an inverse-functional opj:Tj, respectively, where ipi and opj are identifiers, and Ti idp one-to-many mapping. DC is a set of ordered pairs: and Tj are port types. idp Definition 2.2 (Data Product). A data product is a triple DCidp {(dpi, (Wj, ipk)) | dpi DP, Wj W*, ipk Wj (id, value, type) consisting of a unique identifier, a value and .IP}. That is, each pair in DCidp represents a data channel that a type associated with this data product. We denote each data connects a data product dpi DP to the input port ipk of product as dpi:Ti, where dpi is the identifier, and Ti is the type some component Wj W*. of the data product. Fig. 1 shows seven workflows that we will reference in Given a workflow W, and the set of its constituent this paper as Wa, Wb, Wc, Wd, We, Wf, and Wg respectively. workflows W*, we use W.pj to denote port pj of W (be it These seven workflows use other workflows as their input or output port) and W.W*.IP (W.W*.OP) to represent building blocks. Such constituent workflows are shown as the union of sets of input (output) ports of all constituent blue boxes with their ids written inside each box. Ports workflows of W. Whenever it is clear from the context we appear as red pins pointing right (input port) or left (output omit the leading “W.”. Formally, port). Finally, data products are visualized as yellow boxes W*.IP = {ip | ip W .IP, W W*} with their values placed inside (e.g., “true” in Wa in Fig. 1). i i j j Because the order of input arguments of a workflow W*.OP = {opk | opk Wl.OP, Wl W*} matters (e.g., Divide workflow in Wf in Fig. 1), we use Definition 2.3 (Scientific workflow). A scientific ordered set IP to store a list of input ports. We use the term workflow W is a 9-tuple (id, IP, OP, W*, DP, DCin, DCout, data channel to refer to any entity from the set {DCin DCmid, DCidp), where DCmid DCout DCidp}. 1. id is a unique identifier, As we show in later sections, a workflow is represented 2. IP = {ip0, ip1, …, ipn} is an ordered set of input ports, as a lambda expression. To simplify lambda expressions, we 3. OP = {op0, op1,…, opm} is ordered set of output ports, focus on workflows with a single output port. We are 4. W* = {W0, W1, …, Wp} is a set of constituent currently extending our approach to allow set OP with a workflows used in W. Each Wi W* is another 9-tuple, cardinality greater than one. Our definition requires that 5.

Coercion Approach to the Shimming Problem in Scientific Workflows

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support