<<

Coercion Approach to the Shimming Problem in Scientific Workflows Andrey Kashlev, Shiyong Lu Artem Chebotko Department of Computer Science Department of Computer Science Wayne State University University of Texas – Pan American

Abstract—When designing scientific workflows, users often write shimming code [10-11, 18]. We believe these face the so-called shimming problem when connecting two requirements are difficult and make workflow design related but incompatible components. The problem is counterproductive for non-technical users. addressed by inserting a special kind of adaptors, called shims, Second, current approaches produce cluttered workflows that perform appropriate data transformations to resolve data with many visible shims that distract users from main type inconsistencies. However, existing shimming techniques workflow components that perform useful work. provide limited automation and burden users with having to Furthermore, recent workflow studies [3, 20] show that the define ontological mappings, generate data transformations, percentage of shim components in workflows registered in and even manually write shimming code. In addition, these myExperiment portal (www.myexperiment.org) has grown approaches insert many visible shims that clutter workflow from 30% in 2009 [20] to 38% in 2012 [3]. These numbers design and distract user’s attention from functional components of the workflow. To address these issues, we 1) indicate that such explicit shimming tends to make reduce the shimming problem to a runtime coercion problem workflows even messier overtime, which further diminishes in the theory of type systems, 2) propose a scientific workflow the usefulness of these techniques. model and define the notion of well-typed workflows, 3) Third, many shimming techniques only apply under a develop three algorithms to typecheck workflows by first particular set of circumstances that are hard to guarantee or translating them into equivalent lambda expressions, 4) design even predict. Some approaches (e.g., [9-12]) apply only two functions that together insert “invisible shims”, or runtime when all the right shims are supplied by web service coercions into workflows, thereby solving the shimming providers and are properly annotated beforehand, and/or problem for any well-typed workflow, 5) implement our when required shims can be generated by automated agents automated shimming technique, including all the proposed (e.g., XQuery–based shims [12]), which cannot be algorithms, , type system, and translation guaranteed for any practical class of workflows. Such functions in our VIEW system and present a case study to uncertainty makes these techniques unreliable in the eyes of validate the proposed approach. end users (domain scientists) who need assurance that their workflows will run. Keywords-shim; shimming problem; scientific workflows; Finally, these efforts on shims for scientific data of a particular type, such as XML [10-12] or relational schemas I. INTRODUCTION [13], and cannot be generalized to handle all structured data Scientific workflows are becoming increasingly types, let alone primitive types such as String or Double. important to integrate, structure, and orchestrate a variety of To address these issues, we heterogeneous services and applications into complex 1. reduce the shimming problem to a runtime coercion computational processes to enable and facilitate scientific problem in the theory of type systems, discovery. Oftentimes, composing autonomous third-party 2. propose a scientific workflow model and define the services and applications into workflows requires using notion of well-typed workflows, intermediate components, called shims, to mediate syntactic 3. develop three algorithms to typecheck workflows by and semantic incompatibilities between different translating them into equivalent lambda expressions, heterogeneous components. 4. design two functions that together insert “invisible shims” (coercions) into workflows, thereby solving the Consider, a workflow Wa in Fig. 1 comprised of two web services – Not and Increment. Because the output of Not is a shimming problem for any well-typed workflow, boolean value (true or false) while Increment is designed to 5. implement our automated shimming technique and process integer arguments, to execute the workflow we need present a case study to validate the proposed approach. to find and insert a shim that will resolve this To our best knowledge, this work is the first one to incompatibility. Determining where the shim is needed, reduce the shimming problem to the coercion problem and to finding appropriate shim and inserting it is known as the propose a fully automated solution with no human shimming problem, whose significance has been widely involvement. Moreover, our technique does not insert shims recognized by the scientific workflow community [3-8]. in the workflow design, but instead performs implicit Existing approaches to the shimming problem have the shimming by dynamically injecting coercions during following limitations. workflow execution. While this paper focuses on primitive First, existing techniques are not automated and burden types, such as Int and Float, our general approach equally users by requiring them to generate transformation scripts, applies to structured data types as we explain in Section 6. define mappings to and from domain ontologies, and even

Fig 1. Examples of scientific workflows (Wa, Wb, …, Wg).

DCin {(ipi, (Wj, ipk)) | ipi  IP, Wj  W*, ipk  Wj.IP} II. SCIENTIFIC WORKFLOW MODEL That is, each pair in DCin represents a data channel Scientific workflows consist of one or more connecting input port ipi IP to an input port ipk of some computational components connected to each other and component Wj  W*. possibly to some input data products. Each of these 7. DCout : W*  W*.OP → OP is an inverse-functional components can be viewed as a black box with well defined one-to-many mapping. DCout is a set of ordered pairs: input and output ports. Every component is itself another DCout  {((Wi, opj), opk) | Wi W*, opj  Wi.OP, opk workflow, either primitive or composite. Primitive workflows OP}. That is, each pair in DCout represents a data channel are bound to executable components, such as web services, connecting output port opj of some component Wi W* to scripts, or high performance computing (HPC) services and an output port op OP. can be viewed as atomic entities. Composite workflows k 8. DCmid : W*  W*.OP → W*  W*.IP is an inverse- consist of multiple building blocks connected to one another functional one-to-many mapping. DC is a set of ordered via data channels. Each of these building blocks can be mid pairs: DC  {((W , op ), (W , ip )) | W , W  W*, op  either a workflow or a data product. In the following we mid i j k n i k j formalize the scientific workflow model used in this paper. Wi.OP, ipn  Wk.IP}. That is, each pair in DCmid represents a Definition 2.1 (Port). A port is a pair (id, type) data channel connecting an output port opj of some consisting of a unique identifier and a data type associated component Wi  W* with an input port ipn of some other component Wk  W*. with this port. We denote input and output ports as ipi:Ti and 9. DC : DP → W*  W*.IP is an inverse-functional opj:Tj, respectively, where ipi and opj are identifiers, and Ti idp one-to-many mapping. DC is a set of ordered pairs: and Tj are port types. idp Definition 2.2 (Data Product). A data product is a triple DCidp  {(dpi, (Wj, ipk)) | dpi  DP, Wj  W*, ipk  Wj (id, value, type) consisting of a unique identifier, a value and .IP}. That is, each pair in DCidp represents a data channel that a type associated with this data product. We denote each data connects a data product dpi  DP to the input port ipk of product as dpi:Ti, where dpi is the identifier, and Ti is the type some component Wj  W*. of the data product. Fig. 1 shows seven workflows that we will in Given a workflow W, and the set of its constituent this paper as Wa, Wb, Wc, Wd, We, Wf, and Wg respectively. workflows W*, we use W.pj to denote port pj of W (be it These seven workflows use other workflows as their input or output port) and W.W*.IP (W.W*.OP) to represent building blocks. Such constituent workflows are shown as the union of sets of input (output) ports of all constituent blue boxes with their ids written inside each box. Ports workflows of W. Whenever it is clear from the we appear as red pins pointing right (input port) or left (output omit the leading “W.”. Formally, port). Finally, data products are visualized as yellow boxes W*.IP = {ip | ip W .IP, W W*} with their values placed inside (e.g., “true” in Wa in Fig. 1). i i j j Because the order of input arguments of a workflow W*.OP = {opk | opk  Wl.OP, Wl W*} matters (e.g., Divide workflow in Wf in Fig. 1), we use Definition 2.3 (Scientific workflow). A scientific ordered set IP to store a list of input ports. We use the term workflow W is a 9-tuple (id, IP, OP, W*, DP, DCin, DCout, data channel to refer to any entity from the set {DCin  DCmid, DCidp), where DCmid  DCout  DCidp}. 1. id is a unique identifier, As we show in later sections, a workflow is represented 2. IP = {ip0, ip1, …, ipn} is an ordered set of input ports, as a lambda expression. To simplify lambda expressions, we 3. OP = {op0, op1,…, opm} is ordered set of output ports, focus on workflows with a single output port. We are 4. W* = {W0, W1, …, Wp} is a set of constituent currently extending our approach to allow set OP with a workflows used in W. Each Wi  W* is another 9-tuple, cardinality greater than one. Our definition requires that 5. DP = {dp0, dp1, …, dpq} is a set of data products, every workflow and every data product has a unique id. For 6. DCin : IP → W*  W*.IP is an inverse-functional one- simplicity we also require that for any workflow W, all ports to-many mapping. DCin is a set of ordered pairs: of W and all ports of all workflows in W* have unique ids.

We model workflow Wd in Fig. 1 as a 9-tuple, where id = expr is an abstraction, and if W is executable, expr is an ”Wd”, IP = Ø, OP ={op9, Float}, W* = {Mean, Sqrt}, DP = application. {(dp0, 3, Int), (dp1, 5, Int), (dp2, 4, Int)}, DCin = DCout = Ø, In this paper, we present our translateWorkflow function DCmid = {((Mean, op6), (Sqrt, ip7))}, DCidp = {(dp0, (Mean, outlined in Algorithm 1, that given a workflow W, translates ip3)), (dp1, (Mean, ip4)), (dp2, (Mean, ip5))}. Workflow We, it into an equivalent lambda expression which performs the on the other hand does not have concrete input data products same computations and produces the same result as W. For connected to its inputs. We model it using 9-tuple with id = simplicity we assume that workflow diagrams are displayed ”We”, IP = {(ip0, Int), (ip1, Int), (ip2, Int)}, OP = {(op9, horizontally with data flowing from left to right (see Fig. 1). Double)}, W* = {Mean, Sqrt}, DP = Ø, DCin = {(ip0, (Mean, Given a workflow W, our translateWorkflow algorithm ip3)), (ip1, (Mean, ip4)), (ip2, (Mean, ip5))}, DCout = {((Sqrt, translates all components in W into lambda functions, and op8), op9)}, DCmid = {((Mean, op6), (Sqrt, ip7))}, DCidp = Ø. builds an expression whose structure corresponds to Definition 2.4 (Primitive workflow). A workflow W is composition of components in W. Each connection between primitive if and only if it has both input ports and output two components becomes an application in the equivalent ports, and W has neither constituent components, nor data lambda expression. products, nor data channels. Formally, W is primitive iff We accommodate composite workflows (containing sub- W.IP ≠ Ø  W.OP ≠ Ø  W.W* = W.DP = W.DCin = workflows) nested inside each other to arbitrary degree by W.DCout = W.DCmid = W.DCidp = Ø. making recursive calls to translateWorkflow function that We use isPrimitive(W) to denote the above . translates all sub-workflows at each level of nesting (depth- Intuitively, primitive workflow is a black box with inputs wise translation). We also accommodate arbitrary workflow and outputs that represents an atomic component such as compositions within the same level of nesting (flat web service. Primitive workflows are used by other compositions) by recursively calling our getInputExpression workflows as building blocks. Workflows such as Not, function outlined in Algorithm 2, that iterates over and Increment, Decrement, Sqrt, Square, Mean, and Divide in translates all the connected components by backtracking Fig. 1 are examples of primitive workflows. along the data channels from right to left (breadth-wise Definition 2.5 (Composite workflow). A workflow W is translation). Thus, our two algorithms together cover the full composite if and only if it contains at least one reusable range of possible workflow compositions. We now provide a component (i.e. W.W* ≠ Ø) connected to ports and/or data walk-through example of how a workflow Wd is translated products. Formally, W is composite iff into an equivalent lambda expression. (W.W* ≠ Ø  W.IP ≠ Ø  W.OP ≠ Ø  W.DCin ≠ Ø  Example 3.2 (Translating workflow Wd into an W.DCout ≠ Ø)  (W.W* ≠ Ø  W.OP ≠ Ø  W.DP ≠ Ø  equivalent lambda expression). Consider a workflow Wd in W.DCidp ≠ Ø) Fig. 1. When the function translateWorkflow(Wd) is called, it We use isComposite(W) to denote the above predicate. first checks whether Wd is primitive, and because it is not, All workflows Wa, Wb, …, Wg in Fig. 1 are composite. the else clause is executed (lines 7-35). translateWorkflow Intuitively, reusable workflows are primitive or first determines that the component producing final result of composite tasks that can be reused as building blocks of the entire workflow Wd is Sqrt and stores it in the more complex workflows. They are not executable as at least componentProducingFinalRes variable (line 16). Next, some of their input ports are not bound. Workflows Wb, We because Sqrt has a single input, for loop in lines 21-23 and Wg in Fig 1 are reusable. Workflow Wb is reused inside executes once, calling the function getInputExpression(Wd, Wc. Executable workflows, on the other hand have all input Sqrt, ip7), whose output “(Mean dp0 dp1 dp2)” is stored into data needed to perform computation. Workflows Wa, Wc, Wd, a string listOfArguments. Next, translateWorkflow checks and Wf in Fig. 1 are executable. Each executable workflow whether Wd is reusable (line 24), and because it is not it must contain at least one component and one data product returns the application of workflow expression for Sqrt connected to it. Thus, every executable workflow is component to the list of arguments obtained earlier (line 34). composite. The opposite is not true, as composite workflow Since Sqrt is a primitive workflow, translateWorkflow(Sqrt) may be reusable (e.g., Wb), i.e. have input port(s) instead of returns its name “Sqrt”. Thus, the final result of the concrete data product(s). translation is “Sqrt (Mean dp0 dp1 dp2)”. Example 3.3 (lambda expressions for workflows Wa, III. WORKFLOW EXPRESSIONS Wb, …, Wg). We provide lambda expressions obtained by We rely on simply typed lambda calculus [2] enriched calling our translateWorkflow algorithm on each workflow with a set of primitive types as a formal framework to reason in Fig. 1: about the behavior of workflows. For example, expression Wa : Increment (Not dp0) “λx:Int. Increment x” is a function, or abstraction, that takes Wb : λx0:Bool. Increment (Not x0) one argument of type Int, and returns its value increased by Wc : Wb dp0 = (λx0:Bool. Increment (Not x0)) dp0 1. x is the abstraction name and “Increment x” is the Wd : Sqrt (Mean dp0 dp1 dp2) expression of this abstraction. The expression “Increment 3” We : λx0:Int. λx1:Int. λx2:Int. (Sqrt (Mean x0 x1 x2)) is an application, which evaluates to 4. Wf : Divide (Increment (Square dp0)) (Decrement Definition 3.1 (Workflow expression). Given a (Square dp0)) workflow W, its expression expr is a lambda expression that Wg : λx0:Int. Divide (Increment (Square x0)) (Decrement represents computation performed by W. If W is reusable, (Square x0))

Algorithm 1. Translating workflows to lambda expressions Algorithm 2. Algorithm for obtaining lambda expressions 1: function translateWorkflow representing inputs at certain workflow ports 2: input: workflow W 1: function getInputExpression 3: output: lambda expression for W 2: input: workflow W, constituent component c, input port ipm 4: if isPrimitive(W) 3: output: lambda expression that serves as input argument of port 5: /* If W is primitive, return its id */ W.id. 4: /* first, if there is a data product dp in W.DP connected to port ip , 6: then return W.id i m return dp .id of that data product: */ 7: else i 5: for each (dp , (w , ip )) W.DC do 8: /* Otherwise, W is composite (reusable or executable), _translate it i j k idp 6: if w .id = c.id and ip .id = ip .id recursively into lambda expression: */ j k m 7: then return dp .value 9: i /* First, find component in W.W* that performs the very last 8: end if computational step (componentProducingFinalRes): */ 9: end for 10: let OutputPortsOfDCmid be an empty set 10: /* if there is an input port ipj in W.IP connected to port ipm, return 11: for each ((wj, opj), (wk, ipl)) W.DCmid do variable named “x” + ipj.id */ 12: add opj to OutputPortsOfDCmid 11: for each (ipj, (wk, ipl)) W.DCin do 13: end for 12: if wk.id = c.id and ipl.id = ipm.id 14: for each W' W.W* do 13: then return “x” + ipj.id 15: if W'.OP  OutputPortsOfDCmid 14: end if 16: then componentProducingFinalRes = W' 15: end for 16: 17: end if /* if there is another constituent workflow wi whose output is connected to ip , construct the list of input arguments (expressions) 18: end for m of w and return application of these arguments to w : */ 19: /* Build the list of expressions that serve as arguments for i i 17: listOfArguments = “” _componentProducingFinalRes:*/ 18: for each ((w , op ), (w , ip )) W.DC do 20: listOfArguments = “” i j k l mid 19: if wk.id = c.id and ipl.id = ipm.id then 21: for each (idi, typei) componentProducingFinalRes.IP do 20: for ipq wi.IP do 22: listOfArguments += getInputExpression(W, 21: listOfArguments += getInputExpression(W, wi, ipq) + ___ _“ ___componentProducingFinalRes, idi) + “ “ ” 23: end for 22: end for 24: if W is reusable //|W.DCin > 0| 23: return “(” + translateWorkflow(wi) + “ ” + 25: /* translate it into lambda abstraction: */ ______listOfArguments + “)” 26: then 24: end if 27: listOfNames = “” 25: end for

28: for each (idi, typei) W.IP do 26: return “error - cannot obtain input expression” 27: end function 29: listOfNames += “λx” + idi + “:” + typei + “. ” 30: end for T ::= String | Decimal | Integer | NonPositiveInteger | 31: return “(” + listOfNames + NegativeInteger | NonNegativeInteger | UnsignedLong | ______translateWorkflow(componentProducingFinalRes) UnsignedInt | UnsignedShort | UnsignedByte | Double | ______+ “ ” + listOfArguments + “)” 32: else PositiveInteger | Float | Long | Int | Short | Byte | Bool | T→T 33: /* W is executable, thus translate it into a lambda ___application: The type constructor → is right-associative. i.e. the */ expression T1→T2→T3 is equivalent with T1→(T2→T3). This 34: return translateWorkflow(componentProducingFinalRes) type constructor is useful in defining types of reusable ______+ “ ” + listOfArguments; workflows. For example, the workflow W has type 35: b end if Bool→Int, because it expects boolean value as input and 36: end if 37: end function produces integer value as output. Workflow We has the type Int→Int→Int→Double. The type of an executable workflow Note that executable workflows (Wa, Wc, Wd, Wf) are is simply the type of its output, e.g., type of Wa is Int. translated into lambda applications, whereas reusable ones We now introduce the notion of subtyping which is based (Wb, We, Wg) into lambda abstractions. Ports are translated on the fact that some types are more descriptive than others. into variables, e.g. port ip0 appears as x0 in the corresponding For example, any value described by type Int can also be expression. We require that the workflow expression is flat, described by Decimal. That is, the set of values associated i.e. constituent components’ ids are replaced with their with the type Int is a subset of values associated with the translations (see expression for Wc above). Thus, a workflow type Decimal, or, in other words, Decimal is a more expression only contains port variables, names of primitive descriptive type than Int. Therefore, it is safe to pass integer workflows, and data products. argument to a workflow expecting a decimal number as input. Such view of subtyping, based on the subset IV. TYPE SYSTEM FOR SCIENTIFIC WORKFLOWS , is also called the principle of safe substitution. Workflows Wa, Wb, and Wc in Fig. 1 are composed by this For interoperability, we adopt the type system defined in principle. the XML Schema language specification [1]. While our approach can accommodate all types defined in [1], due to We formalize the subtype relation as a set of inference the space limit, in this paper we focus on the following types, rules used to derive statements of the form Ti <: Tj, which are most relevant to the scientific workflow domain. pronounced“Ti is a subtype of Tj ”, or“Tj is a supertype of

Fig. 2. Subtyping inference rules.

Ti ”, or “Tj subsumes Ti ”, where Ti and Tj are two types. As shown in Fig. 2, the first two rules (S-REFL, and S-TRANS) state that the subtype relation is reflexive and transitive. They are then followed by an incomplete set of rules for primitive data types (collectively labeled S-Prim) derived from the hierarchy presented in [1]. As Bool type is less descriptive than Int (true and false can be mapped to 1 and 0, Fig. 3. Workflow typing rules. a subset of Int), we consider Bool to be a subtype of Int. We also include the rule for records (S-R), which is a structured Here, variable x represents a primitive object such as primitive workflow, port or data product, t, targ and tf are type. A record is a labeled n-tuple that has a type, e.g. r1 = lambda expressions, and T, T1, T2, Tin and Tout denote types. {a:1, b:true, c: 2.25} is a 3-fields record, whose type is t1 = {a:Int, b:Bool, c:Float}. We denote an n-fields record and its Set Г = {x0:Tp0, x1:Tp1, … , xn:Tpn} is a typing context, i.e. a set of assumptions about primitive objects and their types. type as {li=vi}:{li:Ti}, i 1…n where li, vi and Ti denote labels, values and types respectively. Given two record types The first rule (T-VAR) states that variable x has the type T and T , T is subtype of T if all T ’s fields form a superset assumed about it in Г. The second rule (T-ABS) is used to 1 2 1 2 1 derive types of expressions representing reusable workflows. of T2’s fields. For example, t1 is a subtype of {a:Int, b:Bool}. Indeed, it is safe to pass a record r where a record of type It states that if the type of expression with x plugged in is T2, 1 then the type of abstraction, with the name x and expression t {a:Int, b:Bool} is expected, since r1 provides all necessary information (and even some extra) needed by the workflow. is T1→T2. The third rule (T-APP) is used to derive types of Definition 4.1 (Subtype relation). A subtype relation is applications, which represent data channels in workflows. The next rule (T-APPS) is necessary to typecheck workflows a binary relation between types, Ti <: Tj that satisfies all instances of the inference rules in Fig. 2. with subtyping connections (shown dashed in Fig. 1). We Due to the small number of primitive types, the algorithm call such compositions workflows with subtyping. The last rule is used to typecheck records. We show concrete type to check whether Ti <: Tj is true straightforward. We assume the function subtype(T , T ) that returns true iff T <: T . derivation that uses the above rules in Example 5.3. i j i j Definition 5.1 (Workflow context). Given a workflow W, a workflow context Z is a set of all data products and V. TYPECHECKING FOR SCIENTIFIC WORKFLOWS primitive workflows used inside W (at all levels of nesting) To determine whether a given workflow can execute and their respective types. successfully, we need to check whether connections between Definition 5.2 (Well-typed workflow). A workflow W is its components are consistent, i.e. each component receives well-typed, or typable, if and only if for some T, there exists input data in the format it expects. The expected format is a typing derivation that satisfies all the inference rules in Fig. constrained by a type declared in component’s specification. 3, and whose conclusion is Z ˫ W : T, where Z is a workflow We formalize such consistency of connections through the context for W. notion of workflow well-typedness. We check whether a Example 5.3 (Typing derivation for workflow Wa). workflow is well-typed by attempting to find its type. Consider the workflow Wa shown in Fig. 1. Its workflow Intuitively, we can derive the type of a workflow expression is Increment (Not dp0). Wa’s workflow context Z expression if we know the types of primitive workflows and is a set {Increment:Int→Int, Not:Bool→Bool, dp0:Bool}. data products involved in it. For example, it is easy to see Typing derivation tree for this workflow is shown in Fig. 4. that expression (Increment dp0) has the type Int, assuming Each step is labeled with the corresponding inference rule. Increment expects integer argument and returns integer Derivation holds for Г = Z. According to Definition 5.2, (formally, Increment:Int→Int) and dp0 is Int. In other words, existence of typing derivation with the conclusion we can derive workflow type given a set of assumptions. {Increment:Int→Int, Not:Bool→Bool, dp :Bool} ˫ Increment Typing derivation is done according to the following 0 (Not dp0) : Int, proves that W is well-typed. inference rules (see Fig. 3) for variables (T-VAR), We now introduce the generation lemma that we use to abstractions (T-ABS), records (T-R), and applications (T- design our typechecking function. Generation lemma APP), as well as the rule for application with substitution (T- captures three observations about how to typecheck a given APPS) that provides a bridge between typing and subtyping expression. Each entry is read as “if workflow expression has rules. Our inference rules for typing and subtyping are based the type T, then its subexpressions must have types of these on those from the classical theory of type systems [2], forms”. Each observation inverses the corresponding rule in although modified to suit the scientific workflow domain and Fig. 3 by stating it “from bottom to top”. Note that for T-ABS to ensure determinism of the typechecking algorithm we add variable-type pair for name x, which is given presented later in this section. explicitly in the abstraction.

Fig. 4. Typing derivation for workflow Wa. Lemma 5.4 (Generation lemma). context Г respectively, abstraction.name, 1. Г ˫ x:T  x:T  Γ /* inverses T-VAR */ abstraction.nameType and abstraction.expression return 2. Г ˫ (λx:T1. t) : T  T1 T2 ( T = T1 → T2  (Г  name, type of name variable and expression of the given {x:T1} ˫ t:T2 )) /* inverses T-ABS */ abstraction respectively. 3. Г ˫ tf targ : T  Tin Tout ( (Г ˫ tf : Tin → T )  ((Г ˫ VI. AUTOMATIC COERCION IN WORKFLOWS targ:Tin )  T1 (Г ˫ targ:T1  T1 <: Tin)) ) /* inverses T- APP and T-APPS*/ Our approach not only allows to determine workflow Proof: Part 1 - by contradiction. Assume Г ˫ x:T, and x:T well-typedness, but also ensures the correct execution of  Г. Since Г ˫ x : T, there must be a typing derivation well-typed workflows. Consider the workflow Wa shown in satisfying inference rules in Fig. 3 with the conclusion Г ˫ x : Fig. 1. Although the Bool type is a subtype of Int, data T. Rules T-ABS and T-APP and T-APPS cannot be used to products of these two types may have entirely different derive the type of x, since neither of them deduces a type of physical representations in workflow management systems. primitive object. The rule T-VAR is also not applicable since In particular, the workflow engine may use two different x:T  Г is false. Thus, there exists no derivation with the classes BoolDP and IntDP to wrap data products of types conclusion Г ˫ x : T, and hence Г ˫ x : T cannot be true, which Bool and Int. If neither of the two classes is a subclass of the is a contradiction. Parts 2 and 3 can be proved similarly by other, casting BoolDP to IntDP is impossible and hence contradiction. □ using BoolDP in place of IntDP will result in runtime error In practice, to reason about workflow behavior we need a during workflow execution. deterministic algorithm to derive the type of W. To this end, Thus, to ensure successful evaluation, we adopt the so- we now present the typecheckWorkflow function outlined in called coercion semantics for workflows, in which we Algorithm 3. Given a workflow W, it derives W's type from replace subtyping with runtime coercions that change Algorithm 3. Typechecking of scientific workflows physical representation of data products to their target types. 1: function typecheckWorkflow We express the coercion semantics for workflows as a 2: input: workflow expression expr, context  function translateT that translates workflow expressions 3: output: type of W with subtyping into those without subtyping. In this paper, 4: if expr is primitive object /* i.e. variable representing _____port, primitive workflow or data product*/ then we use C :: T1 <: T2 to denote subtyping derivation tree 5: return .getBinding(expr) whose conclusion is T1 <: T2. Similarly, we use D :: Γ ˫ t:T 6: else if expr is Abstraction then to denote typing derivation whose conclusion is Γ ˫ t:T. 7: let ' =  Given a subtyping derivation C :: T1 <: T2, function 8: '.addBinding(expr.name, expr.nameType) translateS(C) returns a coercion (lambda expression) that 9: typeOfExpr = typecheckWorkflow (expr.expression,  ') 10: return expr.nameType → typeOfExpr converts data products of type T1 into data products of type 11: else if expr is Application then T2. We denote function translateS(C) as [[C]] and define it 12: typeOfF = typecheckWorkflow(expr.f, ) in a case-by-case form: 13: typeOfN = typecheckWorkflow (expr.n, ) 14: if typeOfF is of the form T0 → T1 → … → Tn, where n > 0, . then 15: if subtype(typeOfN, T0) 16: then return T1 → … → Tn 17: else 18: return “error: parameter type mismatch” 19: end if 20: else 21: return “arrow type expected” 22: end if 23: end if 24: end function The last case defines a function producing a record by extracting a subset of fields (1…n) from an input record. the primitive objects inside W according to the typing rules Given a typing derivation D :: Γ ˫t:T, function translateT(D) in Fig. 3. This function is a transcription of the generation produces an expression similar to t but in which subtyping is lemma (Lemma 5.4) that performs backward reasoning on the inference rules. Each recursive call of replaced with coercions. We also denote translateT(D) as typecheckWorkflow is made according to the corresponding [[D]]. From the context, it will be clear which one is used. entry of the generation lemma. We assume that methods Similarly, we define translateT by cases: Г.getBinding(name) and Г.addBinding(name, type) get the type of a given variable and add the variable-type pair to the

VIEW system dialog window showing Wa’s SWL (top left part of the dialog). The composed workflow is executed by pressing the “Run” button in the browser. First, using algorithms 1 and 2, our system translates workflow into typed lambda calculus with subtyping (bottom left section Bool2Int dp0:Bool dp5:Int ip1:Bool ip3:Int true Not Increment op2:Bool op4:Int

Result. Shim Bool2Int has been automatically Note that in the case of T-APPS rule, translateT calls inserted at translateS(C) to retrieve appropriate coercion and insert it Step 1. runtime between Not into the application where subsumption (i.e. T-APPS rule) is Translate SWL into the and used. In the last case, translateT translates typing derivation corresponding Increment workflow components for records by recursively calling itself on typing derivations expression Step 3. of individual fields. 1 3 Translate Example 6.1 (Automatic coercion injection). Consider the the workflow Wa in Fig. 1. To inject coercions into it, we workflow call function translateT on its typing derivation shown in Step 2. Replace subtyping 2 expression with runtime coercions into SWL Fig. 4. The function evaluates as follows Fig. 5. Automatically inserting shims in scientific workflow using the VIEW system. of the dialog in Fig. 5), and typechecks it using Algorithm 3. If the workflow is well-typed, using translateS and translateT functions, VIEW inserts coercions (primitive workflows performing type conversion) into the workflow expression by translating it into lambda calculus without subtyping. For example, coercion Bool2Int is inserted in the expression for Wa workflow (bottom right part of the dialog). Finally, the latter expression is translated back into a runtime version of SWL, which has necessary shims in it and is supplied to the workflow engine for execution. Note that all these steps are fully automated and hidden from the user, who sees results of workflow execution upon pressing the “Run” button. The resulting expression contains coercion Bool2Int that RELATED WORK converts boolean data products to integer data products. Note that coercion Bool2Int (implemented as a primitive The significance of the shimming problem has been workflow) is inserted dynamically at runtime and is widely recognized by the scientific workflow community transparent to the user. [3-8]. Much work to address shimming problem was focused on transforming XML documents whose elements VII. IMPLEMENTATION AND CASE STUDY are associated with domain models, (e.g., expressed using We now present the new version of our VIEW system OWL) [10-12]. The common limitations of these [19], in which we implement our proposed workflow model, approaches are: (1) they all focus on translating algorithms 1, 2, and 3, simply typed lambda calculus, and syntactically different XML documents, whereas other data our translation functions translateS and translateT. We give types, including primitive, or structured types (e.g., record, a walk-through explanation of our automated shimming relational schema) are not supported, (2) they all require technique using workflow Wa in Fig. 1. services to be semantically annotated and hence they cannot Our new version of VIEW is web-based, with no compose arbitrary (not annotated) web services, let alone installation required. Scientists access VIEW through a other kinds of executable components (e.g., scripts, local browser and compose scientific workflows from web applications or HPC jobs). services, scripts, local applications, etc. A workflow Sellami et al. [9] address the shimming problem by structure is stored in a specification document written in our using semantic annotations of web services to find shims. XML-based workflow language SWL. Fig. 5 displays the Besides requiring composed web services to be semantically workflow Wa from earlier examples, and a screenshot of the annotated, this approach also expects web service providers

to supply all the necessary shims that are also annotated. Ambite and Kapoor [13] present a planning approach to the [1] “W3C XML schema definition language (XSD) 1.1 Part 2: datatypes. shimming problem that focuses on relational data types and W3C Recommendation, http://www.w3.org/TR/xmlschema11-2/” does not apply to primitive types or other non-relational [2] B. Pierce, Types and Programming Languages, MIT Press, 2002. structured types (e.g., record). Existing scientific workflow [3] R. Littauer, K. Ram, B. Ludäscher, W. Michener, R. Koskela, systems [14, 16, 17, 22] provide limited support to the “Trends in use of scientific workflows: insights from a public repository and recommendations for best practice,” International shimming problem, i.e. shimming is explicit or requires Journal of Digital Curation, vol. 7, no. 2, pp. 92-100, 2012. additional workflow configuration. [4] C. Lin, S. Lu, X. Fei, D. Pai, J. Hua, “A task abstraction and mapping None of the above approaches (1) guarantees an approach to the shimming problem in scientific workflows,” in Proc. automated solution with no human involvement, (2) makes of SCC, 2009, 284-291. shims invisible in the workflow specification, (3) provides a [5] B. Ludäscher, S. Bowers, T. McPhillips, and N. Podhorszki, solution for arbitrary workflow (even within some well- “Scientific Workflows: More e-science mileage from cyberinfrastructure,” in Proc. of e-Science, pp. 145, 2006. defined class), (4) applies to both primitive and structured [6] U. Radetzki, U. Leser, S. C. Schulze-Rauschenbach, J. Zimmermann, types. Our approach addresses all four issues. J. Lüssem, T. Bode, and A. B. Cremers, “Adapters, shims, and glue – To address these issues, in [4], we presented a primitive service interoperability for in silico experiments,” Bioinformatics, workflow model and a workflow specification language that vol. 22, no. 9, pp.1137-1143, 2006. allowed hiding shims inside task specifications. This paper [7] P. Kelly, P. Coddington, and A. Wendelborn, “Lambda calculus as a workflow model,” Concurrency and Computation: Practice and improves our earlier work by proposing an approach that Experience, vol. 21, no. 16, pp. 1999-2017, 2009. determines where a shim needs to be placed in the [8] D. Hull. R. Stevens, P. Lord, C. Wroe, and C. Goble, “Treating workflow, and inserts appropriate coercion in the workflow shimantic web syndrome with ontologies,” in Proc. of AKT-SWS04, expression. Specifically, we choose typed lambda calculus 2004. [2] to represent workflows which is naturally suitable for [9] M. Sellami, W. Gaaloul, B. Defude, “Data Mapping Web Services for dataflow modeling due to its functional characteristics [7]. Composite DaaS Mediation,”, in Proc. of WETICE, 2012. While recognizing the importance of shims, [7] does not [10] M. Szomszor, T. Payne, L. Moreau, “Automated syntactic mediation for web service integration,” in Proc. of ICWS, 2006, 127-136. address the shimming problem. We formalize coercion in [11] M. Nagarajan, K. Verma, A. Sheth, and J. Miller, “Ontology driven scientific workflows with typetheoretic rigor [2, 15]. data mediation in web services,” International Journal of Web Existing typechecking techniques apply in contexts other Services Research, vol. 4, no. 4, pp. 104-126, 2007. than scientific workflows, e.g., Hindley-Milner algorithm [12] S. Bowers, B. Ludäscher, “Ontology-driven framework for data [21] requires typed prefix to typecheck expressions with transformation in scientific workflows,” in Proc. of DILS, p.11-16 polymorphic types (not used in our model) and therefore 2004. cannot be directly applied to typecheck workflow [13] J. Ambite, D. Kapoor, “Automatically composing data workflows with relational descriptions and shim services,” in Proc. of expressions. We present a concrete fully algorithmic ISWC/ASWC, 2007, pp. 15-29. solution and demonstrate its application to the specific [14] J. Sroka, J. Hidderns, P. Missier, C. Goble, “Taverna 2 workflow workflow type system with primitive and structured types. model,” Journal of Computer and System Sciences, vol. 76, no. 6, pp. To our best knowledge, this work is the first one to 490-508, 2010. reduce the shimming problem to the coercion problem and [15] V. Tannen, T. Coquand, C. Gunter, A. Scedrov, “Inheritance as implicit coercion,” Information and Computation, vol. 91, no. 1, pp. to propose a fully automated solution. 172-221, 1991. [16] L. Dou, D. Zinn, T. McPhillips, S. Köhler, S. Riddle, S. Bowers, B. VIII. CONCLUSIONS AND FUTURE WORK Ludäscher, “Scientific workflow design 2.0: demonstrating streaming In this paper, we first reduced the shimming problem to data collections in Kepler,” in Proc. of ICDE, 2011, pp.1296-1299. the runtime coercion problem from in theory of type [17] J. Freire, C. Silva, “Making Computations and Publications systems. Secondly, we proposed a scientific workflow Reproducible with VisTrails,” Computing in Science and Engineering, vol. 14, no. 4, pp. 18-25, 2012. model and defined the notion of well-typed workflow. [18] C. Hérault, G. Thomas, and P. Lalanda, “A distributed service- Thirdly, we developed three algorithms to typecheck oriented mediation tool,” in Proc. of SCC, 2007, pp. 403-409. workflows by first translating them into equivalent lambda [19] C. Lin, S. Lu, X. Fei, A. Chebotko, D. Pai, Z. Lai, F. Fotouhi, and J. expressions. Fourthly, we designed two functions that Hua, “A Reference Architecture for Scientific Workflow together insert “invisible shims”, or runtime coercions in Management Systems and the VIEW SOA Solution,” IEEE workflows, thereby solving the shimming problem for any Transactions on Services Computing, vol. 2, no. 1, pp.79-92, 2009. [20] I. H. C. Wassink, P. v. d. Vet, K. Wolstencroft, P. Neerincx, M Roos, well-typed workflow. Finally, we implemented our H. Rauwerda, and T. Breit “Analysing Scientific Workflows: Why automated shimming technique, including all the proposed Workflows Not Only Connect Web Services,” in Proc. of SERVICES algorithms, lambda calculus, type system, and translation I, 2009, pp. 314-321. functions in our VIEW system and presented a case study to [21] R. Milner, “A Theory of type polymorphism in programming,” validate the proposed approach. In the future, we plan to Journal of Computer and System Sciences, vol. 17, no. 3, pp. 348- 375, 1978. extend our technique to mediate structured data types such [22] D. Zinn, S. Bowers, T. McPhillips, B. Ludäscher, “Scientific as relational schema, and to develop real-world scientific workflow design with data assembly lines,” in Proc. of SC-WORKS, workflows relying on our implicit shimming approach. 2009.