1 56 2 Compiling First-Order Functions to 57 3 58 4 Session-Typed Parallel Code 59 5 60 6 David Castro-Perez Nobuko Yoshida 61 7 Imperial College London Imperial College London 62 8 London, UK London, UK 63 9 [email protected] [email protected] 64 10 65 11 Abstract problematic if a function does not match one of the avail- 66 12 Building correct and efficient message-passing parallel pro- able parallel constructs, or if a program needs to be ported 67 13 grams still poses many challenges. The incorrect use of to an architecture where some of the skeletons have not 68 14 message-passing constructs can introduce deadlocks, and been implemented. Unlike previous structured parallelism 69 15 a bad task decomposition will not achieve good speedups. approaches, we do not require the existence of an underlying 70 16 Current approaches focus either on correctness or efficiency, library or implementation of common patterns of parallelism. 71 17 but limited work has been done on ensuring both. In this In this paper, we propose a structured parallel program- 72 18 paper, we propose a new parallel programming framework, ming framework whose front-end language is a first-order 73 19 PAlg, which is a first-order language with participant anno- language based on the algebra of programming [2; 3]. The 74 20 tations that ensures deadlock-freedom by construction. PAlg algebra of programming is a mathematical framework that 75 21 programs are coupled with an abstraction of their communi- codifies the basic laws of algorithmics, and it has been suc- 76 22 cation structure, a global type from the theory of multiparty cessfully applied to e.g. program calculation techniques [4], 77 23 session types (MPST). This global type serves as an output datatype-generic programming [35], and parallel computing 78 24 for the programmer to assess the efficiency of their achieved [66]. Our framework produces message-passing parallel code 79 25 parallelisation. PAlg is implemented as an EDSL in Haskell, from program specifications written in the front-end lan- 80 26 from which we: 1. compile to low-level message-passing guage. The programmer controls how the program is paral- 81 27 code; 2. compile to sequential C code, or interpret as se- lelised by annotating the code with participant identifiers. To 82 28 quential Haskell functions; and, 3. infer the communication make sure that the achieved parallelisation is satisfactory, 83 29 protocol followed by the compiled message-passing program. we produce as an output a formal description of the com- 84 30 We use the properties of global types to perform message munication protocol achieved by a particular parallelisation. 85 31 reordering optimisations to the compiled C code. We prove This formal description is a global type, introduced by Honda 86 32 the extensional equivalence of the compiled code, as well et al. [42] in the theory of Multiparty Session Types (MPST). 87 33 as protocol compliance. We achieve linear speedups on a We prove that the parallelisation, and any optimisation per- 88 34 shared-memory 12-core machine, and a speedup of 16 on a formed to the low-level code respects the inferred protocol. 89 35 2-node, 24-core NUMA. The properties of global types justify the message reordering 90 36 done by our back-end. In particular, we permute send and 91 Keywords multiparty session types, parallelism, arrows 37 receive operations whenever sending does not depend on the 92 38 values received. This is called asynchronous optimisation [57], 93 and removes unnecessary synchronisation, while remaining 39 1 Introduction 94 communication-safe. 40 Structured parallel programming is a technique for parallel 95 41 96 programming that requires the use of high-level parallel 1.1 Overview 42 constructs, rather than low-level send/receive operations 97 43 [52; 62]. A popular approach to structured parallelism is 98 PAlg (§3) 44 the use of algorithmic skeletons [20; 36], i.e. higher-order 99 45 100 functions that implement common patterns of parallelism. optimise 46 code generation protocol inference 101 Programming in terms of high-level constructs rather than typability 47 low-level send/receive operations is a successful way to avoid Parallel Code (§5) MPST (§4) 102 48 common concurrency bugs by construction [38]. One limita- 103 49 tion of structured parallelism is that it restricts programmers Figure 1. Overview 104 50 to use a set of fixed, predefined parallel constructs. This is 105 51 Our framework has three layers: (1) Parallel Algebraic Lan- 106 52 107 CC ’20, February 22–23, 2020, San Diego, CA, USA guage (PAlg), a point-free first-order language with partici- 53 2020. ACM ISBN 978-1-4503-7120-9/20/02...$15.00 pant annotations, which describe which process is in charge 108 54 https://doi.org/10.1145/3377555.3377889 of executing which part of the computation; (2) Message 109 55 1 110 CC ’20, February 22–23, 2020, San Diego, CA, USA David Castro-Perez and Nobuko Yoshida

111 Passing Monad (Mp), a monadic language that represents protocol compliance. The MPST theory guarantees that the 166 112 low-level message-passing parallel code, from which we gen- code that we generate complies with the inferred protocol 167 113 erate parallel C code; and (3) global types (from MPST), a (Theorem 5.2), which greatly simplifies the proof of exten- 168 114 formal description of the protocol followed by the output sional equivalence (Theorem 5.3), by allowing us to focus on 169 115 Mp code. Fig. 1 shows how these layers interact. PAlg, high- representative traces, instead of all possible interleavings of 170 116 lighted in green, is the input to our framework; and Mp and actions. On the practical side, we perform message reorder- 171 117 global types (MPST), highlighted in yellow, are the outputs. ing optimisation based on the global types [57]. Moreover, 172 118 We prove that the generated code behaves as prescribed by an explicit representation of the communication protocol is 173 119 the global type, and any low-level optimisation performed on a valuable output for programmers, since it can be used to 174 120 the generated code must respect the protocol. As an example, assess a parallelisation. (Fig. 4). 175 121 we show below a parallel mergesort. mergesort. 176 122 1.2 Outline and Contributions 177 1 msort :: (CVal a, CAlg f) => Int -> f [a] [a] 123 §2 Alg 178 2 msort n = fix n $ \ms x -> vlet (vsize x) $ \sz -> defines the Algebraic Functional Language ( ), a lan- 124 179 3 if sz <= 1 then x guage inspired by the algebra of programming, that we use 125 §3 180 4 else vlet (sz / 2) $ \sz2 -> as a basis for our work; proposes the Parallel Algebraic 126 PAlg 181 5 vlet(par ms $ vtake sz2 x) $ \xl -> Language ( ), our front-end language, as an extension 127 Alg §4 182 6 vlet(par ms $ vdrop sz2 x) $ \xr -> of with participant annotations; introduces a proto- 128 PAlg 183 7 app merge $ pair (sz, pair (xl, xr)) col inference relation that associates expressions with 129 MPST protocols, specified as global types. We prove that 184 130 The return type of msort, f [a] [a], is the type of first-order the inferred protocols are deadlock-free: i.e. every send has 185 131 programs that take lists of values [a], and return [a]. Con- a matching receive. Moreover, we use the global types to 186 132 straint CAlg restricts the kind of operations that are allowed justify message reordering optimisations, while preserving 187 133 in the function definition. The integer parameter to function communication safety; §5 develops a translation scheme 188 134 fix is used for rewriting the input programs, limiting the which generates message-passing code from PAlg, that we 189 135 depth of recursion unrolling. par is used to annotate the func- prove to preserve the extensionality of the input programs; 190 136 tions that we want to run at different processes, and function §6 demonstrates our approach using a number of examples. 191 137 app is used to run functions at the same participant as their We will provide as an artifact our working prototype im- 192 138 inputs. In case this input comes from different participants, plementation, and the examples that we used in §6, with 193 139 first all values are gathered at any of them, and thenthe instructions on how to replicate our experiments. 194 140 function is applied. We can instantiate f either as a sequen- 195 141 tial program, as a parallel program, or as an MPST protocol. 2 Algebraic Functional Language 196 142 We prove that the sequential program, and output parallel This section describes the Algebraic Functional Language 197 143 programs are extensionally equal, and that the output parallel (Alg) and its combinators. In functional programming lan- 198 144 program complies with the inferred protocol. For example, guages, it is common to provide these combinators as ab- 199 145 interpreting msort 1 as a parallel program produces C code stractions defined in a base language. For example, onesuch 200 that is extensionally equal to its sequential interpretation, 146 combinator is the split function (△), also known as fanout, or 201 147 and behaves as the following protocol: (&&&), in the arrow literature [45] and Control.Arrow Haskell 202 148 package [61]. Programming in terms of these combinators, 203 149 p1 ⊕ p2 p1 avoiding explicit mention of variables is known as point-free 204 150 programming. Another approach is to translate code writ- 205 p 151 3 ten in a pointed style, i.e. with explicit use of variables, to a 206 152 point-free style [23; 44]. This translation can be fully auto- 207 This is a depth 1 divide-and-conquer, where p1 divides the 153 mated [23; 29]. In our approach, we define common point- 208 task, sends the sub-tasks to p2 and p3, and combines the 154 free combinators as syntactic constructs of Alg, and require 209 results. If the input is small, p1 produces the result directly. 155 210 Our prototype implementation is a tagless-final encoding programs to be implemented in this style. Our implementa- 156 211 [9] in Haskell of a point-free language. Constraint CAlg is tion provides a layer of syntactic sugar for programmers to 157 212 a first-order form of arrows [45; 61], with a syntactic sugar refer to variables explicitly, as shown in msort in §1, but that 158 213 layer that allows us to write code closer to (point-wise) id- builds internally a point-free representation. 159 214 iomatic Haskell. The remainder of the paper focuses on the 160 2.1 Syntax 215 language underlying CAlg. 161 216 F1, F2 F I | Ka | F1 + F2 | F1 × F2 162 Why Multiparty Session Types There are both practical a,b F 1 | int | ... | a → b | a + b | a × b | F a | µF 217 163 218 and theoretical advantages. On the theoretical side, the the- e1, e2 F f | v | const e | id | e1 ◦ e2 | πi | e1 △ e2 | ιi | e1 ▽ e2 164 ory of multiparty session types ensures deadlock-freedom and | F e | inF | outF | recF e1 e2 219 165 2 220 Compiling First-Order Functions to Session-Typed Parallel Code CC ’20, February 22–23, 2020, San Diego, CA, USA

221 In our syntax, f1, f2, ..., capture atomic functions, which can be split in two halves Ls × Ls. Assume that a functions 276 222 are functions of which we only know their types; v1,v2 are spl : Ls → T Ls, and a function mrg : T Ls → Ls. We define 277 223 values of primitive types (e.g. integer and boolean); e1, e2,..., ms = recT mrg spl. By the definition of rec: 278 224 represent expressions; F1, F2, ..., are functors; and a, b,..., 279 ms = recT (id ▽ mrg) spl = (id ▽ mrg) ◦ T (recT mrg spl) ◦ spl 225 are types. The syntax and semantics are standard [34; 53]. 280 = (id ▽ mrg) ◦ (id + (recT mrg spl) × (recT mrg spl)) ◦ spl 226 Constant, identity functions, and function composi- 281 = (id ▽ mrg ◦ (ms × ms)) ◦ spl 227 tion are const, id and ◦ respectively. Products are repre- 282 228 sented using the standard pair notation: if x : a andy : b, then Function ms first applies spl. Then, if the list was empty or 283 229 (x,y) : a×b. The functions on product types are πi and △, and singleton, it returns the input unmodified. Otherwise, ms 284 230 they represent, respectively, the projections, and the split applies recursively to the first and second halves. Finally, 285 231 operation: (f △ д)(x) = (f x,д x). Coproducts have two mrg returns a pair of sorted lists. 286 232 constructors, the injections ιi , that build values of type a +b. 287 233 The ▽ combinator is the case operation: (f1 ▽ f2)(ιi x) = fi x. 3 Parallel Algebraic Language 288 234 Products and coproducts can be generalised to multiple argu- In the previous section we introduced Alg, a point-free func- 289 235 ments: a ×b ×c is isomorphic to a × (b ×c), and to (a ×b) ×c. tional language. In this section, we extend this language with 290 Î 236 We use i ∈[1,n] ai as notation for the product of more than . Annotations occur both at the type 291 Í Î participant annotations 237 two types; similarly we use for coproducts. The no- and expression levels: at the type level, annotations represent 292 238 tation binds tighter than any other construct. Whenever the data of the respective type is; at the expression 293 Î where 239 ∀i, j ∈ I, ai = aj = a, we use the notation n a as a syn- level, it represents the computation is performed. 294 Î by whom 240 onym for i ∈[1,n] ai . This language extension is called PAlg. 295 241 Functors are objects that take types into types, and func- The implicit dataflow of the Alg (or PAlg) constructs deter- 296 242 tions to functions, such that identities and compositions are mines which interactions must take place to evaluate an an- 297 243 preserved. In this work, we focus on polynomial functors [31], notated program. To illustrate this, we use the Cooley-Tukey 298 244 which are defined inductively: I is the identity functor, and Fast-Fourier Transform algorithm [21]. The Cooley-Tukey 299 245 takes a type a to itself; Kb is the constant functor, and takes algorithm is based on the observation that an FFT of size n, 300 246 any type to b; F × F is the , and takes a type 301 1 2 product functor fftn can be described as the combination of two FFTs of size 247 a to F1 a × F2 a; F1 + F2 is the coproduct functor, and takes n/2. We focus its high-level structure: 302 248 a type to a coproduct type. A term F e behaves as mapping 303 (add@p1 sub@p2) ◦ ((fft / @p3 ◦ π1) ((exp ◦ fft / )@p4 ◦ π2)) 249 term e to the I positions in F. For example, if F = Ka × I × I, △ n 2 △ n 2 304 250 then applying F e to (x, y, z) yields (x, e y, e z). Assume that the input is a pair of vectors that contain the 305 251 Recursion is captured by combinators in, out, rec, and deinterleaved input, i.e. elements at even positions on the left, 306 252 type µF. We use standard isorecursive types [31; 47; 53], and odd positions on the right. We first compute the fft of size 307 253 where µF is isomorphic to F µF, and the isomorphism is given n/2 to the even and odd elements at p3 and p4 respectively. 308 254 by the combinators inF (roll) and outF (unroll). For any Then, the first half of the output is produced by adding there- 309 255 polynomial functor F, µF, and strict functions inF and outF sults pairwise (at p1), and the second half by subtracting them 310 256 are guaranteed to exist. In our implementation, inF is just (at p2). In order to evaluate this expression, we need to know 311 257 a constructor (like inji ). Recursion is recF e1 e2, and it is where is the input data. This is specified by the programmer 312 258 known as a hylomorphism [53]. A hylomorphism captures as an annotated type, which we call interface. Suppose that 313 259 a divide-and-conquer algorithm, with a structure described the interface specifies that the even elements areat p, and the 314 ′ 260 by F, where e1 is the conquer term and e2 the divide term. odd elements at p . The interface that represents this scenario 315 ′ 261 Using hylomorphisms requires us to work in a semantic is (vec × vec)@(p × p ), i.e. an annotated pair of vectors, with 316 ′ 262 interpretation with algebraic compactness, i.e. in which car- the first component at p, and the second component at p . 317 263 riers of initial F-algebras and terminal F-coalgebras coin- By keeping track of the locations of the data, we obtain type 318 264 cide (or are isomorphic). Hylomorphisms and exponentials (vec × vec)@(p1 × p2), which is the output (or codomain) in- 319 265 ap : (a → b) × a → b allow the definition of a general fix- terface the PAlg expression. We also refer to the annotations 320 266 point operator [54]. Working with hylomorphisms implies (e.g p1 × p2) as interfaces, whenever there is no ambiguity. 321 ′ 267 that our input programs may not terminate. We guarantee We write fftn : (vec × vec)@(p × p ) → (vec × vec)@(p1 × p2) 322 268 that, given a terminating input program, we will not produce to represent the input and output interfaces of fftn. 323 269 a non-terminating parallelisation (Theorem 5.3). Consider now e1@p1 ▽ e2@p2. The output interface of this 324 270 Example 2.1 (MergeSort in Alg). Assume a type Ls of lists expression is either p1 or p2, depending on whether the input 325 271 of elements of type a. Functor T = K (Ls) + I × I captures is the result of applying ι1 or ι2. We represent such interfaces 326 272 the recursive structure of ms : Ls → Ls. When splitting using unions: e1@p1 ▽ e2@p2 : (a + b)@p → c@(p1 ∪ p2). Since 327 273 some l : Ls, we may find one of the two cases described by p contains a value of a sum type a +b, p is responsible for no- 328 274 T : an empty or singleton list, Ls, or a list of size ≥ 2, that tifying both p1 and p2 which branch needs to be taken in the 329 275 3 330 CC ’20, February 22–23, 2020, San Diego, CA, USA David Castro-Perez and Nobuko Yoshida

331 control flow. Incorrectly notifying the necessary participants contain. We use mappings from participants to values to 386 332 will produce incorrect parallelisations that might deadlock. represent such states: V B [p 7→ v]p∈P . The programmer, 387 333 For example, consider the expression e0@p0 ◦ (e1@p1 ▽e2@p2). additionally to writing an Alg (PAlg) expression, will need 388 334 Assuming that the input at p, p needs to notify p0, otherwise to provide an input interface, i.e. where is the input to the 389 335 p0 will be stuck. To avoid such cases, and to compute the parallel program. Consider, for example, the interface int@pi . 390 336 interfaces of an expression, we define a type system for PAlg. Given a concurrent system with participants po ··· pn, we 391 337 know that pi contains a value of type int: [··· pi 7→ 42 ···]. 392 338 3.1 Syntax of PAlg An interface with a product of participants (a × b)@(p1 × p2) 393 339 pì p a 394 I F p | ιi I | I × IR F I | R ∪ RP F R → R represents a state in which 1 contains an element of type , 340 395 e F e@p |[ p ⊕ pì]| id | e◦ e | πi | e△ e | ιi | e▽ e and p2 an element of type b, e.g a possible state represented 341 by (int×vec)@(p1 ×p2) is: [··· p1 7→ 42 ··· p2 7→ [1, 1, 2,...]···]. 396 The syntax of PAlg is that of Alg, extended with partici- 342 An interface ιi I represents the same state as interface I, but 397 343 pant annotations (red). Note that certain Alg constructs can we statically know that this state was reached after an i- 398 344 only occur under annotations (e@p), e.g: in, out and rec. This th injection. Then, if a participant requires the value at I, 399 345 implies that recursive functions need to be annotated at a this participant will apply the necessary injections to the 400 single participant. To parallelise recursive functions, they 346 received values. Finally, an interface a@(R ∪pì R ) means that 401 need first to be rewritten into a suitable form, and then an- 1 2 347 the state might be either R or R , and that all participants pì 402 notate the resulting expression. At the moment, we support 1 2 348 should be notified of the state. 403 349 automatic recursion unrolling up to a user-specified depth. 404 350 We provide an overview of the main syntactic constructs of Well-formedness The above examples are of well-formed 405 351 PAlg: annotations, interfaces, and annotated functions. interfaces: int@pi , (int × vec)@(p1 × p2). Well-formedness en- 406 ′ 352 Annotations are ranged over by R, R ,... We define them sures that interfaces represent valid states. Generally, a@R 407 353 in two layers, I, or simple annotations that cannot contain is well-formed if a matches the structure of R. For example, 408 354 choices (∪), and R. This way, we ensure that choices only oc- int@(p1 × p2) is ill-formed, since a single integer cannot be 409 355 cur at the topmost level. Simple annotations are: participant at two different participants. An interface a@(R1 ∪ R2) re- 410 356 ids p, that identify processes; products of interfaces I1 × I2; quires that both a@R1 and a@R2 are well-formed. So, (vec × 411 357 and tagged interfaces ιi I, that keep track of the branch of the vec)@((p1 × p2) ∪ p3) is well-formed because we can have 412 pì vec@p vec@p (vec × vec)@p int@((p × 358 choice that led to I. A choice R1 ∪ R2 describes an scenario 1 and 2, or 3. However, 1 413 ) ∪ ) ( × ) 359 that is the result of a branch in the control flow, where a p2 p3 is ill-formed, because int@ p1 p2 is ill-formed. 414 360 value can be found at either R or R . Here, pì = p ··· p are 415 1 2 1 n 3.3 Typing of Parallel Algebraic Language 361 the participants whose behaviour depends on the path in the 416 We introduce a relation that associates Alg expressions with 362 control flow. Finally, arrows P of the form R1 → R2 represent 417 363 the input/output annotations of a parallel program. potential parallelisations PAlg, and their interfaces. This re- 418 364 Interfaces are annotated types. They range over A, B,..., lation can be seen as a type system for both Alg and PAlg. 419 365 and are of the form a@R, which means that values of type As a type system for PAlg, this relation provides a way to 420 366 a are distributed across R. We require annotated types to check or infer the output interface of some e. By using this 421 367 be well-formed, WF(a@R), which implies that the structure relation as a type system for Alg, we can explore potential 422 368 of a matches that of R. We write I to represent one-hole parallelisations of some input expression e. Additionally, the 423 369 contexts for interfaces, with I[p] representing the interface type system ensures that all choice point annotations contain 424 370 that results of placing p at the hole in I. every participant that depends on each particular choice. 425 ′ 371 Annotated functions are ranged over by e, e . The anno- Typing Rules A judgement of the form ⊢ e ⇒ e : A → B 426 372 tations are introduced using e@p, where e is an unannotated means that the PAlg expression e is one potential paral- 427 373 Alg expression, and p is a single participant identifier. These lelisation of the Alg expression e, with domain interface 428 374 annotations need to be set by the programmer, but their A and codomain interface B. The intuition of a judgement 429 375 introduction can be also automated. Additionally, we intro- 430 ⊢ e ⇒ e : a@R1 → b@R2 is that the participants in e collec- 376 duce the choice point annotations: [p ⊕ pì]. This annotation tively apply computation e to the value of type a distributed 431 377 specifies that p performs a choice, and notifies pì. Choice 432 across R1, and produce a value of type b distributed across 378 points can be introduced fully automatically by collecting 433 R2. We sometimes omit e and write ⊢ e : A → B. We ensure 379 all participants whose behaviour depends on the value of a that given any e and e such that they are typeable against 434 380 sum type. 435 interfaces a@Ra → b@Rb , then e must have type a → b. 381 436 Lemma 3.1. e ⇒ e a@R → b@R e a → b 382 3.2 Interfaces If : a b , then : . 437 383 An interface represents a state in a concurrent system: the The typing rules (Fig. 2) must ensure that the participants 438 384 set of participants, and the types of the values that they involved in a choice are notified, and that Alg expressions 439 385 4 440 Compiling First-Order Functions to Session-Typed Parallel Code CC ’20, February 22–23, 2020, San Diego, CA, USA

Alg Ext 441 are correctly expanded. Rule Choice specifies that a choice 496 ⊢ e : a → b ⊢ e2 ⇒ e : a@I → B e1 =ext e2 442 point may be introduced at any point when a participant 497 ⊢ e ⇒ e@p : a@I → b@p ⊢ e ⇒ e : a@I → B 443 contains a value of a sum-type. In such cases p sends the 1 498 444 tag of the sum-type value to any other participant whose Alt 499 445 behaviour depends on it. After the choice point, the interface ⊢ e ⇒ e : A1 → B1 ⊢ e ⇒ e : A2 → B2 A1 , A2 pids(e) ⊆r ì 500 446 is I[ι p] ∪pì I[ι p], with the constraint that the participants pì pì 501 1 2 ⊢ e ⇒ e : A1 ∪ A2 → B1 ∪ B2 447 in I[p] must be in pì. Rule Alt specifies that e must be the 502 448 503 parallelisation of e, considering both A1 and A2 as input in- Choice 449 pids(I[p]) ⊆ pì 504 terfaces. The output interface is the union of B1 and B2. Any pì 450 participant in e must be notified of the choice pids(e) ⊆ pì, WF(a@I[ιi p]), i ∈ [1, 2] ⊢ e ⇒ e : a@(I[ι1 p] ∪ I[ι2 p]) → B 505 451 to make sure that they perform the interactions that corre- ⊢ e ⇒ e ◦ [p ⊕ pì] : a@I[p] → B 506 452 507 spond to the correct Ai . Rule Alg specifies that given any e Figure 2. Typing rules of PAlg (selected) 453 and participant p, e@p is a valid parallelisation, with output 508 454 interface b@p. Finally, rule Ext is crucial for exploring po- 509 455 tential parallelisations. It states that if e is the parallelisation 4 Multiparty Session Types for PAlg 510 456 511 of e2, and e2 is extensionally equal to e1, then e is also a The dataflow of the PAlg constructs determine the commu- 457 512 parallelisation of e1. The undecidability of this rule requires nication protocol of the annotated expression. However, it is 458 that the programmer specifies rewriting strategies both for hard to manually check what this communication structure 513 459 checking and inference. is. Recall the mergesort PAlg expression of §3, ms, and sup- 514 460 pose that we want to produce a parallelisation for a 32-core 515 461 516 Rewriting and Annotation Strategies We use rewrit- machine. Then, we might be interested in using a 5-unfolding 462 517 ing strategies when exploring potential parallelisations of of ms, so that we have ms executing concurrently on all of 463 518 functions. This is inference problem (2) below. Let ?i be the cores. How do we know, for such cases, that we produced 464 519 metavariables. The two inference problems that we are in- a sensible parallelisation? As an example, suppose we use an 465 520 terested in are: 1. Solving ⊢ e ⇒ e : A → ?0 obtains the annotation strategy that produces the following code: 466 521 output interface for e, with input interface A. 2. Solutions ( ( @ ◦ ( @ ◦ @ ) ( @ ◦ @ ))) 467 id ▽ mrg p1 ms p2 π1 p1 △ ms p2 π2 p1 522 of ⊢ e ⇒ ?0 : A → ?1 are potential parallelisations of e, ◦[p ⊕ p p ] ◦ spl@p : Ls@p → Ls@p ∪p1p2 Ls@p 468 1 1 2 1 0 1 1 523 and their output interface. Solving (1) is straightforward. 469 Notice that this example will run correctly, and produce the 524 Problem (2) requires to decide how to introduce role annota- 470 expected result. However, the achieved PAlg expression is 525 tions (rule Alg), how to perform rewritings (rule Ext), and 471 not parallel! If we represent the implicit dataflow of this 526 where to introduce choice points (rule Choice). Introduc- 472 expression as explicit communication, the reason becomes 527 ing choice points is straightforward: we introduce them as 473 apparent. We use global types from multiparty session types 528 early as possible, as soon as an input interface contains a 474 to provide an explicit representation of the communication 529 sum-type at a participant. For introducing annotations and 475 structure of the program: 530 doing Alg rewritings, the programmer has to specify annota- 476 → → { 531 tion and rewriting strategies. At the moment, our tool allows p0 p1 : Ls. p1 p2 ι1. end; 477 ι . p → p : Ls. p → p : Ls.p → p : Ls × Ls. end} 532 the developer to introduce annotations explicitly, or to se- 2 1 2 1 2 2 1 478 533 lect sub-expressions that will be annotated with fresh new This global type represents the following protocol: 1. par- 479 534 participants. The rewriting strategies that our current imple- ticipant p0 sends a list to p1; 2. p1 sends to p2 either ι1 or 480 535 mentation supports are unrollings of recursive definitions. ι2, and if the label is ι1, the protocol ends; 3. if p1 sent ι2, 481 536 However, our tool is extensible: the equivalences used in the then p1 sends to p2 two lists, in two different interactions; 482 537 rewritings are a parameter. and 4. p2 replies with a message to p1 with a pair of lists. 483 538 It is clear from this protocol that p1 and p2 are dependent 484 539 on each others’ messages, and that p2 cannot perform any 485 540 Example 3.2 (Mergesort). Consider the mergesort defini- computation in parallel. The larger the expression is, the 486 541 tion ms = recT mrg spl. Solutions to the inference problem harder avoiding these wrong annotations will become. By 487 ⊢ ⇒ @ → 542 ms ?0 : Ls p0 ?1 provide the alternative parallelisa- changing the annotation strategy, we produce the following 488 tions of ms. By choosing a rewriting strategy that unrolls ms 543 parallel structure, where p2 and p3 can operate in parallel: 489 once, and annotates any remaining instances of ms at fresh 544 → → { }{ 490 new participants, we produce the following PAlg expression: p0 p1 : Ls. p1 p2p3 ι1. end; 545 → → → → } 491 ι2. p1 p2 : Ls. p1 p3 : Ls.p2 p1 : Ls. p3 p1 : Ls. end 546 ⊢ (id ▽ (mrg ◦ (ms × ms))) ◦ spl 492 This abstraction of the communication protocol of an achieved 547 ⇒ (id ▽ (mrg@p1 ◦ (ms@p2 ◦ π1@p1) △ (ms@p3 ◦ π2@p1))) 493 548 ◦[p1 ⊕ p1p2p3] ◦ spl@p1 parallelisation is therefore useful as an output for the pro- p p p 494 : Ls@p0 → Ls@p1 ∪ 1 2 3 Ls@p1 grammer. Additionally, these global types are a contract that 549 495 5 550 CC ’20, February 22–23, 2020, San Diego, CA, USA David Castro-Perez and Nobuko Yoshida

551 can be enforced on the generated code. We use this for prov- Choice 606 552 ing that our back-end is correct, but also for applying low- 607 [p ⊕ pì] ⇐ a[b + c]@I[p] ∼ p → {pì \ p}{ι . end; ι . end} 553 level code optimisations (e.g. message reordering) guided ⊨ 1 2 608 554 by this global type, ensuring that they do not introduce any Alt Alg 609 555 run-time error. For example, when we find in a global type ⊨ e ⇐ A1 ∼ G1 ⊨ e ⇐ A2 ∼ G2 ⊢ e : a → b 610 556 p1 → p2. p2 → p3, we mark the send/receive actions for p2 pì @ ⇐ @ ∼ [ @ ] 611 ⊨ e ⇐ A1 ∪ A2 ∼ G1 ∪ G2 ⊨ e p a I a I { p 557 as point of potential optimisation. If the messages exchanged 612 558 613 do not depend on each other, we permute them, performing [a@p1 { p2] = p1 → p2 : a. end, if p1 , p2; [a@p { p] = end; 559 p 614 first the send action, so that 2 is not blocked by a receive [(a × b)@(Ia × Ib ) { p] = [a@Ia { p] [b@Ib { p]; and 560 # 615 action. This is known as asynchronous optimisation [57]. [(a1 + a2)@(ιi I) { p] = [ai @I { p] 561 616 Figure 3. Protocol Relation (selected) 562 4.1 Multiparty Session Types 617 563 Our global types are based on the most commonly used in the 618 564 literature [22]. We start with a set of participant identifiers, are the same, or they branch on the same role, and their 619 565 620 p1, p2, ..., and a set of labels, ι1, ι2, .... These are considered continuations can be merged. 566 as natural numbers: participant identifiers uniquely identify We use a standard definition of well-formedness that states 621 567 an independent unit of computation, e.g. thread or process that a global type is well formed if ts projection on all its roles 622 568 ids; and labels are tags that differentiate branches in the is defined. We denote: WF(G) = ∀p ∈ pids(G), ∃L,G ↾ p = L. 623 569 data/control flow. The syntax of globalG ( ) and local (L) types 624 570 in MPST is given as: 4.2 Protocol Relation 625 571 626 G F p1 → p2 : a.G | p1 → {pj }j ∈[2,n] : {ιi .Gi }i ∈I We introduce now the set of rules that associate a PAlg ex- 572 | µX .G | X | end pression and domain interface with their global type (Fig. 627 573 pì 628 L F p!⟨a⟩.L | p?(a).L | p & {ιi .Li }i ∈I | {pj }j ∈[2,n] ⊕ {ιi .Li }i ∈I 3). We extend the syntax of global types with G1 ∪ G2 to 574 629 | µX .L | X | end represent the external choices, i.e. Gi are the continuations 575 630 Global type p1 → p2 : a.G denotes data interactions from for both branches of a previous choice that affects pì. We also 576 631 p1 to p2 with value of type a; Branching is represented by extend the local types, and projection rules (G ∪pì G ) = 577 1 2 632 p → {p } {ι .G } ι p ì 1 j j ∈[2,n] : i i i ∈I with actions i from 1 to all G p ∪p G p, and the notion of well-formedness. We 578 ∈ [ ] 1 ↾ 2 ↾ 633 pj, j 2, n . end represents a termination of the protocol. pì 579 say that an external choice is well-formed, WF(G ∪ G ), if 634 µX .G represents a recursive protocol, which is equivalent to 1 2 580 ( ) ( ) ì = 635 [µX . G/X]G. We assume recursive types are guarded. WF G1 , WF G2 , and for all p < p, G1 ↾ p G2 ↾ p. We omit 581 Each participant in G represents a different participant in the annotation of the participants involved in the choice 636 582 whenever it is not needed. The relation p ⇐ A ∼ (G, B) 637 a parallel process. Local session types represent the commu- ⊨ 583 specifies that the parallel code for p and input interface A 638 nication actions performed by each participant, i.e. the role 584 of the participant. Since each participant has a unique role, will behave as global type G, and output interface B (Fig. 3). 639 585 The rules are similar to the typing rules of PAlg. 640 we sometimes refer to them interchangeably. The send type 586 641 p!⟨a⟩.L expresses the action of sending of a value of type a Example 4.1 (Mergesort Protocol). The protocol for Exam- 587 to p followed by interactions specified by L. The receive type 642 588 ple 3.2 is obtained by solving: 643 p?(a).L is the dual, where a value with type a is received (id (mrg@p ◦ (ms@p ◦ π @p ) (ms@p ◦ π @p ))) ◦ [p ⊕ 589 ⊨ ▽ 1 2 1 1 △ 3 2 1 1 644 from p. The selection type represents the transmission to all p p p ] ◦ spl@p ⇐ Ls@p ∼ ?0 590 1 2 3 1 1 . 645 pj of label ιi chosen in the set of labels (i ∈ I) followed by Li .   591 ι1.end; 646 The branching type is its dual. pids(G)/pids(L) denote the p1 → {p2p3} 592 ι2.p1 → p2 : Ls.p1 → p3 : Ls.end # 647 set of participants that occur in G/L.  593 end ∪ (p2 → p1 : Ls. p3 → p1 : Ls.end) 648 594 Projection We use a standard definition of projection that 649 ι1.end;  595 uses the full merging operator [24; 27], which allows more  →  650 ι2.p1 p2 : Ls.  596   p1 ⊕ p2 p1 651 well-formed global types than the original projection rules = p1 → {p2p3} p1 → p3 : Ls. 597 [42]. We write G ↾ p for the projection of G onto the role  p → p : Ls.  p 652  2 1  3 598 of p. We illustrate the projection with an interaction p0 →  p → p : Ls.end 653  3 1  599 p1 : a.G. The projection onto p0 is p1!⟨a⟩.(G ↾ p0), the pro- 654 600 jection onto p1 is p0?(a).(G ↾ p1), and the projection onto 4.3 Correctness 655 601 any other role p is G ↾ p. Projection on choices is similar, We guarantee that for e s.t. ⊢ e ⇒ e : A → B, with A 656 602 with the difference that whenever the role is not at the re- and B well-formed, there exists a protocol G and that it is 657 603 ceiving or sending ends of the choice, the different branches well-formed and deadlock-free. 658 604 must be merged. Two local types can be merged when they 659 605 6 660 Compiling First-Order Functions to Session-Typed Parallel Code CC ’20, February 22–23, 2020, San Diego, CA, USA

661 Lemma 4.2. [Existence of Associated Global Type] For all Each monadic computation f or m has a type m : Mp L a, 716 662 WF(A), if ⊢ e : A → B, then there exists G s.t. ⊨ e ⇐ A ∼ G. where a is the return type of m, and L is the type index of 717 663 Mp, and it represents the local type that corresponds to the 718 664 m 719 Lemma 4.3. [Protocol Deadlock-Freedom] For all WF(A), if behaviour of the term . There is almost a one to one cor- 665 respondence between the terms L and the monadic actions 720 ⊢ e : A → B and ⊨ e ⇐ A ∼ G, then WF(G). 666 m, so we omit the full definition. The types of the constructs 721 667 Remark. Since the local type abstracts the behaviour of mul- that deal with choices use a new type, ⊎, that is isomorphic to 722 668 tiparty typed processes, a well-formed global type ensures sum types, but that can only be constructed and eliminated 723 669 the end-point processes (programs) typed by that global type by using the corresponding monadic constructs: 724 670 725 are guaranteed to satisfy the properties (such as safety and sel pì : a + b → (a → Mp L1 c1) → (b → Mp L2 c2) 671 726 deadlock-freedom) of local types [27; 43]. → Mp (pì ⊕ {ι1.L1; ι2.L2})( c1 ⊎ c2) 672 727 brn p : Mp L1 a1 → Mp L2 a2 673 5 Code Generation → Mp (p & {ι1.L1; ι2.L2})( a1 ⊎ a2) 728 674 ( → ) → ( → ) → ⊎ 729 This section addresses the problem of generating low-level case : a Mp L1 c b Mp L2 d a b 675 → Mp (L ∪ L )( c ⊎ d) 730 parallel code from PAlg expressions. We prove that the gen- 1 2 676 731 erated code complies with its inferred protocol, which has These constructs ensure that the tag used to build a ⊎ b 677 several implications: (1) code generation does not introduce indeed corresponds to the correct branch of the right choice. 732 678 any concurrency errors, and the parallel code is therefore We use case to compose actions that depend on a previous 733 679 734 deadlock-free; and (2) we can prove that the generated code choice. While this treatment of ⊎ leads to unnecessary code 680 735 is extensionally equal to the input expression by considering duplication, our back-end easily optimises cases where we 681 only a representative trace, since any valid interleaving of have case f f to avoid code duplication. 736 682 actions must respect this protocol. The target language of 737 683 Parallel programs We define the basic constructs of PAlg 738 our tool is an indexed monad, the Message Passing Monad in a bottom-up way by manipulating parallel programs. Paral- 684 (Mp). From Mp, we implement our low-level C backend. We 739 lel programs are mappings from participants to their monadic 685 implement an untyped version of Mp as a deep embedding 740 action: E F [pi 7→ mi ]i ∈I . If mi : Mp Li ai for all i ∈ I, then 686 in Haskell, and session typing on top of it. This is suitable 741 we write [p 7→ m ] ∈ : Mp [p 7→ L ] ∈ [p 7→ a ] ∈ . 687 for code generation: we only generate parallel code if the i i i I i i i I i i i I 742 The semantics of both local types and monadic actions is 688 monadic actions are typeable against the respective local 743 defined in terms of such collections of actions or local types, 689 types. Our definition of Mp has significant differences to 744 and shared queues of values W , or queues of types Q, e.g. 690 other embeddings of session types in Haskell, such as the 745 ⟨E,W ⟩ {ℓ ⟨E′,W ′⟩ is a transition from E to E′, and shared 691 Session monad by Neubauer and Thiemann [58]. First, our 746 queues W to W ′ with observable action ℓ. We prove a stan- 692 Mp monad is deeply embedded in Haskell, and secondly, we 747 dard safety theorem (Theorem 5.1 below) that guarantees 693 use type indices instead of an encoding of session types in 748 that if a participant does a transition with some observable 694 terms of type classes. Our approach is better suited for com- 749 action, then so does the type index. 695 pilation since we manipulate session types, and postpone 750 696 751 session typing until code generation. Theorem 5.1. [Soundness] Assume E : Mp CA, m : Mp L a 697 752 and W : Q. Suppose ⟨E[r 7→ m],W ⟩ {ℓ ⟨E[r 7→ m′],W ′⟩. 698 753 5.1 Message Passing Monad Then there exists ⟨C[r 7→ L], Q⟩ →ℓ ⟨C[r 7→ L′], Q ′⟩ such 699 ′ ′ ′ ′ 754 Mp comprises four basic operations: send, receive, choice that W : Q and m : Mp L a. 700 755 and branching, with a standard (asynchronous) semantics. 701 Mp code generation The translation scheme for Mp code 756 Additionally, for composing actions that depend on the same 702 generation is done recursively on the structure of PAlg ex- 757 choice, we introduce case expressions. Our definition of Mp 703 pressions. It takes a PAlg expression e, an interface A, and 758 is based on the free monad construction: 704 produces a mapping from all participants in e and A to their 759 705 v F x | (v,v) | ιi v | · · · | e v respective monadic continuations. We write e (A), and 760 m ret v | send p v m | recv p a f | sel pì v f f J K 706 i F 1 2 guarantee that e (A) : A → Mp GB, if ⊨ e ⇐ A ∼ (G, B). 761 | brn p m m | case f f f λx.m 707 1 2 1 2 F This means thatJ K if e induces protocol G with interfaces 762 708 Values v are either primitive values, tagged values ιi v, pairs A → B, then the generated code behaves as G, with in- 763 709 of values, or the result of applying an Alg expression e to a terfaces A and B. Code generation follows a similar struc- 764 710 value. We use standard notation for the monadic unit (ret), ture to global type inference, and is defined by building 765 711 bind (≫=) and Kleisli composition: f1>=> f2 = λx. f1 x ≫=f2. PAlg constructs as Mp parallel programs. For example, the 766 712 The message-passing constructs are standard, except sel, translation of e@p : a@I → B requires to define the in- 767 713 brn and case, which are used for performing choices, and teractions from an interface I that gathers a type a at p: 768 714 composing actions that depend on the same choice. a@I { p : a@I → Mp [a@I { p](a@p). The definition 769 715 7 L M 770 CC ’20, February 22–23, 2020, San Diego, CA, USA David Castro-Perez and Nobuko Yoshida

771 is analogous to that of [a@I p]. The remaining of the ι1 826 { p0 → {p1, p2} : p0 ⊕ p1 p1 p0 772 translation is straightforward, so we skip the details. {ι1.p0 → p1 : 1 + int. ι2 827 773 We prove two main correctness results. We guarantee that p1 → p0 : µL. end, 828

Mergesort ι .p → p : µL. p → p : µL. 774 the generated code behaves as its inferred protocol (Theorem 2 0 1 0 2 p2 829 p2 → p1 : µL. p1 → p0 : µL. 775 5.2). We also guarantee that regardless of the annotations and end 830 776 interfaces chosen for e, the parallel code always produces the 831 p → p : µL. p → p : µL. p1 p1 p1 777 same result as the sequential implementation (Theorem 5.3). 0 1 0 2 832 p0 → p3 : µL. p0 → p4 : µL. 778 p2 p2 p2 833 p1 → p3 : µL. p3 → p1 : µL. Theorem 5.2. FFT p0 779 [Protocol Conformance of the Generated Code] p2 → p4 : µL. p4 → p2 : µL. 834 p3 p3 p3 780 If ⊨ e ⇐ A ∼ G, then e (A) complies with protocol G. p2 → p1 : µL. p1 → p2 : µL. 835 J K p → p : µL. p → p : µL. 781 3 4 4 3 p4 p4 p4 836 end 782 Theorem 5.3. [Extensionality] Assume e ⇒ e : a@p → b@R 837 p → p : µL. p → p : µL. p1 783 and x : a initially at p. If e x = y, then the execution of e (p) 0 1 0 2 838 p0 → p3 : µL. p2 → p3 : int. 784 J K p0 p2 p3 p0 839 also produces y, distributed across R. p1 → p3 : int. p3 → p0 : int.

785 Dot Prod. 840 end p3 786 Example 5.4 (MergeSort Code Generation). We show be- 841 787 low the code generation for ms (Example 3.2), with p1 as Figure 4. Benchmarks: potential parallelisations. 842 788 domain interface: 843 789 p 7→ λx. sel {p , p }(spl x)( λx. ret x) 844 1 2 3 Additional Algorithms We implemented scalar prod, that 790 (λx. send p (π x)≫=λy. send p (π x)≫=λ_. 845 2 1 3 2 recursively splits a matrix into sub-matrices, distributes them 791 recv p Ls≫=λx. recv p Ls≫=λy. ret (mrg (x,y))) 846 2 3 to different workers, and then multiplies their elements bya 792 p2,3 7→ λx. brn p1 (ret x)( recv p1 Ls≫=λx.send p1 (ms x)) 847 scalar, and , with a divide-and-conquer structure. 793 quicksort 848 794 6 Parallel Algorithms and Evaluation 6.2 Evaluation 849 795 We evaluate our approach using a number of parallel al- We translate Mp monadic actions to C using pthreads and 850 796 gorithms derived from Alg expressions, and the speedups shared buffers for communication, and we have a prelim- 851 797 achieved. The purpose of this is twofold: (i) showing that inary compilation of the first-order sequential terms to C. 852 798 our approach achieves speedups for an input sequential al- We compile the generated C code using gcc version 4.8.5. 853 799 gorithm, with naïve annotation strategies, and limited opti- We take the average of 50 repetitions for each benchmark. 854 800 misations (Fig. 5), and (ii) illustrating the practical value of Our benchmarks achieve reasonable speedups against the 855 801 providing a global type that describes the parallel strategy sequential C implementations. Fig. 5 presents the speedups 856 802 achieved by a particular annotation strategy (Fig. 4). We run against the number of participants for different input sizes, 857 803 and Fig. 6 present a summary of our speedups for large inputs 858 all our experiments on 2 NUMA nodes, 12 cores per node 9 804 and 62GB of memory, using Intel Xeon CPU E5-2650 v4 @ of size > 10 . We show below an analysis of these results, by 859 805 2.20GHz chips. We run our experiments first restricting the plotting the speedups against two factors: 1. the number of 860 806 execution to a single node to avoid NUMA effects, and then participants (threads) produced by a particular annotation 861 807 on the 2 NUMA nodes. and recursion unrolling, named K; and 2. the input size, e.g. 862 808 number of elements in the input list. 863 809 6.1 Benchmarks Increasing the number of threads (parameter K), increases 864 810 the speedups obtained, up to a certain value that depends on 865 Mergesort Mergesort is the usual divide-and-conquer al- 811 the amount of available cores and the input size. For bench- 866 gorithm, using a -like parallel reduce. 812 marks that work better with dynamic task creation, our tool 867 813 Cooley-Tukey FFT We use a recursive Cooley-Tukey al- does not currently achieve good performance (e.g. quicksort). 868 814 gorithm. The algorithm starts by splitting the elements of For FFT, our tool produces the usual butterfly pattern from 869 815 the list into those that are at even and odd positions. Then, it a straightforward recursive definition, that we can achieve 870 816 recursively computes the FFT of them, and finally combines a speedup of 12 when running on a single shared-memory 871 817 the results. To generate a butterfly pattern, we use: products node. The rest of the examples are limited either by Amdahl’s 872 818 of size n, to store the results of the subsequent interleav- law (justified by their global types in Fig. 4), or by the over- 873 819 ings; product associativity to produce a perfect tree; and head of the communication and pthread creation with respect 874 820 asynchronous optimisations. to the cost of the computations, but still achieve speedups of 875 821 up to 7 and 8 on 12 cores. We can observe that there is a slow 876 822 Dot Product The dot product algorithm zips the inputs, down after creating a much larger number of participants 877 823 multiplies them pairwise, and then adds them by folding the than the ones required. This usually depends on how evenly 878 824 result. We use products of size n to derive a scatter-gather. we can distribute the data amongst workers, and whether 879 825 8 880 Compiling First-Order Functions to Session-Typed Parallel Code CC ’20, February 22–23, 2020, San Diego, CA, USA

881 936

882 12 2^24 12 2^20 12 2^24 937 883 2^27 2^23 2^27 938 10 2^29 10 2^25 10 2^29 884 2^30 2^26 2^30 939 8 8 8 885 940 886 6 6 6 941

887 12 cores 4 4 4 942

888 2 2 2 943 889 944 0 0 0 0 1 2 3 4 5 6 0 1 2 3 0 1 2 3 890 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 945 K K K 891 946 892 18 18 18 947 2^24 2^20 2^24 16 16 16 893 2^27 2^23 2^27 948 14 2^29 14 2^25 14 2^29 894 2^30 2^26 2^30 949 12 12 12

895 10 10 10 950

896 8 8 8 951 12 cores 897 6 6 6 952 ×

898 2 4 4 4 953 899 2 2 2 954 0 0 0 900 20 21 22 23 24 25 26 20 21 22 23 20 21 22 23 955 K K K 901 956 DotProd FFT Mergesort 902 957 903 Figure 5. Benchmark speedups, run in 2 NUMA nodes with 12 cores each. The X-axis is the number of workers of the parallel 958 904 program generated from a set of annotations and recursion unrolling. We show the results for 4 different input sizes. 959 905 960 906 it p1 finishes its own local computation, thus sequentialising 961 907 the code. Asynchronous permutations [16; 57] allow us to 962 17.5 12 C 908 2 x 12 C permute such actions, and still have communication safety, 963 i.e. p !⟨µL⟩. p ?(µL) .... Global types capture the structure 909 15.0 1 1 964 910 of the parallelisation, which can in some cases be used to 965 12.5 911 justify the achieved speedups. For example, we can observe 966 that the mergesort global type contains a part that needs to 912 10.0 967 913 happen sequentially (p0 and the last merging point in p1), 968 914 Speedups 7.5 and this will prevent us from achieving linear speedups. 969 915 970 5.0 916 7 Related Work 971 917 2.5 López et al. [50] develop a verification framework for MPI/C 972 918 inspired by MPST by translating parameterised protocol 973 0.0 919 dot fft ms qs scalar specifications to protocols in VCC [19]. They focus on verifi- 974 920 cation, not on code or protocol generation. Ng et al. [59; 60] 975 921 Figure 6. Achieved speedups use parameterised MPST [25] to generate an MPI backbone 976 922 in C that encapsulates the whole protocol (i.e., every end- 977 923 the amount of workers can be evenly scheduled to differ- point), and merges it with user-supplied computation ker- 978 924 ent cores. We observe that we can achieve further speedups nels. Several authors (e.g. [10]) generate skeleton API from 979 925 when running our benchmarks in the 2 NUMA nodes. Over- extensions of Scribble (www.scribble.org). Their approach 980 926 all, we observe that our annotation strategies enable good requires the protocol to be specified beforehand, and it is 981 927 speedups over the sequential implementation, with relatively not extracted from sequential code. Unlike ours, none of the 982 928 little effort. Global types can be used to detect optimisation above work formally defines code generation or proves its 983 929 opportunities that yield efficient parallelisations, such as the correctness. 984 930 Butterfly topology in Fig. 4. Without message-reordering Structured parallelism includes the use of high-level con- 985 931 based on the session types, FFT participant p3 would need to structs in languages with implicit/data parallelism [5; 12– 986 932 wait for p1’s message before sending its part to p1, i.e. p3’s lo- 15; 46; 64], algorithmic skeleton [1; 18; 20; 36; 48], and 987 933 cal type would be p1?(µL). p1!⟨µL⟩ .... This means that p3’s DSLs/APIs that compile to parallel code [8; 11; 28; 63; 69]. 988 934 local computation would only become available to p1 after Besides safety, such approaches are often highly optimised. 989 935 9 990 CC ’20, February 22–23, 2020, San Diego, CA, USA David Castro-Perez and Nobuko Yoshida

991 However, most rely on using a fixed, predetermined range of code generation techniques. We will study this for future 1046 992 patterns, typically by design with respect to their application work. 1047 993 domains. By contrast, our work only relies on send/receive Though our approach can already generate representative 1048 994 operations, which makes it highly portable, and can be easily parallel protocols, our framework is extensible. E.g. we can 1049 995 extended to support further parallel structures by extend- extend our framework with dynamic participants to handle 1050 996 ing the annotation strategies. Optimisations for structured dynamic task generation [26], and we plan to use this to 1051 997 parallel approaches also require to study and define a set capture a wider range of communication patterns for paral- 1052 998 of equivalences between patterns [6; 7; 41]. In contrast, our lel computing, such as load-balancing or work-stealing. We 1053 999 approach does not require the definition of new sets of equiv- plan to study the extension of our back-end to heterogeneous 1054 1000 alences, since these are derived from program equivalences. architectures, e.g. GPU clusters, or FPGAs. Our prototype 1055 1001 Lift is a new language for portable parallel code genera- generates code that can achieve speedups against sequential 1056 1002 tion, based on a small set of expressive parallel primitives implementations, the optimisations that we support are very 1057 1003 [40; 67; 68]. Currently, their backend focuses on generat- basic, and our generated code can be very large. We plan to 1058 1004 ing high-performance OpenCL code, while our approach introduce optimisations that reduce the amount messages 1059 1005 focuses on placing computations on different participants exchanged, further message reorderings guided by the global 1060 1006 of a concurrent/distributed system. Both approaches could type, and optimisations of the size of the generated code. Fi- 1061 1007 be combined: annotations can be used to generate a high- nally, we plan to study the instrumentation of global types to 1062 1008 level message-passing layer that distributes tasks to multiple estimate statically the speedups of different parallelisations, 1063 1009 nodes in a GPU cluster, using the global type to minimise and optimise communication costs. 1064 1010 communication costs; then, the code at each participant can 1065 1011 be compiled to high-performance GPU code using Lift. Acknowledgements 1066 1012 Elliott exploits the idea of giving functional programs We thank Shuhao Zhang for his contributions to the C back- 1067 1013 multiple interpretations in different categories, and shows 1068 end, described in [70]. We thank Francisco Ferreira for the 1014 examples of applications to multiple domains, including par- helpful discussions in the early stages of this work. This 1069 1015 allelism [29; 30]. Our approach is similar in the sense that we work was supported in part by EPSRC projects EP/K011715/1, 1070 1016 allow the specifications of first-order functional programs EP/K034413/1, EP/L00058X/1, EP/N027833/1, EP/N028201/1, 1071 1017 to have multiple different interpretations, but we focus on and EP/T006544/1. 1072 1018 generating parallel code, and provide a finer-grained con- 1073 1019 trol over the parallelisations by adding participant anno- 1074 References 1020 tations. There is a large body of literature in using pro- 1075 [1] Marco Aldinucci, Marco Danelutto, Peter Kilpatrick, Massimiliano 1021 gram equivalences to derive parallel implementations, e.g. 1076 Meneghin, and Massimo Torquati. 2011. Accelerating Code on Multi- 1022 1077 [17; 32; 37; 39; 49; 51; 55; 56; 65; 66]. Our framework is or- cores with FastFlow. In Proc. of 17th International Euro-Par Conference 1023 thogonal, in that we focus on tying a low-level C back-end (Euro-Par 2011) (LNCS), Vol. 6853. Springer, 170–181. 1078 1024 with global types. Our front-end, however, supports some [2] John Backus. 1978. Can Programming Be Liberated from the Von 1079 1025 basic form of rewritings, and we plan to extend it in the Neumann Style?: A Functional Style and Its Algebra of Programs. 1080 21, 8 (Aug. 1978), 613–641. 1026 future with more interesting ones from the literature. Commun. ACM 1081 [3] Richard Bird and Oege De Moor. 1996. The algebra of programming. 1027 1082 In NATO ASI DPD. 167–203. 1028 [4] R. S. Bird. 1989. Algebraic Identities for Program Calculation. Comput. 1083 1029 8 Conclusions and Future Work J. 32, 2 (01 1989), 122–126. 1084 1030 We have presented a novel approach to protocol inference [5] Guy E. Blelloch. 1996. Programming Parallel Algorithms. Commun. 1085 39, 3 (1996), 85–97. 1031 and code generation. By using this approach, we can reason ACM 1086 [6] Christopher Brown, Marco Danelutto, Kevin Hammond, Peter Kil- 1032 1087 about extensionality of the parallel programs, and alternative patrick, and Archibald Elliott. 2014. Cost-Directed Refactoring for 1033 mappings of computations to participants. We produce the Parallel Erlang Programs. International Journal of Parallel Program- 1088 1034 parallel program global type, i.e. its communication protocol, ming 42, 4 (01 Aug 2014), 564–582. 1089 1035 that acts as a contract for the low-level code, can be used to [7] Christopher Brown, Kevin Hammond, Marco Danelutto, and Peter 1090 Kilpatrick. 2012. A Language-independent Parallel Refactoring Frame- 1036 pin-point potential optimisations, or assessing the suitabil- 1091 work. In Proc. of the 5th Workshop on Refactoring Tools (WRT ’12). ACM, 1037 1092 ity of a parallelisation. This approach has several benefits: New York, NY, USA, 54–58. 1038 1. our message-passing code is deadlock-free by construc- [8] Kevin J. Brown, Arvind K. Sujeeth, HyoukJoong Lee, Tiark Rompf, 1093 1039 tion, since it follows the data-flow of the program, and the Hassan Chafi, Martin Odersky, and Kunle Olukotun. 2011. A Hetero- 1094 1040 optimisations must respect the global type; 2. we prove that geneous Parallel Framework for Domain-Specific Languages. In Proc. 1095 1041 our parallelisations are extensionally equivalent to the input of 2011 Intl. Conf. on Parallel Architectures and Compilation Techniques 1096 (PACT’11). IEEE, 89–100. 1042 1097 function. Additionally, PAlg code could be used for further [9] Jacques Carette, Oleg Kiselyov, and Chung-chieh Shan. 2009. Finally 1043 multiple purposes, such as parallel GPU/FPGA code genera- Tagless, Partially Evaluated: Tagless Staged Interpreters for Simpler 1098 1044 tion, by combining our approach with other state of the art Typed Languages. J. Funct. Program. 19, 5 (2009), 509–543. 1099 1045 10 1100 Compiling First-Order Functions to Session-Typed Parallel Code CC ’20, February 22–23, 2020, San Diego, CA, USA

1101 [10] David Castro, Raymond Hu, Sung-Shik Jongmans, Nicholas Ng, and [27] Pierre-Malo Deniélou and Nobuko Yoshida. 2013. Multiparty Compati- 1156 1102 Nobuko Yoshida. 2019. Distributed Programming Using Role Paramet- bility in Communicating Automata: Characterisation and Synthesis of 1157 1103 ric Session Types in Go. In 46th Symp. on Principles of Programming Global Session Types. In 40th International Colloquium on Automata, 1158 Languages (POPL’19), Vol. 3. ACM, 29:1–29:30. Languages and Programming (LNCS), Vol. 7966. Springer, 174–186. 1104 1159 [11] Hassan Chafi, Arvind K. Sujeeth, Kevin J. Brown, HyoukJoong Lee, [28] Zach DeVito, Niels Joubert, Francisco Palacios, Stephen Oakley, 1105 Anand R. Atreya, and Kunle Olukotun. 2011. A Domain-Specific Montserrat Medina, Mike Barrientos, Erich Elsen, Frank Ham, Alex 1160 1106 Approach to Heterogeneous Parallelism. In Proc. of the 16th Symp. Aiken, Karthik Duraisamy, Eric Darve, Juan Alonso, and Pat Hanrahan. 1161 1107 on Principles and Practice of Parallel Programming (PPoPP’11). ACM, 2011. Liszt: a Domain Specific Language for Building Portable Mesh- 1162 1108 35–46. based PDE Solvers. In Conference on High Performance Computing 1163 [12] Manuel M. T. Chakravarty, Roman Leshchinskiy, Simon L. Peyton Networking, Storage and Analysis, SC 2011. ACM, 9:1–9:12. 1109 1164 Jones, Gabriele Keller, and Simon Marlow. 2007. Data Parallel Haskell: [29] Conal Elliott. 2017. Compiling to Categories. Proc. ACM Program. 1110 a Status Report. In Proc. of the POPL 2007 Workshop on Declarative Lang. 1, ICFP, Article 27 (Aug. 2017), 27 pages. 1165 1111 Aspects of Multicore Programming, (DAMP’07) (2007). ACM, 10–18. [30] Conal Elliott. 2017. Generic functional parallel algorithms: Scan and 1166 1112 [13] Bradford L. Chamberlain, David Callahan, and Hans P. Zima. 2007. FFT. Proc. ACM Program. Lang. 1, ICFP, Article 48 (Sept. 2017), 24 pages. 1167 1113 Parallel Programmability and the Chapel Language. IJHPCA 21, 3 [31] Maarten M. Fokkinga and Erik Meijer. 1991. Program Calculation Prop- 1168 (2007), 291–312. erties of Continuous Algebras. Number CS-R91 in Report / Department 1114 1169 [14] Rohit Chandra. 2001. Parallel Programming in OpenMP. Morgan of Computer Science. CWI. 1115 Kaufmann. [32] Jeremy Gibbons. 1996. Computing Downwards Accumulations on 1170 1116 [15] Philippe Charles, Christian Grothoff, Vijay A. Saraswat, Christopher Trees Quickly. Theoretical Computer Science 169, 1 (1996), 67–80. 1171 1117 Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and [33] Jeremy Gibbons. 1996. The Third Homomorphism Theorem. Journal 1172 1118 Vivek Sarkar. 2005. X10: an Object-Oriented Approach to Non-Uniform of Functional Programming (JFP) 6, 4 (1996), 657–665. 1173 Cluster Computing. In Proc. of the 20th Conf. on Object-Oriented Pro- [34] Jeremy Gibbons. 2002. Calculating Functional Programs. In Algebraic 1119 1174 gramming, Systems, Languages, and Applications, (OOPSLA05). ACM, and Coalgebraic Methods in the Mathematics of Program Construction. 1120 San Diego, CA, USA, 519–538. Springer, Chapter 5, 151–203. 1175 1121 [16] Tzu-chun Chen, Mariangiola Dezani-Ciancaglini, Alceste Scalas, and [35] Jeremy Gibbons. 2007. Datatype-Generic Programming. In Datatype- 1176 1122 Nobuko Yoshida. 2017. On the Preciseness of Subtyping in Session Generic Programming. Springer, 1–71. 1177 1123 Types. Logical Methods in Computer Science Volume 13, Issue 2 (June [36] Horacio González-Vélez and Mario Leyton. 2010. A survey of algorith- 1178 2017). mic skeleton frameworks: high-level structured parallel programming 1124 1179 [17] Yun-Yan Chi and Shin-Cheng Mu. 2011. Constructing List Homo- enablers. Softw., Pract. Exper. 40, 12 (2010), 1135–1160. 1125 morphisms from Proofs. In Proc. APLIAS ’11: Asian Symposium on [37] Sergei Gorlatch. 1999. Extracting and Implementing List Homomor- 1180 1126 Programming Languages & Systems. 74–88. phisms in Parallel Program Development. Science of Computer Pro- 1181 1127 [18] Philipp Ciechanowicz and Herbert Kuchen. 2010. Enhancing Muesli’s gramming 33, 1 (1999), 1 – 27. 1182 1128 Data Parallel Skeletons for Multi-core Computer Architectures. In [38] Sergei Gorlatch. 2004. Send-receive Considered Harmful: Myths and 1183 12th Intl. Conf. on High Performance Computing and Communications, Realities of Message Passing. ACM Trans. Program. Lang. Syst. 26, 1 1129 1184 (HPCC’10). IEEE, 108–113. (Jan. 2004), 47–56. 1130 [19] Ernie Cohen, Markus Dahlweid, Mark A. Hillebrand, Dirk Leinenbach, [39] Sergei Gorlatch and Christian Lengauer. 1995. Parallelization of Divide- 1185 1131 Michal Moskal, Thomas Santen, Wolfram Schulte, and Stephan Tobies. and-Conquer in the Bird-Meertens Formalism. Formal Aspects of 1186 1132 2009. VCC: A Practical System for Verifying Concurrent C. In Proc. of Computing 7, 6 (1995), 663–682. 1187 1133 the 22nd Intl. Conf. on Theorem Proving in Higher Order Logics, (TPHOLs [40] Bastian Hagedorn, Larisa Stoltzfus, Michel Steuwer, Sergei Gorlatch, 1188 2009) (LNCS), Vol. 5674. Springer, 23–42. and Christophe Dubach. 2018. High Performance Stencil Code Gener- 1134 1189 [20] Murray Cole. 1988. Algorithmic skeletons : a structured approach to the ation with Lift. In Proc. of the 2018 Intl. Symp. on Code Generation and 1135 management of parallel computation. Ph.D. Dissertation. University of Optimization (CGO 2018). ACM, New York, NY, USA, 100–112. 1190 1136 Edinburgh, UK. [41] Kevin Hammond, Marco Aldinucci, Christopher Brown, Francesco 1191 1137 [21] James W Cooley and John W Tukey. 1965. An Algorithm for the Cesarini, Marco Danelutto, Horacio González-Vélez, Peter Kilpatrick, 1192 1138 Machine Calculation of Complex Fourier Series. Mathematics of com- Rainer Keller, Michael Rossbory, and Gilad Shainer. 2013. The Para- 1193 putation 19, 90 (1965), 297–301. Phrase Project: Parallel Patterns for Adaptive Heterogeneous Multicore 1139 1194 [22] Mario Coppo, Mariangiola Dezani-Ciancaglini, Luca Padovani, and Systems. Springer, 218–236. 1140 Nobuko Yoshida. 2015. A Gentle Introduction to Multiparty Asynchro- [42] Kohei Honda, Nobuko Yoshida, and Marco Carbone. 2008. Multiparty 1195 1141 nous Session Types. In 15th International School on Formal Methods for Asynchronous Session Types. In Proc. of 35th Symp. on Princ. of Prog. 1196 1142 the Design of Computer, Communication and Software Systems: Multi- Lang. (POPL ’08). ACM, New York, NY, USA, 273–284. 1197 1143 core Programming (LNCS), Vol. 9104. Springer, 146–178. [43] Kohei Honda, Nobuko Yoshida, and Marco Carbone. 2016. Multiparty 1198 [23] Alcino Cunha, Jorge Sousa Pinto, and José Proença. 2006. A Framework Asynchronous Session Types. J. ACM 63, 1 (2016), 9:1–9:67. 1144 1199 for Point-Free Program Transformation. In Proc. of 17th Intl. Workshop [44] Zhenjiang Hu, Hideya Iwasaki, and Masato Takeichi. 1996. Deriving 1145 on Implementation and Application of Functional Languages (IFL 2005). Structural Hylomorphisms from Recursive Definitions. In Proc. ICFP 1200 1146 Springer, 1–18. ’96: ACM Int. Conf. on Functional Programming (ICFP ’96). ACM, New 1201 1147 [24] Romain Demangeon and Kohei Honda. 2012. Nested Protocols in York, NY, USA, 73–82. 1202 1148 Session Types. In CONCUR 2012 – Concurrency Theory. Springer, 272– [45] John Hughes. 2000. Generalising monads to arrows. Science of Com- 1203 286. puter Programming 37, 1 (2000), 67 – 111. 1149 1204 [25] Pierre-Malo Deniélou, Nobuko Yoshida, Andi Bejleri, and Raymond [46] Guy L. Steele Jr. 2005. Parallel Programming and Parallel Abstractions 1150 Hu. 2012. Parameterised Multiparty Session Types. Logical Methods in Fortress. In 14th International Conference on Parallel Architecture and 1205 1151 in Computer Science 8, 4 (2012). Compilation Techniques (PACT 2005), 17-21 September 2005, St. Louis, 1206 1152 [26] Pierre-Malo Deniélou and Nobuko Yoshida. 2011. Dynamic Multirole MO, USA. IEEE Computer Society, 157. 1207 1153 Session Types (POPL’11). ACM, New York, NY, USA, 435–446. [47] Daniel J. Lehmann and Michael B. Smyth. 1981. Algebraic Specification 1208 of Data Types: A Synthetic Approach. Mathematical Systems Theory 1154 1209 1155 11 1210 CC ’20, February 22–23, 2020, San Diego, CA, USA David Castro-Perez and Nobuko Yoshida

1211 14, 1 (01 Dec 1981), 97–139. [67] Michel Steuwer, Christian Fensch, Sam Lindley, and Christophe 1266 1212 [48] Mario Leyton and José M. Piquer. 2010. Skandium: Multi-core Program- Dubach. 2015. Generating Performance Portable Code Using Rewrite 1267 1213 ming with Algorithmic Skeletons. In Proceedings of the 18th Euromicro Rules. In Proc ICFP 2015: 20th ACM Conf. on Functional Prog. Lang. and 1268 Conference on Parallel, Distributed and Network-based Processing, PDP Comp. Arch. 205–217. 1214 1269 2010, Pisa, Italy, February 17-19, 2010. IEEE Computer Society, 289–296. [68] Michel Steuwer, Toomas Remmelg, and Christophe Dubach. 2017. 1215 [49] Yu Liu, Zhenjiang Hu, and Kiminori Matsuzaki. 2011. Towards Sys- Lift: a functional data-parallel IR for high-performance GPU code 1270 1216 tematic Parallel Programming over Mapreduce. In Proc. Euro-Par 2011: generation. In Proceedings of the 2017 International Symposium on Code 1271 1217 European Conference on Parallelism. 39–50. Generation and Optimization, CGO 2017, Austin, TX, USA, February 4-8, 1272 1218 [50] Hugo A. López, Eduardo R. B. Marques, Francisco Martins, Nicholas 2017. ACM, 74–85. 1273 Ng, César Santos, Vasco Thudichum Vasconcelos, and Nobuko Yoshida. [69] Arvind K. Sujeeth, HyoukJoong Lee, Kevin J. Brown, Tiark Rompf, 1219 1274 2015. Protocol-Based Verification of Message-Passing Parallel Pro- Hassan Chafi, Michael Wu, Anand R. Atreya, Martin Odersky, and 1220 grams. In Proc. of the 2015 Intl. Conf. on Object-Oriented Programming, Kunle Olukotun. 2011. OptiML: An Implicitly Parallel Domain-Specific 1275 1221 Systems, Languages, and Applications (OOPSLA’15). ACM, 280–298. Language for Machine Learning. In Proc. of the 28th Intl. Conf. on 1276 1222 [51] Frédéric Loulergue, Wadoud Bousdira, and Julien Tesson. 2017. Calcu- Machine Learning (ICML’11). Omnipress, 609–616. 1277 1223 lating Parallel Programs in Coq Using List Homomorphisms. Interna- [70] Shuhao Zhang. 2019. Session Arrows: A Session-Type Based Framework 1278 tional Journal of Parallel Programming 45, 2 (01 Apr 2017), 300–319. For Parallel Code Generation. Master’s thesis. Imperial College London. 1224 1279 [52] Michael McCool, Arch Robison, and James Reinders. 2012. Structured 1225 parallel programming: patterns for efficient computation. Elsevier. 1280 1226 [53] Erik Meijer, Maarten Fokkinga, and Ross Paterson. 1991. Functional A Further Definitions 1281 1227 Programming with Bananas, Lenses, Envelopes and Barbed Wire. A.1 Algebraic Functional Language 1282 1228 In Functional Programming Languages and Computer Architecture. 1283 Springer, 124–144. 1229 1284 [54] Erik Meijer and Graham Hutton. 1995. Bananas in : Extending 1230 1285 Fold and Unfold to Exponential Types. In Proceedings of the Seventh F1, F2 F I | Ka | F1 + F2 | F1 × F2 1231 International Conference on Functional Programming Languages and a,b F 1 | int | ... | a → b | a + b | a × b | F a | µF 1286 1232 Computer Architecture (FPCA ’95). ACM, New York, NY, USA, 324–333. e1, e2 F f | v | const e | id | e1 ◦ e2 | πi | e1 △ e2 | ιi | e1 ▽ e2 1287 [55] Akimasa Morihata. 2013. A Short Cut to Parallelization Theorems. In 1233 | F e | inF | outF | recF e1 e2 1288 Proc. ICFP 2013: 18th Int. Conf. on Functional Programming. 245–256. 1234 1289 [56] Akimasa Morihata and Kiminori Matsuzaki. 2010. Automatic Paral- f : a → b ∈ Γ ⊢ e : a 1235 1290 lelization of Recursive Functions using Quantifier Elimination. In Proc. ⊢ f : a → b ⊢ const e : b → a ⊢ id : a → a 1236 FLOPS ’10: Functional and Logic Programming. 321–336. 1291 1237 [57] Dimitris Mostrous and Nobuko Yoshida. 2009. Session-Based Commu- 1292 1238 nication Optimisation for Higher-Order Mobile Processes. In Proc. of ⊢ inF : F µF → µF ⊢ outF : µF → F µF 1293 the 9th Intl. Conf on Typed Lambda Calculi and Applications (LNCS), 1239 1294 Vol. 5608. Springer, 203–218. ⊢ e1 : b → c ⊢ e2 : a → b i ∈ [1, 2] 1240 [58] Matthias Neubauer and Peter Thiemann. 2004. An Implementation 1295 1241 of Session Types. In Practical Aspects of Declarative Languages, 6th ⊢ e1 ◦ e2 : a → c ⊢ πi : a1 × a2 → ai 1296 1242 International Symposium, PADL 2004, Dallas, TX, USA, June 18-19, 2004, 1297 ⊢ → ⊢ → ∈ [ ] 1243 Proceedings. 56–70. e1 : a b e2 : a c i 1, 2 1298 [59] Nicholas Ng, José Gabriel de Figueiredo Coutinho, and Nobuko Yoshida. 1244 ⊢ e1 e2 : a → b × c ⊢ ιi : ai → a1 + a2 1299 2015. Protocols by Default - Safe MPI Code Generation Based on △ 1245 Session Types. In Proc. of the 24th Intl. Conf. on Compiler Construction 1300 ⊢ e a → c ⊢ e b → c ⊢ e a → b 1246 (LNCS), Vol. 9031. Springer, 212–232. 1 : 2 : : 1301 1247 [60] Nicholas Ng and Nobuko Yoshida. 2015. Pabble: parameterised Scribble. ⊢ e1 ▽ e2 : a + b → c ⊢ F e : F a → F b 1302 1248 Service Oriented Computing and Applications 9, 3-4 (2015), 269–284. 1303 [61] Ross Paterson. 2018. arrows: Arrow classes and transformers. http: 1249 ⊢ e1 : F b → b ⊢ e2 : a → F a 1304 //hackage.haskell.org/package/arrows-0.4.4.2. ⊢ → 1250 [62] Fethi A. Rabhi and Sergei Gorlatch (Eds.). 2003. Patterns and Skeletons recF e1 e2 : a b 1305 1251 for Parallel and Distributed Computing. Springer. 1306 Figure 7. 1252 [63] Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Syntax and types of Alg. 1307 1253 Paris, Frédo Durand, and Saman P. Amarasinghe. 2013. Halide: a 1308 Language and Compiler for Optimizing Parallelism, Locality, and 1254 1309 Recomputation in Image Processing Pipelines. In Proc. of Conf. on A.1.1 Properties of Alg Constructs 1255 Programming Language Design and Implementation, PLDI ’13. ACM, 1310 1256 519–530. Alg constructs are characterised by well-known properties. 1311 1257 [64] James Reinders. 2007. Intel threading building blocks - outfitting C++ Sum and product functions, are uniquely determined by their 1312 . O’Reilly. 1258 for multi-core processor parallelism universal properties. Composition and identity must satisfy 1313 [65] Rodrigo C. O. Rocha, Luís Fabrício Wanderley Góes, and Fernando 1259 1314 Magno Quintão Pereira. 2016. An Algebraic Framework for Paral- the associativity and cancellation properties. These basic 1260 lelizing Recurrence in Functional Programming. In Proc. of the 20th properties are summarised in Fig. 9. Functors preserve iden- 1315 1261 Brazilian Symposium on Programming Languages, SBLP 2016 (LNCS), tities and composition, and rec satisfy the hylomorphism 1316 1262 Vol. 9889. Springer, 140–155. laws (Fig. 9b). The laws of hylomorphisms can be used to 1317 [66] David B Skillicorn. 1993. 1263 The Bird-Meertens Formalism as a Parallel perform some common program optimisations. For exam- 1318 Model. Springer. 1264 ple, the well-known deforestation transformation can be 1319 1265 12 1320 Compiling First-Order Functions to Session-Typed Parallel Code CC ’20, February 22–23, 2020, San Diego, CA, USA

Constant, Identity and Composition 1321 input interface ιi Ia. Note that ej with j , i is free in rule 1376 const e = λx. e id = λx. x e ◦ e = λx. e (e x) 1322 1 2 1 2 Case . We solve these cases by unification (see rule Alt). If 1377 Products i 1323 this is not possible, this means that no branch in the control 1378 πi = λ(x1, x2). xi, i ∈ [1, 2] e1 △ e2 = λx. (e1 x, e2 x) 1324 flow will ever reach ej , and so we can set it to any arbitrary 1379 e1 × e2 = (e1 ◦ π1) △ (e2 ◦ π2) 1325 parallelisation of ej , or even optimise ej away. Rule Alg 1380 1326 Coproducts specifies that given any e and participant p, e@p is a valid 1381 . ( ). 1327 ιi = λx inji x e1 ▽ e2 = λ inji x ei x parallelisation, with output interface b@p. Finally, rule Ext is 1382 1328 e1 + e2 = (ι1 ◦ e1) ▽ (ι2 ◦ e2) crucial for exploring potential parallelisations. It states that 1383 1329 Functors if e is the parallelisation of e2, and e2 is extensionally equal 1384 ( † ) † † ∈ { ×} 1330 I a = a Ka b = a F1 F2 a = F1 a F2 a, +, to e1, then e is also a parallelisation of e1. The undecidability 1385 ( † ) † 1331 I e = e Ka e = id F1 F2 e = F1 e F2 e of this rule requires that the programmer specifies rewriting 1386 1332 Recursion strategies both for checking and inference. 1387 1333 inF = λx. inF x outF = λ(inF x). x 1388 ◦ ◦ 1334 recF e1 e2 = f where f = e1 F f e2 1389 1335 Figure 8. Semantics of Alg combinators. 1390 1336 1391 1337 Example A.1 (Rewriting and annotation strategies). We il- 1392 derived from the hylomorphism equation, and the prop- 1338 lustrate how rewriting and annotation strategies work by 1393 erties of functors. Particularly, it is an instance of Equa- 1339 showing the mergesort (ms) example. Consider the merge- 1394 tions A.11 and A.7 in Fig. 9. From the universal proper- 1340 sort definition ms = recT mrg spl. Solutions to the inference 1395 ties of and (Fig. 9a), a number of equivalences can 1341 △ ▽ problem ⊢ ms ⇒ ?0 : Ls@p0 → ?1 provide the alternative 1396 be derived, e.g.: πi ◦ e1 e2 = ei ; (π1 ◦ e) (π2 ◦ e) = e; 1342 △ △ parallelisations of ms. The only two rules that can be applied 1397 π1 π2 = id; (e1 × e2) ◦ (e3 e4) = (e1 ◦ e3) (e2 ◦ e4); 1343 △ △ △ are Alg or Ext. By rule Alg, we can annotate ms at some 1398 e1 e2 ◦ ιi = ei ; (e ◦ ι1) (e ◦ ι2) = eι1 ι2 = id; and 1344 ▽ ▽ ▽ p1: ⊢ ms ⇒ ms@p1 : Ls@p0 → Ls@p1. Alternatively, we can 1399 (e1 e2) ◦ (e3 +e4) = (e1 ◦e3) (e2 ◦e4) where i ∈ {1, 2}. The 1345 ▽ ▽ use the hylomorphism equation, and apply rule Ext: 1400 properties of combinators provide a formal framework for 1346 ms = rec mrg spl = mrg ◦ T ms ◦ spl = mrg ◦ 1401 equational reasoning that can be used as a basis for doing T 1347 (id + ms × ms) ◦ spl 1402 program transformations [31; 34; 53]. These properties have 1348 1403 been used for parallelising functions, e.g. [17; 33; 55]. In this We decide which of the rules to apply by querying a collec- 1349 1404 paper, we use =ext for the equations in 9, to distinguish them tion of rewriting hints, that we call rewriting strategy. This 1350 1405 from the syntactic equality (=). collection of hints is of the form [e1 : rw1,...], and must be 1351 1406 specified by the programmer. The rewritings rwi are essen- 1352 ′ 1407 A.2 Parallel Algebraic Language tially proofs that ei =ext ei , by applying equations in Fig. 1353 1408 A.2.1 Typing rules 9. Once a hint is used, it is removed from the collection of 1354 hints. For the mergesort example, if we use the rewriting 1409 Rules Id, Comp, Proj and Split are standard. The main 1355 i strategy [ms : unroll 1], we will apply rule Ext, unroll the 1410 feature of this type system is the use of eta-expanded sum- 1356 hylomorphism equation once, and continue with an empty 1411 types and unions of interfaces to deal with choices. Rule 1357 strategy []. 1412 1358 Choice specifies that a choice point may be introduced at 1413 ⊢ mrg ◦ (id + ms × ms) ◦ spl ⇒ ?0 : Ls@p → ?1 1359 any point when a participant contains a value of a sum- 0 1414 ms =ext mrg ◦ (id + ms × ms) ◦ spl 1360 type. In such cases p sends the tag of the sum-type value 1415 ⊢ ⇒ → 1361 to any other participant whose behaviour depends on it. ms ?0 : Ls@p0 ?1 1416 pì 1362 After the choice point, the interface is I[ι1 p] ∪ I[ι2 p], with With an empty rewriting strategy, the only possibility once 1417 1363 the constraint that the participants in I[p] must be in pì. we find the atomic function spl is to use rule Alg. To select a 1418 1364 Rule Alt specifies that e must be the parallelisation of e, participant, we query the annotation strategy. The annotation 1419 1365 considering both A1 and A2 as input interfaces. The output strategy is a collection of expressions that we require to place 1420 1366 interface is the union of B1 and B2. Any participant in e must at distinct participants. Suppose that our annotation strategy 1421 1367 be notified of the choice pids(e) ⊆ pì, to make sure that they is {spl}. Then, we would need to select a fresh participant 1422 1368 perform the interactions that correspond to the correct Ai . p1: ⊢ spl ⇒ spl@p1 : Ls@p0 → ((1 + a) + Ls × Ls)@p1. If the 1423 1369 Rule Join is the same as rule Alt, but we do not require annotation strategy does not contain spl, then we would 1424 1370 the participants in e to be notified of the choice, since the select any participant from the input interface, to minimise 1425 1371 input interface is the same in both branches of the choice. the amount of messages exchanged: ⊢ spl ⇒ spl@p0 : Ls@p0 → 1426 1372 Rule Inji is used to tag an interface with the i-th injection. ((1 + a) + Ls × Ls)@p0. Suppose that the annotation strategy is 1427 1373 Then, rule Casei specifies that if ei is the parallelisation of {spl}. Then, after spl we have a sum type at p1. This requires 1428 1374 ei , then e1 ▽e2 is a parallelisation of e1 ▽e2, given the tagged us to introduce a choice point: 1429 1375 13 1430 CC ’20, February 22–23, 2020, San Diego, CA, USA David Castro-Perez and Nobuko Yoshida

1431 1486 id ◦ e = e ◦ id = e (A.1) e1 ◦ (e2 ◦ e3) = (e1 ◦ e2) ◦ e3 (A.4) 1432 1487 F (e ◦ e ) = F e ◦ F e (A.2) f = e e ⇔ π ◦ e = e ∧ π ◦ e = e (A.5) 1433 1 2 1 2 1 △ 2 1 1 2 2 1488 1434 F id = id (A.3) e = e1 ▽ e2 ⇔ e ◦ ι1 = e1 ∧ e ◦ ι2 = e2 (A.6) 1489 1435 (a) Basic Properties of Combinators 1490 1436 1491 e3 ◦ e4 = id ⇒ (recF e1 e3) ◦ (recF e4 e2) = recF e1 e2 (A.7) recF in out = idµF (A.9) 1437 1492 η : F1 → F2 ⇒ recF (e1 ◦ η) e2 = recF e1 (η ◦ e2) (A.8) e1, e2 strict ⇒ recF e1 e2 strict (A.10) 1438 1 2 1493 ∧ ◦ = ◦ ∧ ◦ = ◦ ⇒ ◦ ( ) ◦ = 1439 e1 strict e1 e3 e5 F e1 e4 e2 F e2 e6 e1 recF e3 e4 e2 recF e5 e6 (A.11) 1494 1440 (b) Hylomorphism Laws 1495 1441 Figure 9. Properties of point-free combinators 1496 1442 1497 ⊢ ◦ ( × ) ⇒ { } ⊆ Join 1443 mrg id + ms ms ?0 p1 ?3 1498 (( ) × )@( ∪?3 ) → ⊢ e ⇒ e : A → B 1444 : 1 + a + Ls Ls ι1 p1 ι2 p1 ?1 1499 pì pì 1445 ⊢ mrg ◦ (id + ms × ms) ⇒ ?0 ◦ [p1 ⊕ ?3] ⊢ e ⇒ e : A ∪ A → B ∪ B 1500 ?3 1446 : ((1 + a) + Ls × Ls)@p1 → ?1 ∪ ?2 1501 Alt 1447 1502 By collecting all constraints of the form {p1} ⊆ ?3, we can ⊢ e ⇒ e : A1 → B1 1448 fully determine what is the minimum list of participants ?3 ⊢ e ⇒ e : A2 → B2 A1 , A2 pids(e) ⊆r ì 1503 1449 that we require. To conclude our example, we show the end pì pì 1504 ⊢ e ⇒ e : A1 ∪ A2 → B1 ∪ B2 1450 result of applying the rest of the rules. The final structure 1505 1451 follows a divide-and-conquer parallel structure, that may Alg Ext 1506 1452 ⊢ e : a → b ⊢ e ⇒ e : a@I → B e = e 1507 bypass p2 and p3 if the input list is empty or singleton. 2 1 ext 2 1453 1508 ⊢ mrg ◦ (id + ms × ms) ⊢ e ⇒ e@p : a@I → b@p ⊢ e1 ⇒ e : a@I → B 1454 1509 ⇒ mrg@p1 ◦ (id + (ms@p2 ◦ π1@p1) △ (ms@p3 ◦ π2@p1)) 1455 Inji Id 1510 ◦[p1 ⊕ p1p2p3] ◦ spl@p1 1456 p1p2p3 1511 : Ls@p0 → Ls@p1 ∪ Ls@p1 ⊢ ιi ⇒ ιi : a@I → a@(ιi I) ⊢ id ⇒ id : A → A 1457 1512 1458 Proji 1513 1459 Definition A.2 (Product and Injection of Choice Interfaces). i ∈ [1, 2] 1514 1460 We sometimes write A × B to represent the product of inter- ⊢ πi ⇒ πi : (a1 × a2)@(I1 × I2) → ai @Ii 1515 1461 faces that contain choices. We do the product of the respec- 1516 1462 tive interfaces, after performing first the choices in A, and Comp 1517 ⊢ e ⇒ e : B → C ⊢ e ⇒ e : A → B 1463 then the choices in B: 1 1 2 2 1518 ⊢ ◦ ⇒ e ◦ e → 1464 pì pì e1 e2 1 2 : A C 1519 (R1 ∪ R2) × R3 = (R1 × R3) ∪ (R2 × R3) 1465 pì pì 1520 I1 × (R1 ∪ R2) = (I1 × R1) ∪ (I1 × R2) Casei 1466 1521 ⊢ ei ⇒ ei : ai @I → B 1467 We also write injections of interfaces that contain choices: 1522 ⊢ e e ⇒ e e : (a + a )@(ιi I) → B 1468 pì pì 1 ▽ 2 1 ▽ 2 1 2 1523 ιi (R1 ∪ R2) = ιi R1 ∪ ιi R2 1469 Split 1524 1470 Definition A.3 (Well-formedness of interfaces: WF(a@R)). ⊢ e1 ⇒ e1 : a@I → B ⊢ e2 ⇒ e2 : a@I → C 1525 1471 1526 Interface a@R is well formed if R matches the structure of a: ⊢ e1 △ e2 ⇒ e1 △ e2 : a@I → B × C 1472 1527 1473 WF(a @I) WF(a@I ) WF(b@I ) Choice 1528 i 1 2 pì 1474 ⊢ e ⇒ e : a@(I[ι1 p] ∪ I[ι2 p]) → B 1529 WF(a@p) WF((a1 + a2)@(ιi I)) WF((a × b)@(I1 × I2)) 1475 pids(I[p]) ⊆ pì tyAt(I, a) = b + c 1530 1476 ⊢ e ⇒ e ◦ [p ⊕ pì] : a@I[p] → B 1531 WF(a@R1) WF(a@R2) 1477 1532 WF(a@(R ∪pì R )) Figure 10. Typing rules of PAlg 1478 1 2 1533 1479 Definition A.4 (Projection Rules (G ↾ p) and Merging (L1⊓L2)). 1534 1480 For well-formed interfaces, we sometimes propagate the Projection defines how to obtain the local type L of a partici- 1535 × 1481 annotation down the type structure, e.g. a@R1 b@R2 is nota- pant p in a global type G: G ↾ p. 1536 1482 tion for (a × b)@(R1 × R2). We also define sums of interfaces 1537 pì pì (p1 → p2 : a. G) ↾ p 1483 a@R1 + b@R2 as notation for (a + b)@(ι1 R1 ∪ ι2 R2). 1538 p2!⟨a⟩. (G ↾ p) if p = p1 , p2 1484  1539 = p1?(a). (G ↾ p) if p = p2 , p1 1485 14 1540 G ↾ p otherwise  (p1 → p2 : {li .Gi }i ∈I ) ↾ p p2 ⊕ {li .Gi p} if p = p1  ↾ = p1 & {li .Gi ↾ p} if p = p2 ⊓i ∈I (Gi ↾ p) otherwise   ′ ′ µX .(G p) if G p , X , ∀X (µX .G) p = ↾ ↾ ↾ end otherwise (X) ↾ p = X end ↾ p = end Compiling First-Order Functions to Session-Typed Parallel Code CC ’20, February 22–23, 2020, San Diego, CA, USA

′ 1541 p & {li .Li }i ∈I ⊓ p & {lj .Lj }j ∈J of their respective global types, G1 G2. The sequencing 1596 { ⊓ ′ } ∪ { } ∪ { } # 1542 = p & lk .Lk Lk k ∈I ∩J ll .Ll l ∈I \J lm .Lm m ∈J \I produces the global type that results of performing first G1, 1597 1543 and then G2, by taking into account branching and choices: 1598 µX .L1 ⊓ µX .L2 = µX .(L1 ⊓ L2) L ⊓ L = L 1544 end G = G 1599 1545 Projection onto a role is not necessarily defined. Particularly, # 1600 (p1 → p2 : a.G1) G2 = p1 → p2 : a.(G1 G2) 1546 projecting p1 → p2 : {li .Gi } onto p, with p , p1 and p , p2, # # 1601 (p1 → p2{ιi .G1}) G2 = p1 → p2{ιi .G1 G2} 1547 is only defined if the projection of all Gi onto p can be merged.   #  #  1602 ι1.G1 ι1.G1 G3 1548 Two local types can be merged only if they are the same, or (p1 → p2 ) (G3 ∪ G4) = p1 → p2 # 1603 ι2.G2 # ι2.G2 G4 1549 if they branch on the same role p, and their continuations # 1604 (G1 ∪ G2) (G3 ∪ G4) = (G1 G3) ∪ (G2 G4) 1550 can be merged. For example, p3’s local type of the global # # # 1605 1551 type: µX .p1 → p2{l1.p2 → p3 : l2.p1 → p3 : l3.X, l4.p2 → Rule Choice turns a choice point of p with dependencies 1606 1552 p3 : l5.p1 → p3 : l6.end} is µX .p2 & {l2.p1 & {l3.X }, l4.p2 & pì into a global type choice: p notifies all participants in pì 1607 1553 {l5.p1 & {l6.end}}}. of the branch in the protocol that they need to take. Rule 1608 Alt associates p with two protocols, G1 and G2, whenever 1554 Definition A.5. LTS for Local Type Configurations 1609 1555 the input interface is a choice of A1 and A2. Each Gi is the 1610 ℓ ′ ′ 1556 ⟨C, Q⟩ { ⟨C , Q ⟩ C = [pi 7→ Li ]i ∈I Q = [pi pj 7→ w]i ∈I ,j ∈I continuation that corresponds to the i-th branch of the choice 1611 1557 that led to interface Ai . Note that contrary to the typing 1612 ⟨C[p0 7→ p1!⟨a⟩. L], Q[p0p1 7→ w]⟩ 1558 p p !⟨a⟩ rules of Fig. 2, there is no rule Join. This is because Join 1613 → 0 1 ⟨C[p0 7→ L], Q[p0p1 7→ a · w]⟩ 1559 was only used to avoid adding too many participants to a 1614 ⟨C[p0 7→ p1?(a). L], Q[p1p0 7→ w · a]⟩ choice. However, at this point, these participants are known, 1560 p p ?(a) 1615 → 0 1 ⟨C[p0 7→ L], Q[p1p0 7→ w]⟩ 1561 and specified at the choice points. Rule Split associates a 1616 ⟨ [ 7→ ⊕ { } ] [ 7→ ]⟩ 1562 C p0 p1 ιi .Li i ∈I , Q p0p1 w split of PAlg expressions with the sequence of the respective 1617 →p0p1 ⊕li ⟨ [ 7→ ], [ 7→ · ]⟩ 1563 C p0 Li Q p0p1 li w interactions. The rule uses [G2/end]G1 instead of , because 1618 # 1564 ⟨C[p0 7→ p1 & {ιi .Li }i ∈I ], Q[p1p0 7→ w · li ]⟩ if G1 contains a global type choice, then the interactions 1619 p0p1&li 1565 → ⟨C[p0 7→ Li ], Q[p1p0 7→ w]⟩ described by G2 must be done after every branch in G1. Since 1620 1566 both G1 and G2 start from the same input interface, any 1621 1567 Definition A.6 (Interface projection: A ↾ p). The projection choice in either Gi must be independent of any choice in the 1622 of interface A onto role p is the part of interface A that is 1568 other Gj . Finally, rule Alg specifies that if the input interface 1623 located at p. We define the projection for a@R inductively on is a@I, and the expression is e@p, then the protocol comprises 1569 the structure of R: 1624 1570 the sequence of interactions from all participants in I to 1625 1571 a@p0 ↾ p1 = {a, if p0 = p1; 1, otherwise} participant p. 1626 ( × )@( × ) ( @ ) × ( @ ) 1572 a b R1 R2 ↾ p = a R1 ↾ p b R2 ↾ p 1627 (a + a )@(ι R) p = (a @R p) 1573 1 2 i ↾ i ↾ 1628 pì a@(R1 ∪ R2) ↾ p = 1574 ( 1629 ( @ ) ⊎ ( @ ) ∈ ì Example A.8 (Mergesort Protocol). Recall the PAlg expres- 1575 a R1 ↾ p a R2 ↾ p if p p 1630 ′ ′ sion inferred for ms in Example A.1: 1576 a if a = (a@R1) ↾ p = (a@R2) ↾ p 1631 ⊢ mrg ◦ (id + ms × ms) 1577 1632 ⇒ mrg@p1 ◦ (id + (ms@p2 ◦ π1@p1) (ms@p3 ◦ π2@p1)) 1578 A.3 MPST △ 1633 ◦[p1 ⊕ p1p2p3] ◦ spl@p1 1579 Definition A.7 (Label Broadcasting). We define a macro p p p 1634 : Ls@p0 → Ls@p1 ∪ 1 2 3 Ls@p1 1580 that represents the broadcasting of a label to multiple partic- 1635 We need to solve ⊨ mrg@p1 ◦ (id + (ms@p2 ◦ π1@p1) △ (ms@p3 ◦ 1581 ipants in a choice. We write 1636 π2@p1)) ◦ [p1 ⊕ p1p2p3] ◦ spl@p1 ⇐ Ls@p0 ∼ ?0. 1582 1637 • p → {pj }j ∈[1,n] : {ιi .Gi }i ∈I for p → p1{ιi .p → p2{ιi The first step is a straightforward application of Comp. The 1583 1638 ... p → p {ι .G } ...}} ∈ ; and n i i i I case for [p1 ⊕ p1p2p3] is the result of applying rule Choice. 1584 1639 •{ pj }j ∈[1,n] ⊕ {ιi .Li }i ∈I for p1 ⊕ {ιi .p2 ⊕ {ιi .... pn ⊕ To help readability, we use different colors for the left and 1585 1640 {ιi .Li } ...}}i ∈I . the right branches: 1586 1641 It is straightforward to show that (p1 → {pj }j ∈J : {ιi .Gi }i ∈I ) 1587 ↾ ⊨ [p1 ⊕ p1p2p3] 1642 p2 = {pj }j ∈J ⊕{ιi .Gi p2}i ∈I , if p1 = p2, and (p1 → {pj }j ∈J : ⇐ (( ) × ) ∼ → { }{ } 1588 ↾ 1 + a + Ls Ls @p1 p1 p2p3 ι1.end; ι2.end 1643 {ιi .Gi }i ∈I ) ↾ p2 = p1 & {ιi .Gi ↾ p2}i ∈I , if p2 = pj for some 1589 ∈ At this point, the input interface is: 1644 j J. p p p 1590 ((1 + a) + Ls × Ls)@(ι1 p1)∪ 1 2 3 ((1 + a) + Ls × Ls)@(ι2 p1) 1645 1591 The relation ⊨ p ⇐ A ∼ G associates p and A with the This means that we need to obtain two sub-protocols, for 1646 1592 global type G (Fig. 11). the left and the right branches respectively. The left branch is 1647 1593 Rules Id, Inji , Proji , and Casei are straightforward. Rule solved by applying rule Inj1, while the right branch is solved 1648 1594 Comp associates two PAlg expressions with the sequencing by rules Case, Inj2, Split, Comp and Alg: 1649 1595 15 1650 CC ’20, February 22–23, 2020, San Diego, CA, USA David Castro-Perez and Nobuko Yoshida

1651 Id Inji 1706 1652 p1 ⊕ p2 p1 1707 id ⇐ a@I ∼ end ι ⇐ a @I ∼ end 1653 ⊨ ⊨ i i 1708 1654 p3 1709 Proji A.4 Mp 1655 i ∈ [1, 2] 1710 1656 Mp comprises four basic operations: send, receive, choice 1711 ⊨ πi ⇐ (a1 × a2)@(I1 × I2) ∼ end 1657 and branching, with a standard (asynchronous) semantics. 1712 1658 Choice Additionally, for composing actions that depend on the same 1713 1659 choice, we introduce case expressions. 1714 [p ⊕ pì] ⇐ a[b + c]@I[p] ∼ p → {pì \ p}{ι1. end; ι2. end} 1660 ⊨ 1715 v F x | (v,v) | ιi v | · · · | e v 1661 1716 Casei mi F ret v | m≫=f | send p v | recv p a | sel pì v f1 f2 1662 ⊨ ei ⇐ ai @I ∼ G | brn p m1 m2 | case f1 f2 f F λx.m 1717 1663 1718 ⊨ e1 ▽ e2 ⇐ (a1 + a2)@(ιi I) ∼ G Values v are either primitive values, tagged values ιi v, pairs 1664 of values, or the result of applying an Alg expression e to a 1719 1665 Alt value. We use standard notation for the monadic unit (ret) 1720 1666 ⊨ e ⇐ A1 ∼ G1 ⊨ e ⇐ A2 ∼ G2 1721 and bind (≫=). The term λx. m is a monadic continuation. pì 1667 ⊨ e ⇐ A1 ∪ A2 ∼ G1 ∪ G2 We write λ_. m when the continuation discards the result 1722 1668 of the previous monadic action. We use the standard Kleisli 1723 1669 Alg 1724 ⊢ e : a → b composition: f1>=> f2 = λx. f1 x ≫=f2. 1670 The message-passing constructs are standard, except sel, 1725 1671 ⊨ e@p ⇐ a@I ∼ [a@I { p] 1726 brn and case, which are used for performing choices, and 1672 1727 Comp composing actions that depend on the same choice. We ex- 1673 1728 ⊨ e2 ⇐ A ∼ G2 ⊨ e1 ⇐ B ∼ G1 ⊢ e2 : A → B plain them in detail below. We include select and branching 1674 1729 ⊨ e1 ◦ e2 ⇐ A ∼ G2 G1 as syntactic constructs to simplify the typeability of parallel 1675 # code against local types, but their semantics can be defined 1730 1676 1731 Split in terms of standard pattern matching, plus send and receive e ⇐ a@I ∼ G (i = 1, 2) 1677 ⊨ i i operations. 1732 1678 ⊨ e1 △ e2 ⇐ a@I ∼ [G2/end]G1 Each monadic computation f or m has a type m : Mp L a, 1733 1679 where a is the return type of m, and L is the type index 1734 1680 [a@p p ] = p → p : a. end, if p p ; [a@p p] = end; 1735 1 { 2 1 2 1 , 2 { of Mp, and it represents the local type that corresponds 1681 [(a × b)@(Ia × Ib ) { p] = [a@Ia { p] [b@Ib { p]; and 1736 # to the behaviour of the term m. There is almost a one to 1682 [(a + a )@(ι I) p] = [a @I p] 1737 1 2 i { i { one correspondence between the terms L and the monadic 1683 Figure 11. Protocol relation actions m, so we refer the reader to Appendix A (Fig. 16) for 1738 1684 the full definition. 1739 1685 1740 1686 1741 Composing Choices The types of the constructs that deal 1687 ⊨ id + (ms@p2 ◦ π1@p1) △ (ms@p3 ◦ π2@p1) 1742 ⊎ 1688 ⇐(( 1 + a) + Ls × Ls)@(ι1 p1)∼ end with choices use a new type, , that is isomorphic to sum 1743 1689 ⊨ id + (ms@p2 ◦ π1@p1) △ (ms@p3 ◦ π2@p1) types, but that can only be constructed and eliminated by 1744 1690 ⇐(( 1 + a) + Ls × Ls)@(ι2 p1)∼ p1 → p2 : Ls.p1 → p3 : Ls.end using the following monadic constructs: 1745

1691 p1p2p3 1746 The interface at this point is ((1 + a) + Ls × Ls)@(ι1 p1 ∪ sel pì : a + b → (a → Mp L1 c1) → (b → Mp L2 c2) 1692 → (ì ⊕ { })( ⊎ ) 1747 ι2 (p2 × p3)). For the last expression, mrg@p1, we produce the Mp p ι1.L1; ι2.L2 c1 c2 1693 brn p : Mp L a → Mp L a 1748 following protocol: end ∪ p2 → p1 : Ls.p3 → p1 : Ls.end . This 1 1 2 2 1694 is because, on the left branch, the input is still at p , so no → Mp (p & {ι1.L1; ι2.L2})( a1 ⊎ a2) 1749 1 ( → ) → ( → ) → ⊎ 1695 communication is required. On the right branch, the input case : a Mp L1 c b Mp L2 d a b 1750 → ( ∪ )( ⊎ ) 1696 Mp L1 L2 c d 1751 is a product of lists, one at p2, and another one at p3, and 1697 ⊎ 1752 so they need to communicate with p1. The final protocol is These constructs ensure that the tag used to build a b indeed 1698 obtained by applying sequencing: corresponds to the correct branch of the right choice. We use 1753 1699   case to compose actions that depend on a previous choice. It 1754 ι1.end; 1700 p1 → {p2p3} may seem that this treatment of ⊎ leads to unnecessary code 1755 ι2.p1 → p2 : Ls.p1 → p3 : Ls.end # 1701  duplication, e.g. the only possibility to compose a single ac- 1756 end ∪ (p2 → p1 : Ls. p3 → p1 : Ls.end) 1702 tion f after a branch is using case: brn p m m ≫=case f f . 1757 ι1.end;  1 2 1703   1758 = p1 → {p2p3} ι2.p1 → p2 : Ls.p1 → p3 : Ls. Our back-end easily optimises those cases to avoid code 1704  p → p : Ls. p → p : Ls.end duplication. 1759  2 1 3 1  1705 16 1760 Compiling First-Order Functions to Session-Typed Parallel Code CC ’20, February 22–23, 2020, San Diego, CA, USA

∗ 1761 By the definition of the monadic bind, when we compose [pk 7→ yk ]k ∈K , we say that P(X) executes to Y , P(X) { Y , 1816 ∗ 1762 a branch or select with a case expression, the final local type if there is a trace ⟨P(X), ∅⟩ { ⟨[pi 7→ ret Y (pi )]i ∈I , ∅⟩. We 1817 1763 cannot contain ∪. To illustrate this, considerm1 : Mp L1 (a ⊎ b) write P(X) = Y, whenever there is a unique Y s.t. for all Z, 1818 ∗ 1764 and f2 : a ⊎ b → Mp (L2 ∪ L3)(c ⊎ d). The local type ofm1 ≫=f2 P(X) { Z implies that Z = Y. 1819 1765 must be L (L ∪ L ). But that is only defined if L contains 1820 1 2 3 1 Composition and Identity Composition is defined as the 1766 a branch or# select. Therefore, m ≫=f is only well-typed if 1821 1 2 standard Kleisli composition, extended to parallel programs 1767 f2 is a case expression on the tag introduced by the topmost 1822 as follows: E1>=>E2 = [p 7→ E1(p)>=>E2(p)]p∈E1∪E2 . Then, E2 ◦ 1768 branch or select of m1. 1823 E = E >=>E . Identity is simply the empty program with just 1769 1 1 2 1824 Parallel programs PAlg the default value, id = []. 1770 We define the basic constructs of 1825 in a bottom-up way by manipulating parallel programs. Paral- 1771 Split and Projection The split operation is the participant- 1826 lel programs are mappings from participants to their monadic 1772 wise split, and the i-th projection is the environment with 1827 action: E F [pi 7→ mi ]i ∈I . If mi : Mp Li ai for all i ∈ I, then 1773 the projection i as the default value: 1828 we write [pi 7→ mi ]i ∈I : Mp [pi 7→ Li ]i ∈I [pi 7→ ai ]i ∈I . 1774 E E = [p 7→ λx. E (p) x ≫=λy. E (p) x ≫=λz. ret(y, z)] 1829 The semantics of both local types and monadic actions is 1 △ 2 1 2 p∈E1∪E2 . 1775 πi = [_ 7→ λx. ret (πi x)] 1830 1776 defined in terms of such collections of actions or local types, 1831 W Q 1777 and shared queues of values , or queues of types , e.g. Case and Injection Case expressions will never occur dur- 1832 ⟨E,W ⟩ ℓ ⟨E′,W ′⟩ E E′ 1778 { is a transition from to , and shared ing code generation, since they will be resolved by choices. 1833 W W ′ ℓ 1779 queues to with observable action . We prove a stan- Injections only tag a branch in the protocol, and so we define 1834 1780 dard safety theorem (Theorem 5.1 below) that guarantees them as the identity: ιi = []. 1835 1781 that if a participant does a transition with some observable 1836 action, then so does the type index. Choices Choices are performed by the participant holding 1782 a value of a sum-type, and the tag is notified to the list 1837 1783 Theorem 5.1. [Soundness] Assume E : Mp CA, m : Mp L a of participants that depend on them. The definition uses 1838 1784 ℓ ′ ′ 1839 and W : Q. Suppose ⟨E[r 7→ m],W ⟩ { ⟨E[r 7→ m ],W ⟩. functions getI(x) and putI(y, x) to extract the value of a 1785 Then there exists ⟨C[r 7→ L], Q⟩ →ℓ ⟨C[r 7→ L′], Q ′⟩ such sum-type from the hole of a one-hole I (§3.1), and 1840 1786 that W ′ : Q ′ and m′ : Mp L′ a. to replace the value at the hole respectively. 1841 1787 1842 [p0 ⊕ p0p1 ··· pn] = 1788 Notations and Operations for Parallel Programs We 1843 p0 7→ λx. sel (p1 ··· pn)(getI(x))  1789 simplify the notation for E, when all Li are projections of the   1844  (λy. ret (putI(y, x))) (λy. ret (putI(y, x))) 1790 same global type, and the ai are projections of the same inter-   1845 p1 7→ λx. brn p (ret x)(ret x);  face. We define the projection of an interface at a participant,   1791  ...  1846   1792 A ↾ p, to be the part of A that is at p (Appendix A.6). When- p 7→ λx. brn p (ret x)(ret x)  1847  n  1793 ever we have mp : Mp (G ↾ p)(A ↾ p) for all participants 1848 The presence of type ⊎ means that we might require to 1794 in p ∈ G, we use the notation [p 7→ mp]p∈pids(G) : Mp GA. 1849 perform a case expression to inspect the result of a previous 1795 This means that the collection of all actions mp behave as pre- 1850 E ⊎pì E 1796 scribed byG, and produce their result in interface A. Finally, if choice: we define 1 2 for this. 1851 E ⊎pì E = [p 7→ λx. case E (p) E (p)] ∪ (E ∪ E )\{pì} 1797 we have E = [p 7→ fp : A ↾ p → Mp (G ↾ p)(B ↾ p)]p∈pids(G), 1 2 1 2 p∈pì 1 2 1852 pì 1798 we write E : A → Mp GB. The definition of ⊎ means that the participants involved 1853 1799 Parallel programs have a default value for participants in a choice will perform a case expression to inspect which 1854 1800 that are not in their domain. Unless otherwise specified, branch they need to take, while the rest of the participants 1855 E( ) 1801 this default value is the identity. For example, p = f if will continue as specified by either E1 or E2. Note (E1 ∪E2)(p) 1856 E E′[ 7→ ] E( ) E 1802 = p f , and p = λx.ret x if p < . We specify the will produce E2(p), if p ∈ E1 ∩ E2. This will not be an issue 1857 1803 default value using the underscore character as a key in the during code generation: any participant that is not involved 1858 1804 mapping from participants to monadic actions: [_ 7→ f ]. in a choice will have the same continuation in both branches. 1859 1805 1860 Distributed Values and Execution We define the exe- A.5 Mp code generation 1806 cution of a parallel program on a distributed value below. 1861 1807 A distributed value V : a@R is a mapping from partici- The translation scheme for Mp code generation (Fig. 12) 1862 1808 pants to the value that they hold in the respective interface: is done recursively on the structure of PAlg expressions. It 1863 1809 takes a PAlg expression e, an interface A, and produces a 1864 [pi 7→ (vi : (a@R) ↾ pi )]i ∈I : (a@R). Additionally, we require 1810 mapping from all participants in e and A to their respective 1865 unit to be the default value, so if p < pids(R), then V (p) = (). 1811 monadic continuations. We write e (A), and guarantee that 1866 J K 1812 Definition A.9 (Execution). Given E = [pi 7→ fi ]i ∈I and e (A) : A → Mp GB, if ⊨ e ⇐ A ∼ (G, B). This means 1867 J K 1813 X = [pj 7→ xj ]j ∈J , we define E(X) = [pi 7→ fi X(pi )]i ∈I , that if e induces protocol G with interfaces A → B, then 1868 1814 with X(pi ) = xi if i ∈ J, or X(pi ) = () otherwise. Given Y = the generated code behaves as G, with interfaces A and B. 1869 1815 17 1870 CC ’20, February 22–23, 2020, San Diego, CA, USA David Castro-Perez and Nobuko Yoshida

1871 Code generation follows a similar structure to global type send their values to pm as follows: ⊢ id@pm ◦e : a@pm → b@R. 1926 1872 inference. For code generation, we require a partial function However, and due to the presence of choices, the codomain 1927 1873 (e, ) e pì 1928 cod A that infers the codomain interface of using Fig. 2. interface may contain ∪. For example, if e : a@pm → b@(I1 ∪ 1874 The translation to Mp requires to define the interactions pì 1929 I2), with I1 , I2, then ⊢ id@pm ◦ e : a@pm → b@(pm ∪ pm), 1875 from an interface I that gathers a type a at p: a@I { p : 1930 with p ∈ pì. This means that b@(p ∪pì p ) p = b ⊎b. To 1876 a@I → Mp [a@I { p](a@p). The definition L is analogousM m m m ↾ m 1931 obtain a single value of type b, we use function join : a⊎a → 1877 to that of [a@I { p]. The remaining of the translation is 1932 a, which is equivalent to id id for regular sum-types. We 1878 straightforward, built on top of the previous definitions. ▽ 1933 lift it to a monadic action, join(R), to join all branches in R: 1879 a@p p = [p 7→ λx.send x p p 7→ λ .recv p a] 1934 1 { 0 1 0 0 _ 1 ì ì 1880 L M join(I) = [] join(R ∪pR ) = join(R )⊎pjoin(R )≫=[p 7→ 1935 (a1 + a2)@(ιi I) { p = ai @I { p >=>[p 7→ λx. ret (ιi x)] 1 2 1 2 1881 L M L M ( )] 1936 (a × b)@(I1 × I2) { p = a@I1 { p × b@I2 { p λx. ret join x p∈pì 1882 L M L >=>[p 7→M λ_L.ret()] M 1937 i pi ∈pids(I1×I2)\{p} Note that join(R ∪pì R ) is only defined if (a@R p) = 1883 1 2 1 ↾ 1938 (a@R p) for all roles. The for 1884 2 ↾ runnable parallel program 1939 id (a@I) = [] ιi (a@I) = [] e : a@pm → b@R, J e (pm), is defined as follows: 1885 Je@Kp (a@I) = a@I { pJ>=>K [p 7→ λx. ret (e x)] J K 1940 J K L M J e (pm) = id@pm ◦ e (a@pm)≫=[pm 7→ join(R)]. 1886 e1 △ e2 (a@I) = e1 (a@I) △ e2 (a@I) 1941 J K J K J K J K J K 1887 e1 ◦ e2 (A) = e2 (A)>=> e1 (cod(e2, A)) 1942 J K J K J K 1888 e1 ▽ e2 ((a1 + a2)@(ιi I)) = ei (ai @I) Our extensionality statement specifies that executing the 1943 J K J K 1889 πi ((a × b)@(I1 × I2)) = πi runnable parallel program for e, with master p and value x, 1944 J[ K⊕ ì] ( I[ ]) [ ⊕ ì] 1890 p p a@ p = p p produces value e x at p. 1945 J K pì pì 1891 e (a@(R1 ∪ R2)) = e (a@R1) ⊎ e (a@R2) 1946 J K J K J K 1892 Figure 12. Translating PAlg to Mp code. Theorem 5.3. [Extensionality] Assume e ⇒ e : a@p → b@R 1947 1893 and x : a initially at p. If e x = y, then the execution of e (p) 1948 1894 J K 1949 Protocol Compliance Theorem 5.2 guarantees that the also produces y, distributed across R. 1895 1950 generated code follows the protocol inferred using the re- 1896 1951 lation in Fig. 3. This fact is enough to guarantee that the Example A.10 (MergeSort Code Generation). We start with 1897 1952 generated code is deadlock-free. Moreover, we can use it to the annotated ms from Example A.1, and we use p1 as master 1898 1953 prove that the generated code is extensionally equal to the role to avoid the initial communication from p0 to p1: 1899 1954 input Alg expression. We state this in Theorem 5.3. 1900 pms = mrg@p1 ◦ (id + (ms@p2 ◦ π1@p1) △ (ms@p3 ◦ π2@p1)) 1955 ◦[p ⊕ p p p ] ◦ spl@p : Ls@p → Ls@p ∪p1p2p3 Ls@p 1901 Theorem 5.2. [Protocol Conformance of the Generated Code] 1 1 2 3 1 1 1 1 1956 1902 If e ⇐ A ∼ G, then e (A) complies with protocol G. Note that id@p1 ◦ pms would be equivalent to pms be- 1957 ⊨ J K J K 1903 J K cause the output interface is already of the form p1 ∪ p1. 1958 This theorem is proved by induction on the structure of the 1904 Therefore, for simplicity, we show pms , and use it to pro- 1959 derivation , and by the definition of . This result guarantees J K 1905 ⊨ ↾ duce the runnable parallel program. Fig. 14 show the code 1960 that the generated code corresponds to the protocol inferred 1906 generation process using a table, where the i-th column is the 1961 from e. Since the protocol inferred from e is deadlock-free, 1907 current code for participant pi , and the last column shows 1962 then so is the generated code. See Appendix B.3. 1908 the expression and input interface that we are translating 1963 1909 Extensional Equivalence Additionally to deadlock-freedom next. 1964 1910 and protocol compliance, we prove that if e is the annotation From this point, we need to produce the code for the two 1965 1911 of e, then running the code generated from e on x produces branches. The left branch is straightforward, and is simply 1966 1912 the same result as evaluating e on x. This guarantees that, λx. ret x for all participants. The result for the right branch 1967 1913 regardless of the annotations and interfaces chosen for e, the is show in Fig. 15: 1968 1914 parallel code always produces the same result as the sequential 1969 1915 implementation. We show the statement below, in Theorem Next, we combine both branches using case, and compose 1970 1916 5.3, and refer to Appendix C for the full proof. it with the previous result. To avoid unnecessarily pattern- 1971 1917 We specify the extensionality theorem on runnable parallel matching expressions such as case, we optimise the code 1972 1918 programs, which are those with a single entry/exit point, i.e. using rules of the form: 1973 1919 a master worker pm that starts the computation, and gathers sel pì f1 f2 ≫=case f3 f4 = sel pì (f1 ≫=f3)(f2 ≫=f4). Addition- 1974 1920 the results. Suppose we call this master worker pm. Given ally, we optimise all instances of e.g. (x, ()) using the fact 1975 1921 any e, we can guarantee that pm is the entry point by setting that 1 × a  a × 1  a. We show below the code for all 1976 p p p 1922 it to be the domain interface: ⊢ e : a@pm → b@R. To make it pi , after composing it with join(p1 ∪ 1 2 3 p1), and applying 1977 1923 the exit point, we need to make sure that it is the codomain these optimisations. We use different colours to highlight 1978 1924 interface. We can do this by forcing the participants in R to the different branches of the protocol: 1979 1925 18 1980 Compiling First-Order Functions to Session-Typed Parallel Code CC ’20, February 22–23, 2020, San Diego, CA, USA

1981 p1 7→ sel {p2, p3}(spl v)( λx. ret (mrg (ι1 x))) (λx. send p2 (π1 x)≫=λy. send p3 (π2 x)≫=λ_. 2036 1982 recv p2 Ls≫=λx. recv p3 Ls≫=λy. ret (mrg (ι2 (x,y))))≫=λx. ret (join x) 2037 7→ ( )( ≫= . ( )≫= . )≫= . ( ) 1983 p2 brn p1 ret x recv p1 Ls λx ret ms x λx send p1 x λx ret join x 2038 p 7→ brn p (ret x)(recv p Ls≫=λx. ret (ms x)≫=λx. send p x)≫=λx. ret (join x) 1984 3 1 1 1 2039 1985 p1 7→ send p2 (π1 (v1,v2))≫=λy. send p3 (π2 (v1,v2))≫=λ_. recv p2 Ls≫=λx. recv p3 Ls≫=λy. 2040 1986 ret (br2 (mrg (ι2 (x,y))))≫=λx. ret (join x) 2041 1987 p2 7→ recv p1 Ls≫=λx. ret (ms x)≫=λx. send p1 x ≫=λx. ret (br2 x)≫=λx. ret (join x) 2042 7→ ≫= . ( )≫= . ≫= . ( )≫= . ( ) 1988 p3 recv p1 Ls λx ret ms x λx send p1 x λx ret br2 x λx ret join x 2043 1989 p1 7→ recv p2 Ls≫=λx. recv p3 Ls≫=λy. ret (br2 (mrg (ι2 (x,y))))≫=λx. ret (join x) 2044 1990 p2 7→ ret (ms v1)≫=λx. send p1 x ≫=λx. ret (br2 x)≫=λx. ret (join x) 2045 1991 p3 7→ ret (ms v2)≫=λx. send p1 x ≫=λx. ret (br2 x)≫=λx. ret (join x) 2046 1992 2047 p1 7→ ret (mrg (ι2 (ms v1, ms v2))) = ret (mrg ((id + ms × ms)(spl v))) = ret (ms v) 1993 p2 7→ ret () | p3 7→ ret () 2048 1994 2049 1995 Figure 13. Step-by-step execution of the parallel code for ms. Input is [p1 7→ v], with spl v = ι2 (v1,v2). 2050 1996 2051 p1 p2 p3 1997 2052 1998 λx. ret x λx. ret x λx. ret x spl@p1 (Ls@p1) 2053 J K 1999 2054 λx. ret (spl x) λx. ret x λx. ret x [p1 ⊕ p1p2p3] (((1 + a) + Ls × Ls)@p1) 2000 J K 2055 λx. ret (spl x)≫=λx.sel {p2, p3} λx. brn {p1} λx. brn {p1} id + ... (((1 + a) + Ls × Ls) 2001 2056 J K p1p2p3 (λx. ret x)(λx. ret x) (λx. ret x)(λx. ret x) (λx. ret x)(λx. ret x) @(ι1 p1 ∪ ι2 p2)) 2002 2057 2003 Figure 14. Example Translation 2058 2004 2059 2005 p1 p2 p3 2060 2006 λx. ret (π1 x)≫=λy. send p2 y ≫= λ_.recv p1 Ls≫= λ_.recv p1 Ls≫= (ms@p2 ◦ π1@p1)△ 2061 J 2007 λz. ret (π2 x)≫=λt. send p3 z ≫= λx. ret (ms x)≫=λx. ret (x, ()) λx. ret (ms x)≫=λx. ret ((), x) (ms@p3 ◦ π2@p1) ((Ls × Ls)@p1) 2062 K 2008 λr. ret(z, r) 2063 2009 Figure 15. Right branch for mergesort. 2064 2010 2065 2011 p1 7→ λx. sel {p2, p3}(spl x)( λx. ret (mrg (ι1 x))) The operational semantics is defined as an LTS with transi- 2066 2012 (λx. send p2 (π1 x)≫=λy. send p3 (π2 x)≫=λ_. ℓ ′ ′ 2067 tions of the form ⟨[pi 7→ mi ]i ∈I ,W ⟩ { ⟨[pi 7→ mi ]i ∈I ,W ⟩. 2013 recv p2 Ls≫=λx. recv p3 Ls≫=λy. ret (mrg (ι2 (x,y)))) 2068 Here ℓ F p0p1!⟨a⟩ | p0p1?(a) | p0p1 ⊕ ιi | p0p1 & ιi is the ≫=λx. ret (join x) 2014 observable action that takes place, and represents, respec- 2069 2015 p2 7→ λx. brn p1 (ret x)( recv p1 Ls≫=λx. ret (ms x)≫=λx. 2070 )≫ ( ) tively, p0 sends to p1 a value of type a, p1 receives from p0, p0 2016 send p1 x =λx. ret join x 2071 p 7→ λx. brn p (ret x)( recv p Ls≫=λx. ret (ms x)≫=λx. sends label i to p1, and p1 receives label i from p0. We use the 2017 3 1 1 2072 send p1 x)≫=λx. ret (join x) special symbol ϵ to represent that no communication took 2018 place. Finally, W is a mapping from ordered pairs of roles 2073 We show in Figure 13 the step-by-step execution of this code 2019 to unbounded buffers that contain the data sent between 2074 on distributed value [p 7→ v], with spl v = ι (v ,v ). The 2020 1 2 1 2 participants. 2075 2021 final result is equal to p1 applying ms directly on the input. 2076 2022 Definition A.11. LTS for Mp Terms ⟨P,W ⟩ {ℓ ⟨P ′,W ′⟩ , 2077 2023 Typing Mp against Local Types We define a relation be- 2078 P, W transitions to P ′, W ′ with action ℓ. The transition rules 2024 tween Mp code and the local type that captures their com- 2079 are defined in Fig. 17. 2025 munication behaviour (Fig. 16). We define a judgement of 2080 ⟨ ⟩ →ℓ ⟨ ′ ′⟩ 2026 the form Γ ⊢ m : Mp L a, where Mp L a is the type of an Mp Similarly, we define C, Q C , Q for the LTS of 2081 2027 expression that conforms to protocol L and returns a value local type configurations (App. A). Here, C is a collection 2082 [ 7→ ] 2028 of type a. The types are parameterised by a variable l that of local types, C = pi Li i ∈I , and Q is a mapping from 2083 2029 represents a local type continuation. The rules in Fig. 16 are ordered pairs of roles to unbounded buffers that contain types 2084 2030 straightforward, since they relate in a one-to-one way to the of the data exchanged. We also say thatW is compatible with 2085 ··· ( ) 2031 constructs of local types. Q, W : Q, if for all pair p1p2, if w1 wn = W p1p2 then 2086 ··· ( ) 2032 a1 an = Q p1p2 , and for all i, wi : ai . 2087 2033 Semantics The operational semantics of Mp terms is stan- Definition A.12 (Get/Set for Types from One Hole Con- 2088 2034 dard, and mirrors that of the local type configurations in [27]. texts). Whenever a role performs a choice, we have a type 2089 2035 19 2090 CC ’20, February 22–23, 2020, San Diego, CA, USA David Castro-Perez and Nobuko Yoshida

pì 2091 Ret Send Case Choice. ⊢ e ⇒ [p ⊕ pì] : a@I[p] → a@(I[ι1 p] ∪ 2146 ⊢ ⊢ 2092 Γ v : a Γ v : a I[ι2 p]). By rule Choice, ⊨ [p ⊕ pì] ⇐ a@I[p] ∼ p → 2147 2093 Γ ⊢ ret v : Mp end a Γ ⊢ send r v : Mp (p!⟨a⟩. end) 1 {pì}{ι1. end; ι2. end}. 2148 2094 2149 Case Proji . By rule Proji , ⊨ πi ⇐ A1 × A2 ∼ end. 2095 Abs 2150 Recv Γ, x : a ⊢ m : Mp L b 2096 Case Comp. ⊢ e1 ◦ e2 ⇒ e1 ◦ e2 : A → C. This implies that 2151 2097 Γ ⊢ recv r a : Mp (p?(a). end) a Γ ⊢ λx.m : a → Mp L b ⊢ e1 ⇒ e1 : B → C and ⊢ e2 ⇒ e2 : A → B. By the IH, 2152 2098 ⊨ e2 ⇐ A ∼ G2, and ⊨ e1 ⇐ B ∼ G1. We proceed by 2153 Bind 2099 induction on the size of B. The base case is b@I2. In this case, 2154 Γ ⊢ m : Mp L1 a Γ ⊢ f : a → Mp L2 b 2100 A must be a@I1, so G1 must not be ∪ or contain any choices. 2155 Γ ⊢ m≫=f ∀l . Mp (L L ) a 2101 : 2 1 2 Therefore, G G is defined, and equal to [G /end]G . If 2156 # 1 2 2 1 2102 B = b@(R ∪#R ), then G = G ∪ G . There are two 2157 Branch 21 22 2 21 22 2103 Γ ⊢ m : Mp L a Γ ⊢ m : Mp L a cases: (1) G1 = G11 ∪ G12, or (2) there is a choice in G1, i.e. 2158 1 1 1 2 2 2 ′ 2104 G1 = [p → pì{ι1. G11; ι2. G12}/end]G1. In the first case, we 2159 Γ ⊢ brn r m1 m2 : Mp (r & {ι1.L1; ι2.L2})(a1 ⊎ a2) 2105 have that ⊨ e1 ⇐ a@R1i ∼ G1i , and ⊨ e2 ⇐ a@R2i ∼ G2i , 2160 2106 Select which by the IH implies that G1i G2i is defined. Therefore, 2161 # 2107 Γ ⊢ v : a + b (G11 ∪ G12) (G21 ∪ G22) = (G11 G21) ∪ (G12 G22). In the 2162 # # # 2108 Γ ⊢ f1 : a → Mp L1 c1 Γ ⊢ f2 : b → Mp L2 c2 second case, there must be two sub-expressions of e1, e11 2163 e e ⇐ @ ∼ ′ e ◦ [ ⊕ ì] ⇐ @I[ ] ∼ 2109 Γ ⊢ sel v {rj }j ∈J f1 f2 : Mp ({pj }j ∈J ⊕ {ι1.L1; ι2.L2})(c1 ⊎ c2) and 12 s.t. 11 a I G1, and 12 p p d p 2164 2110 p → pì{ι1. G11; ι2. G12}, with e12 : d@I[ιi p] → b@R2i and 2165 2111 Case e12 ⇐ d@I[ιi p] ∼ G1i . By the IH, G1i G2i must be defined, 2166 Γ ⊢ f : a → Mp L b Γ ⊢ f : a → Mp L b # 2112 1 1 1 1 2 2 2 2 which implies that p → pì{ι1. G11; ι2. G12} (G21 ∪ G22) is 2167 # 2113 Γ ⊢ case f1 f2 : a1 ⊎ a2 → Mp (L1 ∪ L2)(b1 ⊎ b2) defined, and therefore G1 (G21 ∪ G22) is also defined. 2168 2114 # 2169 Figure 16. Mp Case Casei . ⊢ e1 ▽ e2 ⇒ e1 ▽ e2 : ιi A → B. By inversion, 2115 Typing rules for code. 2170 ⊢ ei ⇒ ei : A → B. By the IH, ei ⇐ A ∼ G. By Casei , 2116 ⊨ 2171 ⊨ e1 ▽ e2 ⇐ ιi A ∼ G. 2117 with one hole, a@I[p], with a sum-type at the hole (b + c)@p. 2172 2118 The code for p requires to extract the sum type from the Case Split. ⊢det e1 △ e2 ⇒ e1 △ e2 : a@I → B × C. By 2173 2119 type a, and to set the value at the hole pointed by I. This is inversion, ⊢ e1 ⇒ e1 : a@I → B and ⊢ e2 ⇒ e2 : a@I → C. 2174 By the IH, e ⇐ a@I ∼ G e ⇐ a@I ∼ G . Therefore, 2120 because p may occur deep in I[p] and, therefore a@I[p] ↾ p ⊨ 1 1 ⊨ 2 2 2175 2121 may be different to b + c. We achieve this with the follow- ⊨ e1 △ e2 ⇐ a@I ∼ [G2/end]G1. □ 2176 2122 ing families of functions, get : a → typeAt(I, a), and 2177 I B.2 Proof of Lemma 4.3 2123 putI : c × a → substTy(I, a, c), where typeAt and substTy 2178 2124 get/set the type at the hole in I. Lemma 4.3. [Protocol Deadlock-Freedom] For all WF(A), if 2179 2125 ⊢ e : A → B and ⊨ e ⇐ A ∼ G, then WF(G). 2180 2126 2181 B Deadlock-Freedom Proof. By induction on the structure of the derivation ⊢ e : 2127 B.1 Proof of Lemma 4.2 a@I → B. 2182 2128 2183 pì pì 2129 Lemma 4.2. [Existence of Associated Global Type] For all Case Join. ⊢ e ⇒ e : A ∪ A → B ∪ B. By case analysis, 2184 WF(A) ⊢ e : A → B G e ⇐ A ∼ G 2130 , if , then there exists s.t. ⊨ . the only possibility for G is that it is obtained via the Join 2185 2131 e ⇐ ∪ ∼ 2186 Proof. By induction on the structure of the derivation ⊢ e : protocol rule: ⊨ A A G. By the Join typing rule, pì 2132 A → B. ⊢ e ⇒ e : A → B. By the Join protocol rule. ⊨ e ⇐ A∪ A ∼ 2187 2133 G ∪ G (G) (G ∪ G) 2188 ⊢ ⇒ e ∪pì → ∪pì e ⇐ . By the IH, WF , therefore WF . 2134 Case Join. e : A A B B. By the IH, ⊨ 2189 ì Case Alg. ⊢ e ⇒ e@p : a@I → b@p. By case analysis, 2135 A ∼ G. By rule Alt, e ⇐ A ∪p A ∼ G ∪ G. ⊨ 2190 ⊨ e@p ⇐ a@I ∼ [a@I { p]. By definition, WF([a@I { p]). 2136 2191 Case Alg. ⊢ e ⇒ e@p : a@I → b@p. By Alg, ⊨ e@r ⇐ a@I ∼ 2137 Case Inj . ι ⇐ a@I ∼ end. Trivial WF(end). 2192 [a@I { r]. i ⊨ i 2138 2193 ⊢ e ⇒ e A ∪pìA → B ∪pìB дetRoles(e) ⊆ 2139 Case Inji . By Inji , ⊨ ιi ⇐ A ∼ end. Case Alt. : 1 2 1 2, with 2194 pì, ⊢ e ⇒ e : A1 → B1, ⊢ e ⇒ e : A2 → B2, and A1 , A2. We 2140 pì pì 2195 Case Alt. ⊢ e ⇒ e : A1 ∪ A2 → B1 ∪ B2, with pids(e) ⊆ pì. 2141 know, by a straightforward induction on the typing rules for 2196 By the IH, e ⇐ A1 ∼ G1 and e ⇐ A2 ∼ G2. By Alt, pì 2142 ⊨ ⊨ PAlg, that if ⊢ e : A1∪ A2 → B, and A1 , A2, then pids(Ai ) ⊆ 2197 pì e ⇐ A1 ∪ A2 ∼ G1 ∪ G2. pì pì 2143 ⊨ pì. By the Alt protocol rule, ⊨ e ⇐ A1 ∪ A2 ∼ G1 ∪ G2, with 2198 2144 Case Id. By ⊨ id ⇐ a@I ∼ end. ⊨ e ⇐ A1 ∼ G1 ⊨ e ⇐ A2 ∼ G2. By the IH, WF(G1) and 2199 2145 20 2200 Compiling First-Order Functions to Session-Typed Parallel Code CC ’20, February 22–23, 2020, San Diego, CA, USA

2201 ℓ ′ ′ 2256 ⟨P,W ⟩ { ⟨P ,W ⟩ P = [pi 7→ mi ]i ∈I W = [pi pj 7→ w]i ∈I ,j ∈I 2202 ′ ′ 2257 ⟨P[p 7→ m], W ⟩ {ℓ ⟨P[p 7→ m ], W ⟩ 2203 ⟨P[p 7→ ret v ≫=f ], W ⟩ ϵ ⟨P[p 7→ f v], W ⟩ 2258 ℓ ′ ′ { 2204 ⟨P[p 7→ m≫=f ], W ⟩ { ⟨P[p 7→ m ≫=f ], W ⟩ 2259 ⟨ [ 7→ ( )] [ 7→ ]⟩ p1p2!⟨a⟩ ⟨ [ 7→ ()] [ 7→ · ]⟩ 2205 P p1 send p2 v : a , W p1p2 w { P p1 ret , W p1p2 v w 2260 p p ?(a) 2206 ⟨P[p1 7→ recv p2 a], W [p2p1 7→ w · v]⟩ { 2 1 ⟨P[p1 7→ ret v], W [p2p1 7→ w]⟩ 2261 ϵ 2207 ⟨P[p0 7→ sel (ιi v){} f1 f2], W ⟩ { ⟨P[p0 7→ fi v ≫= λx. ret (bri x)], W ⟩ 2262 2208 p p ⊕ι 2263 ⟨P[p0 7→sel(ιi v){p1 ···pn } f1 f2],W [p0p1 7→w]⟩ { 0 1 i ⟨P[p0 7→sel(ιi v){p2 ···pn } f1 f2],W [p0p1 7→li · w]⟩ 2209 p p &ι 2264 ⟨P[p1 7→ brn p2 m1 m2], W [p2p1 7→ w · li ]⟩ { 2 1 i ⟨P[p1 7→ mi ≫= λx. ret (bri x)], W [p2p1 7→ w]⟩ 2210 2265 ϵ ⟨P[p 7→ case f f (bri v)], W ⟩ { ⟨P[p 7→ fi v], W ⟩ 2211 1 1 2 1 2266 2212 Figure 17. Rules for the LTS of Mp terms 2267 2213 2268 2214 (x) = x (x) = (π x) (x) = (π x) (x) = (x) 2269 get[] getI×I getI 1 getI ×I getI 2 getιi I getI 2215 2270 put[](x,y) = x putI× (x,y) = (putI(x, π1 y), π2 y) put ×I(x,y) = (π1 y, putI(x, π2 y)) put I(x,y) = putI(x,y) 2216 I I ιi 2271 2217 Figure 18. Get/Set for Types from One Hole Contexts 2272 2218 2273 2219 2274 WF(G2). Since pids(Gi ) ⊆ pids(p) ∪ pids(A1) ∪ pids(A2) ⊆ WF(G1) and WF(G2). By straightforward induction on the 2220 pì ( ) ([ / ] ) 2275 pids(r), WF(G1 ∪ G2). structure of G1, if WF Gi , then WF G2 end G1 . □ 2221 2276 2222 Case Id. Trivial by WF(end). B.3 Proof of Theorem 5.2 2277 2223 Case Choice. ⊢ e ⇒ e ◦ [p ⊕ pì] : a@I[p] → B1 ∪ pìB2, Theorem 5.2. [Protocol Conformance of the Generated Code] 2278 2224 where pids(e) ⊆ pì, and I[p] ⊆ pì. By the Choice and If ⊨ e ⇐ A ∼ G, then e (A) complies with protocol G. 2279 2225 Alt typing rules, ⊢ e ⇒ e : a@I[ι p] → B . By inver- J K 2280 i i Proof. By induction on the structure of the derivation e ⇐ 2226 [ ⊕ ì] ⇐ ⊨ 2281 sion, the protocol rule must be also Choice: ⊨ p p ∼ 2227 A G. 2282 a@I[p] ∼ p → pì{ιi . Gi }i ∈[1,2]. By the Choice protocol pì pì 2228 e ⇐ ∪ ∼ ∪ e ⇐ 2283 rule, ⊨ e ⇐ a@I[ιi p] ∼ Gi . By the IH, WF(Gi ). Since Case Alt. ⊨ A1 A2 G1 G2, with ⊨ ′ 2229 pids(Gi ) ⊆ pids(e) ∪ pids(I[p) ⊆ pì, then for all p ∈ Gi , A1 ∼ G1, ⊨ e ⇐ A2 ∼ G2. By the IH, e (Ai ) : (Ai ↾ 2284 2230 ′ ) → ( )( ) J K 2285 (p → pì{ιi . Gi }i ∈[1,2]) ↾ p must be defined. Therefore, p Mp Gi ↾ p Bi ↾ p . Moreover, we know that if 2231 WF(p → pì{ιi . Gi } ∈[ , ]). p < pì, then e (A1)(p) = e (A2)(p), and G1 ↾ p = G2 ↾ p. 2286 i 1 2 J K J K 2232 Therefore, by the definition of E ⊎ E , e (A ) ⊎pì e (A ) : 2287 Case Proji . Trivial by WF(end). 1 2 1 2 2233 pì pì J K J K 2288 Mp (G1 ∪ G2)(B1 ∪ B2). 2234 Case Comp. ⊢ e1 ◦ e2 ⇒ e1 ◦ e2 : A → C, with ⊢ e1 ⇒ e1 : 2289 Case Id. id ⇐ a@I ∼ end. By the definition of , 2235 B → C and ⊢ e2 ⇒ e2 : A → B. By inversion, the only ⊨ 2290 ( @ ) [] @ → @ JK 2236 possible protocol rule is also Comp. Therefore, e ◦ e ⇐ id a I = : a I Mp end a I. 2291 ⊨ 1 2 J K 2237 A ∼ G2 G1, with ⊨ e2 ⇐ A ∼ G2 and ⊨ e1 ⇐ B ∼ G1. Case Inji . ⊨ ιi ⇐ a@I ∼ end. By definition, ιi (A) = [] : 2292 2238 By the IH,# WF(G ) and WF(G ). Also, by the induction on J K 2293 1 2 a@I → Mp end (a@(ιi I)). 2239 ⊢ ∪pì 2294 the derivation of , we know that A1 A2, if A1 , A2, then @ ⇐ @ ∼ [ @ ] → 2240 pì Case Alg. ⊨ e pe a I a I { pe , with e : a b. 2295 pids(Ai ) ⊆ pì. This implies that if G1 is G11 ∪ G12, then either 2241 We prove by straightforward induction on the structure of 2296 the projection onto p of G1i is the same, or p ∈ pì. By the I that f = a@I { pe : a@I → Mp [a@I { pe ](a@pe ): if 2242 ′[ → ì{ } ] 2297 Choice rule, G must be of the form G p p ιi .Gi i ∈[1,2] , L[ 7→ M 7→ ] 2243 ′ I = p, then p λx. send pe x, pe λ_. recv p a , which 2298 therefore, for for all p ∈ pids(G1i ), the projection of (G2 clearly follows [a@p { pe ]; if I = I ×I , then a must be a × 2244 (G ∪ G )) p′ G G# 1 2 1 2299 11 12 ↾ must be defined, which implies that 2 1 a a @I p Mp [a @I 2245 # 2, and we have by the IH that i i { e : i i { 2300 is defined. ]( @ ) L ′ M 2246 pe ai pe ; and, finally, if I = ιi I , then a = a1 + a2, and 2301 Case Case . ⊢ e e ⇒ e e : a@(ι I) → B and @ [ @ ]( @ ) 2247 i 1 ▽ 2 1 ▽ 2 i ⊨ ai I { pe : Mp ai I { pe ai pe , which composed 2302 L [ 7→M . ( )] [( + )@( ) 2248 e1 ▽ e2 ⇐ a@(ιi I) ∼ G. By the Casei protocol and typing with pe λx ret ιi x has type Mp a1 a2 ιi I { 2303 ] (( + )@ ) 2249 rules, ⊨ ei ⇐ A ∼ G and ⊢ ei ⇒ ei : A → B. We conclude pe a1 a2 ιi pe . 2304 by the IH that WF(G). 2250 Case Comp. ⊨ e1 ◦ e2 ⇐ A ∼ G2 G1. By the Comp rule, 2305 2251 e ⇐ ∼ e ⇐ ∼ # e ( ) 2306 Case Split. ⊢ e1 △ e2 ⇒ e1 △ e2 : A → (b × c)@(R1 × R2) ⊨ 2 A G2 and ⊨ 1 B G1. By the IH, 2 A : 2252 → e ( ) → J K 2307 and ⊨ e1 △ e2 ⇐ A ∼ [G2/end]G1. By the IH, we know that A Mp G2 B and 1 B : B Mp G1 C. Since G2 G1 is J K # 2253 well-formed, then e2 (A)>=> e1 (B) : A → Mp (G2 G1) C, 2308 2254 since for all p, J K J K # 2309 2255 21 2310 CC ’20, February 22–23, 2020, San Diego, CA, USA David Castro-Perez and Nobuko Yoshida

ℓ ′ ′ 2311 e2 (A) : A ↾ p → Mp G2 (B ↾ p) and e1 (B) : B ↾ Then there exists ⟨C[r 7→ L], Q⟩ → ⟨C[r 7→ L ], Q ⟩ such 2366 J K J K ′ ′ ′ ′ 2312 p → Mp G1 (C ↾ p), so e2 (A)>=> e1 (B) : A ↾ p → that W : Q and m : Mp L a. 2367 2313 Mp (G G )(C p). J K J K 2368 1 2 ↾ L 2314 # Proof. Straightforward induction on i , and case analysis 2369 Case Choice. m → 2315 on i and the rules { and , since there is a one-to-one 2370 ⊨ [p ⊕ pì] ⇐ a@I[pc ] ∼ pc → {pì}{ιi . end}i ∈[1,2]. 2316 correspondence between the rules syntactic constructs in 2371 By the definition of [p ⊕ pì], Mp ϵ 2317 and the local types. For { we need to take several 2372 p 7→ λx. sel {pì}(getI(x)) (λy. ret (putI(y, x))) ℓ 2318 transitions until communication happens. □ 2373 (λy. ret (putI(y, x))), 2319 2374 which has type a@I[p] ↾ p → Mp ({pì} ⊕ {ιi . end}i ∈[1,2]), Lemma C.2. Assume G, A, B, X : A, and fi : A ↾ pi → 2320 ′ ′ 2375 and p ∈ pì, p 7→ λx. brn p (ret x)(ret x), which has type Mp (G ↾ pi )(B ↾ pi ) for all i ∈ I. Let P = [pi 7→ fi ]i ∈I then 2321 ′ 2376 a@I[p] ↾ p → Mp (p & {ιi . end}i ∈[1,2]). Therefore [p ⊕ pì] : there is a unique Y s.t. P(X) = Y. 2322 2377 a@I[p ] → Mp (p → {pì}{ι . end} ) a@(I[ι p ] ∪ppì 2323 c c i i ∈[1,2] i c Proof. Straightforward consequence of Lemma 5.1, and The- 2378 I[ι p ]) 2324 i c . orem 3.1 in [27]. We know that the traces for G can only 2379 2325 Case Casei . ⊨ e1 ▽ e2 ⇐ (a1 + a2)@(ιi I) ∼ G. Then, differ in the order of the actions, and that this ordermust 2380 2326 preserve the dependencies laid out by G. Therefore, there 2381 ⊨ ei ⇐ ai @I ∼ G. By the IH, ei (ai @I) : (ai @I) → Mp GB. 2327 J K the result of any possible execution must respect the data 2382 But by the definition of , e1 ▽ e2 ((a1 +a2)@(ιi I)) : ((a1 + 2328 JK J K dependencies specified by G. □ 2383 a2)@(ιi I)) → Mp GB. 2329 2384 Case Proj . π ⇐ A × A ∼ end. By definition, ( ) ⇓ ⟨ ⟩ ⟨ ′ ′⟩ 2330 i ⊨ i 1 2 Lemma C.3. If P,W X and P,W { P ,W , then 2385 ( ′ ′) ⇓ 2331 πi (A1 × A2) = [_ 7→ λx. ret (πi x)] : A1 × A2 → P ,W X. 2386 MpJ endK A . 2332 i Theorem 5.3. [Extensionality] Assume e ⇒ e : a@p → b@R 2387 2333 2388 Case Split. ⊨ e1 △ e2 ⇐ A ∼ [G2/end]G1. Then, ⊨ e1 ⇐ and x : a initially at p. If e x = y, then the execution of e (p) 2334 2389 A ∼ G1, and ⊨ e2 ⇐ A ∼ G2. By the IH, e1 (A) : A → also produces y, distributed across R. J K 2335 J K 2390 Mp G1 B and e2 (A) : A → Mp G2 C. By definition, 2336 J K Proof. We prove the following generalised statement. Let 2391 e1 △ e2 (A) = e1 (A) △ e2 (A) : 2337 J K J K J K → ⇒ → ì iì ( ) 2392 A → Mp ([G2/end]G1)(B × C). □ e : a b s.t. e e : A B, x : a, and i s.t. δA x is defined. 2338 ì 2393 ì e ( )( iì ( )) j ( ) 2339 Then, there is j, s.t. A δA x = δB e x . We define 2394 C Extensionality ì J K J K 2340 δ i (x) : A as follows: 2395 Each monadic m represents the code for an individual pro- A 2341 2396 cess. The parallel composition of the set of monadic actions ϵ 2342 δI (x) = δI (x), 2397 generated from a PAlg expression e : A → B, each applied i ···i pì i ···i 2343 1 n ( ) ( 2 n ( )) 2398 δ pì x = bri δR x to the corresponding value of type v : A ↾ p represents an R1∪ R2 i1 2344 ( ) [ 7→ ] 2399 execution of the parallel algorithm on an input of type a, if δp x = p x 2345 δ (x,y) = [p 7→ δ (x)(p) × δ (y)(p)] 2400 a@R = A. Recall from Sec. 5 that the transitions are of the I1×I2 I1 I2 p∈pids(I1×I2) 2346 ℓ ′ ′ δ (ι x) = δ (x) 2401 form ⟨P,W ⟩ { ⟨P ,W ⟩, where P is an environment that ιi I i I 2347 2402 contains the code executed by all roles that collaborate to 2348 We proceed by induction on the structure of the derivation 2403 compute the parallel algorithm, andW represents the shared 2349 ⊢ e ⇒ e : A → B: 2404 unbounded buffers used by each pair of participants tocom- 2350 • Case Join. We have ⊢ e ⇒ e : A ∪pì A → B ∪pì B with ⊢ 2405 municate. We write w for such buffers, where ∅ is the empty 2351 e ⇒ e : A → B. By definition, e (A∪pìA) = e (A)⊎pì 2406 buffer, v · w is the buffer w extended with value v at the J K J K 2352 e ( ) i ·iì ( ) ( iì ( )) 2407 leftmost position, and w · v is the buffer w extended with A . We have that δ pì x = bfi1 δA x . Then, 2353 J K A∪ A 2408 value v at the rightmost position. ì 2354 by the induction hypothesis, there exists j s.t. 2409 ì 2355 e (A ∪pì A)(δ i ·i (x)) 2410 Definition C.1 (Type buffers). We write Q = [pi pj → A∪pìA 2356 J K ì pì ì 2411 q]i ∈I ,j ∈I , where q is a buffer of types, that can be either ∅, ( ( ) ⊎p ( )) ( i ( )) = e A e A bri δA x 2357 J pìK J Kì 2412 a · q or q · a. Note that values include singleton types that = e ( )( i ( )) 2358 bri A δA x 2413 represent labels: li : li . We say that a buffer w = v1 ···vn pì J jìK 2359 = br (δ ( e x)) 2414 contains types q = a1 ··· am, w : q if: n = m and vi : ai for i B ·ì J K 2360 ∈ [ ] = δ i j ( e x) 2415 all i 1, n . We say that W : Q if for all pairs of roles, pi pj , B∪pì B 2361 J K 2416 W (pi pj ) : Q(pi pj ). • ⊢ e ⇒ e A ∪pì A → B ∪pì B 2362 Case Alt. We have : 1 2 1 2 2417 with ⊢ e ⇒ e : A → B , ⊢ e ⇒ e : A → B and A 2363 Theorem 5.1. E Mp CA m Mp L a 1 1 2 2 1 , 2418 [Soundness] Assume : , : ì ℓ ′ ′ e ( ∪pì )( i ·i ( )) ( e ( ) ⊎pì 2364 and W : Q. Suppose ⟨E[r 7→ m],W ⟩ { ⟨E[r 7→ m ],W ⟩. A2. Then, A1 A2 δ pì x = A1 2419 J K A1∪ A2 J K 2365 22 2420 Compiling First-Order Functions to Session-Typed Parallel Code CC ’20, February 22–23, 2020, San Diego, CA, USA

pì ì pì ì ′ ′ ′ ′ 2421 e (A )) (br (δ i (x))) = br e (A )(δ i (x)). Fi- 2. ∀p , p , p ∧ p ∈ pì, p 7→ λx. brn p (ret x)(ret x) 2476 2 i Ai i i Ai 2422 J K J K By case analysis, if get (x) = ι v, then we have: 2477 nally, by the induction hypothesis, there exists jìs.t. I i 2423 7→ ( ( )) 2478 ì ì ì ì ì 1. p bri putI v, x brp e (A )(δ i (x)) = brp δ j (B )( e x) = δ i ·j (B ∪pì ′ ′ ′ ′ 2424 i i Ai i i 1 2. ∀p , p , p ∧ p ∈ pì, p 7→ ret (bri x) 2479 J K J K ì 2425 B2)( e x). [ ⊕ ì] ( ) p ( ) 2480 This is clearly p p δI[p] x = bri δI[ιi p] x = 2426 • CaseJ AlgK . We have ⊢ e ⇒ e@p : a@I → b@p, with 2481 δ i (x). 2427 I[ι p]∪pì I[ι p] 2482 ⊢ e : a → b. Then, e@p a@I (δI (x)) = [pi 7→ a@I { 1 2 J K 2428 p δI (x), by straightforward induction on I, thereL ex- □ 2483 M 2429 ists a trace ⟨([pi 7→ a@I { p (pi )] 2484 2430 L M ℓ ···ℓ 2485 ≫=[p 7→ λx. ret(e x)]) δI (x),W ⟩ { 1 m ⟨[pi 7→ D Generated Code 2431 ′ 2486 ret vi ],W ⟩, with vi = () for all i s.t. pi , p, and We show now the generated code for mergesort, unrolling 2432 2487 vj = e x for pj = p. By Theorem 5.2, the only possi- the recursive function once. 2433 ble interleavingsJ K of actions of e@p must follow the 2488 2434 protocol [a@I { p]. Since thisJ impliesK that send/re- D.1 Input Alg expression 2489 2435 ceive operations must happen respecting the data de- 2490 2436 pendencies, any possible trace must yield the same fftTree :: forall n. SINat n -> Int 2491 2437 result. -> Tree n (D [Complex Double]) 2492 2438 : Tree n (D [ ]) 2493 • Case Inj. ⊢ ιi ⇒ ιi : A → ιi A straightforward since => Complex Double 2439 fftTree SZ w 2494 ιi (A)(δA (x)) = διi A(ιi x). 2440 • CaseJ K Id. ⊢ id ⇒ id : A → A straightforward, since = lift (intlit SZ &&& (lit w &&& id) 2495 2441 >>> prim ) 2496 id (A)(δA(x)) = δA (id x). "baseFFT" 2442 J K fftTree (SS x) w 2497 • Case Proj. ⊢ πi ⇒ πi : A1 × A2 → Ai straightforward, 2443 = withCDict (cdictTree @(D [Complex Double]) x) 2498 since πi (A1 × A2)(δA1×A2 (x)) = δAi (πi x). 2444 J K $ 2499 • Case Comp. ⊢ e1 ◦ e2 ⇒ e1 ◦ e2 : A → C with ⊢ e1 ⇒ 2445 − − 2500 e1 : B → C and ⊢ e2 ⇒ e2 : A → B. A straightforward { Recursive FFT to EVENS and ODDS } 2446 (fftTree x w 2501 consequence of Theorem 5.2 is that if E1 behaves as G1 2447 *** fftTree x (w + 2^ toInteger x)) 2502 and E2 as G2, then (E1 E2)(X) = E2(E1(X)), since the 2448 # − − 2503 permutations of actions of E1 E2 must respect G1 G2. { Multiply right side by exponential } 2449 # # >>> 2504 ◦ ( )( iì ( )) id Then, by the definition of , e1 e2 A δA x = 2450 JK J K *** mapTree x (lit ps2x &&& id >>> mapExp) 2505 ( )( ( )( iì ( ))) 2451 e1 B e2 A δA x By the induction hypoth- 0 2506 J K J K ì 2452 ( )( ( )( iì ( ))) ( j ( )) >>> 2507 esis: e1 B e2 A δA x = e1 B δB e2 x = 2453 J K J K J K J K {− zipWith add (swap arguments to force butterfly 2508 kì ( ( )) kì ( ◦ )) 2454 δC e1 e2 x = δC e1 e2 x . pattern 2509 J K J K J K 2455 • Case Casei . We have ⊢ e1 ▽ e2 ⇒ e1 ▽ e2 : ιi A → B, − &&& zipWith sub 2510 2456 with ⊢ ei ⇒ ei : A → B. Note that (διi A(x)) is only −} 2511 ′ 2457 defined if x = ιi x . Then, by definition, zipTree x True lvl w addc 2512 ′ ′ 2458 e1 ▽ e2 (ιi A)(διi A(ιi x )) = ei (A)(δA(x )). &&& zipTree x False lvl (w + 2^ toInteger x) 2513 J K ′ J K ′ 2459 By the IH, ei (A)(δA(x )) = δ(B)( ei x ) = δ(B)( e1▽ subc 2514 ′ J K J K J 2460 e2 (ιi x )) = δ(B)( e1 ▽ e2 x). where 2515 K J K 2461 • Case Split. We have ⊢ e1 △ e2 ⇒ e1 △ e2 : A → B × C, lvl :: Int 2516 2462 with ⊢ e2 ⇒ e2 : A → C and ⊢ e1 ⇒ e1 : A → C. lvl = fromInteger (toInteger (SS x) + 1) 2517 2463 By definition, e1 △ e2 (A) = e1 (A) △ e2 (A). By ps2x :: Int 2518 J K J K J K 2464 Theorem 5.2, we know that this behaves as G1 G2, if ps2x = 2 ^ toInteger (SS x) 2519 # 2465 p1 ∼ G1 and p2 ∼ G2. Therefore, we assume again that 2520 2466 the interleavings of the subtraces must not affect the fft :: SINat n -> (D [Complex Double]) :=> D[ 2521 2467 data dependencies. Then, Complex Double] 2522 2468 ì fft n = 2523 ( ( )( ( ))) ( ( )( ( ))) j ( ) e1 A δA x △ e2 A δA x = δB e1 x △ 2469 J K J K J K withCDict (cdictTree @(D [Complex Double]) n) 2524 ì jì·kì 2470 k ( ) ( ) $ 2525 δC e2 x = δB×C e1 △ e2 x . 2471 • CaseJ ChoiceK . WeJ have ⊢ Ke ⇒ [p ⊕ pì] : a@I[p] → tsplit n deinterleave >>> fftTree n 0 >>> tfold 2526 2472 pì n (append@@ 0) 2527 a@(I[ι1 p] ∪ I[ι2 p]). We have two cases: 2473 fft5 :: D [Complex Double]:=> D[Complex 2528 1. p 7→ λx. sel (getI(x)) {pì}(λy.putI(y, x))(λy.putI(y, x)) 2474 Double] 2529 2475 23 2530 CC ’20, February 22–23, 2020, San Diego, CA, USA David Castro-Perez and Nobuko Yoshida

2531 fft5 = withSize 5 fft vec_cplx_t zip_add( 2586 2532 Listing 1. Fragment of FFT.hs pair_pair_int_int_pair_vec_cplx_vec_cplx_t in) 2587 2533 { 2588 2534 int lvl = in.fst.fst; 2589 2535 D.2 Main C Code and Atomic Functions int wid = in.fst.snd; 2590 2536 vec_cplx_t l = in.snd.fst; 2591 These need to be implemented by the programmer. 2537 vec_cplx_t r = in.snd.snd; 2592 2538 #include "FFT.h" vec_cplx_t = stages[lvl][wid]; 2593 2539 #include for(int i = 0; i < l.size; i++){ 2594 2540 #include lout.elems[i] = l.elems[i] + r.elems[i]; 2595 2541 #include } 2596 2542 #include return lout; 2597 2543 #include } 2598 2544 #include 2599 2545 vec_cplx_t zip_sub( 2600 2546 #define REPETITIONS 50 pair_pair_int_int_pair_vec_cplx_vec_cplx_t in) 2601 2547 { 2602 2548 #define BENCHMARKSEQ(s, f) { \ int lvl = in.fst.fst; 2603 2549 time = 0; \ int wid = in.fst.snd; 2604 2550 time_diff = 0; \ vec_cplx_t l = in.snd.fst; 2605 2551 time_old = 0; \ vec_cplx_t r = in.snd.snd; 2606 2552 var = 0; \ vec_cplx_t lout = stages[lvl][wid]; 2607 2553 for(int i=0; i

2641 else } 2696 2642 printf("(%g, %g) ", creal(in.elems[i]), cimag( free(stages); 2697 2643 in.elems[i])); } 2698 2644 printf("\n"); 2699 2645 } pair_vec_cplx_vec_cplx_t deinterleave( 2700 2646 pair_int_int_t iin){ 2701 2647 void showstep(int stp, const char * s, vec_cplx_t int wl = iin.fst; 2702 2648 in) { int wr = iin.snd; 2703 2649 printf("%s", s); 2704 2650 for (int i = 0; i < in.size; i+=stp) int mid = stages[0][wl].size/2; 2705 2651 if (!cimag(in.elems[i])) 2706 2652 printf("%g ", creal(in.elems[i])); stages[1][wl].size = mid; 2707 2653 else stages[1][wr].size = mid; 2708 2654 printf("(%g, %g) ", creal(in.elems[i]), cimag( stages[1][wr].elems = stages[1][wl].elems + mid; 2709 2655 in.elems[i])); 2710 2656 printf("\n"); for(int i = 0; i < stages[0][wl].size; i+= 2){ 2711 2657 } stages[1][wl].elems[i/2] = stages[0][wl].elems 2712 2658 [i]; 2713 2659 vec_cplx_t baseFFT(pair_int_pair_int_vec_cplx_t in stages[1][wr].elems[i/2] = stages[0][wl].elems 2714 2660 ) [i+1]; 2715 2661 { } 2716 2662 int lvl = in.fst; memcpy(stages[0][wl].elems, stages[1][wl].elems, 2717 2663 int wid = in.snd.fst; stages[0][wl].size * sizeof(cplx_t)); 2718 2664 cplx_t *buf = stages[lvl][wid].elems; stages[0][wr].elems = stages[0][wl].elems + mid; 2719 2665 int n = in.snd.snd.size; stages[0][wr].size = mid; 2720 2666 2721 2667 _fft(buf, in.snd.snd.elems, n, 1); for (int i = 2; i < num_stages; i++){ 2722 2668 memcpy(stages[i][wl].elems, stages[1][wl]. 2723 2669 return stages[lvl][wid]; elems, stages[0][wl].size * sizeof(cplx_t)) 2724 2670 } ; 2725 2671 stages[i][wl].size = mid; 2726 2672 vec_cplx_t seqfft(vec_cplx_t in) stages[i][wr].elems = stages[i][wl].elems + 2727 2673 { mid; 2728 2674 pair_int_pair_int_vec_cplx_t i = {1, {0, in}}; stages[i][wr].size = mid; 2729 2675 return baseFFT(i); } 2730 2676 } stages[0][wl].size = mid; 2731 2677 return (pair_vec_cplx_vec_cplx_t) { stages[0][wl 2732 2678 vec_cplx_t map_exp(pair_int_pair_int_vec_cplx_t iv ], stages[0][wr] }; 2733 2679 ){ } 2734 2680 int i = iv.snd.fst; 2735 2681 int ps2x = iv.fst; 2736 2682 vec_cplx_t in = iv.snd.snd; vec_cplx_t randvec(int depth, size_t s){ 2737 2683 int step = i * in.size; num_workers = depth <= 1? 1 : 1 << depth - 1; 2738 2684 for(int k = 0; k < in.size; k++){ num_stages = depth <= 1? 2 : 1 + depth ; 2739 2685 in.elems[k] = in.elems[k] * cexp (2 * - I * stages = (vec_cplx_t **)malloc(num_stages * 2740 2686 PI * (k + step) / (ps2x * in.size)); sizeof(vec_cplx_t *)); 2741 2687 } for (int i = 0; i < num_stages; i++){ 2742 2688 return in; stages[i] = (vec_cplx_t *)malloc(num_workers * 2743 2689 } sizeof(vec_cplx_t)); 2744 2690 stages[i][0].elems = (cplx_t *)calloc(s, sizeof 2745 2691 void free_fftvec(){ (cplx_t)); 2746 2692 for(int i = 0; i < num_stages; i++){ } 2747 2693 free(stages[i][0].elems); stages[0][0].size = s; 2748 2694 free(stages[i]); 2749 2695 25 2750 CC ’20, February 22–23, 2020, San Diego, CA, USA David Castro-Perez and Nobuko Yoshida

2751 srand(time(NULL)); out = fft5(in); 2806 2752 show("Result: ", out); 2807 2753 2808 2754 for (int i = 0; i < s; i++) { free_fftvec(); 2809 2755 double rand_r = (double)rand() / (double) } 2810 2756 RAND_MAX; 2811 2757 double rand_i = (double)rand() / (double) D.3 Automatically Generated C Code 2812 2758 RAND_MAX; 2813 2759 stages[0][0].elems[i] = rand_r + rand_i * I; #ifndef __FFT__ 2814 2760 } #define __FFT__ 2815 2761 2816 2762 for(int j = 1; j < num_stages; j++) { #include 2817 2763 memcpy(stages[j][0].elems, stages[0][0]. #include 2818 2764 elems, s * sizeof(vec_cplx_t)); #include 2819 2765 stages[j][0].size = s / num_workers; #include 2820 2766 } typedef double _Complex cplx_t; 2821 2767 2822 2768 for(int i = 0; i < num_stages; i++) { typedef struct vec_cplx { 2823 2769 for(int j = 1; j < num_workers; j++) { cplx_t * elems; size_t size; 2824 2770 stages[i][j] = stages[i][j-1]; } vec_cplx_t; 2825 2771 } 2826 2772 } typedef struct q_vec_cplx { 2827 2773 volatile unsigned int q_size; 2828 2774 return stages[0][0]; int q_head; 2829 2775 } int q_tail; 2830 2776 pthread_mutex_t q_mutex; 2831 2777 void usage(const char *nm){ pthread_cond_t q_full; 2832 2778 printf("Usage: %s \n", nm); pthread_cond_t q_empty; 2833 2779 exit(-1); vec_cplx_t q_mem[1]; 2834 2780 } } q_vec_cplx_t; 2835 2781 2836 2782 int main(int argc, const char *argv[]) { void q_vec_cplx_put(q_vec_cplx_t *, vec_cplx_t); 2837 2783 setbuf(stdout, NULL); 2838 2784 if (argc <= 1) { vec_cplx_t q_vec_cplx_get(q_vec_cplx_t *); 2839 2785 usage(argv[0]); 2840 2786 } typedef enum unit { 2841 2787 char *endptr = NULL; Unit 2842 2788 errno = 0; } unit_t; 2843 2789 size_t size = strtoimax(argv[1],&endptr,10); 2844 2790 size = (size_t) 1 << (long)ceil(log2(size)); typedef struct pair_int_vec_cplx { 2845 2791 size = size < 256? 256:size; int fst; vec_cplx_t snd; 2846 2792 if (errno != 0) { } pair_int_vec_cplx_t; 2847 2793 printf("%s", strerror(errno)); 2848 2794 usage(argv[0]); typedef struct pair_int_pair_int_vec_cplx { 2849 2795 } int fst; pair_int_vec_cplx_t snd; 2850 2796 if (endptr != NULL && *endptr != 0) { } pair_int_pair_int_vec_cplx_t; 2851 2797 usage(argv[0]); 2852 2798 } vec_cplx_t baseFFT(pair_int_pair_int_vec_cplx_t); 2853 2799 2854 2800 vec_cplx_t in, out; vec_cplx_t fft0(vec_cplx_t); 2855 2801 /∗ allocate memory ∗/ 2856 2802 in = randvec(size, size); vec_cplx_t fft1(vec_cplx_t); 2857 2803 2858 2804 /∗ calling generated fft5 ∗/ typedef struct pair_int_int { 2859 2805 26 2860 Compiling First-Order Functions to Session-Typed Parallel Code CC ’20, February 22–23, 2020, San Diego, CA, USA

2861 int fst; int snd; { 2916 2862 } pair_int_int_t; pair_int_int_t v_t; 2917 2863 v_t.fst = 0; 2918 2864 typedef struct pair_vec_cplx_vec_cplx { v_t.snd = 1; 2919 2865 vec_cplx_t fst; vec_cplx_t snd; pair_vec_cplx_vec_cplx_t v_u; 2920 2866 } pair_vec_cplx_vec_cplx_t; v_u = deinterleave(v_t); 2921 2867 vec_cplx_t v_v; 2922 2868 pair_vec_cplx_vec_cplx_t deinterleave( v_v = v_u.fst; 2923 2869 pair_int_int_t); q_vec_cplx_put(&ch0, v_v); 2924 2870 vec_cplx_t v_w; 2925 2871 vec_cplx_t cat(pair_vec_cplx_vec_cplx_t); v_w = v_u.snd; 2926 2872 q_vec_cplx_put(&ch2, v_w); 2927 2873 typedef struct vec_cplx_t v_x; 2928 2874 pair_pair_int_int_pair_vec_cplx_vec_cplx { v_x = q_vec_cplx_get(&ch1); 2929 2875 pair_int_int_t fst; vec_cplx_t v_y; 2930 2876 pair_vec_cplx_vec_cplx_t snd; v_y = q_vec_cplx_get(&ch3); 2931 2877 } pair_vec_cplx_vec_cplx_t v_z; 2932 2878 pair_pair_int_int_pair_vec_cplx_vec_cplx_t v_z.fst = v_x; 2933 2879 ; v_z.snd = v_y; 2934 2880 vec_cplx_t v_aa; 2935 2881 vec_cplx_t zip_add( v_aa = cat(v_z); 2936 2882 pair_pair_int_int_pair_vec_cplx_vec_cplx_t); return v_aa; 2937 2883 } 2938 2884 vec_cplx_t map_exp(pair_int_pair_int_vec_cplx_t); 2939 2885 q_vec_cplx_t ch4 = { 0, 0, 0, { } }; 2940 2886 vec_cplx_t zip_sub( 2941 2887 pair_pair_int_int_pair_vec_cplx_vec_cplx_t); q_vec_cplx_t ch5 = { 0, 0, 0, { } }; 2942 2888 2943 2889 vec_cplx_t fft2(vec_cplx_t); unit_t fft2_part_1() 2944 2890 { 2945 2891 vec_cplx_t fft3(vec_cplx_t); vec_cplx_t v_ba; 2946 2892 v_ba = q_vec_cplx_get(&ch0); 2947 2893 vec_cplx_t fft4(vec_cplx_t); pair_int_pair_int_vec_cplx_t v_ca; 2948 2894 v_ca.fst = 1; 2949 2895 vec_cplx_t fft5(vec_cplx_t); pair_int_vec_cplx_t v_da; 2950 2896 v_da.fst = 0; 2951 2897 vec_cplx_t fft6(vec_cplx_t); v_da.snd = v_ba; 2952 2898 v_ca.snd = v_da; 2953 2899 vec_cplx_t fft7(vec_cplx_t); vec_cplx_t v_ea; 2954 2900 v_ea = baseFFT(v_ca); 2955 2901 vec_cplx_t fft8(vec_cplx_t); q_vec_cplx_put(&ch4, v_ea); 2956 2902 vec_cplx_t v_fa; 2957 2903 #endif v_fa = q_vec_cplx_get(&ch5); 2958 2904 Listing 2. Generated FFT.h pair_pair_int_int_pair_vec_cplx_vec_cplx_t 2959 2905 v_ga; 2960 2906 pair_int_int_t v_ha; 2961 2907 #include "FFT.h" v_ha.fst = 2; 2962 2908 v_ha.snd = 0; 2963 2909 q_vec_cplx_t ch0 = { 0, 0, 0, { } }; v_ga.fst = v_ha; 2964 2910 q_vec_cplx_t ch2 = { 0, 0, 0, { } }; pair_vec_cplx_vec_cplx_t v_ia; 2965 2911 v_ia.fst = v_ea; 2966 2912 q_vec_cplx_t ch3 = { 0, 0, 0, { } }; v_ia.snd = v_fa; 2967 2913 v_ga.snd = v_ia; 2968 2914 vec_cplx_t fft2_part_0(vec_cplx_t v_s) vec_cplx_t v_ja; 2969 2915 27 2970 CC ’20, February 22–23, 2020, San Diego, CA, USA David Castro-Perez and Nobuko Yoshida

2971 v_ja = zip_add(v_ga); } 3026 2972 q_vec_cplx_put(&ch1, v_ja); 3027 2973 return Unit; vec_cplx_t fft2(vec_cplx_t v_wa) 3028 2974 } { 3029 2975 vec_cplx_t v_xa; 3030 2976 unit_t fft2_part_2() pthread_t thread1; 3031 2977 { pthread_t thread2; 3032 2978 vec_cplx_t v_ka; pthread_create(&thread1, NULL, fun_thread_1_1, 3033 2979 v_ka = q_vec_cplx_get(&ch2); NULL); 3034 2980 pair_int_pair_int_vec_cplx_t v_la; pthread_create(&thread2, NULL, fun_thread_2, 3035 2981 v_la.fst = 1; NULL); 3036 2982 pair_int_vec_cplx_t v_ma; v_xa = fft2_part_0(v_wa); 3037 2983 v_ma.fst = 1; pthread_join(thread1, NULL); 3038 2984 v_ma.snd = v_ka; pthread_join(thread2, NULL); 3039 2985 v_la.snd = v_ma; return v_xa; 3040 2986 vec_cplx_t v_na; } 3041 2987 v_na = baseFFT(v_la); Listing 3. Fragment of generated FFT.c 3042 2988 pair_int_pair_int_vec_cplx_t v_oa; 3043 2989 v_oa.fst = 2; 3044 2990 pair_int_vec_cplx_t v_pa; E Artifact Appendix 3045 2991 v_pa.fst = 0; E.1 Abstract 3046 2992 v_pa.snd = v_na; 3047 This artifact provides a prototype implementation of PAlg, 2993 v_oa.snd = v_pa; 3048 embedded in Haskell, along with a number of benchmarks 2994 vec_cplx_t v_qa; 3049 used to test the scalability of our approach. We provide 2995 v_qa = map_exp(v_oa); 3050 scripts to regenerate the execution time measurements that 2996 q_vec_cplx_put(&ch5, v_qa); 3051 we used in our paper. This will allow to evaluate our results 2997 vec_cplx_t v_ra; 3052 on any multi-core shared-memory architecture. 2998 v_ra = q_vec_cplx_get(&ch4); 3053 We also provide a small tutorial that is meant to guide 2999 pair_pair_int_int_pair_vec_cplx_vec_cplx_t 3054 a programmer, step-by-step, in the implementation of a 3000 v_sa; 3055 message-passing parallel algorithm using our library. The 3001 pair_int_int_t v_ta; 3056 tutorial includes a guide on how to visualise the global types 3002 v_ta.fst = 2; 3057 that correspond to the achieved parallelisations, as well as 3003 v_ta.snd = 1; 3058 any asynchronous optimisations applicable to the generated 3004 v_sa.fst = v_ta; 3059 message-passing code. 3005 pair_vec_cplx_vec_cplx_t v_ua; 3060 3006 v_ua.fst = v_ra; E.2 Artifact check-list (meta-information) 3061 3007 v_ua.snd = v_qa; 3062 • Algorithm: Message-passing C code generation from first- 3008 v_sa.snd = v_ua; 3063 order Haskell functions. Global type inference of the com- 3009 vec_cplx_t v_va; 3064 munication protocol followed by the parallelisation. 3010 v_va = zip_sub(v_sa); • Program: Haskell libraries Language.CAlg, 3065 3011 q_vec_cplx_put(&ch3, v_va); noindent Language.CAlg.CSyn and dependencies, as well 3066 3012 return Unit; as session-arrc, to compile to C Haskell functions built 3067 3013 } using such libraries. 3068 3014 • Compilation: GHC >= 8.6 && < 8.8, and C compiler that 3069 3015 void * fun_thread_1_1(void * arg) supports C11. 3070 3016 { • Transformations: Compilation to C, and asynchronous 3071 3017 fft2_part_1(); optimisation pass. 3072 • Binary: Source code and scripts included to generate the 3018 return NULL; 3073 binaries from the sources. 3019 } 3074 • Data set: Included original run-time measurements for com- 3020 parison. 3075 3021 void * fun_thread_2(void * arg) • Hardware: We used a 12-core Intel Xeon CPU E5-2650 v4 3076 3022 { @ 2.20GHz. We recommend a shared-memory architecture, 3077 3023 fft2_part_2(); with uniform access times, to measure the overheads of our 3078 3024 return NULL; approach, not message latencies. 3079 3025 28 3080 Compiling First-Order Functions to Session-Typed Parallel Code CC ’20, February 22–23, 2020, San Diego, CA, USA

3081 • Execution: We include a script to run the benchmarks. size: 3136 3082 • Output: Benchmark execution times. K: seq 3137 3083 • Experiments: Small, representative benchmarks of com- mean: 3138 3084 mon parallel algorithms. stddev: 3139 • 3085 How much memory required (approximately)?: 64GB K: 1 3140 for using the maximum benchmark input size. 3086 mean: ... 3141 • How much time is needed to complete experiments 3087 stddev: ... 3142 (approximately)?: 5 days on the hardware stated in §E.3.2. ... 3088 • Publicly available?: Yes. 3143 3089 • Code licenses (if publicly available)?: BSD-3. Keyword size denotes the size of the inputs for the particular 3144 3090 benchmark. Keyword mean is the average execution time. Keyword 3145 3091 E.3 Description stddev is the standard deviation. We write K: to denote the number 3146 3092 E.3.1 How delivered of recursion unfoldings used to produce the parallel version. 3147 3093 Examples of global types for each benchmark are under 3148 We provide a docker image with the necessary dependencies: https: benchmarks//protocol/ 3094 //imperialcollegelondon.box.com/v/cc20-artifact-p43. After down- 3149 _.mpst, where is the func- 3095 loading, the image can be loaded using: 3150 tion name in .hs that corresponds to this protocol. 3096 $ sudo docker load -i cc20-artifact-p43.docker 3151 3097 E.4 Installation 3152 To run the image, run the command: 3098 Note: this section can be omitted if using our docker image. 3153 3099 $ sudo docker run -ti cc20-artifact-p43 3154 We recommend using Stack (https://docs.haskellstack.org/en/stable/ 3100 File README.md inside the docker image contains additional in- 3155 README/#how-to-install). To build our tool: 3101 structions. Our benchmarks, source code and scripts are also pub- 3156 3102 licly available on Github, in https://github.com/session-arr/session- 3157 $ git clone \ 3103 arr. 3158 https://github.com/session-arr/session-arr 3104 3159 E.3.2 Hardware dependencies $ cd session-arr 3105 3160 We used a 12-core Intel Xeon CPU E5-2650 v4 @ 2.20GHz. We $ stack build 3106 recommend using a shared-memory architecture, with uniform ac- 3161 There is no need to install the tool. However, to install it, run: 3107 cess times, to measure the overheads of our approach, not message 3162 3108 latencies. $ stack install 3163 3109 This will copy the binary session-arrc to a local directory, 3164 E.3.3 Software dependencies 3110 usually ${HOME}/.local/bin. 3165 3111 All our dependencies are listed in the Dockerfile in our public Manual compilation and installation using GHC is also possible, 3166 3112 repository. We list them below. To compile our tool: but we discourage it. Read session-arr/package.yaml to find 3167 3113 1. GHC >= 8.6 (not tested with GHC >= 8.8) out which haskell packages are required. 3168 3114 2. stack Version 1.9.1 3169 E.5 Experiment workflow 3115 To run our experiments: 3170 3116 1. C compiler that supports C11 (tested with GCC >= 4.8 && < E.5.1 Automatic 3171 3117 8.3) We included script session-arr/benchmark.sh to compile and 3172 3118 2. glibc (tested with versions >= 2.17 && < 2.29) run all the benchmarks used in the paper. To customise the amount 3173 3119 3. numactl of cores, the number of repetitions per experiment and the maxi- 3174 3120 To generate the graphs: mum input size, run: 3175 3121 1. python (== 2.7) $ CORES= REPETITIONS= \ 3176 3122 2. python-matplotlib (== 2) MAXSIZE= ./benchmark.sh 3177 3123 3. python-pint (== 0.7) The defaults are: 3178 3124 3179 E.3.4 Data sets 1. CORES: number of physical cores on your machine 3125 2. REPETITIONS: 50 3180 We include as part of the artifact the raw data that we obtained for 3126 3. MAXSIZE: 30 3181 our benchmarks. These are included under 3127 3182 benchmarks//data/t_, where The script requires that MAXSIZE ≥ 15. 3128 3183 is either 12 or 24. There is additionally a file t_48, Note: using MAXSIZE= 30 requires a machine with a large amount 3129 3184 that uses all full 24 cores + hyperthreading. The structure of the of memory. We run our experiments on a machine with 64GB of 3130 files is: memory. 3185 3131 3186 3132 3187 3133 3188 3134 3189 3135 29 3190 CC ’20, February 22–23, 2020, San Diego, CA, USA David Castro-Perez and Nobuko Yoshida

3191 E.5.2 Manual Ensure that there are measurements with at least N > 14 sizes. 3246 3192 3247 Clone and build the repository: Plotting the speedups: Navigate to examples/. The speedups can 3193 $ git clone \ be plotted using scripts plotall.sh and plot.py, these will re- 3248 3194 https://github.com/session-arr/session-arr generate the graphs used in our paper. The usage is 3249 3195 $ cd session-arr ./plotall.sh BENCHMARK_DIR CORES, where CORES is the number 3250 3196 $ stack build of cores used for the experimental workflow. For example: 3251 3197 $ ./plotall.sh FFT 4 3252 3198 Navigate to one of the benchmarks 3253 This will generate the graphs for FFT run on 4 cores under 3199 $ cd examples/FFT 3254 examples/plots. 3200 Here, there should be two files: FFT.hs and main.c. 3255 3201 $ ls E.6 Evaluation and expected result 3256 3202 FFT.hs main.c run.sh If you followed the experiment workflow, you should find under 3257 3203 examples/plots a series of graphs with the speedups for each 3258 To run our tool, run session-arrc using stack, with the .hs file 3204 benchmark. To visualise them, we recommend copying them to a 3259 as input. 3205 local directory, by running docker cp from outside the docker 3260 3206 $ stack exec session-arrc -- FFT.hs container: 3261 3207 The tool should output the list of functions found in module FFT.hs $ docker cp \ 3262 3208 that are going to be compiled to C, and produce two files FFT.c and :/home/cc20-artifact/session-arr/examples/plots \ 3263 3209 FFT.h. The interface file contains the type definitions and function

3264 signatures of the functions in FFT.c. Finally, compile main.c: 3210 Here, is the container name obtained via docker ps -a, and 3265 3211 $ gcc FFT.c main.c -o bench -lpthread -lm is the destination path. 3266 3212 To configure the number of repetitions, recompile the benchmark 3267 Outcome When run on similar hardware to the one that we de- 3213 as follows: 3268 scribe in the paper, following our workflow, comparable speedups 3214 3269 $ gcc FFT.c main.c -DREPETITIONS= \ and scalability to the ones that we reported in the paper should be 3215 -o bench -lpthread -lm observed. 3270 3216 3271 You may use run.sh to run the benchmark on a range of inputs. 3217 Note: for more reliable results, execution should be done outside 3272 The usage is: 3218 the docker container. Use the container to generate all C code, then 3273 copy it running docker cp from outside the container, as well as 3219 $ ./run.sh 3274 the necessary scripts run.sh, and proceed locally. If you decide to 3220 For example, ./run.sh 2 10 will run the benchmark with sizes 3275 run the experiments locally, please check §E.3.3 and ensure that 29 and 210. The maximum size must be > 9. To generate the 3221 you have all required software. 3276 3222 graphs, you need measurements using at least 7 different sizes, i.e. 3277 3223 size must be > 14. E.7 Experiment customization 3278 3224 Running each benchmark manually Pass a valid input size to Several aspects can be customised in the benchmark source code, 3279 3225 bench, the output looks as follows (run in a 4-core machine): execution scripts and compilation options: 3280 3226 $ ./bench $((2**17)) • Annotations to functions in the .hs files should produce 3281 3227 K: seq different parallelisations. 3282 3228 mean: 0.039446 • Files main.c can be compiled using different numbers of 3283 -DREPETITIONS= 3229 stddev: 0.000713 repetitions using . 3284 • The script benchmark.sh can be run with such number of 3230 ... 3285 repetitions, to reduce execution times. The maximum input 3231 K: 4 3286 size for the benchmarks, and the number of cores can also 3232 3287 mean: 0.011952 be customised: 3233 stddev: 0.000636 3288 $ cd session-arr 3234 ... 3289 3235 $ REPETITIONS= CORES= \ 3290 Save all execution times to files with the format described in §E.3.4, 3236 MAXSIZE= ./benchmark.sh 3291 as follows: 3237 3292 $ mkdir data 3238 E.8 Methodology 3293 $ echo "size: " >> data/t_ 3239 Submission, reviewing and badging methodology: 3294 $ ./bench >> data/t_ 3240 • http://cTuning.org/ae/submission-20190109.html 3295 $ ... 3241 • http://cTuning.org/ae/reviewing-20190109.html 3296 $ echo "size: " >> data/t_ 3242 • https://www.acm.org/publications/policies/ 3297 $ ./bench >> data/t_ 3243 artifact-review-badging 3298 3244 3299 3245 30 3300