Retrofitting Effect Handlers onto OCaml

KC Sivaramakrishnan Stephen Dolan Leo White IIT Madras OCaml Labs Jane Street Chennai, India Cambridge, UK London, UK [email protected] [email protected] [email protected] Sadiq Jaffer Tom Kelly Anil Madhavapeddy Opsian and OCaml Labs OCaml Labs University of Cambridge and OCaml Labs Cambridge, UK Cambridge, UK Cambridge, UK [email protected] [email protected] [email protected] Abstract 1 Introduction Effect handlers have been gathering momentum as amech- Effect handlers46 [ ] provide a modular foundation for user- anism for modular programming with user-defined effects. defined effects. The key idea is to separate the definition of Effect handlers allow for non-local control flow mechanisms the effectful operations from their interpretations, which are such as generators, async/await, lightweight threads and given by handlers of the effects. For example, coroutines to be composably expressed. We present a design effect In_line : in_channel -> string and evaluate a full-fledged efficient implementation of effect declares an effect In_line, which is parameterised with an handlers for OCaml, an industrial-strength multi-paradigm input channel of type in_channel, which when performed re- programming language. Our implementation strives to main- turns a string value. A computation can perform the In_line tain the backwards compatibility and performance profile of effect without knowing how the In_line effect is implemented. existing OCaml code. Retrofitting effect handlers onto OCaml This computation may be enclosed by different handlers that is challenging since OCaml does not currently have any non- handle In_line differently. For example, In_line may be imple- local control flow mechanisms other than exceptions. Our mented by performing a blocking read on the input channel implementation of effect handlers for OCaml: (i) imposes a or performing the read asynchronously by offloading it to an mean 1% overhead on a comprehensive macro benchmark event loop such as libuv, without changing the computation. suite that does not use effect handlers; (ii) remains compati- Thanks to the separation of effectful operations from their ble with program analysis tools that inspect the stack; and implementation, effect handlers enable new approaches to (iii) is efficient for new code that makes use of effect handlers. modular programming. Effect handlers are a generalisation CCS Concepts: • Software and its engineering → Run- of exception handlers, where, in addition to the effect being time environments; Concurrent programming struc- handled, the handler is provided with the delimited contin- tures; Control structures; Parallel programming languages; uation [15] of the perform site. This continuation may be Concurrent programming languages. used to resume the suspended computation later. This en- ables non-local control-flow mechanisms such as resumable Keywords: Effect handlers, Backwards compatibility, Fibers, exceptions, lightweight threads, coroutines, generators and Continuations, Backtraces asynchronous I/O to be composably expressed. ACM Reference Format: One of the primary motivations to extend OCaml with KC Sivaramakrishnan, Stephen Dolan, Leo White, Sadiq Jaffer, Tom effect handlers is to natively support asynchronous I/Oin Kelly, and Anil Madhavapeddy. 2021. Retrofitting Effect Handlers order to express highly scalable concurrent applications such onto OCaml. In Proceedings of the 42nd ACM SIGPLAN International as web servers in direct style (as opposed to using callbacks). Conference on Programming Language Design and Implementation Many programming languages, including OCaml, require (PLDI ’21), June 20–25, 2021, Virtual, UK. ACM, New York, NY, USA, non-local changes to source code in order to support asyn- 16 pages. https://doi.org/10.1145/3453483.3454039 chronous I/O, often leading to a dichotomy between syn- chronous and asynchronous code [11]. For asynchronous Permission to make digital or hard copies of part or all of this work for I/O, OCaml developers typically use libraries such as Lwt [54] personal or classroom use is granted without fee provided that copies are and Async [41, §18], where asynchronous functions are rep- not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third- resented as monadic computations. In these libraries, while party components of this work must be honored. For all other uses, contact asynchronous functions can call synchronous functions di- the owner/author(s). rectly, the converse is not true. In particular, any function PLDI ’21, June 20–25, 2021, Virtual, UK that calls an asynchronous function will also have to be © 2021 Copyright held by the owner/author(s). marked as asynchronous. As a result, large parts of the appli- ACM ISBN 978-1-4503-8391-2/21/06. cations using these libraries end up being in monadic form. https://doi.org/10.1145/3453483.3454039 PLDI ’21, June 20–25, 2021, Virtual, UK KC Sivaramakrishnan, Stephen Dolan, Leo White, Sadiq Jaffer, Tom Kelly, and Anil Madhavapeddy

Languages such as GHC Haskell and Go provide lightweight and profilers that inspect the stack using DWARF un- threads, which avoids the dichotomy between synchronous wind tables. and asynchronous code. However, these languages bake- R3 Effect handler efficiency. The program must accom- in the lightweight thread implementation into the runtime modate millions of continuations at the same time to system. With effect handlers, asynchronous I/O can be im- support highly-concurrent applications. Installing ef- plemented directly in OCaml as a library without imposing fect handlers, capturing and resuming continuations a monadic form on the users. must be fast. There are many research languages and libraries built R4 Forwards compatibility. As a cornerstone of modu- around effect handlers4 [ , 7, 8, 12, 27, 35]. Unlike these ef- larity, we also want blocking I/O code to transparently forts, our goal is to retrofit effect handlers onto the OCaml be made asynchronous with the help of effect handlers. programming language, which has been in continuous use The need to host millions of continuations at the same for the past 25 years in large codebases including verifi- time rules out the use of a large contiguous stack space as in cation tools [5, 13], mission critical software systems [40] C for continuations. Instead, we resort to using small initial and latency sensitive networked applications [39]. OCaml stacks and growing the stacks on demand. As a result, OCaml is particularly favoured for its competitive yet predictable functions, irrespective of whether they use effect handlers, performance, with a fast foreign-function interface (FFI). It need to perform stack overflow checks, and external C func- has excellent compatibility with program analysis tools such tions (which do not have stack overflow checks) must be as debuggers and profilers that utilise DWARF stack unwind performed on a separate system stack. Additionally, we must tables [19] to obtain a backtrace. generate DWARF stack unwind tables for stacks that may be OCaml currently does not support any non-local control non-contiguous. In this work, we develop the compiler and flow mechanisms other than exceptions. This makes it partic- runtime support required for implementing efficient effect ularly challenging to implement the delimited continuations handlers for OCaml that satisfy these requirements. necessary for effect handlers without sacrificing the desir- Our work is also timely. The WebAssembly [26] commu- able properties of OCaml. A standard way of implementing nity group is considering effect handlers as one of the mech- continuations is to use continuation-passing style (CPS) in anisms for supporting concurrency, asynchronous I/O and the compiler’s intermediate representation (IR) [35]. OCaml generators [55]. Project Loom [38] is an OpenJDK project does not use a CPS IR, and changing the compiler to utilise that adds virtual threads and delimited continuations to Java. a CPS IR would be an enormous undertaking that would af- The Swift roadmap [53] includes direct style asynchronous fect the performance profile of existing OCaml applications programming and structured concurrency as milestones. We due to the increased memory allocations as the continuation believe that our design choices will inform similar choices closures get allocated on the heap [21]. Moreover, with CPS, to be made in other industrial-strength languages. an explicit stack is absent, and hence, we would lose com- patibility with tools that inspect the program stack. Hence, 1.2 Contributions we choose not to use CPS translation and represent the con- Our contributions are to present: tinuations as call stacks. The search for an expressive effect system that guarantees • the design and implementation of effect handlers for that all the effects performed in the program are handled OCaml. Our design retains OCaml’s compatibility with (effect safety) in the presence of advanced features such as program analysis tools that inspect the stack using polymorphism, modularity and generativity is an active area DWARF unwind tables. We have validated our DWARF of research [6, 7, 27, 35]. We do not focus on this question unwind tables with the assistance of an automated in this paper, and our implementation of effect handlers validator tool [3]. in OCaml does not guarantee effect safety. We leave the • a formal for the effect handler question of effect safety for future work. implementation in OCaml. Our formalism explicitly models the interactions with the C stack, which is gen- 1.1 Requirements erally overlooked by other formal models, but which We motivate our effect handler design based on the following the implementations must handle. ideal requirements: • extensive evaluation which shows that our implemen- R1 Backwards compatibility. Existing OCaml programs tation has minimal impact on code that does not use do not break under OCaml extended with effect han- effect handlers, and serves as an efficient foundation dlers. OCaml code that does not use effect handlers for scalable concurrent programming. will pay minimal performance and memory cost. We have implemented effect handlers in a multicore ex- R2 Tool compatibility. OCaml programs with effect han- tension of the OCaml programming language which we call dlers produce well-formed backtraces and remain com- Multicore OCaml to distinguish it from stock OCaml. Mul- patible with program analysis tools such as debuggers ticore OCaml delineates concurrency (overlapped execution Retrofitting Effect Handlers onto OCaml PLDI ’21, June 20–25, 2021, Virtual,UK of tasks) from parallelism (simultaneous execution of tasks) external function does not allocate in the OCaml heap, then with distinct mechanisms for expressing them. Sivaramakr- it can be called directly and no bookkeeping is necessary. ishnan et al. [50] describe the parallelism support in Multi- For external functions which allocate in the OCaml heap, the core OCaml enabled by domains. The focus of this paper is cached allocation pointer is saved to Caml_state before the the concurrency support enabled by effect handlers. external call and it is restored on return. Similarly, callbacks The remainder of the paper continues with a description of into OCaml from C are also cheap: these involve loading the the stock OCaml program stack (§2). We then describe effect arguments in the right registers and calling the OCaml func- handlers in Multicore OCaml focussing on the challenges in tion. OCaml callbacks are relatively common as the garbage retrofitting them into a mainstream systems language (§3), collector (GC), which is implemented in C, executes OCaml followed by the static and dynamic semantics for Multicore finalisation functions as callbacks. OCaml effect handlers (§4). We then discuss the compiler and 2.2 Exception handlers the runtime system support for implementing effect handlers (§5), and present an extensive performance evaluation of The lack of callee-saved registers also makes exception han- effect handlers (§6) against our design goals (§1.1). Finally, dling fast. In the absence of callee-saved registers, no regis- we discuss the related work (§7) and conclude (§8). ters need to be saved when entering a try block. Similarly, no registers need to be restored when handling an exception. 2 Background: OCaml Stacks Installing an exception handler simply pushes the program counter (pc) of the handler and the current exception pointer The main challenge in implementing effect handlers in Mul- (exn_ptr – a field in Caml_state) onto the stack. After this, the ticore OCaml is managing the program stack and preserv- current exception pointer is updated to be the current stack ing its desirable properties. In this section, we provide an pointer (rsp). This creates a linked-list of exception handler overview of the program stack and related mechanisms in frames on the stack as shown in Figure 1c. Raising an excep- stock OCaml. tion simply sets rsp to exn_ptr, loads the saved exn_ptr, and Consider the layout of the stock OCaml stack for the pro- jumps to the pc of the handler. gram shown in Figures 1a and 1b. The OCaml main function In order to forward exceptions across C frames, the C stub omain installs two exception handlers h1 and h2 to handle function caml_call_ocaml, pushes an exception handler frame the exceptions E1 and E2. omain calls the external C function that either forwards the exception to the innermost OCaml ocaml_to_c, which in turn calls back into the OCaml function exception handler (raise_exn_c in Figure 1c) or prints a fa- c_to_ocaml, which raises the exception E1. OCaml supports tal error (fatal_uncaught) if there are no enclosing handlers. raising exceptions in C as well as throwing exceptions across Exceptions are so cheap in OCaml that it is common to use external calls. Hence, the exception E1 gets caught in the them for control flow. handler h1, and omain returns 42. The layout of the stack in local the native code backend just before raising the exception 2.3 Stack unwinding in c_to_ocaml is illustrated in Figure 1c. Note that the stack OCaml generates stack maps in order to accurately identify grows downwards. roots on the stack for assisting the GC. For every call point OCaml uses the same program stack as C, and hence the in the program, the OCaml compiler emits the size of the stack has alternating sequences of C and OCaml frames. frame and the set of all live registers in the frame that point However, unlike C, OCaml does not create pointers into to the heap. During a GC, the OCaml stack is walked and OCaml frames. OCaml uses the hardware support for call the roots are marked, skipping over the C frames. and return instructions for function calls and returns. OCaml OCaml also generates precise DWARF unwind informa- does not perform explicit stack overflow checks in code, and, tion for OCaml, thanks to which debuggers such as gdb and just like C, relies on the guard page at the end of the stack lldb, and profilers such as perf work out-of-the-box. For ex- region to detect stack overflow. Stack overflow is detected ample, for the program in Figures 1a and 1b, one could set Stack_overflow by a memory fault and a exception is raised a break point in gdb at caml_raise_exn to get the backtrace in to unwind the stack. Figure 1d which corresponds to the stack in Figure 1c. 2.1 External calls and callbacks The same backtrace can also be obtained by using frame instead of DWARF unwind tables. OCaml allows OCaml does not use the C calling convention. In particular, pointers compiling code with frame pointers, but they are not enabled there are no callee-saved registers in OCaml. In the x86-64 by default. The OCaml stack tends to be deep with small backend, the OCaml runtime makes use of two C callee-saved frames due to the pervasive use of recursive functions, not registers for supporting OCaml execution. The register r15 all of which are tail-recursive. Hence, the addition of frame holds the allocation pointer into the minor heap used for pointers can significantly increase the size of the stack1. bump pointer allocation, and r14 holds a reference to the Moreover, not using frame pointers saves two instructions Caml_state, a table of global variables used by the runtime. This makes external calls extremely fast in OCaml. If the 1https://github.com/ocaml/ocaml/issues/5721#issuecomment-472965549 PLDI ’21, June 20–25, 2021, Virtual, UK KC Sivaramakrishnan, Stephen Dolan, Leo White, Sadiq Jaffer, Tom Kelly, and Anil Madhavapeddy

1 external ocaml_to_c main 2 : unit -> int ="ocaml_to_c" #0 0x925dc in caml_raise_exn () ... 3 exception E1 #1 0x6fd3e in camlMeander__c_to_ocaml_83 () at caml_call_ocaml 4 exception E2 meander.ml:5 Main C Main 5 let c_to_ocaml () = raise E1 pc(fatal_uncaught) #2 0x925a4 in caml_call_ocaml () 6 let _ = Callback.register NULL #3 0x8a84a in caml_callback_exn (...) at 7 "c_to_ocaml" c_to_ocaml ... callback.c:145 let omain () = 8 omain #4 caml_callback (...) at callback.c:199 9 try (* h1*) pc(h1) #5 0x76e0a in ocaml_to_c (unit=1) at meander.c:5 10 try (* h2*) ocaml_to_c () sp(fatal_uncaught) #6 0x6fd77 in camlMeander__omain_88 () at 11 | with E2 -> 0

Main OCaml Main meander.ml:10 pc(h2) 12 | with E1 -> 42;; #7 0x6fe92 in camlMeander__entry () at sp(h1) 13 let _ = assert (omain () = 42) meander.ml:13 (a) meander.ml ocaml_to_c #8 0x6f719 in caml_program () caml_callback #9 0x925a4 in caml_call_ocaml () 1 ... #10 0x92e4c in caml_startup_common (...) at 2 #include caml_call_ocaml startup_nat.c:162 3 #include

External call External #11 0x92eab in caml_startup_exn (...) at 4 pc(raise_exn_c) startup_nat.c:167 5 value ocaml_to_c (value unit) { sp(h2) #12 caml_startup (...) at startup_nat.c:172 6 caml_callback(*caml_named_value c_to_ocaml exn_ptr #13 0x6f55c in main (...) at main.c:44 7 ("c_to_ocaml"), Val_unit); caml_raise_exn sp Callback 8 return Val_int(0); (d) gdb backtrace before raise E1. 9 } (c) Stack layout before raise E1. (b) meander.c Figure 1. Program stack on stock OCaml.

in the function prologue and epilogue, and makes an extra 1 let run main = register (rbp on x86_64) available. Note that the DWARF 2 let runq = Queue.create () in unwind information is complementary to the information 3 let suspend k = Queue.push k runq in used by OCaml to walk the stack for GC. 4 let rec run_next () = 5 match Queue.pop runq with 3 Effect Handlers 6 | k -> continue k () 7 | exception Queue.Empty -> () In this section, we describe the effect handlers in Multicore 8 in OCaml, and refine the design to retrofit them onto OCaml. 9 let rec spawn f = 10 match f () with 3.1 Asynchronous I/O 11 | () -> run_next ()(* value case*) | effect Yield k -> suspend k; run_next () Since our primary motivation is to enable composable asyn- 12 13 | effect (Fork f') k -> suspend k; spawn f' chronous I/O, let us implement a cooperative lightweight 14 in thread library with support for forking new threads and 15 spawn main yielding control to other threads. We will then extend this li- The function spawn (line 9) evaluates the computation f in brary with support for synchronously reading from channels an effect handler. The computation f may return normally and subsequently make it asynchronous without changing with a value, or perform effects Fork f' and Yield. The pattern the client code for asynchrony. In order to support forking effect Yield k handles the effect Yield and binds k to the and yielding threads, we declare the following effects: continuation of the corresponding perform delimited by this effect Fork : (unit -> unit) -> unit handler. The scheduler queue runq maintains a queue of these effect Yield : unit continuations. suspend pushes continuations into the queue, The Fork effect takes a thunk which is spawned as acon- run_next pops continuations from the queue and resumes current thread, and the Yield effect yields control to another them with () value using the continue primitive. In the case thread in the scheduler queue. We can define helper func- of the Yield effect, we suspend the current continuation k tions to perform these effects: and resume the next available continuation. In the case of let fork f = perform (Fork f) the Fork f' effect, we suspend the current continuation and let yield () = perform Yield recursively call spawn on f' in order to run f' concurrently. The implementation of the scheduler queue is defined in Observe that we can change the scheduling algorithm from the run function, which handles the effects appropriately: FIFO to LIFO by changing the scheduler queue to a stack. Retrofitting Effect Handlers onto OCaml PLDI ’21, June 20–25, 2021, Virtual,UK

We can implement support for synchronous read from input is still pending and their corresponding continuations. channels by adding the following case to the effect handler pending_reads is updated to point to the todo list so that they in spawn: may be attempted later. Observe that all of the changes to let rec spawn f = add asynchrony are localised to the run function, and the match f () with computation that performs these effects can remain in direct ... style (as opposed to the monadic-style in Lwt and Async). | effect (In_line ic) k -> continue k (input_line ic) This example does not resume a continuation more than This uses OCaml’s standard input_line function to read a line once. This also holds true for other use cases such as genera- synchronously from the channel ic and resume the continu- tors and coroutines. Hence, our continuations are one-shot, ation k with the resultant string. However, performing reads and resuming the continuation more than once raises an synchronously blocks the entire scheduler, preventing other Invalid_argument exception. It is well-known that one-shot threads from running until the I/O is completed. continuations can be implemented efficiently [9]. We can make the I/O asynchronous by modifying the run While OCaml permits throwing exceptions across C frames, function as follows: we do not allow effects to propagate across C frames asthe 1 let run main = C frames would become part of the captured continuation. 2 let runq = Queue.create () in Managing C frames as part of the continuation is a com- 3 let suspend k = Queue.push (continue k) runq in plex endeavour [34], and we find that the complexity budget 4 let pending_reads = ref [] in outweighs the relatively fewer mechanisms enabled by this 5 let rec run_next () = addition in our setting. 6 match Queue.pop runq with 7 | f -> f () 8 | exception Queue.Empty -> 3.2 Resource cleanup match !pending_reads with 9 The interaction of non-local control flow with systems pro- 10 | [] -> ()(* no pending reads*) gramming is quite subtle [18, 36]. Consider the following 11 | todo -> 12 let compl,todo = do_reads todo in function that uses blocking I/O functions from the OCaml 13 List.iter (fun (str,k) -> standard library to copy data from the input channel ic to 14 Queue.push (fun () -> continue k str) runq) compl; the output channel oc: 15 pending_reads := todo; let copy ic oc = 16 run_next () let rec loop () = 17 in output_string oc ((input_line ic) ^"\n"); loop () in 18 let rec spawn f = try loop () with 19 match f () with | End_of_file -> close_in ic; close_out oc 20 | () -> run_next ()(* value case*) | e -> close_in ic; close_out oc; raise e | effect Yield k -> suspend k; run_next () 21 The function input_line raises an End_of_file exception on | effect (Fork f') k -> suspend k; spawn f' 22 reaching the end of input, which is handled by the exception 23 | effect (In_line ic) k -> handler which closes the channels. The close_* functions do 24 pending_reads := (ic,k)::!pending_reads; run_next () 25 in nothing if the channel is already closed. The code is written 26 spawn main in a defensive style to handle other exceptional cases such as the channels being closed externally. Both input_line and The scheduler queue runq now holds thunks instead of output_string raise a Sys_error exception if the channel is continuations. The value pending_reads maintains a list of closed. In this case, the catch-all exception handler closes pending reads and the associated continuations (line 4). At the channels and reraises the exception to communicate the line 24, we handle the In_line effect by pushing the pair of in- exceptional behaviour to the caller. put channel ic and continuation k to pending_reads, allowing One of our goals (§1.1) is to make this code transparently other threads in the scheduler to run. asynchronous. We can define effects for performing the I/O When the scheduler queue is empty, the run_next function operations and wrap them up in functions with the same performs the pending reads. We abstract away the details signature as the one from the standard library: of the event-based I/O using the do_reads function (line 12). effect do_reads takes a list of pending reads and blocks until at least In_line : in_channel -> string effect Out_str : out_channel * string -> unit one of the reads succeeds. It returns a pair of lists compl and let input_line ic = perform (In_line ic) todo compl . contains the result strings from successful reads let output_string oc s = perform (Out_str (oc, s)) and corresponding continuations. These continuations are arranged to be resumed with the read result and pushed into We can then use the run function that we defined earlier, the scheduler queue. todo contains the channels on which to discharge the I/O operations asynchronously and resume with the result. While this handles value return cases, what PLDI ’21, June 20–25, 2021, Virtual, UK KC Sivaramakrishnan, Stephen Dolan, Leo White, Sadiq Jaffer, Tom Kelly, and Anil Madhavapeddy about the exceptional cases End_of_file and Sys_error? To value for resumption and returns a 'b value. The contin- this end, we introduce a discontinue primitive to resume a uations may be continued with a suitably typed value or continuation by raising an exception. In this example, on discontinued with an exception. reaching the end of file, we would discontinue the captured For handling the effects, our implementation extends OCaml’s continuation of the input_line function with discontinue k match ... with syntax with effect patterns. The expression End_of_file, which raises the exception at input_line call site, match e with and the open channels will be closed. | None -> false | Some b -> b OCaml programs that use resources such as channels are | effect (E s) k1 -> e1 | effect (F f) k2 -> e2 usually written defensively with the assumption that calling is translated to the equivalent of a function will return exactly once, either normally or excep- match_with (fun () -> e) tionally. Since effect handlers in Multicore OCaml do not { retc = (function None -> false | Some b -> b); ensure that all the effects are handled, if the function per- effc = (function forms an effect with no matching handler, then the function | (E s) -> (fun k1 -> e1) will not return at all. To remedy this, when such an effect | (F f) -> (fun k2 -> e2) bubbles up to the top-level, we discontinue the continuation | e -> (fun k -> match perform e with with an Unhandled exception so that the exception handlers | v -> continue k v may run and clean up the resources. | exception e -> discontinue k e)); } For the sake of exposition, we introduce a ('a,'b) handler 4 Semantics type. This handler handles a 'a comp that returns a 'a value, and itself returns a 'b value. The handler has a return field In this section, we formalise the effect handler design for retc of type 'a -> 'b. The effect field effc handles effects of Multicore OCaml. type 'c eff with ('c,'b) continuation and returns a value of 4.1 Static semantics type 'b. The last case in effc reperforms any unmatched effect to the outer handler and returns the value and exceptions As mentioned earlier, effect handlers in Multicore OCaml do back to the original performer. In the implementation, reper- not guarantee effect safety, but only guarantee type safety. form is implemented as a primitive to avoid executing code Programs without matching effect handlers are well-typed on the resumption path. Multicore OCaml programs. As a result, our static semantics is simpler than languages that ensure effect safety [4, 6, 12, 4.2 Dynamic semantics 27, 35, 48]. This is important for backwards compatibility as We present an operational semantics for a core language of our goal is to retrofit effect handlers to a language with large effect handlers that faithfully captures the semantics ofthe legacy codebases; programs that do not use effects remain Multicore OCaml implementation. An executable version of well-typed, and those that do compose well with those that the semantics, implemented as an OCaml interpreter, along don’t. with examples, is included in the supplementary material. The static semantics of effect handlers in OCaml is cap- 4.2.1 Syntax. Our expressions (Figure 2a) consist of in- tured succinctly by its API: teger constants (n), variables (x), abstraction (Λx.e), appli- type 'a eff = .. cation (e e), arithmetic expressions (e ⊙ e) where ⊙ ranges type ('a,'b) continuation {+, −, ∗, /} raise l e val perform: 'a eff -> 'a over , raising exceptions ( ), performing val continue:('a,'b) continuation -> 'a -> 'b effects (perform l e), and handling effectsmatch ( e with h). o val discontinue:('a,'b) continuation -> exn -> 'b Abstractions come in two forms: OCaml abstractions (λ ) c (* Internal API*) and C abstractions (λ ). The handler consists of a return case type 'a comp = unit -> 'a (return x 7→ e), zero or more exception cases (exceptionl x 7→ type ('a,'b) handler = e) with label l, parameter x and body e, and zero or more {retc: 'a -> 'b; effect caseseffect ( l x k 7→ e) with label l, parameter x, effc: 'c.'c eff -> ('c,'b) continuation -> 'b; } continuation k and body e. val match_with: 'a comp -> ('a,'b) handler -> 'b The operational semantics is an extension of the CEK We introduce an extensible variant type [45] 'a eff of effect machine semantics [22] for effect handlers, following the values, which when performed using the perform primitive abstract machine semantics of Hillerstrom et al. [27]. The returns an 'a value. Constructors for the value of type 'a key difference from Hillerstrom et al. is that our stacks are eff are declared using the effect declarations. For example, composed of alternating sequence of OCaml and C stack the declaration effect E : string -> int is syntactic sugar segments. The program state is captured as configuration for adding a new constructor to the variant type type _ C B ∥τ, ϵ, σ ∥ with the current term τ under evaluation, its eff += E : string -> int eff. We introduce the type ('a,'b) environment ϵ and the current stack σ. The term is either an continuation of delimited continuations which expects a 'a expression e or a value v. The values are integer constants n, Retrofitting Effect Handlers onto OCaml PLDI ’21, June 20–25, 2021, Virtual,UK

Constants n B Z Handler Closures η B (h, ϵ) Abstractions Λ B λo | λc Frame List ψ B [] | r :: ψ Expressions e B n | x | e e | Λx.e | e ⊙ e | raise l e Fibers φ B (ψ,η) | match e with h | perform l e Continuations k B [] | φ ◁ k { 7→ } | { 7→ } ⊎   Handlers h B return x e exception l x e h C stacks γ B ψ, ω c | {effect 7→ } ⊎   | • l x k e h OCaml stacks ω B k,γ o Values v B n | k | Λx.e, ϵ | eff l k | exn l Stacks σ B γ | ω L M Frames r B ⟨e ϵ⟩a | ⟨v⟩f | ⟨⊙ e ϵ⟩b1 | ⟨⊙ N⟩b2 Terms τ B e | v Environments ϵ B ∅ | ϵ[x 7→ v] Configurations C B ∥τ, ϵ, σ ∥ (a) Syntax of expressions and configurations c o (τ, ϵ,ψ, ω) −→ C (τ, ϵ,k,γ ) −→ C StepC StepO ∥   ∥ → C ∥   ∥ → C τ, ϵ, ψ, ω c τ, ϵ, k,γ o

(b) Top-level reductions Var (x, ϵ,ψ ) ⇝ (ϵ(x), ϵ,ψ ) Arith1 (e1 ⊙ e2, ϵ,ψ ) ⇝ (e1, ϵ, ⟨⊙ e2 ϵ⟩b1 :: ψ ) Arith2 (n1, _, ⟨⊙ e2 ϵ⟩b1 :: ψ ) ⇝ (e2, ϵ, ⟨⊙ n1⟩b2 :: ψ ) Arith3 (n2, ϵ, ⟨⊙ n1⟩b2 :: ψ ) ⇝ ( n1 ⊙ n2 , ϵ,ψ ) J K App1 (e1 e2, ϵ,ψ ) ⇝ (e1, ϵ, ⟨e2 ϵ⟩a :: ψ ) App2 (Λx.e, ϵ,ψ ) ⇝ ( Λx.e, ϵ , ϵ,ψ ) L M App3 ( Λx.e1, ϵ1 , _, ⟨e2 ϵ2⟩a :: ψ ) ⇝ (e2, ϵ2, ⟨ Λx.e1, ϵ1 ⟩f :: ψ ) L M L M Resume1 (k, _, ⟨e1 ϵ1⟩a :: ⟨e2 ϵ2⟩a :: ψ ) ⇝ (e1, ϵ1, ⟨k⟩f :: ⟨e2 ϵ2⟩a :: ψ ) Resume2 ( Λx.e1, ϵ1 , _, ⟨k⟩f :: ⟨e2 ϵ2⟩a :: ψ ) ⇝ (e2, ϵ2, ⟨k⟩f :: ⟨ Λx.e1, ϵ1 ⟩f :: ψ ) L M L M Perform (perform l e, ϵ,ψ ) ⇝ (e, ϵ, ⟨eff l [[], ({return x 7→ x}, ∅)]⟩f :: ψ ) Raise (raise l e, ϵ,ψ ) ⇝ (e, ϵ, ⟨exn l⟩f :: ψ )

(c) Administrative Reductions – (τ, ϵ,ψ ) ⇝ (τ, ϵ,ψ ). c ( ) −→ ∥ ′ ′  ′  ∥ ( ) ( ′ ′ ′) AdminC τ, ϵ,ψ, ω τ , ϵ , ψ , ω c if τ, ϵ,ψ ⇝ τ , ϵ ,ψ c ( ⟨ c ⟩ ) −→ ∥ [ 7→ ]   ∥ CallC v, _, λ x.e, ϵ f :: ψ, ω e, ϵ x v , ψ, ω c L M c ( ⟨ o ⟩ ) −→ ∥ [ 7→ ]     ∥ [[] ({ 7→ } ∅)] Callback v, _, λ x.e, ϵ f :: ψ, ω e, ϵ x v , k, ψ, ω c o if k = , return x x , L M c ( []   ) −→ ∥   ∥ RetToO v, ϵ, , k,γ o v, ϵ, k,γ o c ( ⟨ ⟩ ( )  ) −→ ∥ (⟨ ⟩ )  ∥ ExnFwdO v, ϵ, exn l f :: _, ψ,η ◁ k,γ o v, ϵ, exn l f :: ψ,η ◁ k,γ o c (d) C Reductions – (τ, ϵ,ψ, ω) −→ C. o ( ( ) ) −→ ∥ ′ ′ ( ′ )  ∥ ( ) ( ′ ′ ′) AdminO τ, ϵ, ψ,η ◁ k,γ τ , ϵ , ψ ,η ◁ k,γ o if τ, ϵ,ψ ⇝ τ , ϵ ,ψ o ( (⟨ o ⟩ ) ) −→ ∥ [ 7→ ] ( )  ∥ CallO v, _, λ x.e, ϵ f :: ψ,η ◁ k,γ e, ϵ x v , ψ,η ◁ k,γ o L M o ( (⟨ c ⟩ ) ) −→ ∥ [ 7→ ] [] ( )   ∥ ExtCall v, _, λ x.e, ϵ f :: ψ,η ◁ k,γ e, ϵ x v , , ψ,η ◁ k,γ o c L M o RetToC (v, _, [([], (h, ∅))],γ ) −→ ∥v, ϵ,γ ∥ if h = {return x 7→ x} o ( ([] ( )) ) −→ ∥ [ 7→ ]   ∥ { 7→ } ∈ RetFib v, _, , h, ϵ ◁ k,γ e, ϵ x v , k,γ o if return x e h and k , [] o ( ) −→ ∥ ([] ( ))  ∥ Handle match e with h, ϵ,k,γ e, ϵ, , h, ϵ ◁ k,γ o o ( (⟨ ⟩ ( )) ′ ) −→ ∥ [ 7→ ]  ′  ∥ { 7→ } ∈ ExnHn v, _, exn l f :: _, h, ϵ ◁ k ,γ e, ϵ x v , k ,γ o if exception l x e h o ( [⟨ ⟩ ( )]  ′  ) −→ ∥ ⟨ ⟩ ′  ∥ { 7→ } ExnFwdC v, ϵ, exn l f :: _, h, _ , ψ , ω c v, ϵ, exn l f :: ψ , ω c if exception l _ _ < h o ( (⟨ ⟩ ( )) ( ′ ′) ′ ) −→ ∥ (⟨ ⟩ ′ ′) ′  ∥ { 7→ } ExnFwdFib v, ϵ, exn l f :: _, h, _ ◁ ψ ,η ◁ k ,γ v, ϵ, exn l f :: ψ ,η ◁ k ,γ o if exception l _ _ < h o ( (⟨ ⟩ ( )) ′ ) −→ ∥ [ 7→ ′′][ 7→ ]  ′  ∥ { 7→ } ∈ EffHn v, _, eff l k f :: ψ, h, ϵ ◁ k ,γ e, ϵ r k x v , k ,γ o if effect l x r e h and k ′′ = k @ [(ψ, (h, ϵ))] o ( ′ (⟨ ⟩ ( )) ( ′ ′) ′ ) −→ ∥ ′ (⟨ ′′⟩ ′ ′) ′  ∥ { 7→ } EffFwd v, ϵ , eff l k f :: ψ, h, ϵ ◁ ψ ,η ◁ k ,γ v, ϵ , eff l k f :: ψ ,η ◁ k ,γ o if effect l __ _ < h and k ′′ = k @ [(ψ, (h, ϵ))] o ( [⟨ ⟩ ( )] ) −→ ∥ ∅  [( ( ))]  ∥ { 7→ } EffUnHn v, _, eff l k f :: ψ, h, ϵ ,γ e, , k @ ψ, h, ϵ ,γ o if effect l __ _ < h and e = raise Unhandled 0 o ′ o  ′  Resume (v, _, (⟨k⟩f :: ⟨ λ x.e, ϵ ⟩f :: ψ,η) ◁ k ,γ ) −→ ∥e, ϵ[x 7→ v], k @ ((ψ,η) ◁ k ),γ ∥ L M o o (e) OCaml Reductions – (τ, ϵ,k,γ ) −→ C.

Figure 2. Operational semantics of Multicore OCaml effect handlers. PLDI ’21, June 20–25, 2021, Virtual, UK KC Sivaramakrishnan, Stephen Dolan, Leo White, Sadiq Jaffer, Tom Kelly, and Anil Madhavapeddy continuations k, closures Λx.e, ϵ , effects being performed captures the behaviour of external calls, which are evaluated (eff l k) and exceptions beingL raisedM (exn l). The environment on an empty C stack with the current OCaml stack as its tail. is a map from variables to values. RetToC returns a value to the enclosing C stack. In this case, The stack σ is either a C stack (γ ) or an OCaml stack (ω). we have exactly one fiber on the stack, and this was created   The C stack ψ, ω c consists of a list of frames ψ , and the in the rule Callback, whose handler has identity return OCaml stack ω under it. The OCaml stack is either empty • case alone and the environment is empty. RetFib returns the   or non-empty k,γ o with the current continuation k and the value from a fiber to the previous one, evaluating the body C stack γ under it. Thus, the program stack is an alternating of the return case. sequence of C and OCaml stacks terminating with an empty The rule Handle installs a handler by pushing a fiber with OCaml stack •. The frame list ψ is composed of individual no frames and the given handler. The rule ExnHn handles an frames r, which is one of an argument frame ⟨e ϵ⟩a with the exception, if the current handler has a matching exception expression e at the argument position of an application with case, unwinding the current fiber. The rule ExnFwdC for- its environment ϵ, a function frame ⟨v⟩f with the value v wards the exception to C. Here, there is exactly one fiber on at the function position of an application, and frames for the current stack, and the handler does not have a matching evaluating the arguments of an arithmetic expression. exception case, which we know is the case (see Callback A continuation k is either empty [] or a non-empty list of rule). The rule ExnFwdFib forwards the exception to the fibers. A fiber φ B (ψ,η) is a list of frames ψ and a handler next fiber if the current handler does not handle it. closure η B (h, ϵ), which is a pair of handler h and its envi- The rule EffHn captures the handling of effects when ronment ϵ. We use the infix operator @ for appending two the current handler has a matching effect case. We evaluate lists. the body of the matching case, and bind the continuation parameter r to the captured continuation k''. Observe that 4.2.2 Top-level reductions. The initial configuration for the captured continuation k'' includes the current handler. an expression e is ∥e, ∅, [], • ∥, where the environment and c Intuitively, the handler wraps around captured continuation. the stack are empty. The top-level reductions (Figure 2b) can c This gives Multicore OCaml effect handlers deep handler be performed by either by taking a C step −→ or an OCaml semantics [27]. EffFwd forwards the effect to the outer fiber, o step −→. and extends the captured continuation k'' in the process. Recall that we do not capture C frames as part of a continua- 4.2.3 C reductions. We can take a C step (Figure 2d) by tion. To this end, EffUnHn models unhandled effect. If the taking an administrative reduction step ⇝. The administra- effect bubbles up to the top fiber — which we know does not tive reductions are common to both C and OCaml. The rules have an effect case (see Callback rule) — we raise Unhan- Var, Arith1, Arith2, App1, App2 and App3 are standard. dled exception at the point where the corresponding effect Arith3 performs the arithmetic operation on the integers was performed. This is achieved by appending the captured ( n ⊙ n ). Raise pushes an function frame with exception 1 2 continuation to the front of the current continuation. valueJ to indicateK that an exception is being raised. Similarly, Observe that continue and discontinue are not part of the ex- Perform pushes a function frame with an effect value with pressions. They are encoded as continue k e = (k (λox.x)) e an empty continuation [[], ({return x 7→ x}, ∅)] with no and discontinue k l e = (k (λox.raise l x)) e. Intuitively, captured frames and an empty handler with an identity re- resuming a continuation in both the cases involves evaluat- turn case alone. We shall return to Resume1 and Resume2 in ing the appropriate abstraction on top of the continuation. the next subsection. We perform the administrative reductions Resume1 and Re- Continuing with the rest of the C reduction steps, CallC sume2 to evaluate the arguments to continue and discontinue. captures the behaviour of calling a C function. Since the pro- The rule Resume appends the given continuation to the front gram is currently executing C, we can perform the call on the of the current continuation, and evaluates the body of the current stack. In case the abstraction is an OCaml abstrac- closure. tion (Callback), we create an OCaml stack with the C stack as its tail, with the current continuation being empty. This 5 Implementation captures the behaviour of calling back into OCaml from C. We now present the implementation details of effect handlers RetToO returns a value to the enclosing OCaml stack. Exn- in Multicore OCaml. While we assume an x86_64 architec- FwdO forwards a raised exception to the enclosing OCaml ture for the remainder of this paper, our design does not stack, unwinding the rest of the frames. This captures the preclude other architectures and operating systems. semantics of raising OCaml exceptions from C. 4.2.4 OCaml reductions. In OCaml (Figure 2e), reduc- 5.1 Exceptions tions always occur on the top-most fiber in the current stack. The implementation follows the operational semantics, but AdminO performs administrative reductions. CallO eval- has a few key representational differences. Unlike the op- uates an OCaml function on the current stack. ExtCall erational semantics, handlers with just exception patterns Retrofitting Effect Handlers onto OCaml PLDI ’21, June 20–25, 2021, Virtual,UK

(exception handlers) are implemented differently than effect elides the stack overflow check for leaf functions whose handlers. As mentioned in §2.2, exceptions are pervasive in frame size is less than the size of the red zone. The default OCaml and are so cheap that they are used for local control size of the red zone in Multicore OCaml is 16 words. flow. Hence, we retain the linked exception handler frame Finally, we have the saved exception pointer, which points implementation of stock OCaml in Multicore OCaml to en- to the top-most exception frame, and the saved stack pointer, sure performance backwards compatibility. This differs from which points to the top of the stack. Switching between fibers other research languages with effect handlers4 [ , 12, 27], only involves saving the exception and the stack pointer of which implement exceptions using effects (by ignoring the the current stack and loading the same on the target stack. continuation argument in the handler). Since OCaml does not generate pointers into the stack, the two fiber_info fields are the only ones that need to be up- 5.2 Heap-allocated fibers dated when fibers are moved. In the operational semantics, the continuations may be re- sumed more than once. Captured continuations are copies of 5.3 External calls and callbacks the original fibers and resuming the continuation copies the Since C functions do not have stack overflow checks, we have fibers and leaves the continuation as it is. Since our primary to execute the external calls in the system stack. Calling a C use case is concurrency, continuations will be resumed at function from OCaml involves saving the stack pointer in most once, and copying fibers is unnecessary and inefficient. the current fiber, saving the allocation pointer value in r15 in Instead, Multicore OCaml optimises fibers for one-shot con- the Caml_state, updating rsp to the top of system stack (main- tinuations. Fibers are allocated on the C heap using malloc tained in Caml_state), and calling the C function. The actions and are freed when the handled computation returns with are reversed when returning from the external call. For C a value or an exception. Similar to Farvardin et al. [21], we functions that take arguments on the stack, the arguments use a stack cache of recently freed stacks in order to speed must be copied to the C stack from the OCaml stack. up allocation. When we first enter OCaml from C, a new fiber is allocated Figure 3a shows the layout of a fiber in Multicore OCaml. for the main OCaml stack. Since callbacks may be frequent At the bottom of the stack, we have the handler_info, which in OCaml programs that use finalisers, we run the callbacks contains the pointer to the parent fiber, and the closures for on the same fiber as the current one. For example, the lay- the value, exception and effect cases. The closures are cre- out of the Multicore OCaml stack at caml_raise_exn in the ated by the translation described in §4.1; Multicore OCaml meander example from §2 is shown in Figure 3b. The func- supports exception patterns in addition to effect patterns in tions caml_call_c and caml_call_ocaml switch the stacks, and the same handler. This is followed by a context block needed hence are shown in both the system stack and the fiber. Since for DWARF and GC bookkeeping with callbacks. Then, there we are reusing the fiber for the callback, care must be taken is a top-level exception handler frame that forwards excep- to save and restore the handler_info before calling and after tions to the parent fiber. When the exceptions are caught by returning from c_to_ocaml function, respectively. Thanks to this handler, the control switches to the parent stack, and the fiber representation, external calls and callbacks remain the exception handler closure clos_hexn is invoked. This is competitive with stock OCaml. followed by the pc of the code that returns values to the par- ent fiber. This stack is laid out such that when the handled 5.4 Effect handlers computation returns, the control switches to the parent fiber Similar to exception handlers, the lack of callee-saved regis- and the value handler clos_hval is invoked. ters in OCaml benefits effect handlers. There is no register Next, we have the variable-sized area for the OCaml frames. state to save when entering an effect handler or perform- In order to keep fibers small, this area is initially 16 words ing an effect. Similarly, there is no register state to restore in length. When the stack pointer rsp becomes less than the when handling an effect or resuming a continuation. This stack threshold (maintained in the Caml_state table), the stack fortuitous design choice in stock OCaml has a significant im- is said to have overflowed. On stack overflow, we copy the pact in enabling fast switching between fibers in Multicore whole fiber to a new area with double the size. In Multicore OCaml. OCaml, we introduce stack overflow checks into the function In order to illustrate the runtime support for handling prologue of OCaml functions. These stack overflows are rare effects, consider the example presented in Figure 3c. The lay- and so the overflow checks will be correctly predicted bythe out of the program state as the program executes is captured CPU branch predictor. in Figure 3d. The code performs effect E which is handled In our evaluation of real world OCaml programs (§6), in the outer-most handler, and is immediately resumed. The we observed that most function calls are to leaf functions arrows between the fibers are parent pointers. At position with small frame sizes. Can we eliminate the stack overflow p1, rsp is at the top of the fiber f. checks for these functions? To this end, we introduce a small, When the effect E is performed, we allocate a continuation fixed-sized red zone at the top of the stack. The compiler object ke in the OCaml heap that points to the current fiber PLDI ’21, June 20–25, 2021, Virtual, UK KC Sivaramakrishnan, Stephen Dolan, Leo White, Sadiq Jaffer, Tom Kelly, and Anil Madhavapeddy

System stack Fiber parent_fiber main caml_call_ocaml clos_heffect handler_info ...... clos_hexn caml_call_ocaml omain clos_hval caml_call_c caml_call_c

Context block 2 words ocaml_to_c caml_call_ocaml calls caml_callback c_to_ocaml Top-level pc(ExnHandle) ... caml_raise_exn exn handler sp NULL caml_call_ocaml calls pc(RetVal) (b) Stack layout for meander example from §2.

OCaml Frames Fiber main Fiber e Fiber f Variable size effect E : unit OCaml effect F : unit;; comp_e comp_f p1 main comp_e Free space stack rsp threshold match (* comp_e*) OCaml p2 comp_e comp_f Red Zone 16 words match (* comp_f*) main (*p1*) perform E(*p3*) rsp ke saved_exn_ptr with | v -> v | effect F kf -> () fiber_info OCaml p3 comp_e comp_f saved_sp with | v -> v main fiber | effect E ke ->(*p2*) continue ke () HEADER WORD ke rsp

(a) Fiber layout (c) Constructing continuation objects (d) Program state for code in 3c Figure 3. Layout of Multicore OCaml effect handlers. f, set fiber f’s parent pointer to NULL, and evaluate the con- Logically, DWARF call-frame information maintains a tinuation closure clos_heffect on the parent fiber e with the large table which records for every machine instruction effect E and the continuation ke as arguments. Since the first where the return address and callee-saved registers are stored. handler does not handle effect E, the effect is reperformed To avoid reifying this large table, DWARF directives repre- (§4.1) by appending the fiber e to the tail of continuation sent the table using a compact bytecode representation that ke, set fiber e’s parent pointer to NULL, and evaluate the cur- describes the unwind table as a sequence of edits from the rent continuation closure on the parent fiber main with E and start of the function. In order to compute the call-frame in- ke as arguments, which handles E (position p2). Thus, con- formation at any given instruction within a function, the tinuations are captured without copying frames. Since every DWARF bytecode from the start of that function must be handler closure is evaluated until a matching one is found, interpreted on demand. For each function, DWARF main- the time taken to handle an effect is linear in the number of tains a canonical frame address (CFA) and is traditionally the handlers. We observed that the handler stack is shallow in stack pointer before entering this function. Hence, on x86-64, real programs. where the return address is pushed on the stack on call, the When the continuation is resumed, we overwrite the value return address is at CFA - 8. of ke to NULL to enforce at-most once semantics. Resuming Our goal is to compute the CFA of the caller when stacks a continuation involves traversing the linked-list of fibers are switched using the DWARF directives. Recall that stack and making the last fiber point to the current fiber. Just as switching occurs in effect handlers, external calls and call- in the operational semantics, the implementation invokes backs. At the entry to an effect handler block, we insert the appropriate closure to either continue or discontinue the DWARF bytecode to follow the parent_fiber pointer and continuation (position p3). We perform tail-call optimisation dereference the saved_sp to get the CFA (saved_sp + 8). Dur- so that resumptions at tail positions do not build up stack. ing callbacks into OCaml, we save the current system stack pointer in the context block in Figure 3a to identify the CFA in the C stack. DWARF unwinding for external calls is im- 5.5 Stack unwinding plemented by following a link to the current OCaml stack pointer. With these changes, we get the same backtrace for The challenge with DWARF stack unwinding is to make the meander program from §2.3, modulo runtime system it aware of the non-contiguous stacks. While the complete functions due to effect handlers. We have verified the cor- details of DWARF stack unwinding is beyond the scope of rectness of our DWARF directives using the verification tool the paper, it is beneficial to know how DWARF unwind tables from Bastian et al. [3]. are constructed in order to appreciate our solution. We refer Despite the correct DWARF unwind information, using the interested reader to Bastian et al. [3] for a good overview DWARF to record call stack information in perf only captures of DWARF stack unwinding. Retrofitting Effect Handlers onto OCaml PLDI ’21, June 20–25, 2021, Virtual,UK the call stack of the current fiber in Multicore OCaml. Since Table 1. Micro benchmarks without effects. Each entry is stack unwinding using DWARF is slow due to bytecode in- the percentage difference for Multicore OCaml over stock terpretation overhead, perf dumps the (user) call stack when OCaml. sampled [3]. This only includes the frames from the current fiber. This is a limitation of perf and not of our stack layout.

Bastian et al. [3] report on a technique to pre-compile the exnval exnraise extcall callback ack fib motzkin sudan tak unwind table to assembly, which speeds up DWARF-based Time +0.0 -1.9 +17 +65 +5.3 +2.2 +10 +0.0 +4.2 unwinding by up to 25×. With this technique, perf can un- Instr +0.0 +0.0 +10 +72 +16 +24 +16 +14 +17 wind the stack at sample points rather than dumping the call stack, which would capture the complete backtrace rather benchmark suite consists of 54 real OCaml workloads includ- than just the current fiber. ing verification toolsCoq ( , Cubicle, AltErgo), parsers (menhir, yojson), storage engines (irmin), utilities (cpdf, decompress), 5.6 Garbage collection bioinformatics (fasta, knucleotide, revcomp2, regexredux2), nu- Recall that OCaml programs are written with the expectation merical analysis (grammatrix, LU_decomposition) and simula- that function calls return exactly once (§3). Consider the tions (nbody, game_of_life). In addition to Stock (stock) and scenario when a continuation is never resumed. Since fibers Multicore OCaml (MC), we also ran the benchmarks on Multi- allocate memory for the stack using malloc, which are freed core OCaml with no red zone (MC+RedZone0) in the fibers (all when the computation returns, not resuming continuations OCaml functions will have a stack overflow check) and a red leaks memory. In addition, unresumed continuations may zone size of 32 words (MC+RedZone32). Recall that the default also leak other system resources such as sockets and open red zone size in Multicore OCaml is 16 words (§5.2). file descriptors. Figure4 presents the running time of the different mul- We make a pragmatic trade-off and expect the user code ticore variants normalized against the sequential baseline to resume captured continuations exactly once. One can use stock. On average (geometric mean of the normalized values the GC support to free up resources by installing a finaliser against stock as the baseline), the multicore variants were that discontinues the continuation and ignores the result: less than 1% slower than stock. The outliers (on either ends) Gc.finalise (fun k -> were due to the difference in the allocator and the GCbe- try ignore (discontinue k Unwind) with _ -> ()) k tween stock and Multicore OCaml. Of the 54 programs in This frees up both the memory allocated for the fiber the benchmark suite, 32 programs had an overhead of 5% or stack as well as other system resources, assuming that user lower, and 8 programs had more than 10% overhead. code does not handle Unwind exception and fails to re-raise it. The biggest impact was the increase in the OCaml text sec- Since installing a finaliser on every captured continuation tion size (OTSS) due to the stack overflow checks. We define introduces significant overhead (§6.3.3), we choose not to do OTSS as the sum of the sizes of all the OCaml text sections it by default. It is also useful to note that even if the memory in the compiled binary file ignoring the data sections, the for the fiber stack is managed by the GC, we would still need debug symbols, the text sections associated with OCaml run- a finalisation mechanism to unwind the stack and release time and other statically linked C libraries. Figure5 presents other system resources that may be held by the continuation. the OTSS of the multicore variants normalized against the The challenges and the solutions for integrating fibers sequential baseline stock. Compared to stock, OTSS is 19% with the concurrent mark-and-sweep GC of Multicore OCaml more for MC and MC+RedZone32, and 30% more for MC+RedZone0. have been discussed previously [50]. The result shows that our 16-word red zone is effective at reducing OTSS compared to having no red zone, whereas the 6 Evaluation 32-word red zone does not noticeably reduce OTSS further. In this section, we evaluate the performance of Multicore Further work is required to bring OTSS closer to stock. OCaml effect handlers against the performance requirements We also present micro benchmarks results in Table1. Since set in §1.1. Multicore OCaml is an extension of the OCaml micro benchmarks magnify micro-architectural optimisa- 4.10.0 compiler with support for shared memory parallelism tions, we also report the number of instructions executed and effect handlers. Since our objective is to evaluate the (obtained using perf) along with time. exnval performs 100 impact of effect handlers, none of our benchmarks utilise million iterations of installing exception handlers and return- parallelism. These results were obtained on a 2-socket In- ing with a value. exnraise is similar, but raises an exception tel®Xeon®Gold 5120 x86-64 [29] server running Ubuntu in each iteration. extcall and callback perform 100 million 18.04 with 64GB of main memory. external calls and callbacks to identity functions. The other micro benchmarks are highly recursive programs and were 6.1 No effects benchmarks taken from Farvardin et al. [21]. For micro benchmarks, we In this section, we measure the impact of the addition of effect observed that padding tight loops with a few nop instructions, handlers on code that does not use effect handlers. Our macro which changes the loop alignment, makes the code up to PLDI ’21, June 20–25, 2021, Virtual, UK KC Sivaramakrishnan, Stephen Dolan, Leo White, Sadiq Jaffer, Tom Kelly, and Anil Madhavapeddy

1.2

1.0

0.8

Normalized Time 0.6 MC MC+RedZone0 0.4 MC+RedZone32 fft (4.96) kb (4.13) bdd (5.27) setrip (1.53) fasta6 (5.76) fasta3 (7.48) nbody (7.45) pidigits5 (6.26) zarith_pi (1.48) test_lwt (28.83) quicksort (2.79) revcomp2 (3.18) kb_no_exc (2.58) cpdf.scale (13.77) lexifi-g2pp (17.19) matrix_mult (9.73) gammatrix (93.95) binarytrees (11.73) decompress (4.07) knucleotide (43.46) sequence_cps (1.6) cpdf.blacktext (4.19) regexredux2 (19.55) mandelbrot6 (39.84) knucleotide3 (45.31) spectralnorm2 (8.11) game_of_life (12.56) floyd_warshall (4.55) yojson_ydump (0.76) cpdf.squeeze (16.43) alt-ergo.fill.why (2.02) menhir.sysver (83.83) levinson-durbin (2.89) naive-multilayer (4.26) fannkuchredux (78.61) menhir.ocamly (234.47) qr-decomposition (2.08) fannkuchredux2 (87.46) alt-ergo.yyll.why (17.53) menhir.sql-parser (6.89) lu_decomposition (4.22) coq.AbsInterp.v (293.34) minilight.roomfront (21.91) crout-decomposition (1.33) imrin_mem_rw.reads (6.72) coq.BasicSyntax.v (104.02) imrin_mem_rw.writes (7.95) durand-kerner-aberth (0.14) thread_ring_lwt_mvar (3.34) cubicle.german_pfs (234.79) chameneos_redux_lwt (2.08) cubicle.szymanski_at (566.16) evolutionary_algorithm (68.46) thread_ring_lwt_stream (11.26) Figure 4. Normalized time of macro benchmarks. Baseline is Stock OCaml, whose running time in seconds in given in parenthesis.

MC 1.4 MC+RedZone0 MC+RedZone32

1.3

1.2 Normalized OTSS 1.1

1.0 fft (30.63) kb (144.09) bdd (133.74) setrip (232.89) fasta6 (232.91) fasta3 (232.87) nbody (232.79) test_lwt (424.59) quicksort (50.14) pidigits5 (261.38) zarith_pi (165.63) kb_no_exc (144.6) revcomp2 (231.92) cpdf.scale (943.06) lexifi-g2pp (175.62) matrix_mult (49.83) gammatrix (204.85) binarytrees (231.75) mandelbrot6 (231.7) knucleotide (232.86) game_of_life (50.67) decompress (276.71) regexredux2 (231.94) knucleotide3 (235.45) floyd_warshall (50.61) cpdf.squeeze (943.06) cpdf.blacktext (943.06) menhir.sysver (778.83) levinson-durbin (28.71) spectralnorm2 (232.07) sequence_cps (118.68) menhir.ocamly (778.83) yojson_ydump (416.76) fannkuchredux (232.03) fannkuchredux2 (232.1) naive-multilayer (213.85) qr-decomposition (180.6) alt-ergo.fill.why (2391.98) lu_decomposition (50.45) coq.AbsInterp.v (4829.77) alt-ergo.yyll.why (2391.98) menhir.sql-parser (778.83) minilight.roomfront (296.46) cubicle.german_pfs (986.15) coq.BasicSyntax.v (4829.77) crout-decomposition (280.79) cubicle.szymanski_at (986.15) durand-kerner-aberth (171.82) thread_ring_lwt_mvar (322.31) imrin_mem_rw.reads (1336.67) imrin_mem_rw.writes (1336.67) chameneos_redux_lwt (325.67) evolutionary_algorithm (137.41) thread_ring_lwt_stream (342.99) Figure 5. Normalized OCaml text section size (OTSS) of macro benchmarks. Baseline is Stock OCaml, whose size in kilobytes given in parenthesis.

15% faster. Hence, the difference in running times under 15% Table 2. Micro benchmarks with handlers but no perform. may not be statistically significant. Each entry is the slowdown factor (× times) over its idiomatic The results show that exceptions are no more expensive in implementation in stock OCaml. MC compared to stock. In the other programs, MC executes more ack fib motzkin sudan tak instructions due to stack overflow checks. The performance MC 12.25 12.05 11.44 6.74 8.9 impact on callbacks is more significant than external calls. monad 348.69 69.77 39.24 33.29 42.79 For callbacks, since we reuse the current fiber stack, we need to ensure it has enough room for inserting additional frames, while stock does not need to do this. Callback performance recursive micro benchmarks with an effect handler. These is less important than external calls, which are far more programs do not perform effects. We also implemented the numerous. same benchmarks using a concurrency monad [10](monad) as a proxy for CPS versions. Recall that the OCaml compiler 6.2 No perform benchmarks does not use CPS in its IR. In the monad version, we use a fork Next, we aim to quantify the overhead of setting up and to invoke the non-tail call and use an MVar to collect its result. tearing down effect handlers compared to a non-tail func- The results are presented in Table2. They show that using tion call. To this end, we surround the non-tail calls in the effect handlers (concurrency monad) is 10.02× (67.09×) more Retrofitting Effect Handlers onto OCaml PLDI ’21, June 20–25, 2021, Virtual,UK

1000 conns, 20000 req/s expensive than the idiomatic implementation using non-tail 1000 conns MC 30 MC 20 calls. The concurrency monad suffers due to the heap alloca- lwt lwt 15 go tion of continuation frames (which need to be garbage col- go 20 lected), whereas effect handlers benefit from stack allocation 10 of the frames. For example, the number of major collections 10 milliseconds 5 for the ack benchmark is 0 for stock OCaml, 1 for MC and 0 serviced (x1000 req/s)

112 for monad. Our concurrency monad (and other monadic 0 20 40 60 0.0 93.75 offered (x1000 req/s) 99.609499.975699.9985 concurrency libraries such as Lwt [54] and Async [1]) also percentile have other downsides – exceptions, backtraces, and DWARF (a) Throughput (b) Tail latency unwinding are no longer useful due to the lack of a stack. Figure 6. Web server performance. We note that a compiler that uses CPS IR will be faster than the concurrency monad implementation due to optimisations fastest, thanks to specialisation and hand optimisation. MC to reduce the heap allocation of continuation frames. But version was only 2.76× slower than cps while being a generic Farvardin et al. [21] show that CPS with optimisations is still solution, and the monad version was 8.69× slower than cps. slower than using the call stack. 6.3.2 Chameneos. Chameneos [30] is a concurrency game 6.3 Concurrent benchmarks aimed at measuring context switching and synchronization Next we look at benchmarks that utilize non-local control overheads. Our implementation uses MVars for synchroniza- flow using effect handlers. First, we quantify the cost ofindi- tion. We compare effect handler (MC), concurrency monad vidual operations in effect handling. Consider the following (monad) and Lwt, a widely used concurrency programming annotated code: library for OCaml (lwt) versions. We observed that MC was monad lwt × × MC effect E : unit the fastest, and ( ) was 1.67 (4.29 ) slower than . (*a*) match (*b*) perform E(*d*) with 6.3.3 Finalised continuations. In §5.6, we described how | v ->(*e*)v continuation resources can be cleaned up by attaching a fi- | effect E k ->(*c*) continue k () naliser. Attaching this finaliser to every captured continua- The sequence a-b involves allocating a new fiber and switch- tion slows down generator (chameneos) benchmark by 4.1× ing to it. b-c is performing the effect and handling it. c-d is (2.1×) compared to not attaching a finaliser. Hence, Multicore resuming the continuation. d-e is returning from the fiber OCaml does not attach such finaliser to every continuation with a value and freeing the fiber. We measured the time by default. taken to execute these sequences using perf support for cycle- accurate tracing on modern Intel processors. We executed 10 6.3.4 Webserver. Using effect handlers, we have imple- iterations of the code, with 3 warm-up runs. For calibration, mented a full-fledged HTTP/1.1 web server by extending the idle memory load latency for the local NUMA node is 93.2 the example from §3.1( MC). The web server spawns a light- ns as measured using the Intel MLC tool [42]. We observed weight thread per request. We use httpaf [28] for HTTP that the sequences a-b, b-c, c-d and d-e took 23 ns, 5 ns, handling, and libev [37] for the eventloop. We compare our 11 ns and 7 ns, respectively. The time in the sequence a-b implementation against an Lwt version (lwt) which also uses is dominated by the memory allocation. Thus, the individual httpaf and libev. Unlike using effect handlers, the Lwt ver- operations in effect handling are fast. sion is written in monadic style and does not have the notion of a thread per request. For comparison, we include a Go 1.13 6.3.1 Generators. Generators allow data structures to be version (go) that uses the net/http [43] package. As both traversed on demand. Many languages including JavaScript the OCaml versions are single threaded, the Go benchmark and Python provide generators as a built-in primitive. Us- is run with GOMAXPROCS=1. ing effect handlersMC ( ), given any data structure ('a t) and The client workload was generated with wrk2 [56]. We its iterator (val iter: 'a t -> ('a -> unit) -> unit), we can 2 maintain 1k open connections and perform requests for a derive its generator function (val next : unit -> 'a option) . static web page at different rates, and record the service We evaluate the performance of traversing a complete binary rate and latency. The throughput and tail latency graphs are 226 tree of depth 25 using this generator. This involves stack given in Figures 6a and 6b. In all the versions, the throughput switches in total. For comparison, we implemented a hand- plateaus at around 30k requests per second. We measure the written, selective CPSed [44], defunctionalised [16] version tail latencies at 2/3rd of this rate (20k requests per second) (cps) and a concurrency monad (monad) version of the genera- to simulate optimal load. We observe that both of the OCaml cps monad tor for the tree. Both and versions are specialised to versions remain competitive with go, and MC performs best the binary tree with the usual caveats of not using the stack in terms of tail latency. for function calls. We observed that the cps version was the Multicore OCaml supports backtraces for continuations 2https://gist.github.com/kayceesrk/eb0ab496c22861f21b1d9484772e982d in addition to backtraces of the current stack as in stock PLDI ’21, June 20–25, 2021, Virtual, UK KC Sivaramakrishnan, Stephen Dolan, Leo White, Sadiq Jaffer, Tom Kelly, and Anil Madhavapeddy

OCaml. Using effect handlers in a system such as a web server which is a generalisation of Common Lisp unwind-protect [52], aids debugging and profiling because it is possible to geta which ensures de-allocation and re-allocation of resources backtrace snapshot of all current requests. This feature is every time the non-local control leaves and enters back into available in Go [25], but not in OCaml concurrency libraries a context. dynamic-wind is not quite the right abstraction as such as Async and Lwt which lack the notion of a thread. resources need to be cleaned up only on non-returning ex- its [20, 49]. This requires distinguishing returning exits from 7 Related Work non-returning ones. There are several strategies for implementing effect han- Multicore OCaml builds on the existing defensive cod- dlers. Eff [4], Helium [7], Frank [12] and the Links server ing practices against exceptions to clean up resources on backend [27] use an interpreter similar to our operational non-returning exits. We assume that the continuations are semantics to implement effect handlers. Effekt48 [ ], Links resumed exactly once using continue or discontinue. Under JavaScript backend [27] and Koka [34] use type-directed se- this assumption, when a computation performs an effect, lective CPS translation. These language are equipped with we expect the control to return. For the non-returning cases an effect system, which allows compiling pure code in direct (value and exceptional return), the code already handles re- style and effectful code in CPS. Leijen35 [ ] implements effect source cleanup. handlers as a library in C using stack copying. C allows point- OCaml does not have a try/finally construct commonly ers into the stack, so care is taken to ensure that when con- used for resource cleanup in many programming languages. tinuations are resumed, the constituent frames are restored The OCaml standard library [51] as well as alternative stan- to the same memory addresses as at the time of capture. dard libraries such as Base [2] and Core [14] provide mech- Kiselyov et al. [33] use an implementation of multi-prompt anisms analogous to unwind-protect, which are in turn im- delimited continuations as an OCaml library [32] to embed plemented using exception handlers. Thus, the linear use of the Eff language in OCaml. Indeed, Forster et al.[24] showed continuations enabled by the discontinue primitive ensures that in an untyped setting, effect handlers, monadic reflection backwards compatibility of legacy OCaml systems code un- and delimited control can macro-express each other. der non-local control flow introduced by effect handlers. Multicore OCaml uses the call stack for implementing Leijen [36] explicitly extends effect handlers with initially continuations (as do [32, 35]), but with one-shot continua- and finally clauses in Koka for resource safety. Dolan et tions. Bruggeman et al. [9] show how to implement one-shot al. [18] describe the interaction of effect handlers and asyn- continuations efficiently using segmented stacks in Scheme. chronous exceptions. This is orthogonal to the contributions Farvardin et al. [21] perform a comprehensive evaluation of of this paper. Our focus is the compiler and runtime system various implementation strategies for continuations on mod- support for implementing effect handlers. ern hardware. Multicore OCaml stacks do not neatly fit the description of one of these implementation strategies – they 8 Conclusions are best described as using the resize strategy from Farvardin Our design for effect handlers in OCaml walks the tightrope et al. for each of the fibers, which are linked to represent of maintaining compatibility (for profiling, debuggers and the current stack and the captured continuations. Kawa- minimal overheads for existing programs), while unlocking hara et al. [31] implement one-shot effect handlers using the full power of non-local control flow constructs. Our eval- coroutines as a macro-expressible translation, and present uation shows that we have achieved our goal: we retain com- an embedding in Lua and Ruby. Lua provides asymmetric patibility with a surprisingly low performance overhead for coroutines [17] where each coroutine uses its own stack sim- sequential code that preserves the spirit of “fast exceptions” ilar to how each handled computation runs in its own fiber that has always characterised OCaml programming. We be- in Multicore OCaml. lieve that the introduction of effect handlers into OCaml Multicore OCaml is not the first language to support stack implemented using lightweight fibers, along with a parallel inspection in the presence of non-local control operators. runtime [50], as has been demonstrated in our work, will Chez Scheme supports continuation marks [23] which per- open OCaml to highly scalable concurrent and task-parallel mit stack inspection as a language feature. This enables im- applications with minimal hit to sequential performance. plementation of dynamic binding, exceptions, profilers, de- buggers, etc, in the presence of first-class continuations. As the authors note, continuation marks can be implemented Acknowledgements using effect handlers, but direct support for continuation We thank Sam Lindley, Francóis Pottier, the PLDI reviewers marks leads to better performance. In this work, we focus on and our shepherd, Matthew Flatt, whose comments substan- retaining the support for stack inspection through DWARF tially helped improve the presentation. This research was unwind tables in the presence of effect handlers. funded via Royal Commission for the Exhibition of 1851 and The interaction of non-local control flow and resources Darwin College Research Fellowships, and by grants from has been studied extensively. Scheme uses dynamic-wind [47], Jane Street and the Tezos Foundation. Retrofitting Effect Handlers onto OCaml PLDI ’21, June 20–25, 2021, Virtual,UK

References jucs.org/jucs_10_7/coroutines_in_lua [1] Async 2020. Typeful concurrent programming. https://opensource. [18] Stephen Dolan, Spiros Eliopoulos, Daniel Hillerström, Anil Mad- janestreet.com/async/ havapeddy, K. C. Sivaramakrishnan, and Leo White. 2018. Concurrent [2] Base.Exn.protect 2020. Unwind-protect in JaneStreet Base li- System Programming with Effect Handlers. In Trends in Functional Pro- brary. https://ocaml.janestreet.com/ocaml-core/v0.13/doc/base/Base/ gramming, Meng Wang and Scott Owens (Eds.). Springer International Exn/index.html#val-protectx Publishing, Cham, 98–117. [3] Théophile Bastian, Stephen Kell, and Francesco Zappa Nardelli. 2019. [19] DWARF 2020. The DWARF Debugging Standard. http://dwarfstd.org/ Reliable and Fast DWARF-Based Stack Unwinding. Proc. ACM Program. [20] Dynamic Wind 2020. The dynamic-wind problem. http://okmij.org/ Lang. 3, OOPSLA, Article 146 (Oct. 2019), 24 pages. https://doi.org/10. ftp/continuations/against-callcc.html#dynamic_wind 1145/3360572 [21] Kavon Farvardin and John Reppy. 2020. From Folklore to Fact: [4] Andrej Bauer and Matija Pretnar. 2015. Programming with algebraic Comparing Implementations of Stacks and Continuations. In Pro- effects and handlers. Journal of Logical and Algebraic Methods in Pro- ceedings of the 41st ACM SIGPLAN Conference on Programming Lan- gramming 84, 1 (2015), 108–123. https://doi.org/10.1016/j.jlamp.2014. guage Design and Implementation (London, UK) (PLDI 2020). Associ- 02.001 Special Issue: The 23rd Nordic Workshop on Programming The- ation for Computing Machinery, New York, NY, USA, 75–90. https: ory (NWPT 2011) Special Issue: Domains X, International workshop //doi.org/10.1145/3385412.3385994 on and applications, Swansea, 5-7 September, 2011. [22] Matthias Felleisen and Daniel P.Friedman. 1986. Control Operators, the [5] Karthikeyan Bhargavan, Barry Bond, Antoine Delignat-Lavaud, Cé- SECD-Machine, and the Lambda-Calculus. Technical Report. https: dric Fournet, Chris Hawblitzel, Catalin Hritcu, Samin Ishtiaq, Markulf //help.luddy.indiana.edu/techreports/TRNNN.cgi?trnum=TR197 Kohlweiss, Rustan Leino, Jay Lorch, Kenji Maillard, Jianyang Pang, [23] Matthew Flatt and R. Kent Dybvig. 2020. Compiler and Runtime Sup- Bryan Parno, Jonathan Protzenko, Tahina Ramananandro, Ashay Rane, port for Continuation Marks. In Proceedings of the 41st ACM SIGPLAN Aseem Rastogi, Nikhil Swamy, Laure Thompson, Peng Wang, Santi- Conference on Programming Language Design and Implementation (Lon- ago Zanella-Béguelin, and Jean-Karim Zinzindohoué. 2017. Everest: don, UK) (PLDI 2020). Association for Computing Machinery, New York, Towards a Verified, Drop-in Replacement of HTTPS. In 2nd Summit on NY, USA, 45–58. https://doi.org/10.1145/3385412.3385981 Advances in Programming Languages. http://drops.dagstuhl.de/opus/ [24] Yannick Forster, Ohad Kammar, Sam Lindley, and Matija Pretnar. volltexte/2017/7119/pdf/LIPIcs-SNAPL-2017-1.pdf 2019. On the expressive power of user-defined effects: Effect han- [6] Dariusz Biernacki, Maciej Piróg, Piotr Polesiuk, and Filip Sieczkowski. dlers, monadic reflection, delimited control. J. Funct. Program. 29 2019. Abstracting Algebraic Effects. Proc. ACM Program. Lang. 3, POPL, (2019), e15. https://doi.org/10.1017/S0956796819000121 Article 6 (Jan. 2019), 28 pages. https://doi.org/10.1145/3290319 [25] Go PProf 2020. Profiling a Go Program. https://golang.org/pkg/ [7] Dariusz Biernacki, Maciej Piróg, Piotr Polesiuk, and Filip Sieczkowski. runtime/pprof/#Profile 2020. Binders by Day, Labels by Night: Effect Instances via Lexically [26] Andreas Haas, Andreas Rossberg, Derek L. Schuff, Ben L. Titzer, Scoped Handlers. Proc. ACM Program. Lang. 4, POPL, Article 48 (Dec. Michael Holman, Dan Gohman, Luke Wagner, Alon Zakai, and JF 2020), 29 pages. https://doi.org/10.1145/3371116 Bastien. 2017. Bringing the Web up to Speed with WebAssembly. In [8] Eli Bingham, Jonathan P. Chen, Martin Jankowiak, Fritz Obermeyer, Proceedings of the 38th ACM SIGPLAN Conference on Programming Neeraj Pradhan, Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Language Design and Implementation (Barcelona, Spain) (PLDI 2017). Horsfall, and Noah D. Goodman. 2018. Pyro: Deep Universal Proba- Association for Computing Machinery, New York, NY, USA, 185–200. bilistic Programming. arXiv:1810.09538 [cs.LG] https://doi.org/10.1145/3062341.3062363 [9] Carl Bruggeman, Oscar Waddell, and R. Kent Dybvig. 1996. Represent- [27] Daniel Hillerström, Sam Lindley, and Robert Atkey. 2020. Effect han- ing Control in the Presence of One-Shot Continuations. In Proceedings dlers via generalised continuations. Journal of Functional Programming of the ACM SIGPLAN 1996 Conference on Programming Language De- 30 (2020), e5. https://doi.org/10.1017/S0956796820000040 sign and Implementation (Philadelphia, Pennsylvania, USA) (PLDI ’96). [28] httpaf 2020. A high performance, memory efficient, and scalable web Association for Computing Machinery, New York, NY, USA, 99–107. server written in OCaml. https://github.com/inhabitedtype/httpaf https://doi.org/10.1145/231379.231395 [29] Intel Xeon Gold 5120 2020. Intel® Xeon® Gold 5120 Processor Specifica- [10] Koen Claessen. 1999. A Poor Man’s Concurrency Monad. J. tion. https://ark.intel.com/content/www/us/en/ark/products/120474/ Funct. Program. 9, 3 (May 1999), 313–323. https://doi.org/10.1017/ intel-xeon-gold-5120-processor-19-25m-cache-2-20-ghz.html S0956796899003342 [30] C. Kaiser and J. . Pradat-Peyre. 2003. Chameneos, a concurrency [11] Colour 2020. What Color is Your Function? http://journal.stuffwithstuff. game for Java, Ada and others. In ACS/IEEE International Conference com/2015/02/01/what-color-is-your-function/ on Computer Systems and Applications, 2003. Book of Abstracts. 62–. [12] Lukas Convent, Sam Lindley, Conor McBride, and Craig McLaughlin. https://doi.org/10.1109/AICCSA.2003.1227495 2020. Doo bee doo bee doo. Journal of Functional Programming 30 [31] Satoru Kawahara and Yukiyoshi Kameyama. 2020. One-Shot Algebraic (2020), e9. https://doi.org/10.1017/S0956796820000039 Effects as Coroutines. In Trends in Functional Programming, Aleksander [13] Coq 2020. The Coq Proof Assistant. https://coq.inria.fr/ Byrski and John Hughes (Eds.). Springer International Publishing, [14] Core.Exn.protect 2020. Unwind-protect in JaneStreet Core library. https: Cham, 159–179. //ocaml.janestreet.com/ocaml-core/109.20.00/doc/core/Exn.html [32] Oleg Kiselyov. 2010. Delimited Control in OCaml, Abstractly and [15] Olivier Danvy and Andrzej Filinski. 1990. Abstracting Control. In Pro- Concretely: System Description. In Functional and Logic Programming, ceedings of the 1990 ACM Conference on LISP and Functional Program- Matthias Blume, Naoki Kobayashi, and Germán Vidal (Eds.). Springer ming (Nice, France) (LFP ’90). Association for Computing Machinery, Berlin Heidelberg, Berlin, Heidelberg, 304–320. New York, NY, USA, 151–160. https://doi.org/10.1145/91556.91622 [33] Oleg Kiselyov and KC Sivaramakrishnan. 2018. Eff Directly in OCaml. [16] Olivier Danvy and Lasse R. Nielsen. 2001. Defunctionalization at Electronic Proceedings in Theoretical 285 (Dec 2018), Work. In Proceedings of the 3rd ACM SIGPLAN International Conference 23–58. https://doi.org/10.4204/eptcs.285.2 on Principles and Practice of Declarative Programming (Florence, Italy) [34] Daan Leijen. 2017. Implementing Algebraic Effects in C. In Asian (PPDP ’01). Association for Computing Machinery, New York, NY, USA, Symposium on Programming Languages and Systems, Bor-Yuh Evan 162–174. https://doi.org/10.1145/773184.773202 Chang (Ed.). Springer International Publishing, Cham, 339–363. [17] Ana Lúcia de Moura, Noemi Rodriguez, and Roberto Ierusalimschy. [35] Daan Leijen. 2017. Type Directed Compilation of Row-Typed Alge- 2004. Coroutines in Lua. j-jucs 10, 7 (jul 2004), 910–925. http://www. braic Effects. In Proceedings of the 44th ACM SIGPLAN Symposium PLDI ’21, June 20–25, 2021, Virtual, UK KC Sivaramakrishnan, Stephen Dolan, Leo White, Sadiq Jaffer, Tom Kelly, and Anil Madhavapeddy

on Principles of Programming Languages (Paris, France) (POPL 2017). [55] Wasm Effect Handlers 2020. Typed continuations to model stacks. https: Association for Computing Machinery, New York, NY, USA, 486–499. //github.com/WebAssembly/design/issues/1359 https://doi.org/10.1145/3009837.3009872 [56] Wrk2 2020. A constant throughput, correct latency recording variant of [36] Daan Leijen. 2018. Algebraic Effect Handlers with Resources and Deep wrk. https://github.com/giltene/wrk2 Finalization. Technical Report MSR-TR-2018-10. 35 pages. [37] libev 2020. A high performance full-featured event loop written in C. https://metacpan.org/pod/distribution/EV/libev/ev.pod#NAME [38] Loom 2020. Fibers, continuations and tail-calls for the JVM. https: //openjdk.java.net/projects/loom/ [39] Anil Madhavapeddy, Richard Mortier, Charalampos Rotsos, David Scott, Balraj Singh, Thomas Gazagnaire, Steven Smith, Steven Hand, and Jon Crowcroft. 2013. Unikernels: Library Operating Systems for the Cloud. In Proceedings of the Eighteenth International Confer- ence on Architectural Support for Programming Languages and Op- erating Systems (Houston, Texas, USA) (ASPLOS ’13). Association for Computing Machinery, New York, NY, USA, 461–472. https: //doi.org/10.1145/2451116.2451167 [40] Laurent Mauborgne. 2004. AstrÉe: Verification of Absence of Runtime Error. In Building the Information Society, Renè Jacquart (Ed.). Springer US, Boston, MA, 385–392. [41] Yaron Minsky, Anil Madhavapeddy, and Jason Hickey. 2013. Real World OCaml: Functional Programming for the Masses. O’Reilly. https: //realworldocaml.org [42] MLC 2020. Intel Memory Latency Checker v3.9. https: //software.intel.com/content/www/us/en/develop/articles/intelr- memory-latency-checker.html [43] net/http 2020. HTTP client and server implementations in Go. https: //golang.org/pkg/net/http/ [44] Lasse R. Nielsen. 2001. A Selective CPS Transformation. Electronic Notes in Theoretical Computer Science 45 (2001), 311 – 331. https: //doi.org/10.1016/S1571-0661(04)80969-1 MFPS 2001,Seventeenth Con- ference on the Mathematical Foundations of Programming Semantics. [45] OCaml Manual 2020. Extensible variant types. https://caml.inria.fr/ pub/docs/manual-ocaml/extensiblevariants.html [46] Gordon Plotkin and Matija Pretnar. 2009. Handlers of Algebraic Effects. In Programming Languages and Systems, Giuseppe Castagna (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 80–94. [47] 1998. Revised5 Report on the Algorithmic Language Scheme. Higher- Order and Symbolic Computation 11, 1 (Aug. 1998), 7–105. https: //doi.org/10.1023/A:1010051815785 [48] Philipp Schuster, Jonathan Immanuel Brachthäuser, and Klaus Oster- mann. 2020. Compiling Effect Handlers in Capability-Passing Style. Proc. ACM Program. Lang. 4, ICFP, Article 93 (Aug. 2020), 28 pages. https://doi.org/10.1145/3408975 [49] Dorai Sitaram. 2003. Unwind-protect in portable Scheme. In Proceed- ings of the 4th Workshop on Scheme and Functional Programming (7 Nov. 2003), M. Flatt, Ed., no. UUCS-03-023 in Tech. Rep., School of Computing, University of Utah. 48–52. [50] KC Sivaramakrishnan, Stephen Dolan, Leo White, Sadiq Jaffer, Tom Kelly, Anmol Sahoo, Sudha Parimala, Atul Dhiman, and Anil Mad- havapeddy. 2020. Retrofitting Parallelism onto OCaml. Proc. ACM Program. Lang. 4, ICFP, Article 113 (Aug. 2020), 30 pages. https: //doi.org/10.1145/3408995 [51] Stdlib.Fun.protect 2020. Unwind-protect in the OCaml 4.10.0 standard li- brary. https://caml.inria.fr/pub/docs/manual-ocaml/libref/Fun.html# exception [52] Guy L. Steele. 1990. Common LISP: The Language (2nd Ed.). Digital Press, USA. https://www.cs.cmu.edu/Groups/AI/html/cltl/cltl2.html [53] Swift 2020. Swift Concurrency Roadmap. https://forums.swift.org/t/ swift-concurrency-roadmap/41611 [54] Jérôme Vouillon. 2008. Lwt: A Cooperative Thread Library. In Proceed- ings of the 2008 ACM SIGPLAN Workshop on ML (Victoria, BC, Canada) (ML ’08). Association for Computing Machinery, New York, NY, USA, 3–12. https://doi.org/10.1145/1411304.1411307