P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017 Scaling Reliably: Improving the Scalability of the Erlang Distributed Actor Platform Phil Trinder,Natalia Chechina,Nikolaos Papaspyrou, Konstantinos Sagonas,Simon Thompson,Stephen Adams,Stavros Aronis, Robert Baker,Eva Bihari,Olivier Boudeville,Francesco Cesarini, Maurizio Di Stefano,Sverker Eriksson,Viktoria Fordos,Amir Ghaffari, Aggelos Giantsios,Rockard Green,Csaba Hoch,David Klaftenegger, Huiqing Li,Kenneth Lundin,Kenneth MacKenzie,Katerina Roukounaki, Yiannis Tsiouris,Kjell Winblad RELEASE EU FP7 STREP (287510) project www.release-project.eu/

April 25, 2017

Abstract

Distributed actor languages are an effective means of constructing scalable reliable systems, and the Erlang has a well-established and influential model. While the Erlang model conceptually provides reliable scalability, it has some inherent scalability limits and these force developers to depart from the model at scale. This article establishes the scalability limits of Erlang systems, and reports the work of the EU RELEASE project to improve the scalability and understandability of the Erlang reliable distributed actor model. We systematically study the scalability limits of Erlang, and then address the issues at the virtual machine, language and tool levels. More specifically: (1) We have evolved the Erlang virtual machine so that it can work effectively in large scale single-host multicore and NUMA architectures. We have made important changes and architectural improvements to the widely used Erlang/OTP release. (2) We have designed and implemented Scalable Distributed (SD) Erlang libraries to address language-level scalability issues, and provided and validated a set of semantics for the new language constructs. (3) To make large Erlang systems easier to deploy, monitor, and debug we have developed and made open source releases of five complementary tools, some specific to SD Erlang. Throughout the article we use two case studies to investigate the capabilities of our new technologies and tools: a distributed hash table based Orbit calculation and Ant Colony Optimisation (ACO). Chaos Monkey experiments show that two versions of ACO survive random process failure and hence that SD Erlang preserves the Erlang reliability model. While we report measurements on a range of NUMA and cluster architectures, the key scalability experiments are conducted on the Athos cluster with 256 hosts (6144 cores). Even for programs with no global recovery data to maintain, SD Erlang partitions the network to reduce network traffic and hence improves performance of the Orbit and ACO benchmarks above 80 hosts. ACO measurements show that maintaining global recovery data dramatically limits scalability; however scalability is recovered by partitioning the recovery data. We exceed the established scalability limits of distributed Erlang, and do not reach the limits of SD Erlang for these benchmarks at this scale (256 hosts, 6144 cores).

1 I. Introduction ing reliable scalable servers, e.g. Ericsson’s AXD301 telephone exchange (switch) [79], the Distributed programming languages and Facebook chat server, and the Whatsapp in- frameworks are central to engineering large stant messaging server [77]. scale systems, where key properties include In Erlang, the actors are termed processes scalability and reliability. By scalability we and are managed by a sophisticated Virtual mean that performance increases as hosts and Machine on a single multicore or NUMA cores are added, and by large scale we mean host, while distributed Erlang provides rela- architectures with hundreds of hosts and tens tively transparent distribution over networks of thousands of cores. Experience with high of VMs on multiple hosts. Erlang is supported performance and data centre computing shows by the Open Telecom Platform (OTP) libraries that reliability is critical at these scales, e.g. that capture common patterns of reliable dis- host failures alone account for around one fail- tributed computation, such as the client-server ure per hour on commodity servers with ap- pattern and process supervision. Any large- proximately 105 cores [11]. To be usable, pro- scale system needs scalable persistent stor- gramming languages employed on them must age and, following the CAP theorem [39], be supported by a suite of deployment, moni- Erlang uses and indeed implements Dynamo- toring, refactoring and testing tools that work style NoSQL DBMS like [49] and Cassan- at scale. dra [50]. Controlling shared state is the only way to While the Erlang distributed actor model build reliable scalable systems. State shared by conceptually provides reliable scalability, it has multiple units of computation limits scalability some inherent scalability limits, and indeed due to high synchronisation and communica- large-scale distributed Erlang systems must de- tion costs. Moreover shared state is a threat part from the distributed Erlang paradigm in for reliability as failures corrupting or perma- order to scale, e.g. not maintaining a fully con- nently locking shared state may poison the nected graph of hosts. The EU FP7 RELEASE entire system. project set out to establish and address the scal- Actor languages avoid shared state: actors or ability limits of the Erlang reliable distributed processes have entirely local state, and only in- actor model [67]. teract with each other by sending messages [2]. After outlining related work (Section II) and Recovery is facilitated in this model, since ac- the benchmarks used throughout the article tors, like processes, can fail (Section III) we investigate the scalability lim- independently without affecting the state of its of Erlang/OTP, seeking to identify specific other actors. Moreover an actor can supervise issues at the virtual machine, language and other actors, detecting failures and taking re- persistent storage levels (Section IV). medial action, e.g. restarting the failed actor. We then report the RELEASE project work to Erlang [6, 19] is a beacon language for re- address these issues, working at the following liable scalable computing with a widely em- three levels. ulated distributed actor model. It has in- fluenced the design of numerous program- 1. We have designed and implemented a ming languages like Clojure [43] and F# [74], set of Scalable Distributed (SD) Erlang li- and many languages have Erlang-inspired braries to address language-level reliabil- actor frameworks, e.g. Kilim for Java [73], ity and scalability issues. An operational Cloud Haskell [31], and Akka for C#, F# and semantics is provided for the key new Scala [64]. Erlang is widely used for build- s_group construct, and the implementa-

2 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

tion is validated against the semantics (Sec- report measurements on a range of NUMA tion V). and cluster architectures as specified in Ap- pendix A, the key scalability experiments are 2. We have evolved the Erlang virtual machine conducted on the Athos cluster with 256 hosts so that it can work effectively in large-scale and 6144 cores. Having established scientifi- single-host multicore and NUMA archi- cally the folklore limitations of around 60 con- tectures. We have improved the shared nected hosts/nodes for distributed Erlang sys- ETS tables, time management, and load tems in Section 4, a key result is to show that balancing between schedulers. Most of the SD Erlang benchmarks exceed this limit these improvements are now included in and do not reach their limits on the Athos clus- the Erlang/OTP release, currently down- ter (Section VIII). loaded approximately 50K times each Contributions. This article is the first sys- month (Section VI). tematic presentation of the coherent set of technologies for engineering scalable reliable 3. To facilitate the development of scalable Erlang systems developed in the RELEASE Erlang systems, and to make them main- project. tainable, we have developed three new Section IV presents the first scalability study tools: Devo, SDMon and WombatOAM, covering Erlang VM, language, and storage and enhanced two others: the visualisa- scalability. Indeed we believe it is the first tion tool Percept, and the refactorer Wran- comprehensive study of any distributed ac- gler. The tools support refactoring pro- tor language at this scale (100s of hosts, and grams to make them more scalable, eas- around 10K cores). Individual scalability stud- ier to deploy at large scale (hundreds of ies, e.g. into Erlang VM scaling [8], or lan- hosts), easier to monitor and visualise their guage and storage scaling have appeared be- behaviour. Most of these tools are freely fore [38, 37]. available under open source licences; the WombatOAM deployment and monitor- At the language level the design, implemen- ing tool is a commercial product (Sec- tation and validation of the new libraries (Sec- tion VII). tion V) have been reported piecemeal [21, 60], and are included here for completeness. Throughout the article we use two bench- While some of the improvements made to marks to investigate the capabilities of our new the Erlang Virtual Machine (Section i) have technologies and tools. These are a compu- been thoroughly reported in conference pub- tation in symbolic algebra, more specifically lications [66, 46, 47, 69, 70, 71], others are re- an algebraic ‘orbit’ calculation that exploits ported here for the first time (Sections iii, ii). a non-replicated distributed hash table, and In Section VII, the WombatOAM and SD- an Ant Colony Optimisation (ACO) parallel Mon tools are described for the first time, as search program (Section III). is the revised Devo system and visualisation. We report on the reliability and scalability The other tools for profiling, debugging and implications of our new technologies using Or- refactoring developed in the project have previ- bit, ACO, and other benchmarks. We use a ously been published piecemeal [52, 53, 55, 10], Chaos Monkey instance [14] that randomly but this is their first unified presentation. kills processes in the running system to demon- All of the performance results in Section VIII strate the reliability of the benchmarks and are entirely new, although a comprehensive to show that SD Erlang preserves the Erlang study of SD Erlang performance is now avail- language-level reliability model. While we able in a recent article by [22].

3 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

II. Context simply re-run. In contrast, actor languages like Erlang are used to engineer reliable general i. Scalable Reliable Programming purpose computation, often recovering failed Models stateful computations. There is a plethora of shared memory con- current programming models like PThreads ii. Actor Languages or Java threads, and some models, like The actor model of concurrency consists of in- OpenMP [20], are simple and high level. How- dependent processes communicating by means ever synchronisation costs mean that these of messages sent asynchronously between pro- models generally do not scale well, often strug- cesses. A process can send a message to any gling to exploit even 100 cores. Moreover, re- other process for which it has the address (in liability mechanisms are greatly hampered by Erlang the “process identifier” or pid), and the shared state: for example, a lock becomes the remote process may reside on a different permanently unavailable if the thread holding host. While the notion of actors originated it fails. in AI [42], it has been used widely as a gen- The High Performance Computing (HPC) eral metaphor for concurrency, as well as being 6 community build large-scale (10 core) dis- incorporated into a number of niche program- tributed memory systems using the de facto ming languages in the 1970s and 80s. More re- standard MPI communication libraries [72]. In- cently it has come back to prominence through creasingly these are hybrid applications that the rise of not only multicore chips but also combine MPI with OpenMP. Unfortunately, larger-scale distributed programming in data MPI is not suitable for producing general pur- centres and the cloud. pose concurrent software as it is too low level With built-in concurrency and data isola- with explicit message passing. Moreover, the tion, actors are a natural paradigm for engi- most widely used MPI implementations offer 1 neering reliable scalable general-purpose sys- no fault recovery: if any part of the compu- tems [1, 41]. The model has two main concepts: tation fails, the entire computation fails. Cur- actors that are the unit of computation, and rently the issue is addressed by using what messages that are the unit of communication. is hoped to be highly reliable computational Each actor has an address-book that contains the and networking hardware, but there is intense addresses of all the other actors it is aware of. research interest in introducing reliability into These addresses can be either locations in mem- HPC applications [33]. ory, direct physical attachments, or network Server farms use commodity computational addresses. In a pure actor language, messages and networking hardware, and often scale to are the only way for actors to communicate. around 105 cores, where host failures are rou- After receiving a message an actor can do tine. They typically perform rather constrained the following: (i) send messages to another computations, e.g. Big Data Analytics, using re- actor from its address-book, (ii) create new ac- liable frameworks like Google MapReduce [26] tors, or (iii) designate a behaviour to handle the or Hadoop [78]. The idempotent nature of the next message it receives. The model does not analytical queries makes it relatively easy for impose any restrictions in the order in which the frameworks to provide implicit reliability: these actions must be taken. Similarly, two queries are monitored and failed queries are messages sent concurrently can be received in 1Some fault tolerance is provided in less widely used any order. These features enable actor based MPI implementations like [27]. systems to support indeterminacy and quasi-

4 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017 commutativity, while providing locality, modu- message passing with copying semantics; pro- larity, reliability and scalability [41]. cess monitoring; strong dynamic typing, and Actors are just one message-based paradigm, selective message reception. and other languages and libraries have related By default Erlang processes are addressed message passing paradigms. Recent example by their process identifier (pid), e.g. languages include Go [28] and Rust [62] that Pong_PID = spawn(fun some_module:pong/0) provide explicit channels, similar to actor mail- boxes. Probably the most famous message spawns a process to execute the anonymous passing library is MPI [72], with APIs for many function given as argument to the spawn prim- languages and widely used on clusters and itive, and binds Pong_PID to the new process High Performance Computers. It is, however, identifier. Here the new process will exe- arguable that the most important contribution cute the pong/0 function which is defined in of the actor model is the one-way asynchronous some_module. A subsequent call communication [41]. Messages are not coupled Pong_PID ! finish with the sender, and neither they are trans- finish ferred synchronously to a temporary container sends the messaged to the process iden- Pong_PID where transmission takes place, e.g. a buffer, a tified by . Alternatively, processes can queue, or a mailbox. Once a message is sent, be given names using a call of the form: the receiver is the only entity responsible for register(my_funky_name, Pong_PID) that message. which registers this process name in the node’s Erlang [6, 19] is the pre-eminent program- process name table if not already present. Sub- ming language based on the actor model, hav- sequently, these names can be used to refer to ing a history of use in production systems, or communicate with the corresponding pro- initially with its developer Ericsson and then cesses (e.g. send them a message): more widely through open source adoption. There are now actor frameworks for many my_funky_process ! hello. other languages; these include Akka for C#, F# distributed Erlang 2 3 A system executes on mul- and Scala [64], CAF for C++, Pykka , Cloud tiple nodes, and the nodes can be freely de- Haskell [31], PARLEY [51] for Python, and Ter- ployed across hosts, e.g. they can be located mite Scheme [35], and each of these is currently on the same or different hosts. To help make under active use and development. Moreover, distribution transparent to the programmer, the recently defined Rust language [62] has a when any two nodes connect they do so transi- version of the actor model built in, albeit in an tively, sharing their sets of connections. With- imperative context. out considerable care this quickly leads to a fully connected graph of nodes. A process may iii. Erlang’s Support for Concurrency be spawned on an explicitly identified node, e.g. In Erlang, actors are termed processes, and vir- tual machines are termed nodes. The key ele- Remote_Pong_PID = spawn(some_node, ments of the actor model are: fast process cre- fun some_module:pong/0). ation and destruction; lightweight processes, After this, the remote process can be addressed 6 e.g. enabling 10 concurrent processes on a sin- just as if it were local. It is a significant bur- gle host with 8GB RAM; fast asynchronous den on the programmer to identify the remote 2http://actor-framework.org nodes in large systems, and we will return to 3http://pykka.readthedocs.org/en/latest/ this in Sections ii, i.2.

5 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

that is ready to compute at most a fixed num- ber of computation steps before switching to another. Erlang built-in functions or BIFs are im- plemented in C, and at the start of the project were run to completion once scheduled, caus- ing performance and responsiveness problems if the BIF had a long execution time. Scaling in Erlang is provided in two differ- ent ways. It is possible to scale within a single Figure 1: Conceptual view of Erlang’s concurrency, mul- node by means of the multicore virtual machine ticore support and distribution. exploiting the concurrency provided by the multiple cores or NUMA nodes. It is also pos- iv. Scalability and Reliability in sible to scale across multiple hosts using multiple Erlang Systems distributed Erlang nodes. Reliability in Erlang is multi-faceted. As in Erlang was designed to solve a particular all actor languages each process has private set of problems, namely those in building state, preventing a failed or failing process telecommunications’ infrastructure, where sys- from corrupting the state of other processes. tems need to be scalable to accommodate hun- Messages enable stateful interaction, and con- dreds of thousands of calls concurrently, in soft tain a deep copy of the value to be shared, with real-time. These systems need to be highly- no references (e.g. pointers) to the senders’ available and reliable: i.e. to be robust in the internal state. Moreover Erlang avoids type case of failure, which can come from software errors by enforcing strong typing, albeit dy- or hardware faults. Given the inevitability of namically [7]. Connected nodes check liveness the latter, Erlang adopts the “let it fail” phi- with heartbeats, and can be monitored from losophy for error handling. That is, encourage outside Erlang, e.g. by an operating system programmers to embrace the fact that a process process. may fail at any point, and have them rely on However, the most important way to achieve the supervision mechanism, discussed shortly, reliability is supervision, which allows a process to handle the failures. to monitor the status of a child process and Figure 1 illustrates Erlang’s support for con- react to any failure, for example by spawning currency, multicores and distribution. Each a substitute process to replace a failed process. Erlang node is represented by a yellow shape, Supervised processes can in turn supervise and each rectangle represents a host with an IP other processes, leading to a supervision tree. address. Each red arc represents a connection The supervising and supervised processes may between Erlang nodes. Each node can run on be in different nodes, and on different hosts, multiple cores, and exploit the inherent con- and hence the supervision tree may span mul- currency provided. This is done automatically tiple hosts or nodes. by the VM, with no user intervention needed. To provide reliable distributed service regis- Typically each core has an associated scheduler tration, a global namespace is maintained on that schedules processes; a new process will be every node, which maps process names to pids. spawned on the same core as the process that It is this that we mean when we talk about a spawns it, but work can be moved to a different ‘reliable’ system: it is one in which a named scheduler through a work-stealing allocation process in a distributed system can be restarted algorithm. Each scheduler allows a process without requiring the client processes also to

6 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017 be restarted (because the name can still be used some_key and 42; updates the value associated for communication). with the table entry with key 42; and then looks To see global registration in action, consider up this entry. a pong server process Table = ets:new(my_table, global:register_name(pong_server, [set, public, {keypos, 1}]), Remote_Pong_PID). ets:insert(Table, [ Clients of the server can send messages to the {some_key, an_atom_value}, registered name, e.g. {42, {a,tuple,value}}]), ets:update_element(Table, 42, global:whereis_name(pong_server)!finish. [{2, {another,tuple,value}}]), If the server fails the supervisor can spawn a [{Key, Value}] = ets:lookup(Table, 42). replacement server process with a new pid and register it with the same name (pong_server). ETS tables are heavily used in many Erlang Thereafter client messages to the pong_server applications. This is partly due to the conve- will be delivered to the new server process. We nience of sharing data for some programming return to discuss the scalability limitations of tasks, but also partly due to their fast imple- maintaining a global namespace in Section V. mentation. As a shared resource, however, ETS tables induce contention and become a scala- bility bottleneck, as we shall see in Section i. v. ETS: Erlang Term Storage Erlang is a pragmatic language and the actor III. Benchmarks for scalability model it supports is not pure. Erlang processes, and reliability besides communicating via asynchronous mes- sage passing, can also share data in public The two benchmarks that we use through- memory areas called ETS tables. out this article are Orbit, that measures scal- The Erlang Term Storage (ETS) mechanism ability without looking at reliability, and is a central component of Erlang’s implementa- Ant Colony Optimisation (ACO) that allows tion. It is used internally by many libraries and us to measure the impact on scalability of underlies the in-memory . ETS tables adding global namespaces to ensure reli- are key-value stores: they store Erlang tuples ability. The source code for the bench- where one of the positions in the tuple serves marks, together with more detailed docu- as the lookup key. An ETS table has a type mentation, is available at https://github. set bag duplicate_bag that may be either , or , com/release-project/benchmarks/. The ordered_set implemented as a hash table, or RELEASE project team also worked to im- which is currently implemented as an AVL tree. prove the reliability and scalability of other The main operations that ETS supports are ta- Erlang programs including a substantial (ap- ble creation, insertion of individual entries and proximately 150K lines of Erlang code) Sim- atomic bulk insertion of multiple entries in a Diasca simulator [16] and an Instant Messenger table, deletion and lookup of an entry based that is more typical of Erlang applications [23] on some key, and destructive update. The oper- but we do not cover these systematically here. ations are implemented as C built-in functions in the Erlang VM. i. Orbit The code snippet below creates a set ETS table keyed by the first element of the en- Orbit is a computation in symbolic algebra, try; atomically inserts two elements with keys which generalises a transitive closure compu-

7 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

process gets a share of the parent’s credit, and returns this on completion. The computation is finished when the master node collects all credit, i.e. all workers have completed.

ii. Ant Colony Optimisation (ACO) Figure 2: Distributed Erlang Orbit (D-Orbit) architec- Ant Colony Optimisation [29] is a meta- ture: workers are mapped to nodes. heuristic which has been applied to a large number of combinatorial optimisation prob- lems. For the purpose of this article, we have tation [57]. To compute the orbit for a given applied it to an NP-hard scheduling problem space [0..X], a list of generators g1, g2, ..., gn are known as the Single Machine Total Weighted applied on an initial vertex x0 ∈ [0..X]. This Tardiness Problem (SMTWTP) [63], where a creates new values (x1...xn) ∈ [0..X], where number of jobs of given lengths have to be ar- xi = gi(x0). The generator functions are ap- ranged in a single linear schedule. The goal is plied on the new values until no new value is to minimise the cost of the schedule, as deter- generated. mined by certain constraints. Orbit is a suitable benchmark because it has The ACO method is attractive from the point a number of aspects that characterise a class of view of distributed computing because it of real applications. The core data structure can benefit from having multiple cooperating it maintains is a set and, in distributed envi- colonies, each running on a separate compute ronments is implemented as a distributed hash node and consisting of multiple “ants”. Ants table (DHT), similar to the DHTs used in repli- are simple computational agents which concur- cated form in NoSQL management rently compute possible solutions to the input systems. Also, in distributed mode, it uses problem guided by shared information about standard peer-to-peer (P2P) techniques like a good paths through the search space; there is credit-based termination algorithm [61]. By also a certain amount of stochastic variation choosing the orbit size, the benchmark can be which allows the ants to explore new direc- parameterised to specify smaller or larger com- tions. Having multiple colonies increases the putations that are suitable to run on a single number of ants, thus increasing the probability machine (Section i) or on many nodes (Sec- of finding a good solution. tioni). Moreover it is only a few hundred lines We implement four distributed coordination of code. patterns for the same multi-colony ACO com- As shown in Figure 2, the computation is initiated by a master which creates a number of workers. In the single node scenario of the benchmark, workers correspond to processes but these workers can also spawn other pro- cesses to apply the generator functions on a subset of their input values, thus creating intra- worker parallelism. In the distributed version of the benchmark, processes are spawned by the master node to worker nodes, each main- Figure 3: Distributed Erlang Two-level Ant Colony Op- taining a DHT fragment. A newly spawned timisation (TL-ACO) architecture.

8 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

Figure 4: Distributed Erlang Multi-level Ant Colony Optimisation (ML-ACO) architecture. putation as follows. In each implementation, instances [25, 13, 34], obtaining good results the individual colonies perform some number in all cases, and confirmed that the number of of local iterations (i.e. generations of ants) and perfect solutions increases as we increase the then report their best solutions; the globally- number of colonies. best solution is then selected and is reported Multi-level ACO (ML-ACO). In TL-ACO the to the colonies, which use it to update their master node receives messages from all of the pheromone matrices. This process is repeated colonies, and thus could become a bottleneck. for some number of global iterations. ML-ACO addresses this by having a tree of Two-level ACO (TL-ACO) has a single master submasters (Figure 4), with each node in the node that collects the colonies’ best solutions bottom level collecting results from a small and distributes the overall best solution back to number of colonies. These are then fed up the colonies. Figure 3 depicts the process and through the tree, with nodes at higher levels node placements of the TL-ACO in a cluster selecting the best solutions from their children. with NC nodes. The master process spawns Globally Reliable ACO (GR-ACO). In ML-ACO NC colony processes on available nodes. In if a single colony fails to report back the sys- the next step, each colony process spawns NA tem will wait indefinitely. GR-ACO adds fault ant processes on the local node. Each ant iter- tolerance, supervising colonies so that a faulty ates IA times, returning its result to the colony colony can be detected and restarted, allowing master. Each colony iterates IM times, report- the system to continue execution. ing their best solution to, and receiving the Scalable Reliable ACO (SR-ACO) also adds globally-best solution from, the master process. fault-tolerance, but using supervision within We validated the implementation by applying our new s_groups from Section i.1, and the TL-ACO to a number of standard SMTWTP architecture of SR-ACO is discussed in detail

9 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

20000 there. with intra-worker parallelism, 10048, 128 18000 without intra-worker parallelism, 10048, 128

16000 IV. Erlang Scalability Limits 14000 12000 This section investigates the scalability of 10000 Erlang at VM, language, and persistent stor- time (ms) 8000 age levels. An aspect we choose not to explore 6000 4000 is the security of large scale systems where, 2000 for example, one might imagine providing en- 0 0 10 20 30 40 50 60 70 hanced security for systems with multiple clus- # Schedulers ters or cloud instances connected by a Wide 40 with intra-worker parallelism, 10048, 128 without intra-worker parallelism, 10048, 128 Area Network. We assume that existing secu- 35 rity mechanisms are used, e.g. a Virtual Private 30

Network. 25

20 Speedup i. Scaling Erlang on a Single Host 15 To investigate Erlang scalability we built 10 BenchErl, an extensible open source bench- 5 4 0 mark suite with a web interface. BenchErl 0 10 20 30 40 50 60 70 shows how an application’s performance # Schedulers changes when resources, like cores or sched- ulers, are added; or when options that control Figure 5: Runtime and speedup of two configu- these resources change: rations of the Orbit benchmark using Erlang/OTP R15B01. • the number of nodes, i.e. the number of Erlang VMs used, typically on multiple hosts; with up to 64 cores, including the Bulldozer • the number of cores per node; machine specified in Appendix A. This set of experiments, reported by [8], confirmed that • the number of schedulers, i.e. the OS threads some programs scaled well in the most recent that execute Erlang processes in parallel, Erlang/OTP release of the time (R15B01) but and their binding to the topology of the also revealed VM and language level scalability cores of the underlying computer node; bottlenecks.

• the Erlang/OTP release and flavor; and Figure 5 shows runtime and speedup curves for the Orbit benchmark where master and • the command-line arguments used to start workers run on a single Erlang node in con- the Erlang nodes. figurations with and without intra-worker par- allelism. In both configurations the program Using BenchErl, we investigated the scala- scales. Runtime continuously decreases as we bility of an initial set of twelve benchmarks add more schedulers to exploit more cores. and two substantial Erlang applications us- The speedup of the benchmark without intra- ing a single Erlang node (VM) on machines worker parallelism, i.e. without spawning ad- 4Information about BenchErl is available at http:// ditional processes for the computation (green release.softlab.ntua.gr/bencherl/. curve), is almost linear up to 32 cores but in-

10 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

10000 creases less rapidly from that point on; we see a [2512,1264,640] similar but more clearly visible pattern for the 9000 other configuration (red curve) where there is 8000 no performance improvement beyond 32 sched- 7000 ulers. This is due to the asymmetric charac- 6000 5000 teristics of the machine’s micro-architecture, time (ms) 4000 which consists of modules that couple two con- 3000 ventional x86 out-of-order cores that share the 2000 early pipeline stages, the floating point unit, 1000 and the L2 cache with the rest of the mod- 0 2 4 6 8 10 12 14 16 # Schedulers ule [3]. 1.6 [2512,1264,640]

Some other benchmarks, however, did not 1.4 scale well or experienced significant slow- downs when run in many VM schedulers 1.2 (threads). For example the ets_test benchmark 1 has multiple processes accessing a shared ETS Speedup 0.8 table. Figure 6 shows runtime and speedup 0.6 curves for ets_test on a 16-core (eight cores 0.4 with hyperthreading) Intel Xeon-based ma- 0.2 chine. It shows that runtime increases beyond 0 2 4 6 8 10 12 14 16 two schedulers, and that the program exhibits # Schedulers a slowdown instead of a speedup. For many benchmarks there are obvious rea- Figure 6: Runtime and speedup of the ets_test BenchErl benchmark using sons for poor scaling like limited parallelism Erlang/OTP R15B01. in the application, or contention for shared re- sources. The reasons for poor scaling are less obvious for other benchmarks, and it is exactly are guaranteed to produce monotonically in- these we have chosen to study in detail in sub- creasing values. This lock is precisely the bot- sequent work [8, 46]. tleneck in the VM that limits the scalability A simple example is the parallel BenchErl of this benchmark. We describe our work to benchmark, that spawns some n processes, address VM timing issues in Figure iii. each of which creates a list of m timestamps and, after it checks that each timestamp in the Discussion Our investigations identified con- list is strictly greater than the previous one, tention for shared ETS tables, and for sends the result to its parent. Figure 7 shows commonly-used shared resources like timers, that up to eight cores each additional core leads as the key VM-level scalability issues. Sec- to a slowdown, thereafter a small speedup is tion VI outlines how we addressed these issues obtained up to 32 cores, and then again a slow- in recent Erlang/OTP releases. down. A small aspect of the benchmark, eas- ily overlooked, explains the poor scalability. ii. Distributed Erlang Scalability The benchmark creates timestamps using the erlang:now/0 built-in function, whose imple- Network Connectivity Costs When any nor- mentation acquires a global lock in order to re- mal distributed Erlang nodes communicate, turn a unique timestamp. That is, two calls to they share their connection sets and this typi- erlang:now/0, even from different processes cally leads to a fully connected graph of nodes.

11 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

160000 [70016,640] results for distributed Erlang computing or- 140000 bits with 2M and 5M elements on Athos. In 120000 all cases performance degrades beyond some 100000 scale (40 nodes for the 5M orbit, and 140 nodes

80000 for the 2M orbit). Figure 32 illustrates the ad- time (ms) 60000 ditional network traffic induced by the fully

40000 connected network. It allows a comparison be-

20000 tween the number of packets sent in a fully con-

0 nected network (ML-ACO) with those sent in 0 10 20 30 40 50 60 70 a network partitioned using our new s_groups # Schedulers

1 (SR-ACO). [70016,640] 0.9 0.8 Global Information Costs Maintaining 0.7 global information is known to limit the 0.6 scalability of distributed systems, and crucially 0.5 Speedup the process namespaces used for reliability 0.4 are global. To investigate the scalability 0.3 limits imposed on distributed Erlang by 0.2 such global information we have designed 0.1 0 10 20 30 40 50 60 70 and implemented DE-Bench, an open source, # Schedulers parameterisable and scalable peer-to-peer benchmarking framework [37, 36]. DE-Bench Figure 7: Runtime and speedup of the BenchErl measures the throughput and latency of benchmark called parallel using distributed Erlang commands on a cluster of Erlang/OTP R15B01. Erlang nodes, and the design is influenced by the Basho Bench benchmarking tool for Riak [12]. Each DE-Bench instance acts as 2 So a system with n nodes will maintain O(n ) a peer, providing scalability and reliability connections, and these are relatively expensive by eliminating central coordination and any TCP connections with continual maintenance single point of failure. traffic. This design aids transparent distribu- To evaluate the scalability of distributed tion as there is no need to discover nodes, and Erlang, we measure how adding more hosts the design works well for small numbers of increases the throughput, i.e. the total num- nodes. However at emergent server architec- ber of successfully executed distributed Erlang ture scales, i.e. hundreds of nodes, this design commands per experiment. Figure 8 shows becomes very expensive and system architects the parameterisable internal workflow of DE- must switch from the default Erlang model, e.g. Bench. There are three classes of commands they need to start using hidden nodes that do in DE-Bench:(i) Point-to-Point (P2P) commands, not share connection sets. where a function with tunable argument size We have investigated the scalability limits and computation time is run on a remote imposed by network connectivity costs using node, include spawn, rpc, and synchronous several Orbit calculations on two large clusters: calls to server processes, i.e. gen_server Kalkyl and Athos as specified in Appendix A. or gen_fsm. (ii) Global commands, which The Kalkyl results are discussed by [21], and entail synchronisation across all connected Figure 29 in Section i shows representative nodes, such as global:register_name and

12 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

Figure 8: DE-Bench’s internal workflow. global:unregister_name. (iii) Local commands, Explicit Placement While network connectiv- which are executed independently by a single ity and global information impact performance node, e.g. whereis_name, a look up in the at scale, our investigations also identified ex- local name table. plicit process placement as a programming issue The benchmarking is conducted on 10 to 100 at scale. Recall from Section iii that distributed host configurations of the Kalkyl cluster (in Erlang requires the programmer to identify an steps of 10) and measures the throughput of explicit Erlang node (VM) when spawning a successful commands per second over 5 min- process. Identifying an appropriate node be- utes. There is one Erlang VM on each host and comes a significant burden for large and dy- one DE-Bench instance on each VM. The full namic systems. The problem is exacerbated in paper [37] investigates the impact of data size, large distributed systems where (1) the hosts and computation time in P2P calls both inde- may not be identical, having different hardware pendently and in combination, and the scaling capabilities or different software installed; and properties of the common Erlang/OTP generic (2) communication times may be non-uniform: server processes gen_server and gen_fsm. it may be fast to send a message between VMs Here we focus on the impact of different pro- on the same host, and slow if the VMs are on portions of global commands, mixing global different hosts in a large distributed system. with P2P and local commands. Figure 9 These factors make it difficult to deploy ap- shows that even a low proportion of global plications, especially in a scalable and portable commands limits the scalability of distributed manner. Moreover while the programmer may Erlang, e.g. just 0.01% global commands limits be able to use platform-specific knowledge to scalability to around 60 nodes. Figure 10 re- decide where to spawn processes to enable an ports the latency of all commands and shows application to run efficiently, if the application that, while the latencies for P2P and local com- is then deployed on a different platform, or if mands are stable at scale, the latency of the the platform changes as hosts fail or are added, global commands increases dramatically with this becomes outdated. scale. Both results illustrate that the impact of global operations on throughput and latency Discussion Our investigations confirm three in a distributed Erlang system is severe. language-level scalability limitations of Erlang

13 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

Figure 9: Scalability vs. percentage of global commands in Distributed Erlang. from developer folklore.(1) Maintaining a fully age and evaluated four popular Erlang DBMS connected network of Erlang nodes limits scal- against these requirements. For a target scale ability, for example Orbit is typically limited of around 100 hosts, and CouchDB to just 40 nodes. (2) Global operations, and cru- are, unsurprisingly, not suitable. However, cially the global operations required for relia- Dynamo-style NoSQL DBMS like Cassandra bility, i.e. to maintain a global namespace, seri- and Riak have the potential to be. ously limit the scalability of distributed Erlang We have investigated the current scalabil- systems. (3) Explicit process placement makes it ity limits of the Riak NoSQL DBMS using the hard to built performance portable applications Basho Bench benchmarking framework on a for large architectures. These issues cause de- cluster with up to 100 nodes and indepen- signers of reliable large scale systems in Erlang dent disks. We found that that the scalabil- to depart from the standard Erlang model, e.g. ity limit of Riak version 1.1.1 is 60 nodes on using techniques like hidden nodes and stor- the Kalkyl cluster. The study placed into the ing pids in data structures. In Section V we public scientific domain what was previously develop language technologies to address these well-evidenced, but anecdotal, developer expe- issues. rience. We have also shown that resources like mem- iii. Persistent Storage ory, disk, and network do not limit the scal- ability of Riak. By instrumenting the global Any large scale system needs reliable scalable and gen_server OTP libraries we identified a persistent storage, and we have studied the specific Riak remote procedure call that fails scalability limits of Erlang persistent storage al- to scale. We outline how later releases of Riak ternatives [38]. We envisage a typical large are refactored to eliminate the scalability bot- server having around 105 cores on around tlenecks. 100 hosts. We have reviewed the require- ments for scalable and available persistent stor-

14 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

Figure 10: Latency of commands as the number of Erlang nodes increases.

Discussion We conclude that Dynamo-like validation in Sections ii, iii respectively. NoSQL DBMSs have the potential to deliver re- liable persistent storage for Erlang at our target i. SD Erlang Design scale of approximately 100 hosts. Specifically an Erlang Cassandra interface is available and i.1 S_groups Riak 1.1.1 already provides scalable and avail- able persistent storage on 60 nodes. Moreover reduce both the number of connections a node the scalability of Riak is much improved in maintains, and the size of name spaces, i.e. subsequent versions. they minimise global information. Specifically names are registered on, and synchronised be- tween, only the nodes within the s_group. An V. Improving Language Scalability s_group has the following parameters: a name, a list of nodes, and a list of registered names. A This section outlines the Scalable Distributed node can belong to many s_groups or to none. (SD) Erlang libraries [21] we have designed If a node belongs to no s_group it behaves as a and implemented to address the distributed usual distributed Erlang node. Erlang scalability issues identified in Section ii. The s_group library defines the functions SD Erlang introduces two concepts to improve shown in Table 1. Some of these functions scalability. S_groups partition the set of nodes manipulate s_groups and provide information in an Erlang system to reduce network con- about them, such as creating s_groups and pro- nectivity and partition global data (Section i.1). viding a list of nodes from a given s_group. Semi-explicit placement alleviates the issues of The remaining functions manipulate names explicit process placement in large heteroge- registered in s_groups and provide informa- neous networks (Section i.2). The two features tion about these names. For example, to regis- are independent and can be used separately ter a process, Pid, with name Name in s_group or in combination. We overview SD Erlang in SGroupName we use the following function. The Section i, and outline s_group semantics and name will only be registered if the process is

15 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

Table 1: Summary of s_group functions. Function Description new_s_group(SGroupName, Nodes) Creates a new s_group consisting of some nodes. delete_s_group(SGroupName) Deletes an s_group. add_nodes(SGroupName, Nodes) Adds a list of nodes to an s_group. remove_nodes(SGroupName, Nodes) Removes a list of nodes from an s_group. s_groups() Returns a list of all s_groups known to the node. own_s_groups() Returns a list of s_group tuples the node belongs to. own_nodes() Returns a list of nodes from all s_groups the node belongs to. own_nodes(SGroupName) Returns a list of nodes from the given s_group. info() Returns s_group state information. register_name(SGroupName, Name, Pid) Registers a name in the given s_group. re_register_name(SGroupName, Name, Pid) Re-registers a name (changes a registration) in a given s_group. unregister_name(SGroupName, Name) Unregisters a name in the given s_group. registered_names({node,Node}) Returns a list of all registered names on the given node. registered_names({s_group,SGroupName}) Returns a list of all registered names in the given s_group. whereis_name(SGroupName, Name) Return the pid of a name registered in the given whereis_name(Node, SGroupName, Name) s_group. send(SGroupName, Name, Msg) Send a message to a name registered in the given send(Node, SGroupName, Name, Msg) s_group. being executed on a node that belongs to the hence the names are replicated and synchro- given s_group, and neither Name nor Pid are nised on just ten nodes, and not on all nodes already registered in that group. as in distributed Erlang. The results in Fig- ure 11 show that with 0.01% of global opera- s_group:register_name(SGroupName, Name, tions throughput of distributed Erlang stops Pid) -> yes | no growing at 40 nodes while throughput of SD To illustrate the impact of s_groups on scal- Erlang continues to grow linearly. ability we repeat the global operations exper- The connection topology of s_groups is ex- iment from Section ii (Figure 9). In the SD tremely flexible: they may be organised into Erlang experiment we partition the set of nodes a hierarchy of arbitrary depth or branching, into s_groups each containing ten nodes, and e.g. there could be multiple levels in the tree

16 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

1e+09

9e+08

7e+08

5e+08 Throughput

3e+08

1e+08 10 20 30 40 50 60 70 80 90 100 Number of nodes Distributed Erlang SD Erlang

Figure 11: Global operations in Distributed Erlang vs. SD Erlang. of s_groups; see Figure 12. Moreover it is not is typically routed via gateway nodes that be- necessary to create an hierarchy of s_groups, long to both s_groups. How do we avoid single for example, we have constructed an Orbit im- points of failure? For reliability, and to minimise plementation using a ring of s_groups. communication load, multiple gateway nodes Given such a flexible way of organising dis- and processes may be required. tributed systems, key questions in the design Information to make these design choices is of an SD Erlang system are the following. How provided by the tools in Section VII and by should s_groups be structured? Depending on benchmarking. A further challenge is how to the reason the nodes are grouped – reducing systematically refactor a distributed Erlang ap- the number of connections, or reducing the plication into SD Erlang, and this is outlined in namespace, or both – s_groups can be freely Section i. A detailed discussion of distributed structured as a tree, ring, or some other topol- system design and refactoring in SD Erlang ogy. How large should the s_groups be? Smaller provided in a recent article [22]. s_groups mean more inter-group communica- We illustrate typical SD Erlang system de- tion, but the synchronisation of the s_group signs by showing refactorings of the Orbit and state between the s_group nodes constrains the ACO benchmarks from Section III. In both dis- maximum size of s_groups. We have not found tributed and SD Erlang the computation starts this constraint to be a serious restriction. For on the Master node and the actual computation example many s_groups are either relatively is done on the Worker nodes. In the distributed small, e.g. 10-node, internal or terminal ele- Erlang version all nodes are interconnected, ments in some topology, e.g. the leaves and and messages are transferred directly from the nodes of a tree. How do nodes from different sending node to the receiving node (Figure 2). s_groups communicate? While any two nodes In contrast, in the SD Erlang version nodes can communicate in an SD Erlang system, to are grouped into s_groups, and messages are minimise the number of connections communi- transferred between different s_groups via Sub- cation between nodes from different s_groups master nodes (Figure 13).

17 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

represents a sub-master process

represents a colony process master node master process represents an ant process Level 0

P1 P2 represents an s_group P16 Level 1

represents a node

node1 node2 node16 node1 node1 node16 node1 node1 node16 colony nodes in this level Level 2

Figure 12: SD Erlang ACO (SR-ACO) architecture.

A fragment of code that creates an s_group spawn(some_node,fun some_module:pong/0). on a Sub-master node is as follows: and also recall the portability and program- create_s_group(Master, ming effort issues associated with such explicit GroupName, Nodes0) -> placement in large scale systems discussed in case s_group:new_s_group(GroupName, Section ii. Nodes0) of To address these issues we have developed a {ok, GroupName, Nodes} -> semi-explicit placement library that enables the Master ! {GroupName, Nodes}; programmer to select nodes on which to spawn _ -> io:format("exception: message") processes based on run-time information about end. the properties of the nodes. For example, if a process performs a lot of computation one Similarly, we introduce s_groups in the would like to spawn it on a node with consid- GR-ACO benchmark from Section ii to cre- erable computation power, or if two processes ate Scalable Reliable ACO (SR-ACO); see Fig- are likely to communicate frequently then it ure 12. Here, apart from reducing the num- would be desirable to spawn them on the same ber of connections, s_groups also reduce the node, or nodes with a fast interconnect. global namespace information. That is, in- We have implemented two Erlang libraries stead of registering the name of a pid glob- to support semi-explicit placement [60]. The ally, i.e. with all nodes, the names is regis- first deals with node attributes, and describes tered only on all nodes in the s_group with properties of individual Erlang VMs and associ- s_group:register_name/3 . ated hosts, such as total and currently available A comparative performance evaluation of RAM, installed software, hardware configura- distributed Erlang and SD Erlang Orbit and tion, etc. The second deals with a notion of com- ACO is presented in Section VIII. munication distances which models the commu- nication times between nodes in a distributed i.2 Semi-Explicit Placement system. Therefore, instead of specifying a node we can use the attr:choose_node/1 Recall from Section iii that distributed Erlang function to define the target node, i.e. spawns a process onto an explicitly named Erlang node, e.g. spawn(attr:choose_node(Params),

18 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

Figure 13: SD Erlang (SD-Orbit) architecture.

fun some_module:pong/0). i.e. a hidden node simultaneously acts as a node and as a group, because as a group it has [60] report an investigation into the commu- a namespace but does not share it with any nication latencies on a range of NUMA and other node. Free normal and hidden groups cluster architectures, and demonstrate the ef- have no names, and are uniquely defined by fectiveness of the placement libraries using the the nodes associated with them. Therefore, ML-ACO benchmark on the Athos cluster. group names, gr_names, are either NoGroup or a set of s_group_names. A namespace is a ii. S_group Semantics set of name and process id, pid, pairs and is replicated on all nodes of the associated group. For precise specification, and as a basis for val- A node has the following four parame- idation, we provide a small-step operational ters: node_id identifier, node_type that can semantics of the s_group operations [21]. Fig- be either hidden or normal, connections, and ure 14 defines the state of an SD Erlang system group_names, i.e. names of groups the node and associated abstract syntax variables. The belongs to. The node can belong to either a list abstract syntax variables on the left are defined of s_groups or one of the free groups. The type as members of sets, denoted {}, and these in of the free group is defined by the node type. turn may contain tuples, denoted (), or further Connections are a set of node_ids. sets. In particular nm is a process name, p a pid, ni a node_id, and nis a set of node_ids. Transitions in the semantics have the form 0 The state of a system is modelled as a four (state, command, ni) −→ (state , value) meaning tuple comprising a set of s_groups, a set of that executing command on node ni in state 0 f ree_groups, a set of f ree_hidden_groups, and returns value and transitions to state . a set of nodes. Each type of group is associ- The semantics is presented in more detail ated with nodes and has a namespace. An by [21], but we illustrate it here with the s_group additionally has a name, whereas a s_group:registered_names/1 function from f ree_hidden_group consists of only one node, Section i.1. The function returns a list of names

19 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

(grs, fgs, fhs, nds) ∈ {state} ≡ {s_group}, {free_group}, {free_hidden_group}, {node} gr ∈ grs ≡ {s_group} ≡ s_group_name, {node_id}, namespace fg ∈ fgs ≡ {free_group} ≡ {node_id}, namespace fh ∈ fhs ≡ {free_hidden_group} ≡ node_id, namespace nd ∈ nds ≡ {node} ≡ node_id, node_type, connections, gr_names gs ∈ {gr_names} ≡ NoGroup, {s_group_name} ns ∈ {namespace} ≡ {(name, pid)} cs ∈ {connections} ≡ {node_id} nt ∈ {node_type} ≡ {Normal, Hidden} s ∈ {NoGroup, s_group_name}

Figure 14: SD Erlang state [21]. registered in s_group s if node ni belongs to the sets throughout. Having an executable seman- s_group, an empty list otherwise (Figure 15). tics allows users to engage with it, and to un- Here ⊕ denotes disjoint set union; IsSGroupN- derstand how the semantics behaves vis à vis ode returns true if node ni is a member of some the library, giving them an opportunity to as- s_group s, false otherwise; and OutputNms re- sess the correctness of the library against the turns a set of process names registered in the semantics. ns namespace of s_group s. Better still, we can automatically assess how the system behaves in comparison with the iii. Semantics Validation (executable) semantics by executing them in lockstep, guided by the constraints of which As the semantics is concrete it can readily be operations are possible at each point. We do made executable in Erlang, with lists replacing that by building an abstract state machine model of the library. We can then gener- ate random sequences (or traces) through the model, with appropriate library data gener- ated too. This random generation is supported by the QuickCheck property-based testing sys- tem [24, 9]. The architecture of the testing framework is shown in Figure 16. First an abstract state machine embedded as an “eqc_statem” mod- ule is derived from the executable semantic specification. The state machine defines the abstract state representation and the transition from one state to another when an operation is applied. Test case and data generators are Figure 16: Testing s_groups using QuickCheck. then defined to control the test case generation;

20 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

(grs,fgs, fhs, nds), s_group : registered_names(s), ni −→ (grs, fgs, fhs, nds), nms) if IsSGroupNode(ni, s, grs −→ (grs, fgs, fhs, nds), {} otherwise where (s, {ni} ⊕ nis, ns) ⊕ grs0 ≡ grs nms ≡ OutputNms(s, ns)

IsSGroupNode(ni, s, grs) = ∃nis, ns, grs0 . (s, {ni} ⊕ nis, ns) ⊕ grs0 ≡ grs

OutputNms(s, ns) = (s, nm) | (nm, p) ∈ ns

Figure 15: SD Erlang Semantics of s_group:registered_names/1 Function this includes the automatic generation of eligi- library implementation were found, includ- ble s_group operations and the input data to ing one error due to the synchronisation be- those operations. Test oracles are encoded as tween nodes, and the other related to the the postcondition for s_group operations. remove_nodes operation, which erroneously During testing, each test command is ap- raised an exception. We also found a couple plied to both the abstract model and the of trivial errors in the semantic specification s_group library. The application of the test itself, which had been missed by manual ex- command to the abstract model takes the ab- amination. Finally, we found some situations stract model from its current state to a new where there were inconsistencies between the state as described by the transition functions; semantics and the library implementation, de- whereas the application of the test command spite their states being equivalent: an example to the library leads the system to a new actual of this was in the particular values returned state. The actual state information is collected by functions on certain errors. Overall, the au- from each node in the distributed system, then tomation of testing boosted our confidence in merged and normalised to the same format the correctness of the library implementation as the abstract state representation. For a test and the semantic specification. This work is to be successful, after the execution of a test reported in more detail by [54]. command, the test oracles specified for this command should be satisfied. Various test ora- cles can be defined for s_group operations; for instance one of the generic constraints that ap- plies to all the s_group operations is that after VI. Improving VM Scalability each s_group operation, the normalised system state should be equivalent to the abstract state. This section reports the primary VM and li- Thousands of tests were run, and three kinds brary improvements we have designed and of errors — which have subsequently been implemented to address the scalability and re- corrected — were found. Some errors in the liability issues identified in Section i.

21 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

50 R11B-5 i. Improvements to Erlang Term Stor- 45 R13B02-1 R14B age 40 R16B 35 Because ETS tables are so heavily used in 30 Erlang systems, they are a focus for scalability 25 improvements. We start by describing their 20 redesign, including some improvements that 15 pre-date our RELEASE project work, i.e. those 10 prior to Erlang/OTP R15B03. These historical 5 0 improvements are very relevant for a scalability 0 10 20 30 40 50 60 70 60 study and form the basis for our subsequent R11B-5 R13B02-1 R14B changes and improvements. At the point when 50 R16B Erlang/OTP got support for multiple cores (in release R11B), there was a single reader-writer 40 lock for each ETS table. Optional fine grained 30 locking of hash-based ETS tables (i.e. set, bag or duplicate_bag tables) was introduced in 20

Erlang/OTP R13B02-1, adding 16 reader-writer 10 locks for the hash buckets. Reader groups to 0 minimise read synchronisation overheads were 0 10 20 30 40 50 60 70 introduced in Erlang/OTP R14B. The key ob- servation is that a single count of the multi- Figure 17: Runtime of 17M ETS operations in ple readers must be synchronised across many Erlang/OTP releases: 10% updates (left) and cache lines, potentially far away in a NUMA 1% updates (right). The y-axis shows time system. Maintaining reader counts in multi- in seconds (lower is better) and the x-axis is ple (local) caches makes reads fast, although number of OS threads (VM schedulers). writes must now check every reader count. In Erlang/OTP R16B the number of bucket locks, and the default number of reader groups, were bles has improved in more recent Erlang/OTP both upgraded from 16 to 64. releases. Figure 18 shows the runtimes in sec- We illustrate the scaling properties of the onds of 17M operations on an ETS table with ETS concurrency options using the ets_bench different numbers of reader groups, again vary- BenchErl benchmark on the Intel NUMA ma- ing the number of schedulers. We see that one chine with 32 hyperthreaded cores specified in reader group is not sufficient with 10% up- Appendix A. The ets_bench benchmark inserts dates, nor are two with 1% updates. Beyond 1M items into the table, then records the time to that, different numbers of reader groups have perform 17M operations, where an operation little impact on the benchmark performance is either a lookup, an insert, or a delete. The except that using 64 groups with 10% updates experiments are conducted on a hash-based slightly degrades performance. (set) ETS table with different percentages of We have explored four other extensions or update operations, i.e. insertions or deletions. redesigns in the ETS implementation for better Figure 17 shows the runtimes in seconds of scalability. 1. Allowing more programmer con- 17M operations in different Erlang/OTP ver- trol over the number of bucket locks in hash- sions, varying the number of schedulers (x- based tables, so the programmer can reflect axis), reflecting how the scalability of ETS ta- the number of schedulers and the expected

22 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

12 1 RG 20 2 RG 4 RG 10 8 RG 16 RG 32 RG 15 8 64 RG

6 10

4 set 5 2 Operations / Microsecond ordset AVL-CA tree Treap-CA tree 0 0 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 14 Number of Threads 1 RG 2 RG 25 12 4 RG 8 RG set 16 RG ordset 10 32 RG 64 RG 20 AVL-CA tree Treap-CA tree 8 15 6

4 10

2 Operations / Microsecond 5 0 0 10 20 30 40 50 60 70 0 0 10 20 30 40 50 60 70 Number of Threads Figure 18: Runtime of 17M ETS operations with vary- ing numbers of reader groups: 10% updates Figure 19: Throughput of CA tree variants: 10% up- (left) and 1% updates (right). The y-axis dates (left) and 1% updates (right). shows runtime in seconds (lower is better) and the x-axis is number of schedulers. access pattern. 2. Using contention-adapting trees to get better scalability for ordered_set ETS tables as described by [69]. 3. Using queue chine of Erlang/OTP 17.0. One extends the delegation locking to improve scalability [47]. existing AVL trees in the Erlang VM, and the 4. Adopting schemes for completely eliminat- other uses a Treap data structure [5]. Figure 19 ing the locks in the meta table. A more com- compares the throughput of the CA tree vari- plete discussion of our work on ETS can be ants with that of ordered_set and set as the found in the papers by [69] and [47]. number of schedulers increases. It is unsur- Here we outline only our work on prising that the CA trees scale so much better contention-adapting (CA) trees. A CA tree than an ordered_set which is protected by a monitors contention in different parts of a single readers-writer lock. It is more surprising tree-shaped data structure, introducing rout- that they also scale better than set. This is due ing nodes with locks in response to high con- to hash tables using fine-grained locking at a tention, and removing them in response to low fixed granularity, while CA trees can adapt the contention. For experimental purposes two number of locks to the current contention level, variants of the CA tree have been implemented and also to the parts of the key range where to represent ordered_sets in the virtual ma- contention is occurring.

23 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017 ii. Improvements to Schedulers system. The new balancing mechanism results in a In the Erlang VM, a scheduler is responsible better distribution of processes to schedulers, for executing multiple processes concurrently, reducing the probability of core contention. To- in a timely and fair fashion, making optimal gether with other VM improvements, such as use of hardware resources. The VM imple- interruptable BIFs and garbage collection, it ments preemptive multitasking with soft real- results in lower latency and improved respon- time guarantees. Erlang processes are normally siveness, and hence reliability, for soft real-time scheduled on a reduction count basis where applications. one reduction is roughly equivalent to a func- tion call. Each process is allowed to execute until it either blocks waiting for input, typically iii. Improvements to Time Manage- a message from some other process, or until it ment has executed its quota of reductions. Soon after the start of the RELEASE project, The Erlang VM is usually started with one time management in the Erlang VM became scheduler per logical core (SMT-thread) avail- a scalability bottleneck for many applications, able on the host machine, and schedulers are as illustrated by the parallel benchmark in Fig- implemented as OS threads. When an Erlang ure i. The issue came to prominence as other, process is spawned it is placed in the run queue more severe, bottlenecks were eliminated. This of the scheduler of its parent, and it waits on subsection motivates and outlines the improve- that queue until the scheduler allocates it a ments to time management that we made; slice of core time. Work stealing is used to bal- these were incorporated into Erlang/OTP 18.x ance load between cores, that is an idle sched- as a new API for time and time warping. The uler may migrate a process from another run old API is still supported at the time of writing, queue. Scheduler run queues are visualised in but its use is deprecated. Figure iii.2. The original time API provides the The default load management mechanism erlang:now/0 built-in that returns “Erlang sys- is load compaction that aims to keep as many tem time” or time since Epoch with micro sec- scheduler threads as possible fully loaded with ond resolution. This time is the basis for all work, i.e. it attempts to ensure that sched- time internally in the Erlang VM. uler threads do not run out of work. We Many of the scalability problems of have developed a new optional scheduler utilisa- erlang:now/0 stem from its specification, writ- tion balancing mechanism that is available from ten at a time when the Erlang VM was not Erlang/OTP 17.0. The new mechanism aims multi-threaded, i.e. SMT-enabled. The docu- to balance scheduler utilisation between sched- mentation promises that values returned by it ulers; that is, it will strive for equal scheduler are strictly increasing and many applications utilisation on all schedulers. ended up relying on this. For example applica- The scheduler utilisation balancing mecha- tions often employ erlang:now/0 to generate nism has no performance impact on the system unique integers. when not enabled. On the other hand, when Erlang system time should align with the enabled, it results in changed timing in the sys- operating system’s view of time since Epoch or tem; normally there is a small overhead due “OS system time”. However, while OS system to measuring of utilisation and calculating bal- time can be freely changed both forwards and ancing information, which depends on the un- backwards, Erlang system time cannot, with- derlying primitives provided by the operating out invalidating the strictly increasing value

24 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017 guarantee. The Erlang VM therefore contains a needs to be changed, new adjustment informa- mechanism that slowly adjusts Erlang system tion is published and used to calculate Erlang time towards OS system time if they do not monotonic time in the future. align. When a thread needs to retrieve time, it reads One problem with time adjustment is that the monotonic time delivered by the OS and the VM deliberately presents time with an in- the time adjustment information previously accurate frequency; this is required to align published and calculates Erlang monotonic Erlang system time with OS system time time. To preserve monotonicity it is impor- smoothly when these two have deviated, e.g. tant that all threads that read the same OS in the case of clock shifts when leap seconds monotonic time map this to exactly the same are inserted or deleted. Another problem is Erlang monotonic time. This requires synchro- that Erlang system time and OS system time nisation on updates to the adjustment infor- can differ for very long periods of time. In the mation using a readers-writer (RW) lock. This new API, we resolve this using a common OS RW lock is write-locked only when the adjust- technique [59], i.e. a monotonic time that has ment information is changed. This means that its zero point at some unspecified point in time. in the vast majority of cases the RW lock will Monotonic time is not allowed to make leaps be read-locked, which allows multiple readers forwards and backwards while system time is to run concurrently. To prevent bouncing the allowed to do this. Erlang system time is thus lock cache-line we use a bespoke reader opti- just a dynamically varying offset from Erlang mised RW lock implementation where reader monotonic time. threads notify about their presence in coun- ters on separate cache-lines. The concept is Time Retrieval Retrieval of Erlang system similar to the reader indicator algorithm de- time was previously protected by a global mu- scribed by [48, Fig. 11] and alternatives include tex, which made the operation thread safe, but the ingress-egress counter used by [18] and the scaled poorly. Erlang system time and Erlang SNZI algorithm of [30]. monotonic time need to run at the same fre- quency, otherwise the time offset between them Timer Wheel and BIF Timer The timer would not be constant. In the common case, wheel contains all timers set by Erlang pro- monotonic time delivered by the operating sys- cesses. The original implementation was pro- tem is solely based on the machine’s local clock tected by a global mutex and scaled poorly. To and cannot be changed, while the system time increase concurrency, each scheduler thread is adjusted using the Network Time Protocol has been assigned its own timer wheel that is (NTP). That is, they will run with different fre- used by processes executing on the scheduler. quencies. Linux is an exception with a mono- The implementation of timers in tonic clock that is NTP adjusted and runs with Erlang/OTP uses a built in function (BIF), the same frequency as system time [76]. To as most low-level operations do. Until align the frequencies of Erlang monotonic time Erlang/OTP 17.4, this BIF was also protected and Erlang system time, we adjust the fre- by a global mutex. Besides inserting timers quency of the Erlang monotonic clock. This into the timer wheel, the BIF timer implemen- is done by comparing monotonic time and sys- tation also maps timer references to a timer in tem time delivered by the OS, and calculating the timer wheel. To improve concurrency, from an adjustment. To achieve this scalably, one Erlang/OTP 18 we provide scheduler-specific VM thread calculates the time adjustment to BIF timer servers as Erlang processes. These use at least once a minute. If the adjustment keep information about timers in private ETS

25 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

180000 tables and only insert one timer at the time 15B01 (now()) [70016,640] 160000 17.4 (now()) [70016,640] into the timer wheel. 18.1 (now()) [70016,640] 140000 18.1 (monotonic_time()) [70016,640]

120000 Benchmarks We have measured several 100000 benchmarks on a 16-core Bulldozer machine 80000 with eight dual CPU AMD Opteron 4376 HEs.5 time (ms) 60000 We present three of them here. 40000

The first micro benchmark compares the ex- 20000 receive ecution time of an Erlang with that of 0 a receive after that specifies a timeout and 0 10 20 30 40 50 60 70 receive after # Schedulers provides a default value. The 30 sets a timer when the process blocks in the 18.1 (monotonic_time()) [70016,640] 25 receive, and cancels it when a message ar- rives. In Erlang/OTP 17.4 the total execution 20 time with standard timers is 62% longer than 15 without timers. Using the improved implemen- Speedup tation in Erlang/OTP 18.0, total execution time 10 with the optimised timers is only 5% longer 5 than without timers. 0 The second micro benchmark repeatedly 0 10 20 30 40 50 60 70 checks the system time, calling the built-in # Schedulers erlang:now/0 in Erlang/OTP 17.4, and parallel calling both erlang:monotonic_time/0 and Figure 20: BenchErl benchmark us- erlang:monotonic_time/0 erlang:time_offset/0 and adding the results ing or erlang:now/0 in different Erlang/OTP in Erlang/OTP 18.0. In this machine, where releases: runtimes (left) and speedup obtained the VM uses 16 schedulers by default, the 18.0 using erlang:monotonic_time/0 (right). release is more than 69 times faster than the 17.4 release. The third benchmark is the parallel BenchErl likely to be a scalability bottleneck even when benchmark from Section i. Figure 20 shows using erlang:now/0, and 3. the new time API the results of executing the original version of (using erlang:monotonic_time/0 and friends) this benchmark, which uses erlang:now/0 to provides a scalable solution. The graph on the create monotonically increasing unique values, right side of Figure 20 shows the speedup that using three Erlang/OTP releases: R15B01, 17.4, the modified version of the parallel benchmark and 18.1. We also measure a version of the achieves in Erlang/OTP 18.1. benchmark in Erlang/OTP 18.1 where the call erlang:now/0 to has been substituted with a VII. Scalable Tools call to erlang:monotonic_time/0. The graph on its left shows that: 1. the performance of This section outlines five tools developed in the time management has remained roughly un- RELEASE project to support scalable Erlang changed between Erlang/OTP releases prior systems. Some tools were developed from to 18.0; 2. the improved time management in scratch, like Devo, SDMon and WombatOAM, Erlang/OTP 18.x make time management less while others extend existing tools, like Percept 5See §2.5.4 of the RELEASE project Deliverable 2.4 and Wrangler. These include tooling to trans- (http://release-project.eu/documents/D2.4.pdf). form programs to make them more scalable,

26 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017 to deploy them for scalability, to monitor and tracing frameworks DTrace8 and SystemTap9 visualise them. Most of the tools are freely probes as we show in this section for Erlang, available under open source licences (Devo, provided that the host language has tracing Percept2, SD-Mon, Wrangler); while Wombat- hooks into the appropriate infrastructure. OAM is a commercial product. The tools have been used for profiling and refactoring the i. Refactoring for Scalability ACO and Orbit benchmarks from Section III. The Erlang tool “ecosystem” consists of Refactoring [65, 32, 75] is the process of chang- small stand-alone tools for tracing, profiling ing how a program works without changing and debugging Erlang systems that can be used what it does. This can be done for readability, separately or together as appropriate for solv- for testability, to prepare it for modification or ing the problem at hand, rather than as a sin- extension, or — as is the case here — in order gle, monolithic, super-tool. The tools presented to improve its scalability. Because refactoring here have been designed to be used as part of involves the transformation of source code, it that ecosystem, and to complement already is typically performed using machine support available functionality rather than to duplicate in a refactoring tool. There are a number of it. The Erlang runtime system has built-in tools that support refactoring in Erlang: in the support for tracing many types of events, and RELEASE project we have chosen to extend 10 this infrastructure forms the basis of a number Wrangler [56]; other tools include Tidier [68] of tools for tracing and profiling. Typically the and RefactorErl [44]. tools build on or specialise the services offered by the Erlang virtual machine, through a num- Supporting API Migration The SD Erlang li- ber of built-in functions. Most recently, and braries modify Erlang’s global_group library, since the RELEASE project was planned, the becoming the new s_group library; as a re- Observer6 application gives a comprehensive sult, Erlang programs using global_group will overview of many of these data on a node-by- have to be refactored to use s_group. This kind node basis. of API migration problem is not uncommon, as As actor frameworks and languages (see software evolves and this often changes the API Section ii) have only recently become widely of a library. Rather than simply extend Wran- adopted commercially, their tool support re- gler with a refactoring to perform this particu- mains relatively immature and generic in na- lar operation, we instead added a framework ture. That is, the tools support the language for the automatic generation of API migration itself, rather than its distinctively concurrent refactorings from a user-defined adaptor mod- aspects. Given the widespread use of Erlang, ule. tools developed for it point the way for tools Our approach to automatic API migration in other actor languages and frameworks. For works in this way: when an API function’s example, just as many Erlang tools use trac- interface is changed, the author of this API ing support provided by the Erlang VM, so function implements an adaptor function, defin- can other actor frameworks, e.g. Akka can use ing calls to the old API in terms of the new. the Kamon7 JVM monitoring system. Simi- From this definition we automatically generate larly, tools for other actor languages or frame- the refactoring that transforms the client code works could use data derived through OS-level to use the new API; this refactoring can also be 8http://dtrace.org/blogs/about/ 6http://www.erlang.org/doc/apps/observer/ 9https://sourceware.org/systemtap/wiki 7http://kamon.io 10http://www.cs.kent.ac.uk/projects/wrangler/

27 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017 supplied by the API writer to clients on library a task in parallel with its parent process. The upgrade, allowing users to upgrade their code result of the new process is sent back to the par- automatically. The refactoring works by gener- ent process, which will then consume it when ating a set of rules that “fold in” the adaptation needed. In order not to block other computa- to the client code, so that the resulting code tions that do not depend on the result returned works directly with the new API. More details by the new process, the receive expression is of the design choices underlying the work and placed immediately before the point where the the technicalities of the implementation can be result is needed. found in a paper by [52]. While some tail-recursive list processing functions can be refactored to an explicit map Support for Introducing Parallelism We operation, many cannot due to data dependen- have introduced support for parallelising ex- cies. For instance, an example might perform a plicit list operations (map and foreach), for pro- recursion over a list while accumulating results cess introduction to complete a computation- in an accumulator variable. In such a situa- ally intensive task in parallel, for introducing a tion it is possible to “float out” some of the worker process to deal with call handling in an computations into parallel computations. This Erlang “generic server” and to parallelise a tail can only be done when certain dependency recursive function. We discuss these in turn constraints are satisfied, and these are done by now; more details and practical examples of program slicing, which is discussed below. the refactorings appear in a conference paper describing that work [55]. Uses of map and foreach in list process- Support for Program Slicing Program slic- ing are among of the most obvious places ing is a general technique of program analysis where parallelism can be introduced. We for extracting the part of a program, also called have added a small library to Wrangler, called the slice, that influences or is influenced by a para_lib, which provides parallel implementa- given point of interest, i.e. the slicing criterion. tions of map and foreach. The transformation Static program slicing is generally based on from an explicit use of sequential map/foreach program dependency including both control to the use of their parallel counterparts is dependency and data dependency. Backward very straightforward, even manual refactor- intra-function slicing is used by some of the ing would not be a problem. However a refactorings described above; it is also useful map/foreach operation could also be imple- in general, and made available to end-users mented differently using recursive functions, under Wrangler’s Inspector menu [55]. list comprehensions, etc.; identifying this kind of implicit map/foreach usage can be done us- Our work can be compared with that in 11 ing Wrangler’s code inspection facility, and a PaRTE [17], a tool developed in another EU refactoring that turns an implicit map/foreach project that also re-uses the Wrangler front end. to an explicit map/foreach can also be speci- This work concentrates on skeleton introduc- fied using Wrangler’s rule-based transforma- tion, as does some of our work, but we go tion API. further in using static analysis and slicing in If the computations of two non-trivial tasks transforming programs to make them suitable do not depend on each other, then they can for introduction of parallel structures. be executed in parallel. The Introduce a New Process refactoring implemented in Wrangler 11http://paraphrase-enlarged.elte.hu/ can be used to spawn a new process to execute downloads/D4-3_user_manual.pdf

28 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

Wombat Wombat Web CLI Dashboard REST REST

Northbound Interface

Topology Metrics Notif. Alarms Orchestration API API API API API

Erlang

Master

Node Core Topology Metrics Notif. Alarms Orchestration manager

Core DB Erlang (including Topology and Orchestration Middle Manager data) Node Core Metrics Notif. Alarms Orchestration manager

Erlang

Erlang/RPC elibcloud/ Libcloud Provision software Provision (SSH) VMs (REST)

Infrastructure Provider

Agents Metrics error_logger lager Alarms agent agent agent agent

Managed Erlang Node Virtual machine

Figure 21: The architecture of WombatOAM.

ii. Scalable Deployment Prior to the development of WombatOAM, de- ployment of systems would use scripting in 12 We have developed the WombatOAM tool to Erlang and the shell, and this is the state of provide a deployment, operations and main- the art for other actor frameworks; it would be tenance framework for large-scale Erlang dis- possible to adapt the WombatOAM approach tributed systems. These systems typically con- to these frameworks in a straightforward way. sist of a number of Erlang nodes executing on different hosts. These hosts may have different hardware or operating systems, be physical or Architecture The architecture of virtual, or run different versions of Erlang/OTP. WombatOAM is summarised in Figure 21. Originally the system had problems address- 12WombatOAM (https://www.erlang-solutions. com/products/wombat-oam.html) is a commercial tool ing full scalability because of the role played available from Erlang Solutions Ltd. by the central Master node; in its current

29 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017 version an additional layer of Middle Managers bution, or the lager logging framework15, was introduced to allow the system to scale and will be displayed and logged as they easily to thousands of deployed nodes. As the occur. diagram shows, the “northbound” interfaces to the web dashboard and the command-line Alarms Alarms are more complex entities. are provided through RESTful connections Alarms have a state: they can be raised, and to the Master. The operations of the Master once dealt with they can be cleared; they are delegated to the Middle Managers that also have identities, so that the same alarm engage directly with the managed nodes. Each may be raised on the same node multiple managed node runs a collection of services that times, and each instance will need to be collect metrics, raise alarms and so forth; we dealt with separately. Alarms are gener- SASL lager describe those now. ated by or , just as for notifica- tions.

Services WombatOAM is designed to collect, Topology The Topology service handles store and display various kinds of information adding, deleting and discovering nodes. and event from running Erlang systems, and It also monitors whether they are accessi- these data are accessed and managed through ble, and if not, it notifies the other services, an AJAX-based Web Dashboard; these include and periodically tries to reconnect. When the following. the nodes are available again, it also noti- fies the other services. It doesn’t have a Metrics WombatOAM supports the collection middle manager part, because it doesn’t of some hundred metrics — including, for talk to the nodes directly: instead it asks instance, numbers of processes on a node the Node manager service to do so. and the message traffic in and out of the node — on a regular basis from the Erlang Node manager This service maintains the con- VMs running on each of the hosts. It can nection to all managed nodes via the also collect metrics defined by users within Erlang distribution protocol. If it loses other metrics collection frameworks, such the connection towards a node, it periodi- as Folsom13, and interface with other tools cally tries to reconnect. It also maintains such as graphite14 which can log and dis- the states of the nodes in the database (e.g. play such information. The metrics can be if the connection towards a node is lost, displayed as histograms covering different the Node manager changes the node state windows, such as the last fifteen minutes, to DOWN and raises an alarm). The Node hour, day, week or month. manager doesn’t have a REST API, since the node states are provided via the Topol- Notifications As well as metrics, it can sup- ogy service’s REST API. port event-based monitoring through the collection of notifications from running Orchestration This service can deploy new nodes. Notifications, which are one time Erlang nodes on already running ma- events can be generated using the Erlang chines. It can also provision new vir- System Architecture Support Libraries, tual machine instances using several cloud SASL, which is part of the standard distri- providers, and deploy Erlang nodes on those instances. For communicating with 13Folsom collects and publishes metrics through an the cloud providers, the Orchestration Erlang API: https://github.com/boundary/folsom 14https://graphiteapp.org 15https://github.com/basho/lager

30 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

service uses an external library called 4. Defining a deployment domain. At this step Libcloud,16 for which Erlang Solutions has a deployment domain is created that spec- written an open source Erlang wrapper ifies(i) which providers should be used for called elibcloud, to make Libcloud eas- provisioning machines; (ii) the username ier to use from WombatOAM. Note that that will be used when WombatOAM con- WombatOAM Orchestration doesn’t pro- nects to the hosts using SSH. vide a platform for writing Erlang appli- cations: it provides infrastructure for de- 5. Node deployment. To deploy nodes a ploying them. WombatOAM user needs only to specify the number of nodes and the node fam- ily these nodes belong to. Nodes can be Deployment The mechanism consists of the dynamically added to, or removed from, following five steps. the system depending on the needs of the application. The nodes are started, and 1. Registering a provider. WombatOAM pro- WombatOAM is ready to initiate and run vides the same interface for different the application. cloud providers which support the Open- Stack standard or the Amazon EC2 API. iii. Monitoring and Visualisation WombatOAM also provides the same in- terface for using a fixed set of machines. A key aim in designing new monitoring and In WombatOAM’s backend, this has been visualisation tools and adapting existing ones implemented as two driver modules: the was to provide support for systems running on elibcloud driver module which uses the parallel and distributed hardware. Specifically, elibcloud and Libcloud libraries to com- in order to run modern Erlang systems, and municate with the cloud providers, and in particular SD Erlang systems, it is necessary the SSH driver module that keeps track of to understand both their single host (“multi- a fixed set of machines. core”) and multi-host (“distributed”) nature. That is, we need to be able to understand how 2. Uploading a release. The release can be ei- systems run on the Erlang multicore virtual ther a proper Erlang release archive or a machine, where the scheduler associated with set of Erlang modules. The only important a core manages its own run queue, and pro- aspect from WombatOAM’s point of view cesses migrate between the queues through is that start and stop commands should a work stealing algorithm; at the same time, be explicitly specified. This will enable we have to understand the dynamics of a dis- WombatOAM start and stop nodes when tributed Erlang program, where the user ex- needed. plicitly spawns processes to nodes.17

3. Defining a node family. The next step is cre- iii.1 Percept2 ating the node family, which is the entity that refers to a certain release, contains Percept218 builds on the existing Percept tool deployment domains that refer to certain to provide post hoc offline analysis and visuali- providers, and contains other information 17This is in contrast with the multicore VM, where necessary to deploy a node. programmers have no control over where processes are spawned; however, they still need to gain insight into the 16The unified cloud API [4]: https://libcloud. behaviour of their programs to tune performance. apache.org 18https://github.com/RefactoringTools/percept2

31 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

Figure 22: Offline profiling of a distributed Erlang application using Percept2. sation of Erlang systems. Percept2 is designed Percept2 extends Percept in a number of to allow users to visualise and tune the paral- ways — as detailed by [53] — including most lel performance of Erlang systems on a single importantly: node on a single manycore host. It visualises • Distinguishing between running and Erlang application level concurrency and iden- runnable time for each process: this is tifies concurrency bottlenecks. Percept2 uses apparent in the process runnability com- Erlang built in tracing and profiling to monitor parison as shown in Figure 23, where or- process states, i.e. waiting, running, runnable, ange represents runnable and green rep- free and exiting. A waiting or suspended pro- resents running. This shows very clearly cess is considered an inactive and a running or where potential concurrency is not being runnable process is considered active. As a pro- exploited. gram runs with Percept, events are collected and stored to a file. The file is then analysed, • Showing scheduler activity: the number with the results stored in a RAM database, and of active schedulers at any time during the this data is viewed through a web-based in- profiling. terface. The process of offline profiling for a distributed Erlang application using Percept2 • Recording more information during exe- is shown in Figure 22. cution, including the migration history of a process between run queues; statistics Percept generates an application-level about message passing between processes: zoomable concurrency graph, showing the the number of messages and the average number of active processes at each point message size sent/received by a process; during profiling; dips in the graph represent the accumulated runtime per-process: the low concurrency. A lifetime bar for each accumulated time when a process is in a process is also included, showing the points running state. during its lifetime when the process was active, as well as other per-process information. • Presenting the process tree: the hierarchy

32 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

Figure 23: Percept2: showing processes running and runnable (with 4 schedulers/run queues).

structure indicating the parent-child rela- • Selective function profiling of processes. tionships between processes. In Percept2, we have built a version of fprof, which does not measure a function’s • Recording dynamic function call own execution time, but measures every- graph/count/time: the hierarchy struc- thing else that fprof measures. Eliminating ture showing the calling relationships measurement of a function’s own execu- between functions during the program tion time gives users the freedom of not run, and the amount of time spent in a profiling all the function calls invoked dur- function. ing the program execution. For example, they can choose to profile only functions • Tracing of s_group activities in a dis- defined in their own applications’ code, tributed system. Unlike global group, and not those in libraries. s_group allows dynamic changes to the s_group structure of a distributed Erlang • Improved dynamic function callgraph. system. In order to support SD Erlang, we With the dynamic function callgraph, a have also extended Percept2 to allow the user is able to understand the causes of profiling of s_group related activities, so certain events, such as heavy calls of a par- that the dynamic changes to the s_group ticular function, by examining the region structure of a distributed Erlang system around the node for the function, includ- can be captured. ing the path to the root of the graph. Each edge in the callgraph is annotated with We have also improved on Percept as follows. the number of times the target function is called by the source function as well as • Enabling finer-grained control of what further information. is profiled. The profiling of port activi- ties, schedulers activities, message pass- Finally, we have also improved the scalability ing, process migration, garbage collec- of Percept in three ways. First we have paral- tion and s_group activities can be en- lelised the processing of trace files so that mul- abled/disabled, while the profiling of pro- tiple data files can be processed at the same cess runnability (indicated by the “proc” time. We have also compressed the representa- flag) is always enabled. tion of call graphs, and cached the history of

33 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

generated web pages. Together these make the Run Queues 4500 RunQueue15 RunQueue14 system more responsive and more scalable. RunQueue13 4000 RunQueue12 RunQueue11 RunQueue10 RunQueue9 3500 RunQueue8 RunQueue7 iii.2 DTrace/SystemTap tracing RunQueue6 3000 RunQueue5 RunQueue4 RunQueue3 RunQueue2 DTrace provides dynamic tracing support for 2500 RunQueue1 RunQueue0 various flavours of Unix, including BSD and 2000

Mac OS X, and SystemTap does the same for 1500 Linux; both allow the monitoring of live, run- 1000 ning systems with minimal overhead. They can be used by administrators, language devel- 500 opers and application developers alike to ex- 0 amine the behaviour of applications, language implementations and the operating system dur- Figure 24: DTrace: runqueue visualisation of bang ing development or even on live production benchmark with 16 schedulers (time on X- systems. In comparison to other similar tools axis). and instrumentation frameworks, they are rel- atively lightweight, do not require special re- same storage infrastructure and presentation compiled versions of the software to be ex- facilities. amined, nor special post-processing tools to create meaningful information from the data gathered. Using these probs, it is possible to iii.3 Devo identify bottlenecks both in the VM itself and Devo19 [10] is designed to provide real-time on- in applications. line visualisation of both the low-level (single VM bottlenecks are identified using a large node, multiple cores) and high-level (multi- number of probes inserted into Erlang/OTP’s ple nodes, grouped into s_groups) aspects of VM, for example to explore scheduler run- Erlang and SD Erlang systems. Visualisation queue lengths. These probes can be used to is within a browser, with web sockets provid- measure the number of processes per sched- ing the connections between the JavaScript vi- uler, the number of processes moved during sualisations and the running Erlang systems, work stealing, the number of attempts to gain instrumented through the trace tool builder a run-queue lock, how many of these succeed (ttb). immediately, etc. Figure 24 visualises the re- Figure 25 shows visualisations from devo sults of such monitoring; it shows how the size in both modes. On the left-hand side a sin- of run queues vary during execution of the gle compute node is shown. This consists of bang BenchErl benchmark on a VM with 16 two physical chips (the upper and lower halves schedulers. of the diagram) with six cores each; with hy- Application bottlenecks are identified by an perthreading this gives twelve virtual cores, alternative back-end for Percept2, based on and hence 24 run queues in total. The size of DTrace/SystemTap instead of the Erlang built- these run queues is shown by both the colour in tracing mechanism. The implementation re- and height of each column, and process migra- uses the existing Percept2 infrastructure as far tions are illustrated by (fading) arc between the as possible. It uses a different mechanism for queues within the circle. A green arc shows mi- collecting information about Erlang programs, a different format for the trace files, but the 19https://github.com/RefactoringTools/devo

34 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

sents the current intensity of communication between the nodes (green quiescent; red busi- est).

p0 9152 0 iii.4 SD-Mon

p0 10890 0 SD-Mon is a tool specifically designed for mon- p0 10891 0 itoring SD-Erlang systems. This purpose is accomplished by means of a shadow network of agents, that collect data from a running sys- p0 9501 0 tem. An example deployment is shown in Figure 26, where blue dots represent nodes in the target system and the other nodes make up the SD-Mon infrastructure. The network is deployed on the basis of a configuration file describing the network architecture in terms of hosts, Erlang nodes, global group and s_group partitions. Tracing to be performed on moni- tored nodes is also specified within the config- uration file. An agent is started by a master SD-Mon node for each s_group and for each free node. Con- figured tracing is applied on every monitored node, and traces are stored in binary format in the agent file system. The shadow network fol- lows system changes so that agents are started and stopped at runtime as required, as shown in Figure 27. Such changes are persistently stored so that the last configuration can be re- Figure 25: Devo: low and high-level visualisations of produced after a restart. Of course, the shadow SD-Orbit. network can be always updated via the User Interface. Each agent takes care of an s_group or of a gration on the same physical core, a grey one free node. At start-up it tries to get in contact on the same chip, and blue shows migrations with its nodes and apply the tracing to them as between the two chips. stated by the master. Binary files are stored in On the right-hand side of Figure 25 is a vi- the host file system. Tracing is internally used sualisation of SD-Orbit in action. Each node in order to track s_group operations happening in the graph represents an Erlang node, and at runtime. An asynchronous message is sent colours (red, green, blue and orange) are used to the master whenever one of these changes oc- to represent the s_groups to which a node be- curs. Since each process can only be traced by longs. As is evident, the three nodes in the a single process at a time, each node (included central triangle belong to multiple groups, and those belonging to more than one s_group) is act as routing nodes between the other nodes. controlled by only one agent. When a node is The colour of the arc joining two nodes repre- removed from a group or when the group is

35 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

Figure 26: SD-Mon architecture. deleted, another agent takes over, as shown in Figure 27. When an agent is stopped, all traces on the controlled nodes are switched off. The monitoring network is also supervised, in order to take account of network fragility, and when an agent node goes down another node is deployed to play its role; there are also periodic consistency checks for the system as a whole, and when an inconsistency is detected then that part of the system can be restarted. SD-Mon does more than monitor activities one node at a time. In particular inter-node and inter-group messages are displayed at runtime. As soon as an agent is stopped, the related tracing files are fetched across the network by the master and they are made available in a readable format in the master file system. SD-Mon provides facilities for online visual- isation of this data, as well as post hoc offline analysis. Figure 28 shows, in real time, mes- sages that are sent between s_groups. This data Figure 27: SD-Mon evolution. Before (top) and after can also be used as input to the animated Devo eliminating s_group 2 (bottom). visualisation, as illustrated in the right-hand side of Figure 25.

36 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

Figure 28: SD-Mon: online monitoring.

VIII. Systemic Evaluation the four clusters specified in Appendix A. A co- herent presentation of many of these results is available in an article by [22] and in a RELEASE Preceding sections have investigated the im- project deliverable20. The bulk of the exper- provements of individual aspects of an Erlang iments reported here are conducted on the system, e.g. ETS tables in Section i. This sec- Athos cluster using Erlang/OTP 17.4 and the tion analyses the impact of the new tools and associated SD Erlang libraries. technologies from Sections V, VI, VII in concert. The experiments cover two measures of scal- We do so by investigating the deployment, re- ability. As Orbit does a fixed size computa- liability, and scalability of the Orbit and ACO tion, the scaling measure is relative speedup (or benchmarks from Section III. The experiments strong scaling), i.e. speedup relative to execu- reported here are representative. Similar exper- tion time on a single core. As the work in ACO iments show consistent results for a range of increases with compute resources, weak scaling micro-benchmarks, several benchmarks, and is the appropriate measure. The benchmarks the very substantial (approximately 150K lines of Erlang) Sim-Diasca case study [16] on sev- 20See Deliverable 6.2, available online. http://www. eral state of the art NUMA architectures, and release-project.eu/documents/D6.2.pdf

37 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

450

400

300

200 Relative Speedup

100

0 0 50(1200) 100(2400) 150(3600) 200(4800) 257(6168) Number of nodes (cores) D-Orbit W=2M SD-Orbit W=2M D-Orbit W=5M SD-Orbit W=5M

Figure 29: Speedup of distributed Erlang vs. SD Erlang Orbit (2M and 5M elements) [Strong Scaling]. also evaluate different aspects of s_groups: Or- ii. Ant Colony Optimisation (ACO) bit evaluates the scalability impacts of network connections, while ACO evaluates the impact Deployment The deployment and moni- of both network connections and the global toring of ACO, and of the large (150K namespace required for reliability. lines of Erlang) Sim-Diasca simulation using WombatOAM (Section ii) on the Athos cluster is detailed in [22]. i. Orbit An example experiment deploys 10,000 Erlang nodes without enabling monitoring, Figure 29 shows the speedup of the D-Orbit and hence allocates three nodes per core (i.e. and SD-Orbit benchmarks from Section i.1. The 72 nodes on each of 139 24-core Athos hosts). measurements are repeated seven times, and Figure 30 shows that WombatOAM deploys we plot standard deviation. Results show that the nodes in 212s, which is approximately 47 D-Orbit performs better on a small number nodes per second. It is more common to have of nodes as communication is direct, rather at least one core per Erlang node, and in a re- than via a gateway node. As the number of lated experiment 5000 nodes are deployed, one nodes grows, however, SD-Orbit delivers better per core (i.e. 24 nodes on each of 209 24-core speedups, i.e. beyond 80 nodes in case of 2M Athos hosts) in 101s, or 50 nodes per second. orbit elements, and beyond 100 nodes in case Crucially in both cases the deployment time of 5M orbit elements. When we increase the is linear in the number of nodes. The deploy- size of Orbit beyond 5M, the D-Orbit version ment time could be reduced to be logarithmic fails due to the fact that some VMs exceed the in the number of nodes using standard divide- available RAM of 64GB. In contrast SD-Orbit and-conquer techniques. However there hasn’t experiments run successfully even for an orbit been the demand to do so as most Erlang sys- with 60M elements. tems are long running servers.

38 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

200

160 Time (sec)

120

80 0 2000 4000 6000 8000 10000 Number of nodes (N) Experimental data Best fit: 0.0125N+83

Figure 30: Wombat Deployment Time

The measurement data shows two impor- We evaluate the reliability of two ACO ver- tant facts: it shows that WombatOAM scales sions using a Chaos Monkey service that kills well (up to a deployment base of 10,000 Erlang processes in the running system at random [14]. nodes), and that WombatOAM is non-intrusive Recall that GR-ACO provides reliability by reg- because its overhead on a monitored node is istering the names of critical processes globally, typically less than 1.5% of effort on the node. and SR-ACO registers them only within an We conclude that WombatOAM is capable s_group (Section ii). of deploying and monitoring substantial dis- For both GR-ACO and SR-ACO a Chaos tributed Erlang and SD Erlang programs. The Monkey runs on every Erlang node, i.e. mas- experiments in the remainder of this section ter, submasters, and colony nodes, killing a use standard distributed Erlang configuration random Erlang process every second. Both file deployment. ACO versions run to completion. Recovery, at this failure frequency, has no measurable Reliability SD Erlang changes the organisa- impact on runtime. This is because processes tion of processes and their recovery data at are recovered within the Virtual machine using the language level, so we seek to show that (globally synchronised) local recovery informa- these changes have not disrupted Erlang’s tion. For example, on a common X86/Ubuntu world-class reliability mechanisms at this level. platform typical Erlang process recovery times As we haven’t changed them we don’t exer- are around 0.3ms, so around 1000x less than cise Erlang’s other reliability mechanisms, e.g. the Unix process recovery time on the same those for managing node failures, network con- platform [58]. We have conducted more de- gestion, etc. A more detailed study of SD tailed experiments on an Instant Messenger Erlang reliability, including the use of repli- benchmark, and obtained similar results [23]. cated databases for recovering Instant Messen- We conclude that both GR-ACO and SR- ger chat sessions, finds similar results [23]. ACO are reliable, and that SD Erlang preserves

39 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

16

12

8 Execution time (s)

4

0 0 50(1200) 100(2400) 150(3600) 200(4800) 250(6000) Number of nodes (cores) GR-ACO SR-ACO ML-ACO

Figure 31: ACO execution times, Erlang/OTP 17.4 (RELEASE) [Weak Scaling]. distributed Erlang reliability model. The re- ers, that partitioning the network of Erlang mainder of the section outlines the impact of nodes significantly improves performance at maintaining the recovery information required large scale. for reliability on scalability. To investigate the impact of SD Erlang on network traffic, we measure the number of sent Scalability Figure 31 compares the runtimes and received packets on the GPG cluster for of the ML, GR, and SR versions of ACO (Sec- three versions of ACO: ML-ACO, GR-ACO, tion ii) on Erlang/OTP 17.4(RELEASE). As out- and SR-ACO. Figure 32 shows the total number lined in Section i.1, GR-ACO not only main- of sent packets. The highest traffic (the red line) tains a fully connected graph of nodes, it reg- belongs to the GR-ACO and the lowest traffic isters process names for reliability, and hence belongs to the SR-ACO (dark blue line). This scales significantly worse than the unreliable shows that SD Erlang significantly reduces the ML-ACO. We conclude that providing reliabil- network traffic between Erlang nodes. Even ity with standard distributed Erlang process with the s_group name registration SR-ACO registration dramatically limits scalability. has less network traffic than ML-ACO that has While ML-ACO does not provide reliability, no global name registration. and hence doesn’t register process names, it maintains a fully connected graph of nodes iii. Evaluation Summary which limits scalability. SR-ACO, that main- tains connections and registers process names We have shown that WombatOAM is capable only within s_groups scales best of all. Fig- of deploying and monitoring substantial dis- ure 31 illustrates how maintaining the process tributed Erlang and SD Erlang programs like namespace, and fully connected network, im- ACO and Sim-Diasca. The Chaos Monkey ex- pacts performance. This reinforces the evi- periments with GR-ACO and SR-ACO show dence from the Orbit benchmarks, and oth- that both are reliable, and hence that SD Erlang

40 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

Figure 32: Number of sent packets in ML-ACO, GR-ACO, and SR-ACO. preserves the distributed Erlang language-level and hence maintaining the recovery data only reliability model. within a relatively small group of nodes. These As SD Orbit scales better than D-Orbit, SR- results are consistent with other experiments. ACO scales better than ML-ACO, and SR- ACO has significantly less network traffic, IX. Discussion we conclude that, even when global recovery data is not maintained, partitioning the fully- Distributed actor platforms like Erlang, or connected network into s_groups reduces net- Scala with Akka, are a common choice for work traffic and improves performance. While internet-scale system architects as the model, the distributed Orbit instances (W=2M) and with its automatic, and VM-supported reliabil- (W=5M) reach scalability limits at around 40 ity mechanisms makes it extremely easy to engi- and 60 nodes, Orbit scales to 150 nodes on SD neer scalable reliable systems. Targeting emergent Erlang (limited by input size), and SR-ACO server architectures with hundreds of hosts is still scaling well on 256 nodes (6144 cores). and tens of thousands of cores, we report a Hence not only have we exceeded the 60 node systematic effort to improve the scalability of a scaling limits of distributed Erlang identified leading distributed actor language, while pre- in Section ii, we have not reached the scaling serving reliability. The work is a vade mecum limits of SD Erlang on this architecture. for addressing scalability of reliable actor lan- Comparing GR-ACO and ML-ACO scalabil- guages and frameworks. It is also high impact, ity curves shows that maintaining global re- with downloads of our improved Erlang/OTP covery data, i.e. a process name space, dra- running at 50K a month. matically limits scalability. Comparing GR- We have undertaken the first systematic ACO and SR-ACO scalability curves shows study of scalability in a distributed actor lan- that scalability can be recovered by partitioning guage, covering VM, language and persis- the nodes into appropriately-sized s_groups, tent storage levels. We have developed the

41 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

BenchErl and DE-Bench tools for this purpose. and load balancing between schedulers. Fol- Key VM-level scalability issues we identify in- lowing a systematic analysis of ETS tables, clude contention for shared ETS tables and for the number of fine-grained (bucket) locks and commonly-used shared resources like timers. of reader groups have been increased. We Key language scalability issues are the costs of have developed and evaluated four new tech- maintaining a fully-connected network, main- niques for improving ETS scalability:(i) pro- taining global recovery information, and ex- grammer control of number of bucket locks; plicit process placement. Unsurprisingly the (ii) a contention-adapting tree data structure scaling issues for this distributed actor lan- for ordered_sets; (iii) queue delegation lock- guage are common to other distributed or par- ing; and (iv) eliminating the locks in the meta allel languages and frameworks with other table. We have introduced a new scheduler util- paradigms like CHARM++ [45], Cilk [15], or isation balancing mechanism to spread work to Legion [40]. We establish scientifically the folk- multiple schedulers (and hence cores), and new lore limitations of around 60 connected nodes synchronisation mechanisms to reduce con- for distributed Erlang (Section IV). tention on the widely-used time management The actor model is no panacea, and there can mechanisms. By June 2015, with Erlang/OTP still be scalability problems in the algorithms 18.0, the majority of these changes had been in- that we write, either within a single actor or in cluded in the primary releases. In any scalable the way that we structure communicating ac- actor language implementation such thought- tors. A range of pragmatic issues also impact ful design and engineering will be required the performance and scalability of actor sys- to schedule large numbers of actors on hosts tems, including memory occupied by processes with many cores, and to minimise contention (even when quiescent), mailboxes filling up, on shared VM resources (Section VI). etc. Identifying and resolving these problems To facilitate the development of large Erlang is where tools like Percept2 and WombatOAM systems, and to make them understandable we are needed. However, many of the scalability have developed a range of tools. The propri- issues arise where Erlang departs from the private etary WombatOAM tool deploys and monitors state principle of the actor model, e.g. in main- large distributed Erlang systems over multi- taining shared state in ETS tables, or a shared ple, and possibly heterogeneous, clusters or global process namespace for recovery. clouds. We have made open source releases of We have designed and implemented a set four concurrency tools: Percept2 now detects of Scalable Distributed (SD) Erlang libraries concurrency bad smells; Wrangler provides en- to address language-level scalability issues. The hanced concurrency refactoring; the Devo tool key constructs are s_groups for partitioning the is enhanced to provide interactive visualisation network and global process namespace, and of SD Erlang systems; and the new SD-Mon semi-explicit process placement for deploying tool monitors SD Erlang systems. We an- distributed Erlang applications on large hetero- ticipate that these tools will guide the design geneous architectures in a portable way. We of tools for other large scale distributed actor have provided a state transition operational se- languages and frameworks (Section VII). mantics for the new s_groups, and validated We report on the reliability and scalability the library implementation against the seman- implications of our new technologies using a tics using QuickCheck (Section V). range of benchmarks, and consistently use the To improve the scalability of the Erlang VM Orbit and ACO benchmarks throughout the and libraries we have improved the implemen- article. While we report measurements on a tation of shared ETS tables, time management range of NUMA and cluster architectures, the

42 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

Table 2: Cluster Specifications (RAM per host in GB). Cores per max Name Hosts host total avail. Processor RAM Inter- connection

GPG 20 16 320 320 Intel Xeon E5-2640v2 8C, 64 10GB Ethernet 2GHz Kalkyl 348 8 2,784 1,408 Intel Xeon 5520v2 4C, 24–72 InfiniBand 20 2.26GHz Gb/s AMD Opteron 6220v2 Bull- 2:1 oversub- TinTin 160 16 2,560 2,240 dozer 8C, 3.0GHz 64–128 scribed QDR Infiniband Athos 776 24 18,624 6,144 Intel Xeon E5-2697v2 12C, 64 Infiniband 2.7GHz FDR14 key scalability experiments are conducted on liminary investigations suggest that some SD the Athos cluster with 256 hosts (6144 cores). Erlang ideas could improve the scalability of Even when global recovery data is not main- other actor languages.21 For example the Akka tained, partitioning the network into s_groups framework for Scala could benefit from semi- reduces network traffic and improves the per- explicit placement, and Cloud Haskell from formance of the Orbit and ACO benchmarks partitioning the network. above 80 hosts. Crucially we exceed the 60 node limit for distributed Erlang and do not reach the scalability limits of SD Erlang with 256 nodes/VMs and 6144 cores. Chaos Mon- Appendix A: Architecture key experiments show that two versions of Specifications ACO are reliable, and hence that SD Erlang pre- serves the Erlang reliability model. However The specifications of the clusters used for mea- the ACO results show that maintaining global surement are summarised in Table 2. We also recovery data, i.e. a global process name space, use the following NUMA machines. (1) An dramatically limits scalability in distributed AMD Bulldozer with 16M L2/16M L3 cache, Erlang. Scalability can, however, be recovered 128GB RAM, four AMD Opteron 6276s at by maintaining recovery data only within ap- 2.3 GHz, 16 “Bulldozer” cores each, giving a to- propriately sized s_groups. These results are tal of 64 cores. (2) An Intel NUMA with 128GB consistent with experiments with other bench- RAM, four Intel Xeon E5-4650s at 2.70GHz, marks and on other architectures (Section VIII). each with eight hyperthreaded cores, giving a In future work we plan to incorporate total of 64 cores. RELEASE technologies, along with other tech- nologies in a generic framework for building 21See Deliverable 6.7 (http://www.release-project. performant large scale servers. In addition, pre- eu/documents/D6.7.pdf), available online.

43 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

Acknowledgements Venetis. A scalability benchmark suite for Erlang/OTP. In Torben Hoffman and John We would like to thank the entire RELEASE Hughes, editors, Proceedings of the Eleventh project team for technical insights and adminis- ACM SIGPLAN Workshop on Erlang, pages trative support. Roberto Aloi and Enrique Fer- 33–42, New York, NY, USA, September nandez Casado contributed to the development 2012. ACM. and measurement of WombatOAM. This work has been supported by the European Union [9] Thomas Arts, John Hughes, Joakim Jo- grant RII3-CT-2005-026133 “SCIEnce: Symbolic hansson, and Ulf Wiger. Testing telecoms Computing Infrastructure in Europe”, IST-2011- software with Quviq QuickCheck. In Pro- 287510 “RELEASE: A High-Level Paradigm for ceedings of the 2006 ACM SIGPLAN work- Reliable Large-scale Server Software”, and by shop on Erlang, pages 2–10, New York, NY, the UK’s Engineering and Physical Sciences USA, 2006. ACM. Research Council grant EP/G055181/1 “HPC- GAP: High Performance Computational Alge- [10] Robert Baker, Peter Rodgers, Simon bra and Discrete Mathematics”. Thompson, and Huiqing Li. Multi-level Vi- sualization of Concurrent and Distributed References Computation in Erlang. In Visual Lan- guages and Computing (VLC): The 19th In- [1] Gul Agha. ACTORS: A Model of Concurrent ternational Conference on Distributed Multi- Computation in Distributed Systems. PhD media Systems (DMS 2013), 2013. thesis, MIT, 1985. [11] Luiz André Barroso, Jimmy Clidaras, and [2] Gul Agha. An overview of Actor lan- Urs Hölzle. The Datacenter as a Computer. guages. SIGPLAN Not., 21(10):58–67, 1986. Morgan and Claypool, 2nd edition, 2013. [3] Bulldozer (microarchitecture), 2015. https://en.wikipedia.org/wiki/ [12] Basho Technologies. Riakdocs. Basho Bench, 2014. http://docs.basho. Bulldozer_(microarchitecture). com/riak/latest/ops/building/ benchmarking/ [4] Apache SF. Liblcoud, 2016. https:// . libcloud.apache.org/. [13] J. E. Beasley. OR-Library: Distributing [5] C. R. Aragon and R. G. Seidel. Random- test problems by electronic mail. The ized search trees. In Proceedings of the 30th Journal of the Operational Research Soci- Annual Symposium on Foundations of Com- ety, 41(11):1069–1072, 1990. Datasets puter Science, pages 540–545, October 1989. available at http://people.brunel.ac. uk/~mastjjb/jeb/orlib/wtinfo.html. [6] Joe Armstrong. Programming Erlang: Soft- ware for a Concurrent World. Pragmatic [14] Cory Bennett and Ariel Tseitlin. Chaos Bookshelf, 2007. monkey released into the wild. Netflix [7] Joe Armstrong. Erlang. Commun. ACM, Blog, 2012. 53:68–75, 2010. [15] Robert D Blumofe, Christopher F Joerg, [8] Stavros Aronis, Nikolaos Papaspyrou, Ka- Bradley C Kuszmaul, Charles E Leiserson, terina Roukounaki, Konstantinos Sago- Keith H Randall, and Yuli Zhou. Cilk: An nas, Yiannis Tsiouris, and Ioannis E. efficient multithreaded runtime system. In

44 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

Proceedings of the fifth ACM SIGPLAN sym- Erlang for scalability and reliability. IEEE posium on Principles and practice of parallel Transactions on Parallel and Distributed Sys- programming, pages 207–216. ACM, 1995. tems, 2017.

[16] Olivier Boudeville. Technical manual of [23] Natalia Chechina, Mario Moro Hernan- the sim-diasca simulation engine. EDF dez, and Phil Trinder. A scalable reliable R&D, 2012. instant messenger using the SD Erlang li- braries. In Erlang’16, pages 33–41, New [17] István Bozó, Viktória Förd˝os,Dániel Hor- York, NY, USA, 2016. ACM. pácsi, Zoltán Horváth, Tamás Kozsik, Ju- dit K˝oszegi,and Melinda Tóth. Refactor- [24] Koen Claessen and John Hughes. ings to enable parallelization. In Jurriaan QuickCheck: a lightweight tool for Hage and Jay McCarthy, editors, Trends in random testing of Haskell programs. Functional Programming: 15th International In Proceedings of the 5th ACM SIGPLAN Symposium, TFP 2014. Revised Selected Pa- International Conference on Functional pers, volume 8843 of LNCS, pages 104–121. Programming, pages 268–279, New York, Springer, 2015. NY, USA, 2000. ACM.

[18] Irina Calciu, Dave Dice, Yossi Lev, Vic- [25] H. A. J. Crauwels, C. N. Potts, and L. N. tor Luchangco, Virendra J. Marathe, and van Wassenhove. Local search heuristics Nir Shavit. NUMA-aware reader-writer for the single machine total weighted tar- locks. In Proceedings of the 18th ACM SIG- diness scheduling problem. INFORMS PLAN Symposium on Principles and Prac- Journal on Computing, 10(3):341–350, 1998. tice of Parallel Programming, pages 157–166, The datasets from this paper are included New York, NY, USA, 2013. ACM. in Beasley’s ORLIB.

[19] Francesco Cesarini and Simon Thomp- [26] Jeffrey Dean and Sanjay Ghemawat. son. Erlang Programming: A Concurrent Mapreduce: Simplified data processing on Approach to Software Development. O’Reilly large clusters. Commun. ACM, 51(1):107– Media, 1st edition, 2009. 113, 2008.

[20] Rohit Chandra, Leonardo Dagum, Dave [27] David Dewolfs, Jan Broeckhove, Vaidy Kohr, Dror Maydan, Jeff McDonald, and Sunderam, and Graham E. Fagg. FT- Ramesh Menon. Parallel Programming in MPI, fault-tolerant metacomputing and OpenMP. Morgan Kaufmann, San Fran- generic name services: A case study. In cisco, CA, USA, 2001. EuroPVM/MPI’06, pages 133–140, Berlin, [21] Natalia Chechina, Huiqing Li, Amir Heidelberg, 2006. Springer-Verlag. Ghaffari, Simon Thompson, and Phil [28] Alan AA Donovan and Brian W Trinder. Improving the network scalabil- Kernighan. The Go programming language. ity of Erlang. J. Parallel Distrib. Comput., Addison-Wesley Professional, 2015. 90(C):22–34, 2016. [29] Marco Dorigo and Thomas Stützle. Ant [22] Natalia Chechina, Kenneth MacKenzie, Colony Optimization. Bradford Company, Simon Thompson, Phil Trinder, Olivier Scituate, MA, USA, 2004. Boudeville, Viktória Förd˝os,Csaba Hoch, Amir Ghaffari, and Mario Moro Her- [30] Faith Ellen, Yossi Lev, Victor Luchangco, nandez. Evaluating scalable distributed and Mark Moir. SNZI: scalable NonZero

45 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

indicators. In Proceedings of the 26th Annual [39] Seth Gilbert and Nancy Lynch. Brewer’s ACM Symposium on Principles of Distributed conjecture and the feasibility of consistent, Computing, pages 13–22, New York, NY, available, partition-tolerant web services. USA, 2007. ACM. ACM SIGACT News, 33:51–59, 2002.

[31] Jeff Epstein, Andrew P. Black, and Si- [40] Andrew S Grimshaw, Wm A Wulf, and mon Peyton-Jones. Towards Haskell in the Legion team. The Legion vision of a the Cloud. In Haskell ’11, pages 118–129. worldwide virtual computer. Communica- ACM, 2011. tions of the ACM, 40(1):39–45, 1997.

[32] Martin Fowler. Refactoring: Improving the [41] Carl Hewitt. Actor model for discre- Design of Existing Code. Addison-Wesley tionary, adaptive concurrency. CoRR, Longman Publishing Co., Inc., Boston, abs/1008.1459, 2010. MA, USA, 1999. [42] Carl Hewitt, Peter Bishop, and Richard [33] Ana Gainaru and Franck Cappello. Errors Steiger. A universal modular ACTOR for- and faults. In Fault-Tolerance Techniques for malism for artificial intelligence. In IJ- High-Performance Computing, pages 89–144. CAI’73, pages 235–245, San Francisco, CA, Springer International Publishing, 2015. USA, 1973. Morgan Kaufmann.

[34] Martin Josef Geiger. New instances [43] Rich Hickey. The Clojure programming for the Single Machine Total Weighted language. In DLS’08, pages 1:1–1:1, New Tardiness Problem. Technical Report York, NY, USA, 2008. ACM. Research Report 10-03-01, March 2010. Datasets available at http://logistik. [44] Zoltán Horváth, László Lövei, Tamás hsu-hh.de/SMTWTP. Kozsik, Róbert Kitlei, Anikó Nagyné Víg, Tamás Nagy, Melinda Tóth, and Roland [35] Guillaume Germain. Concurrency ori- Király. Building a refactoring tool for ented programming in termite scheme. Erlang. In Workshop on Advanced Software In Proceedings of the 2006 ACM SIGPLAN Development Tools and Techniques, WAS- workshop on Erlang, pages 20–20, New DETT 2008, 2008. York, NY, USA, 2006. ACM. [45] Laxmikant V Kale and Sanjeev Krishnan. [36] Amir Ghaffari. De-bench, a benchmark Charm++: a portable concurrent object tool for distributed erlang, 2014. https: oriented system based on c++. In ACM //github.com/amirghaffari/DEbench. Sigplan Notices, volume 28, pages 91–108. ACM, 1993. [37] Amir Ghaffari. Investigating the scalabil- ity limits of distributed Erlang. In Proceed- [46] David Klaftenegger, Konstantinos Sago- ings of the 13th ACM SIGPLAN Workshop nas, and Kjell Winblad. On the scalability on Erlang, pages 43–49. ACM, 2014. of the Erlang term storage. In Proceedings of the Twelfth ACM SIGPLAN Workshop on [38] Amir Ghaffari, Natalia Chechina, Phip Erlang, pages 15–26, New York, NY, USA, Trinder, and Jon Meredith. Scalable persis- September 2013. ACM. tent storage for Erlang: Theory and prac- tice. In Proceedings of the 12th ACM SIG- [47] David Klaftenegger, Konstantinos Sago- PLAN Workshop on Erlang, pages 73–74, nas, and Kjell Winblad. Delegation lock- New York, NY, USA, 2013. ACM. ing libraries for improved performance of

46 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

multithreaded programs. In Euro-Par 2014 with Wrangler, updated. In ACM SIG- Parallel Processing., volume 8632 of LNCS, PLAN Erlang Workshop, volume 2008, 2008. pages 572–583. Springer, 2014. [57] Frank Lubeck and Max Neunhoffer. Enu- [48] David Klaftenegger, Konstantinos Sago- merating large Orbits and direct conden- nas, and Kjell Winblad. Queue delegation sation. Experimental Mathematics, 10(2):197– locking. IEEE Transactions on Parallel and 205, 2001. Distributed Systems, 2017. To appear. [58] Andreea Lutac, Natalia Chechina, Ger- [49] Rusty Klophaus. Riak core: Building dis- ardo Aragon-Camarasa, and Phil Trinder. tributed applications without shared state. Towards reliable and scalable robot com- In ACM SIGPLAN Commercial Users of munication. In Proceedings of the 15th Inter- Functional Programming, CUFP ’10, pages national Workshop on Erlang, Erlang 2016, 14:1–14:1, New York, NY, USA, 2010. pages 12–23, New York, NY, USA, 2016. ACM. ACM.

[50] Avinash Lakshman and Prashant Malik. [59] LWN.net. The high-resolution timer Cassandra: A decentralized structured API, January 2006. https://lwn.net/ storage system. SIGOPS Oper. Syst. Rev., Articles/167897/. 44(2):35–40, April 2010. [60] Kenneth MacKenzie, Natalia Chechina, [51] J. Lee et al. Python actor runtime library, and Phil Trinder. Performance portability 2010. osl.cs.uiui.edu/parley/. through semi-explicit placement in dis- [52] Huiqing Li and Simon Thompson. Auto- tributed Erlang. In Proceedings of the 14th mated API Migration in a User-Extensible ACM SIGPLAN Workshop on Erlang, pages Refactoring Tool for Erlang Programs. In 27–38. ACM, 2015. Tim Menzies and Motoshi Saeki, editors, [61] Jeff Matocha and Tracy Camp. A taxon- Automated Software Engineering, ASE’12. omy of distributed termination detection IEEE Computer Society, 2012. algorithms. Journal of Systems and Software, [53] Huiqing Li and Simon Thompson. Multi- 43(221):207–221, 1998. core profiling for Erlang programs using Percept2. In Proceedings of the Twelfth ACM [62] Nicholas D Matsakis and Felix S Klock II. SIGPLAN workshop on Erlang, 2013. The Rust language. In ACM SIGAda Ada Letters, volume 34, pages 103–104. ACM, [54] Huiqing Li and Simon Thompson. Im- 2014. proved Semantics and Implementation Through Property-Based Testing with [63] Robert McNaughton. Scheduling with QuickCheck. In 9th International Workshop deadlines and loss functions. Management on Automation of Software Test, 2014. Science, 6(1):1–12, 1959. [55] Huiqing Li and Simon Thompson. Safe [64] Martin Odersky et al. The Scala program- Concurrency Introduction through Slicing. ming language, 2012. www.scala-lang.org. In PEPM ’15. ACM SIGPLAN, January [65] William F. Opdyke. Refactoring Object- 2015. Oriented Frameworks. PhD thesis, Uni- [56] Huiqing Li, Simon Thompson, György versity of Illinois at Urbana-Champaign, Orosz, and Melinda Töth. Refactoring 1992.

47 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

[66] Nikolaos Papaspyrou and Konstantinos [74] Don Syme, Adam Granicz, and Antonio Sagonas. On preserving term sharing in Cisternino. Expert F# 4.0. Springer, 2015. the Erlang virtual machine. In Torben Hoffman and John Hughes, editors, Pro- [75] Simon Thompson and Huiqing Li. Refac- ceedings of the 11th ACM SIGPLAN Erlang toring tools for functional languages. Jour- Workshop, pages 11–20. ACM, 2012. nal of Functional Programming, 23(3), 2013. [76] Marcus Völker. Linux timers, May 2014. [67] RELEASE Project Team. EU Framework 7 Project 287510 (2011–2015), 2015. http: [77] WhatsApp, 2015. https://www.whatsapp. //www.release-project.eu. com/.

[68] Konstantinos Sagonas and Thanassis [78] Tom White. Hadoop - The Definitive Guide: Avgerinos. Automatic refactoring of Storage and Analysis at Internet Scale (3. ed., Erlang programs. In Principles and Prac- revised and updated). O’Reilly, 2012. tice of Declarative Programming (PPDP’09), pages 13–24. ACM, 2009. [79] Ulf Wiger. Industrial-Strength Functional programming: Experiences with the Er- [69] Konstantinos Sagonas and Kjell Winblad. icsson AXD301 Project. In Implementing More scalable ordered set for ETS using Functional Languages (IFL’00), Aachen, Ger- adaptation. In Proc. ACM SIGPLAN Work- many, September 2000. Springer-Verlag. shop on Erlang, pages 3–11. ACM, Septem- ber 2014.

[70] Konstantinos Sagonas and Kjell Winblad. Contention adapting search trees. In Pro- ceedings of the 14th International Sympo- sium on Parallel and Distributed Computing, pages 215–224. IEEE Computing Society, 2015.

[71] Konstantinos Sagonas and Kjell Winblad. Efficient support for range queries and range updates using contention adapting search trees. In Languages and for Parallel Computing - 28th International Workshop, volume 9519 of LNCS, pages 37–53. Springer, 2016.

[72] Marc Snir, Steve W. Otto, D. W. Walker, Jack Dongarra, and Steven Huss- Lederman. MPI: The Complete Reference. MIT Press, Cambridge, MA, USA, 1995.

[73] Sriram Srinivasan and Alan Mycroft. Kilim: Isolation-typed actors for Java. In ECOOP’08, pages 104–128, Berlin, Heidel- berg, 2008. Springer-Verlag.

48