P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017 Scaling Reliably: Improving the Scalability of the Erlang Distributed Actor Platform Phil Trinder,Natalia Chechina,Nikolaos Papaspyrou, Konstantinos Sagonas,Simon Thompson,Stephen Adams,Stavros Aronis, Robert Baker,Eva Bihari,Olivier Boudeville,Francesco Cesarini, Maurizio Di Stefano,Sverker Eriksson,Viktoria Fordos,Amir Ghaffari, Aggelos Giantsios,Rockard Green,Csaba Hoch,David Klaftenegger, Huiqing Li,Kenneth Lundin,Kenneth MacKenzie,Katerina Roukounaki, Yiannis Tsiouris,Kjell Winblad RELEASE EU FP7 STREP (287510) project www.release-project.eu/

April 25, 2017

Abstract

Distributed actor languages are an effective means of constructing scalable reliable systems, and the Erlang has a well-established and influential model. While the Erlang model conceptually provides reliable scalability, it has some inherent scalability limits and these force developers to depart from the model at scale. This article establishes the scalability limits of Erlang systems, and reports the work of the EU RELEASE project to improve the scalability and understandability of the Erlang reliable distributed actor model. We systematically study the scalability limits of Erlang, and then address the issues at the virtual machine, language and tool levels. More specifically: (1) We have evolved the Erlang virtual machine so that it can work effectively in large scale single-host multicore and NUMA architectures. We have made important changes and architectural improvements to the widely used Erlang/OTP release. (2) We have designed and implemented Scalable Distributed (SD) Erlang libraries to address language-level scalability issues, and provided and validated a set of semantics for the new language constructs. (3) To make large Erlang systems easier to deploy, monitor, and debug we have developed and made open source releases of five complementary tools, some specific to SD Erlang. Throughout the article we use two case studies to investigate the capabilities of our new technologies and arXiv:1704.07234v2 [cs.PL] 8 May 2017 tools: a distributed hash table based Orbit calculation and Ant Colony Optimisation (ACO). Chaos Monkey experiments show that two versions of ACO survive random process failure and hence that SD Erlang preserves the Erlang reliability model. While we report measurements on a range of NUMA and cluster architectures, the key scalability experiments are conducted on the Athos cluster with 256 hosts (6144 cores). Even for programs with no global recovery data to maintain, SD Erlang partitions the network to reduce network traffic and hence improves performance of the Orbit and ACO benchmarks above 80 hosts. ACO measurements show that maintaining global recovery data dramatically limits scalability; however scalability is recovered by partitioning the recovery data. We exceed the established scalability limits of distributed Erlang, and do not reach the limits of SD Erlang for these benchmarks at this scale (256 hosts, 6144 cores).

1 I. Introduction stant messaging server [77]. In Erlang, the actors are termed processes Distributed programming languages and and are managed by a sophisticated Virtual frameworks are central to engineering large Machine on a single multicore or NUMA scale systems, where key properties include host, while distributed Erlang provides rela- scalability and reliability. By scalability we tively transparent distribution over networks mean that performance increases as hosts and of VMs on multiple hosts. Erlang is supported cores are added, and by large scale we mean by the Open Telecom Platform (OTP) libraries architectures with hundreds of hosts and tens that capture common patterns of reliable dis- of thousands of cores. Experience with high tributed computation, such as the client-server performance and data centre computing shows pattern and process supervision. Any large- that reliability is critical at these scales, e.g. scale system needs scalable persistent stor- host failures alone account for around one fail- age and, following the CAP theorem [39], ure per hour on commodity servers with ap- Erlang uses and indeed implements Dynamo- proximately 105 cores [11]. To be usable, pro- style NoSQL DBMS like [49] and Cassan- gramming languages employed on them must dra [50]. be supported by a suite of deployment, moni- While the Erlang distributed actor model toring, refactoring and testing tools that work conceptually provides reliable scalability, it has at scale. some inherent scalability limits, and indeed Controlling shared state is the only way to large-scale distributed Erlang systems must de- build reliable scalable systems. State shared by part from the distributed Erlang paradigm in multiple units of computation limits scalability order to scale, e.g. not maintaining a fully con- due to high synchronisation and communica- nected graph of hosts. The EU FP7 RELEASE tion costs. Moreover shared state is a threat project set out to establish and address the scal- for reliability as failures corrupting or perma- ability limits of the Erlang reliable distributed nently locking shared state may poison the actor model [67]. entire system. After outlining related work (Section II) and Actor languages avoid shared state: actors or the benchmarks used throughout the article processes have entirely local state, and only in- (Section III) we investigate the scalability lim- teract with each other by sending messages [2]. its of Erlang/OTP, seeking to identify specific Recovery is facilitated in this model, since ac- issues at the virtual machine, language and tors, like processes, can fail persistent storage levels (Section IV). independently without affecting the state of We then report the RELEASE project work to other actors. Moreover an actor can supervise address these issues, working at the following other actors, detecting failures and taking re- three levels. medial action, e.g. restarting the failed actor. 1. We have designed and implemented a Erlang [6, 19] is a beacon language for re- set of Scalable Distributed (SD) Erlang li- liable scalable computing with a widely em- braries to address language-level reliabil- ulated distributed actor model. It has in- ity and scalability issues. An operational fluenced the design of numerous program- semantics is provided for the key new ming languages like Clojure [43] and F# [74], s_group construct, and the implementa- and many languages have Erlang-inspired tion is validated against the semantics (Sec- actor frameworks, e.g. Kilim for Java [73], tion V). Cloud Haskell [31], and Akka for C#, F# and Scala [64]. Erlang is widely used for build- 2. We have evolved the Erlang virtual machine ing reliable scalable servers, e.g. Ericsson’s so that it can work effectively in large-scale AXD301 telephone exchange (switch) [79], the single-host multicore and NUMA archi- Facebook chat server, and the Whatsapp in- tectures. We have improved the shared

2 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

ETS tables, time management, and load ter (Section VIII). balancing between schedulers. Most of Contributions. This article is the first sys- these improvements are now included in tematic presentation of the coherent set of the Erlang/OTP release, currently down- technologies for engineering scalable reliable loaded approximately 50K times each Erlang systems developed in the RELEASE month (Section VI). project. Section IV presents the first scalability study 3. To facilitate the development of scalable covering Erlang VM, language, and storage Erlang systems, and to make them main- scalability. Indeed we believe it is the first tainable, we have developed three new comprehensive study of any distributed ac- tools: Devo, SDMon and WombatOAM, tor language at this scale (100s of hosts, and and enhanced two others: the visualisa- around 10K cores). Individual scalability stud- tion tool Percept, and the refactorer Wran- ies, e.g. into Erlang VM scaling [8], or lan- gler. The tools support refactoring pro- guage and storage scaling have appeared be- grams to make them more scalable, eas- fore [38, 37]. ier to deploy at large scale (hundreds of At the language level the design, implemen- hosts), easier to monitor and visualise their tation and validation of the new libraries (Sec- behaviour. Most of these tools are freely tion V) have been reported piecemeal [21, 60], available under open source licences; the and are included here for completeness. WombatOAM deployment and monitor- While some of the improvements made to ing tool is a commercial product (Sec- the Erlang Virtual Machine (Section i) have tion VII). been thoroughly reported in conference pub- Throughout the article we use two bench- lications [66, 46, 47, 69, 70, 71], others are re- marks to investigate the capabilities of our new ported here for the first time (Sections iii, ii). technologies and tools. These are a compu- In Section VII, the WombatOAM and SD- tation in symbolic algebra, more specifically Mon tools are described for the first time, as an algebraic ‘orbit’ calculation that exploits is the revised Devo system and visualisation. a non-replicated distributed hash table, and The other tools for profiling, debugging and an Ant Colony Optimisation (ACO) parallel refactoring developed in the project have previ- search program (Section III). ously been published piecemeal [52, 53, 55, 10], We report on the reliability and scalability but this is their first unified presentation. implications of our new technologies using Or- All of the performance results in Section VIII bit, ACO, and other benchmarks. We use a are entirely new, although a comprehensive Chaos Monkey instance [14] that randomly study of SD Erlang performance is now avail- kills processes in the running system to demon- able in a recent article by [22]. strate the reliability of the benchmarks and to show that SD Erlang preserves the Erlang II. Context language-level reliability model. While we report measurements on a range of NUMA i. Scalable Reliable Programming and cluster architectures as specified in Ap- Models pendix A, the key scalability experiments are conducted on the Athos cluster with 256 hosts There is a plethora of shared memory con- and 6144 cores. Having established scientifi- current programming models like PThreads cally the folklore limitations of around 60 con- or Java threads, and some models, like nected hosts/nodes for distributed Erlang sys- OpenMP [20], are simple and high level. How- tems in Section 4, a key result is to show that ever synchronisation costs mean that these the SD Erlang benchmarks exceed this limit models generally do not scale well, often strug- and do not reach their limits on the Athos clus- gling to exploit even 100 cores. Moreover, re-

3 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017 liability mechanisms are greatly hampered by in AI [42], it has been used widely as a gen- the shared state: for example, a lock becomes eral metaphor for concurrency, as well as being permanently unavailable if the thread holding incorporated into a number of niche program- it fails. ming languages in the 1970s and 80s. More re- The High Performance Computing (HPC) cently it has come back to prominence through community build large-scale (106 core) dis- the rise of not only multicore chips but also tributed memory systems using the de facto larger-scale distributed programming in data standard MPI communication libraries [72]. In- centres and the cloud. creasingly these are hybrid applications that With built-in concurrency and data isola- combine MPI with OpenMP. Unfortunately, tion, actors are a natural paradigm for engi- MPI is not suitable for producing general pur- neering reliable scalable general-purpose sys- pose concurrent software as it is too low level tems [1, 41]. The model has two main concepts: with explicit message passing. Moreover, the actors that are the unit of computation, and most widely used MPI implementations offer messages that are the unit of communication. no fault recovery:1 if any part of the compu- Each actor has an address-book that contains the tation fails, the entire computation fails. Cur- addresses of all the other actors it is aware of. rently the issue is addressed by using what These addresses can be either locations in mem- is hoped to be highly reliable computational ory, direct physical attachments, or network and networking hardware, but there is intense addresses. In a pure actor language, messages research interest in introducing reliability into are the only way for actors to communicate. HPC applications [33]. After receiving a message an actor can do Server farms use commodity computational the following: (i) send messages to another and networking hardware, and often scale to actor from its address-book, (ii) create new ac- 5 around 10 cores, where host failures are rou- tors, or (iii) designate a behaviour to handle the tine. They typically perform rather constrained next message it receives. The model does not computations, e.g. Big Data Analytics, using re- impose any restrictions in the order in which liable frameworks like Google MapReduce [26] these actions must be taken. Similarly, two or Hadoop [78]. The idempotent nature of the messages sent concurrently can be received in analytical queries makes it relatively easy for any order. These features enable actor based the frameworks to provide implicit reliability: systems to support indeterminacy and quasi- queries are monitored and failed queries are commutativity, while providing locality, modu- simply re-run. In contrast, actor languages like larity, reliability and scalability [41]. Erlang are used to engineer reliable general Actors are just one message-based paradigm, purpose computation, often recovering failed and other languages and libraries have related stateful computations. message passing paradigms. Recent example languages include Go [28] and Rust [62] that ii. Actor Languages provide explicit channels, similar to actor mail- boxes. Probably the most famous message The actor model of concurrency consists of in- passing library is MPI [72], with APIs for many dependent processes communicating by means languages and widely used on clusters and of messages sent asynchronously between pro- High Performance Computers. It is, however, cesses. A process can send a message to any arguable that the most important contribution other process for which it has the address (in of the actor model is the one-way asynchronous Erlang the “process identifier” or pid), and communication [41]. Messages are not coupled the remote process may reside on a different with the sender, and neither they are trans- host. While the notion of actors originated ferred synchronously to a temporary container 1Some fault tolerance is provided in less widely used where transmission takes place, e.g. a buffer, a MPI implementations like [27]. queue, or a mailbox. Once a message is sent,

4 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017 the receiver is the only entity responsible for which registers this process name in the node’s that message. process name table if not already present. Sub- Erlang [6, 19] is the pre-eminent program- sequently, these names can be used to refer to ming language based on the actor model, hav- or communicate with the corresponding pro- ing a history of use in production systems, cesses (e.g. send them a message): initially with its developer Ericsson and then my_funky_process ! hello. more widely through open source adoption. There are now actor frameworks for many A distributed Erlang system executes on mul- other languages; these include Akka for C#, F# tiple nodes, and the nodes can be freely de- 2 3 and Scala [64], CAF for C++, Pykka , Cloud ployed across hosts, e.g. they can be located Haskell [31], PARLEY [51] for Python, and Ter- on the same or different hosts. To help make mite Scheme [35], and each of these is currently distribution transparent to the programmer, under active use and development. Moreover, when any two nodes connect they do so transi- the recently defined Rust language [62] has a tively, sharing their sets of connections. With- version of the actor model built in, albeit in an out considerable care this quickly leads to a imperative context. fully connected graph of nodes. A process may be spawned on an explicitly identified node, iii. Erlang’s Support for Concurrency e.g. In Erlang, actors are termed processes, and vir- Remote_Pong_PID = spawn(some_node, tual machines are termed nodes. The key ele- fun some_module:pong/0). ments of the actor model are: fast process cre- After this, the remote process can be addressed ation and destruction; lightweight processes, just as if it were local. It is a significant bur- e.g. enabling 106 concurrent processes on a sin- den on the programmer to identify the remote gle host with 8GB RAM; fast asynchronous nodes in large systems, and we will return to message passing with copying semantics; pro- this in Sections ii, i.2. cess monitoring; strong dynamic typing, and selective message reception. By default Erlang processes are addressed iv. Scalability and Reliability in by their process identifier (pid), e.g. Erlang Systems Erlang was designed to solve a particular Pong_PID = spawn(fun some_module:pong/0) set of problems, namely those in building spawns a process to execute the anonymous telecommunications’ infrastructure, where sys- function given as argument to the spawn prim- tems need to be scalable to accommodate hun- itive, and binds Pong_PID to the new process dreds of thousands of calls concurrently, in soft identifier. Here the new process will exe- real-time. These systems need to be highly- cute the pong/0 function which is defined in available and reliable: i.e. to be robust in the some_module. A subsequent call case of failure, which can come from software or hardware faults. Given the inevitability of Pong_PID ! finish the latter, Erlang adopts the “let it fail” phi- losophy for error handling. That is, encourage sends the messaged finish to the process iden- programmers to embrace the fact that a process tified by Pong_PID. Alternatively, processes can may fail at any point, and have them rely on be given names using a call of the form: the supervision mechanism, discussed shortly, to handle the failures. register(my_funky_name, Pong_PID) Figure 1 illustrates Erlang’s support for con- 2http://actor-framework.org currency, multicores and distribution. Each 3http://pykka.readthedocs.org/en/latest/ Erlang node is represented by a yellow shape,

5 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

with heartbeats, and can be monitored from outside Erlang, e.g. by an operating system process. However, the most important way to achieve reliability is supervision, which allows a process to monitor the status of a child process and react to any failure, for example by spawning a substitute process to replace a failed process. Figure 1: Conceptual view of Erlang’s concurrency, mul- Supervised processes can in turn supervise ticore support and distribution. other processes, leading to a supervision tree. The supervising and supervised processes may be in different nodes, and on different hosts, and each rectangle represents a host with an IP and hence the supervision tree may span mul- address. Each red arc represents a connection tiple hosts or nodes. between Erlang nodes. Each node can run on To provide reliable distributed service regis- multiple cores, and exploit the inherent con- tration, a global namespace is maintained on currency provided. This is done automatically every node, which maps process names to pids. by the VM, with no user intervention needed. It is this that we mean when we talk about a Typically each core has an associated scheduler ‘reliable’ system: it is one in which a named that schedules processes; a new process will be process in a distributed system can be restarted spawned on the same core as the process that without requiring the client processes also to spawns it, but work can be moved to a different be restarted (because the name can still be used scheduler through a work-stealing allocation for communication). algorithm. Each scheduler allows a process To see global registration in action, consider that is ready to compute at most a fixed num- a pong server process ber of computation steps before switching to another. Erlang built-in functions or BIFs are im- global:register_name(pong_server, plemented in C, and at the start of the project Remote_Pong_PID). were run to completion once scheduled, caus- ing performance and responsiveness problems Clients of the server can send messages to the if the BIF had a long execution time. registered name, e.g. Scaling in Erlang is provided in two differ- global:whereis_name(pong_server)!finish. ent ways. It is possible to scale within a single node by means of the multicore virtual machine If the server fails the supervisor can spawn a exploiting the concurrency provided by the replacement server process with a new pid and multiple cores or NUMA nodes. It is also pos- register it with the same name (pong_server). sible to scale across multiple hosts using multiple Thereafter client messages to the pong_server distributed Erlang nodes. will be delivered to the new server process. We Reliability in Erlang is multi-faceted. As in return to discuss the scalability limitations of all actor languages each process has private maintaining a global namespace in Section V. state, preventing a failed or failing process from corrupting the state of other processes. v. ETS: Erlang Term Storage Messages enable stateful interaction, and con- tain a deep copy of the value to be shared, with Erlang is a pragmatic language and the actor no references (e.g. pointers) to the senders’ model it supports is not pure. Erlang processes, internal state. Moreover Erlang avoids type besides communicating via asynchronous mes- errors by enforcing strong typing, albeit dy- sage passing, can also share data in public namically [7]. Connected nodes check liveness memory areas called ETS tables.

6 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

The Erlang Term Storage (ETS) mechanism adding global namespaces to ensure reli- is a central component of Erlang’s implementa- ability. The source code for the bench- tion. It is used internally by many libraries and marks, together with more detailed docu- underlies the in-memory . ETS tables mentation, is available at https://github. are key-value stores: they store Erlang tuples com/release-project/benchmarks/. The where one of the positions in the tuple serves RELEASE project team also worked to im- as the lookup key. An ETS table has a type prove the reliability and scalability of other that may be either set, bag or duplicate_bag, Erlang programs including a substantial (ap- implemented as a hash table, or ordered_set proximately 150K lines of Erlang code) Sim- which is currently implemented as an AVL tree. Diasca simulator [16] and an Instant Messenger The main operations that ETS supports are ta- that is more typical of Erlang applications [23] ble creation, insertion of individual entries and but we do not cover these systematically here. atomic bulk insertion of multiple entries in a table, deletion and lookup of an entry based on some key, and destructive update. The oper- i. Orbit ations are implemented as C built-in functions Orbit is a computation in symbolic algebra, in the Erlang VM. which generalises a transitive closure compu- The code snippet below creates a set ETS tation [57]. To compute the orbit for a given table keyed by the first element of the en- space [0..X], a list of generators g1, g2, ..., gn are try; atomically inserts two elements with keys applied on an initial vertex x0 ∈ [0..X]. This some_key and 42; updates the value associated creates new values (x1...xn) ∈ [0..X], where with the table entry with key 42; and then looks xi = gi(x0). The generator functions are ap- up this entry. plied on the new values until no new value is Table = ets:new(my_table, generated. [set, public, {keypos, 1}]), Orbit is a suitable benchmark because it has ets:insert(Table, [ a number of aspects that characterise a class {some_key, an_atom_value}, of real applications. The core data structure {42, {a,tuple,value}}]), it maintains is a set and, in distributed envi- ets:update_element(Table, 42, ronments is implemented as a distributed hash [{2, {another,tuple,value}}]), table (DHT), similar to the DHTs used in repli- [{Key, Value}] = ets:lookup(Table, 42). cated form in NoSQL management systems. Also, in distributed mode, it uses ETS tables are heavily used in many Erlang standard peer-to-peer (P2P) techniques like a applications. This is partly due to the conve- credit-based termination algorithm [61]. By nience of sharing data for some programming choosing the orbit size, the benchmark can be tasks, but also partly due to their fast imple- parameterised to specify smaller or larger com- mentation. As a shared resource, however, ETS putations that are suitable to run on a single tables induce contention and become a scala- machine (Section i) or on many nodes (Sec- bility bottleneck, as we shall see in Section i. tioni). Moreover it is only a few hundred lines of code. III. Benchmarks for scalability As shown in Figure 2, the computation is and reliability initiated by a master which creates a number of workers. In the single node scenario of the The two benchmarks that we use through- benchmark, workers correspond to processes out this article are Orbit, that measures scal- but these workers can also spawn other pro- ability without looking at reliability, and cesses to apply the generator functions on a Ant Colony Optimisation (ACO) that allows subset of their input values, thus creating intra- us to measure the impact on scalability of worker parallelism. In the distributed version

7 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

rently compute possible solutions to the input problem guided by shared information about good paths through the search space; there is also a certain amount of stochastic variation which allows the ants to explore new direc- tions. Having multiple colonies increases the number of ants, thus increasing the probability Figure 2: Distributed Erlang Orbit (D-Orbit) architec- of finding a good solution. ture: workers are mapped to nodes. We implement four distributed coordination patterns for the same multi-colony ACO com- of the benchmark, processes are spawned by putation as follows. In each implementation, the master node to worker nodes, each main- the individual colonies perform some number taining a DHT fragment. A newly spawned of local iterations (i.e. generations of ants) and process gets a share of the parent’s credit, and then report their best solutions; the globally- returns this on completion. The computation best solution is then selected and is reported is finished when the master node collects all to the colonies, which use it to update their credit, i.e. all workers have completed. pheromone matrices. This process is repeated for some number of global iterations. Two-level ACO (TL-ACO) has a single master ii. Ant Colony Optimisation (ACO) node that collects the colonies’ best solutions Ant Colony Optimisation [29] is a meta- and distributes the overall best solution back to heuristic which has been applied to a large the colonies. Figure 3 depicts the process and number of combinatorial optimisation prob- node placements of the TL-ACO in a cluster lems. For the purpose of this article, we have with NC nodes. The master process spawns applied it to an NP-hard scheduling problem NC colony processes on available nodes. In known as the Single Machine Total Weighted the next step, each colony process spawns NA Tardiness Problem (SMTWTP) [63], where a ant processes on the local node. Each ant iter- number of jobs of given lengths have to be ar- ates IA times, returning its result to the colony ranged in a single linear schedule. The goal is master. Each colony iterates IM times, report- to minimise the cost of the schedule, as deter- ing their best solution to, and receiving the mined by certain constraints. globally-best solution from, the master process. The ACO method is attractive from the point We validated the implementation by applying of view of distributed computing because it TL-ACO to a number of standard SMTWTP can benefit from having multiple cooperating instances [25, 13, 34], obtaining good results colonies, each running on a separate compute in all cases, and confirmed that the number of node and consisting of multiple “ants”. Ants perfect solutions increases as we increase the are simple computational agents which concur- number of colonies. Multi-level ACO (ML-ACO). In TL-ACO the master node receives messages from all of the colonies, and thus could become a bottleneck. ML-ACO addresses this by having a tree of submasters (Figure 4), with each node in the bottom level collecting results from a small number of colonies. These are then fed up through the tree, with nodes at higher levels selecting the best solutions from their children. Figure 3: Distributed Erlang Two-level Ant Colony Op- Globally Reliable ACO (GR-ACO). In ML-ACO timisation (TL-ACO) architecture. if a single colony fails to report back the sys-

8 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

Figure 4: Distributed Erlang Multi-level Ant Colony Optimisation (ML-ACO) architecture. tem will wait indefinitely. GR-ACO adds fault i. Scaling Erlang on a Single Host tolerance, supervising colonies so that a faulty colony can be detected and restarted, allowing To investigate Erlang scalability we built the system to continue execution. BenchErl, an extensible open source bench- mark suite with a web interface.4 BenchErl Scalable Reliable ACO (SR-ACO) also adds shows how an application’s performance fault-tolerance, but using supervision within changes when resources, like cores or sched- our new s_groups from Section i.1, and the ulers, are added; or when options that control architecture of SR-ACO is discussed in detail these resources change: there. • the number of nodes, i.e. the number of Erlang VMs used, typically on multiple hosts;

IV. Erlang Scalability Limits • the number of cores per node; • the number of schedulers, i.e. the OS threads This section investigates the scalability of that execute Erlang processes in parallel, Erlang at VM, language, and persistent stor- and their binding to the topology of the age levels. An aspect we choose not to explore cores of the underlying computer node; is the security of large scale systems where, • Erlang/OTP release and flavor for example, one might imagine providing en- the ; and hanced security for systems with multiple clus- • the command-line arguments used to start ters or cloud instances connected by a Wide the Erlang nodes. Area Network. We assume that existing secu- rity mechanisms are used, e.g. a Virtual Private 4Information about BenchErl is available at http:// Network. release.softlab.ntua.gr/bencherl/.

9 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

20000 with intra-worker parallelism, 10048, 128 curve), is almost linear up to 32 cores but in- 18000 without intra-worker parallelism, 10048, 128 creases less rapidly from that point on; we see a 16000 similar but more clearly visible pattern for the 14000 other configuration (red curve) where there is 12000 no performance improvement beyond 32 sched- 10000

time (ms) 8000 ulers. This is due to the asymmetric charac-

6000 teristics of the machine’s micro-architecture,

4000 which consists of modules that couple two con-

2000 ventional x86 out-of-order cores that share the 0 early pipeline stages, the floating point unit, 0 10 20 30 40 50 60 70 # Schedulers and the L2 cache with the rest of the mod- 40 ule [3]. with intra-worker parallelism, 10048, 128 without intra-worker parallelism, 10048, 128 35 Some other benchmarks, however, did not

30 scale well or experienced significant slow-

25 downs when run in many VM schedulers (threads). For example the ets_test benchmark 20

Speedup has multiple processes accessing a shared ETS 15 table. Figure 6 shows runtime and speedup 10 curves for ets_test on a 16-core (eight cores 5 with hyperthreading) Intel Xeon-based ma- 0 chine. It shows that runtime increases beyond 0 10 20 30 40 50 60 70 # Schedulers two schedulers, and that the program exhibits a slowdown instead of a speedup. Figure 5: Runtime and speedup of two configu- For many benchmarks there are obvious rea- rations of the Orbit benchmark using sons for poor scaling like limited parallelism Erlang/OTP R15B01. in the application, or contention for shared re- sources. The reasons for poor scaling are less Using BenchErl, we investigated the scala- obvious for other benchmarks, and it is exactly bility of an initial set of twelve benchmarks these we have chosen to study in detail in sub- and two substantial Erlang applications us- sequent work [8, 46]. ing a single Erlang node (VM) on machines A simple example is the parallel BenchErl with up to 64 cores, including the Bulldozer benchmark, that spawns some n processes, machine specified in Appendix A. This set of each of which creates a list of m timestamps experiments, reported by [8], confirmed that and, after it checks that each timestamp in the some programs scaled well in the most recent list is strictly greater than the previous one, Erlang/OTP release of the time (R15B01) but sends the result to its parent. Figure 7 shows also revealed VM and language level scalability that up to eight cores each additional core leads bottlenecks. to a slowdown, thereafter a small speedup is Figure 5 shows runtime and speedup curves obtained up to 32 cores, and then again a slow- for the Orbit benchmark where master and down. A small aspect of the benchmark, eas- workers run on a single Erlang node in con- ily overlooked, explains the poor scalability. figurations with and without intra-worker par- The benchmark creates timestamps using the allelism. In both configurations the program erlang:now/0 built-in function, whose imple- scales. Runtime continuously decreases as we mentation acquires a global lock in order to re- add more schedulers to exploit more cores. turn a unique timestamp. That is, two calls to The speedup of the benchmark without intra- erlang:now/0, even from different processes worker parallelism, i.e. without spawning ad- are guaranteed to produce monotonically in- ditional processes for the computation (green creasing values. This lock is precisely the bot-

10 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

10000 160000 [2512,1264,640] [70016,640] 9000 140000

8000 120000 7000 100000 6000 80000 5000 time (ms) time (ms) 60000 4000 40000 3000

2000 20000

1000 0 0 2 4 6 8 10 12 14 16 0 10 20 30 40 50 60 70 # Schedulers # Schedulers

1.6 1 [2512,1264,640] [70016,640] 0.9 1.4 0.8 1.2 0.7

1 0.6

0.5 Speedup 0.8 Speedup

0.4 0.6 0.3 0.4 0.2

0.2 0.1 0 2 4 6 8 10 12 14 16 0 10 20 30 40 50 60 70 # Schedulers # Schedulers

Figure 6: Runtime and speedup of the Figure 7: Runtime and speedup of the BenchErl ets_test BenchErl benchmark using benchmark called parallel using Erlang/OTP R15B01. Erlang/OTP R15B01. tleneck in the VM that limits the scalability tion as there is no need to discover nodes, and of this benchmark. We describe our work to the design works well for small numbers of address VM timing issues in Figure iii. nodes. However at emergent server architec- ture scales, i.e. hundreds of nodes, this design Discussion Our investigations identified con- becomes very expensive and system architects tention for shared ETS tables, and for must switch from the default Erlang model, e.g. commonly-used shared resources like timers, they need to start using hidden nodes that do as the key VM-level scalability issues. Sec- not share connection sets. tion VI outlines how we addressed these issues We have investigated the scalability limits in recent Erlang/OTP releases. imposed by network connectivity costs using several Orbit calculations on two large clusters: ii. Distributed Erlang Scalability Kalkyl and Athos as specified in Appendix A. The Kalkyl results are discussed by [21], and Network Connectivity Costs When any nor- Figure 29 in Section i shows representative mal distributed Erlang nodes communicate, results for distributed Erlang computing or- they share their connection sets and this typi- bits with 2M and 5M elements on Athos. In cally leads to a fully connected graph of nodes. all cases performance degrades beyond some So a system with n nodes will maintain O(n2) scale (40 nodes for the 5M orbit, and 140 nodes connections, and these are relatively expensive for the 2M orbit). Figure 32 illustrates the ad- TCP connections with continual maintenance ditional network traffic induced by the fully traffic. This design aids transparent distribu- connected network. It allows a comparison be-

11 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017 tween the number of packets sent in a fully con- paper [37] investigates the impact of data size, nected network (ML-ACO) with those sent in and computation time in P2P calls both inde- a network partitioned using our new s_groups pendently and in combination, and the scaling (SR-ACO). properties of the common Erlang/OTP generic server processes gen_server and gen_fsm. Global Information Costs Maintaining Here we focus on the impact of different pro- global information is known to limit the portions of global commands, mixing global scalability of distributed systems, and crucially with P2P and local commands. Figure 9 the process namespaces used for reliability shows that even a low proportion of global are global. To investigate the scalability commands limits the scalability of distributed limits imposed on distributed Erlang by Erlang, e.g. just 0.01% global commands limits such global information we have designed scalability to around 60 nodes. Figure 10 re- and implemented DE-Bench, an open source, ports the latency of all commands and shows parameterisable and scalable peer-to-peer that, while the latencies for P2P and local com- benchmarking framework [37, 36]. DE-Bench mands are stable at scale, the latency of the measures the throughput and latency of global commands increases dramatically with distributed Erlang commands on a cluster of scale. Both results illustrate that the impact of Erlang nodes, and the design is influenced global operations on throughput and latency by the Basho Bench benchmarking tool for in a distributed Erlang system is severe. Riak [12]. Each DE-Bench instance acts as a peer, providing scalability and reliability by eliminating central coordination and any single point of failure. Explicit Placement While network connectiv- To evaluate the scalability of distributed ity and global information impact performance Erlang, we measure how adding more hosts at scale, our investigations also identified ex- increases the throughput, i.e. the total num- plicit process placement as a programming issue ber of successfully executed distributed Erlang at scale. Recall from Section iii that distributed commands per experiment. Figure 8 shows Erlang requires the programmer to identify an the parameterisable internal workflow of DE- explicit Erlang node (VM) when spawning a Bench. There are three classes of commands process. Identifying an appropriate node be- in DE-Bench:(i) Point-to-Point (P2P) commands, comes a significant burden for large and dy- where a function with tunable argument size namic systems. The problem is exacerbated in and computation time is run on a remote large distributed systems where (1) the hosts node, include spawn, rpc, and synchronous may not be identical, having different hardware calls to server processes, i.e. gen_server capabilities or different software installed; and or gen_fsm. (ii) Global commands, which (2) communication times may be non-uniform: entail synchronisation across all connected it may be fast to send a message between VMs nodes, such as global:register_name and on the same host, and slow if the VMs are on global:unregister_name. (iii) Local commands, different hosts in a large distributed system. which are executed independently by a single These factors make it difficult to deploy ap- node, e.g. whereis_name, a look up in the plications, especially in a scalable and portable local name table. manner. Moreover while the programmer may The benchmarking is conducted on 10 to 100 be able to use platform-specific knowledge to host configurations of the Kalkyl cluster (in decide where to spawn processes to enable an steps of 10) and measures the throughput of application to run efficiently, if the application successful commands per second over 5 min- is then deployed on a different platform, or if utes. There is one Erlang VM on each host and the platform changes as hosts fail or are added, one DE-Bench instance on each VM. The full this becomes outdated.

12 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

Figure 8: DE-Bench’s internal workflow.

Figure 9: Scalability vs. percentage of global commands in Distributed Erlang.

Discussion Our investigations confirm three ing pids in data structures. In Section V we language-level scalability limitations of Erlang develop language technologies to address these from developer folklore.(1) Maintaining a fully issues. connected network of Erlang nodes limits scal- ability, for example Orbit is typically limited to just 40 nodes. (2) Global operations, and cru- iii. Persistent Storage cially the global operations required for relia- Any large scale system needs reliable scalable global namespace bility, i.e. to maintain a , seri- persistent storage, and we have studied the ously limit the scalability of distributed Erlang scalability limits of Erlang persistent storage al- process placement systems. (3) Explicit makes it ternatives [38]. We envisage a typical large hard to built performance portable applications server having around 105 cores on around for large architectures. These issues cause de- 100 hosts. We have reviewed the require- signers of reliable large scale systems in Erlang ments for scalable and available persistent stor- to depart from the standard Erlang model, e.g. age and evaluated four popular Erlang DBMS using techniques like hidden nodes and stor- against these requirements. For a target scale

13 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

Figure 10: Latency of commands as the number of Erlang nodes increases. of around 100 hosts, and CouchDB the scalability of Riak is much improved in are, unsurprisingly, not suitable. However, subsequent versions. Dynamo-style NoSQL DBMS like Cassandra and Riak have the potential to be. V. Improving Language Scalability We have investigated the current scalabil- ity limits of the Riak NoSQL DBMS using the This section outlines the Scalable Distributed Basho Bench benchmarking framework on a (SD) Erlang libraries [21] we have designed cluster with up to 100 nodes and indepen- and implemented to address the distributed dent disks. We found that that the scalabil- Erlang scalability issues identified in Section ii. ity limit of Riak version 1.1.1 is 60 nodes on SD Erlang introduces two concepts to improve the Kalkyl cluster. The study placed into the scalability. S_groups partition the set of nodes public scientific domain what was previously in an Erlang system to reduce network con- well-evidenced, but anecdotal, developer expe- nectivity and partition global data (Section i.1). rience. Semi-explicit placement alleviates the issues of We have also shown that resources like mem- explicit process placement in large heteroge- ory, disk, and network do not limit the scal- neous networks (Section i.2). The two features ability of Riak. By instrumenting the global are independent and can be used separately and gen_server OTP libraries we identified a or in combination. We overview SD Erlang in specific Riak remote procedure call that fails Section i, and outline s_group semantics and to scale. We outline how later releases of Riak validation in Sections ii, iii respectively. are refactored to eliminate the scalability bot- tlenecks. i. SD Erlang Design i.1 S_groups Discussion We conclude that Dynamo-like NoSQL DBMSs have the potential to deliver re- reduce both the number of connections a node liable persistent storage for Erlang at our target maintains, and the size of name spaces, i.e. scale of approximately 100 hosts. Specifically they minimise global information. Specifically an Erlang Cassandra interface is available and names are registered on, and synchronised be- Riak 1.1.1 already provides scalable and avail- tween, only the nodes within the s_group. An able persistent storage on 60 nodes. Moreover s_group has the following parameters: a name,

14 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017 a list of nodes, and a list of registered names. A structured as a tree, ring, or some other topol- node can belong to many s_groups or to none. ogy. How large should the s_groups be? Smaller If a node belongs to no s_group it behaves as a s_groups mean more inter-group communica- usual distributed Erlang node. tion, but the synchronisation of the s_group The s_group library defines the functions state between the s_group nodes constrains the shown in Table 1. Some of these functions maximum size of s_groups. We have not found manipulate s_groups and provide information this constraint to be a serious restriction. For about them, such as creating s_groups and pro- example many s_groups are either relatively viding a list of nodes from a given s_group. small, e.g. 10-node, internal or terminal ele- The remaining functions manipulate names ments in some topology, e.g. the leaves and registered in s_groups and provide informa- nodes of a tree. How do nodes from different tion about these names. For example, to regis- s_groups communicate? While any two nodes ter a process, Pid, with name Name in s_group can communicate in an SD Erlang system, to SGroupName we use the following function. The minimise the number of connections communi- name will only be registered if the process is cation between nodes from different s_groups being executed on a node that belongs to the is typically routed via gateway nodes that be- given s_group, and neither Name nor Pid are long to both s_groups. How do we avoid single already registered in that group. points of failure? For reliability, and to minimise communication load, multiple gateway nodes s_group:register_name(SGroupName, Name, and processes may be required. Pid) -> yes | no Information to make these design choices is provided by the tools in Section VII and by To illustrate the impact of s_groups on scal- benchmarking. A further challenge is how to ability we repeat the global operations exper- systematically refactor a distributed Erlang ap- iment from Section ii (Figure 9). In the SD plication into SD Erlang, and this is outlined in Erlang experiment we partition the set of nodes Section i. A detailed discussion of distributed into s_groups each containing ten nodes, and system design and refactoring in SD Erlang hence the names are replicated and synchro- provided in a recent article [22]. nised on just ten nodes, and not on all nodes We illustrate typical SD Erlang system de- as in distributed Erlang. The results in Fig- signs by showing refactorings of the Orbit and ure 11 show that with 0.01% of global opera- ACO benchmarks from Section III. In both dis- tions throughput of distributed Erlang stops tributed and SD Erlang the computation starts growing at 40 nodes while throughput of SD on the Master node and the actual computation Erlang continues to grow linearly. is done on the Worker nodes. In the distributed The connection topology of s_groups is ex- Erlang version all nodes are interconnected, tremely flexible: they may be organised into and messages are transferred directly from the a hierarchy of arbitrary depth or branching, sending node to the receiving node (Figure 2). e.g. there could be multiple levels in the tree In contrast, in the SD Erlang version nodes of s_groups; see Figure 12. Moreover it is not are grouped into s_groups, and messages are necessary to create an hierarchy of s_groups, transferred between different s_groups via Sub- for example, we have constructed an Orbit im- master nodes (Figure 13). plementation using a ring of s_groups. A fragment of code that creates an s_group Given such a flexible way of organising dis- on a Sub-master node is as follows: tributed systems, key questions in the design of an SD Erlang system are the following. How create_s_group(Master, should s_groups be structured? Depending on GroupName, Nodes0) -> the reason the nodes are grouped – reducing case s_group:new_s_group(GroupName, the number of connections, or reducing the Nodes0) of namespace, or both – s_groups can be freely {ok, GroupName, Nodes} ->

15 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

Table 1: Summary of s_group functions.

Function Description new_s_group(SGroupName, Nodes) Creates a new s_group consisting of some nodes. delete_s_group(SGroupName) Deletes an s_group. add_nodes(SGroupName, Nodes) Adds a list of nodes to an s_group. remove_nodes(SGroupName, Nodes) Removes a list of nodes from an s_group. s_groups() Returns a list of all s_groups known to the node. own_s_groups() Returns a list of s_group tuples the node belongs to. own_nodes() Returns a list of nodes from all s_groups the node belongs to. own_nodes(SGroupName) Returns a list of nodes from the given s_group. info() Returns s_group state information. register_name(SGroupName, Name, Pid) Registers a name in the given s_group. re_register_name(SGroupName, Name, Pid) Re-registers a name (changes a registration) in a given s_group. unregister_name(SGroupName, Name) Unregisters a name in the given s_group. registered_names({node,Node}) Returns a list of all registered names on the given node. registered_names({s_group,SGroupName}) Returns a list of all registered names in the given s_group. whereis_name(SGroupName, Name) Return the pid of a name registered in the given whereis_name(Node, SGroupName, Name) s_group. send(SGroupName, Name, Msg) Send a message to a name registered in the given send(Node, SGroupName, Name, Msg) s_group.

Master ! {GroupName, Nodes}; distributed Erlang and SD Erlang Orbit and _ -> io:format("exception: message") ACO is presented in Section VIII. end.

Similarly, we introduce s_groups in the i.2 Semi-Explicit Placement GR-ACO benchmark from Section ii to cre- Recall from Section iii that distributed Erlang ate Scalable Reliable ACO (SR-ACO); see Fig- spawns a process onto an explicitly named ure 12. Here, apart from reducing the num- Erlang node, e.g. ber of connections, s_groups also reduce the global namespace information. That is, in- spawn(some_node,fun some_module:pong/0). stead of registering the name of a pid glob- ally, i.e. with all nodes, the names is regis- and also recall the portability and program- tered only on all nodes in the s_group with ming effort issues associated with such explicit s_group:register_name/3. placement in large scale systems discussed in A comparative performance evaluation of Section ii.

16 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

1e+09

9e+08

7e+08

5e+08 Throughput

3e+08

1e+08 10 20 30 40 50 60 70 80 90 100 Number of nodes Distributed Erlang SD Erlang

Figure 11: Global operations in Distributed Erlang vs. SD Erlang.

represents a sub-master process

represents a colony process master node master process represents an ant process Level 0

P1 P2 represents an s_group P16 Level 1

represents a node

node1 node2 node16 node1 node1 node16 node1 node1 node16 colony nodes in this level Level 2

Figure 12: SD Erlang ACO (SR-ACO) architecture.

To address these issues we have developed a ated hosts, such as total and currently available semi-explicit placement library that enables the RAM, installed software, hardware configura- programmer to select nodes on which to spawn tion, etc. The second deals with a notion of com- processes based on run-time information about munication distances which models the commu- the properties of the nodes. For example, if nication times between nodes in a distributed a process performs a lot of computation one system. Therefore, instead of specifying would like to spawn it on a node with consid- a node we can use the attr:choose_node/1 erable computation power, or if two processes function to define the target node, i.e. are likely to communicate frequently then it would be desirable to spawn them on the same spawn(attr:choose_node(Params), node, or nodes with a fast interconnect. fun some_module:pong/0). We have implemented two Erlang libraries [60] report an investigation into the commu- to support semi-explicit placement [60]. The nication latencies on a range of NUMA and first deals with node attributes, and describes cluster architectures, and demonstrate the ef- properties of individual Erlang VMs and associ- fectiveness of the placement libraries using the

17 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

Figure 13: SD Erlang (SD-Orbit) architecture.

ML-ACO benchmark on the Athos cluster. replicated on all nodes of the associated group. A node has the following four parame- ters: node_id identifier, node_type that can ii. S_group Semantics be either hidden or normal, connections, and For precise specification, and as a basis for val- group_names, i.e. names of groups the node idation, we provide a small-step operational belongs to. The node can belong to either a list semantics of the s_group operations [21]. Fig- of s_groups or one of the free groups. The type ure 14 defines the state of an SD Erlang system of the free group is defined by the node type. and associated abstract syntax variables. The Connections are a set of node_ids. abstract syntax variables on the left are defined Transitions in the semantics have the form 0 as members of sets, denoted {}, and these in (state, command, ni) −→ (state , value) meaning turn may contain tuples, denoted (), or further that executing command on node ni in state 0 sets. In particular nm is a process name, p a returns value and transitions to state . pid, ni a node_id, and nis a set of node_ids. The semantics is presented in more detail The state of a system is modelled as a four by [21], but we illustrate it here with the tuple comprising a set of s_groups, a set of s_group:registered_names/1 function from f ree_groups, a set of f ree_hidden_groups, and Section i.1. The function returns a list of names a set of nodes. Each type of group is associ- registered in s_group s if node ni belongs to the ated with nodes and has a namespace. An s_group, an empty list otherwise (Figure 15). s_group additionally has a name, whereas a Here ⊕ denotes disjoint set union; IsSGroupN- f ree_hidden_group consists of only one node, ode returns true if node ni is a member of some i.e. a hidden node simultaneously acts as a s_group s, false otherwise; and OutputNms re- node and as a group, because as a group it has turns a set of process names registered in the a namespace but does not share it with any ns namespace of s_group s. other node. Free normal and hidden groups have no names, and are uniquely defined by iii. Semantics Validation the nodes associated with them. Therefore, group names, gr_names, are either NoGroup As the semantics is concrete it can readily be or a set of s_group_names. A namespace is a made executable in Erlang, with lists replacing set of name and process id, pid, pairs and is sets throughout. Having an executable seman-

18 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

(grs, fgs, fhs, nds) ∈ {state} ≡ {s_group}, {free_group}, {free_hidden_group}, {node} gr ∈ grs ≡ {s_group} ≡ s_group_name, {node_id}, namespace fg ∈ fgs ≡ {free_group} ≡ {node_id}, namespace fh ∈ fhs ≡ {free_hidden_group} ≡ node_id, namespace nd ∈ nds ≡ {node} ≡ node_id, node_type, connections, gr_names gs ∈ {gr_names} ≡ NoGroup, {s_group_name} ns ∈ {namespace} ≡ {(name, pid)} cs ∈ {connections} ≡ {node_id} nt ∈ {node_type} ≡ {Normal, Hidden} s ∈ {NoGroup, s_group_name}

Figure 14: SD Erlang state [21].

(grs,fgs, fhs, nds), s_group : registered_names(s), ni −→ (grs, fgs, fhs, nds), nms) if IsSGroupNode(ni, s, grs −→ (grs, fgs, fhs, nds), {} otherwise where (s, {ni} ⊕ nis, ns) ⊕ grs0 ≡ grs nms ≡ OutputNms(s, ns)

IsSGroupNode(ni, s, grs) = ∃nis, ns, grs0 . (s, {ni} ⊕ nis, ns) ⊕ grs0 ≡ grs

OutputNms(s, ns) = (s, nm) | (nm, p) ∈ ns

Figure 15: SD Erlang Semantics of s_group:registered_names/1 Function tics allows users to engage with it, and to un- ated too. This random generation is supported derstand how the semantics behaves vis à vis by the QuickCheck property-based testing sys- the library, giving them an opportunity to as- tem [24, 9]. sess the correctness of the library against the The architecture of the testing framework semantics. is shown in Figure 16. First an abstract state Better still, we can automatically assess how machine embedded as an “eqc_statem” mod- the system behaves in comparison with the ule is derived from the executable semantic (executable) semantics by executing them in specification. The state machine defines the lockstep, guided by the constraints of which abstract state representation and the transition operations are possible at each point. We do from one state to another when an operation that by building an abstract state machine is applied. Test case and data generators are model of the library. We can then gener- then defined to control the test case generation; ate random sequences (or traces) through the this includes the automatic generation of eligi- model, with appropriate library data gener- ble s_group operations and the input data to

19 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017 those operations. Test oracles are encoded as where there were inconsistencies between the the postcondition for s_group operations. semantics and the library implementation, de- During testing, each test command is ap- spite their states being equivalent: an example plied to both the abstract model and the of this was in the particular values returned s_group library. The application of the test by functions on certain errors. Overall, the au- command to the abstract model takes the ab- tomation of testing boosted our confidence in stract model from its current state to a new the correctness of the library implementation state as described by the transition functions; and the semantic specification. This work is whereas the application of the test command reported in more detail by [54]. to the library leads the system to a new actual state. The actual state information is collected VI. Improving VM Scalability from each node in the distributed system, then merged and normalised to the same format This section reports the primary VM and li- as the abstract state representation. For a test brary improvements we have designed and to be successful, after the execution of a test implemented to address the scalability and re- command, the test oracles specified for this liability issues identified in Section i. command should be satisfied. Various test ora- cles can be defined for s_group operations; for instance one of the generic constraints that ap- i. Improvements to Erlang Term Stor- plies to all the s_group operations is that after age each s_group operation, the normalised system Because ETS tables are so heavily used in state should be equivalent to the abstract state. Erlang systems, they are a focus for scalability Thousands of tests were run, and three kinds improvements. We start by describing their of errors — which have subsequently been redesign, including some improvements that corrected — were found. Some errors in the pre-date our RELEASE project work, i.e. those library implementation were found, includ- prior to Erlang/OTP R15B03. These historical ing one error due to the synchronisation be- improvements are very relevant for a scalability tween nodes, and the other related to the study and form the basis for our subsequent remove_nodes operation, which erroneously changes and improvements. At the point when raised an exception. We also found a couple Erlang/OTP got support for multiple cores (in of trivial errors in the semantic specification release R11B), there was a single reader-writer itself, which had been missed by manual ex- lock for each ETS table. Optional fine grained amination. Finally, we found some situations locking of hash-based ETS tables (i.e. set, bag or duplicate_bag tables) was introduced in Erlang/OTP R13B02-1, adding 16 reader-writer locks for the hash buckets. Reader groups to minimise read synchronisation overheads were introduced in Erlang/OTP R14B. The key ob- servation is that a single count of the multi- ple readers must be synchronised across many cache lines, potentially far away in a NUMA system. Maintaining reader counts in multi- ple (local) caches makes reads fast, although writes must now check every reader count. In Erlang/OTP R16B the number of bucket locks, and the default number of reader groups, were both upgraded from 16 to 64. Figure 16: Testing s_groups using QuickCheck. We illustrate the scaling properties of the

20 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

50 12 R11B-5 1 RG 45 R13B02-1 2 RG R14B 4 RG R16B 10 8 RG 40 16 RG 35 32 RG 8 64 RG 30 25 6 20 4 15 10 2 5 0 0 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 60 14 R11B-5 1 RG R13B02-1 2 RG R14B 12 4 RG 50 R16B 8 RG 16 RG 10 32 RG 40 64 RG 8 30 6 20 4

10 2

0 0 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70

Figure 17: Runtime of 17M ETS operations in Figure 18: Runtime of 17M ETS operations with vary- Erlang/OTP releases: 10% updates (left) and ing numbers of reader groups: 10% updates 1% updates (right). The y-axis shows time (left) and 1% updates (right). The y-axis in seconds (lower is better) and the x-axis is shows runtime in seconds (lower is better) number of OS threads (VM schedulers). and the x-axis is number of schedulers.

ETS concurrency options using the ets_bench that, different numbers of reader groups have BenchErl benchmark on the Intel NUMA ma- little impact on the benchmark performance chine with 32 hyperthreaded cores specified in except that using 64 groups with 10% updates Appendix A. The ets_bench benchmark inserts slightly degrades performance. 1M items into the table, then records the time to We have explored four other extensions or perform 17M operations, where an operation redesigns in the ETS implementation for better is either a lookup, an insert, or a delete. The scalability. 1. Allowing more programmer con- experiments are conducted on a hash-based trol over the number of bucket locks in hash- (set) ETS table with different percentages of based tables, so the programmer can reflect update operations, i.e. insertions or deletions. the number of schedulers and the expected Figure 17 shows the runtimes in seconds of access pattern. 2. Using contention-adapting 17M operations in different Erlang/OTP ver- trees to get better scalability for ordered_set sions, varying the number of schedulers (x- ETS tables as described by [69]. 3. Using queue axis), reflecting how the scalability of ETS ta- delegation locking to improve scalability [47]. bles has improved in more recent Erlang/OTP 4. Adopting schemes for completely eliminat- releases. Figure 18 shows the runtimes in sec- ing the locks in the meta table. A more com- onds of 17M operations on an ETS table with plete discussion of our work on ETS can be different numbers of reader groups, again vary- found in the papers by [69] and [47]. ing the number of schedulers. We see that one Here we outline only our work on reader group is not sufficient with 10% up- contention-adapting (CA) trees. A CA tree dates, nor are two with 1% updates. Beyond monitors contention in different parts of a

21 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

20 ii. Improvements to Schedulers

15 In the Erlang VM, a scheduler is responsible for executing multiple processes concurrently, in a timely and fair fashion, making optimal 10 use of hardware resources. The VM imple-

set ments preemptive multitasking with soft real- 5 Operations / Microsecond ordset time guarantees. Erlang processes are normally AVL-CA tree Treap-CA tree scheduled on a reduction count basis where 0 0 10 20 30 40 50 60 70 one reduction is roughly equivalent to a func- Number of Threads tion call. Each process is allowed to execute 25 set until it either blocks waiting for input, typically ordset a message from some other process, or until it 20 AVL-CA tree has executed its quota of reductions. Treap-CA tree

15 The Erlang VM is usually started with one scheduler per logical core (SMT-thread) avail- 10 able on the host machine, and schedulers are implemented as OS threads. When an Erlang Operations / Microsecond 5 process is spawned it is placed in the run queue of the scheduler of its parent, and it waits on 0 0 10 20 30 40 50 60 70 that queue until the scheduler allocates it a Number of Threads slice of core time. Work stealing is used to bal- Figure 19: Throughput of CA tree variants: 10% up- ance load between cores, that is an idle sched- dates (left) and 1% updates (right). uler may migrate a process from another run queue. Scheduler run queues are visualised in Figure iii.2. The default load management mechanism tree-shaped data structure, introducing rout- is load compaction that aims to keep as many ing nodes with locks in response to high con- scheduler threads as possible fully loaded with tention, and removing them in response to low work, i.e. it attempts to ensure that sched- contention. For experimental purposes two uler threads do not run out of work. We variants of the CA tree have been implemented have developed a new optional scheduler utilisa- to represent ordered_sets in the virtual ma- tion balancing mechanism that is available from chine of Erlang/OTP 17.0. One extends the Erlang/OTP 17.0. The new mechanism aims existing AVL trees in the Erlang VM, and the to balance scheduler utilisation between sched- other uses a Treap data structure [5]. Figure 19 ulers; that is, it will strive for equal scheduler compares the throughput of the CA tree vari- utilisation on all schedulers. ants with that of ordered_set and set as the The scheduler utilisation balancing mecha- number of schedulers increases. It is unsur- nism has no performance impact on the system prising that the CA trees scale so much better when not enabled. On the other hand, when than an ordered_set which is protected by a enabled, it results in changed timing in the sys- single readers-writer lock. It is more surprising tem; normally there is a small overhead due that they also scale better than set. This is due to measuring of utilisation and calculating bal- to hash tables using fine-grained locking at a ancing information, which depends on the un- fixed granularity, while CA trees can adapt the derlying primitives provided by the operating number of locks to the current contention level, system. and also to the parts of the key range where The new balancing mechanism results in a contention is occurring. better distribution of processes to schedulers,

22 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

reducing the probability of core contention. To- accurate frequency; this is required to align gether with other VM improvements, such as Erlang system time with OS system time interruptable BIFs and garbage collection, it smoothly when these two have deviated, e.g. results in lower latency and improved respon- in the case of clock shifts when leap seconds siveness, and hence reliability, for soft real-time are inserted or deleted. Another problem is applications. that Erlang system time and OS system time can differ for very long periods of time. In the new API, we resolve this using a common OS iii. Improvements to Time Manage- technique [59], i.e. a monotonic time that has ment its zero point at some unspecified point in time. Soon after the start of the RELEASE project, Monotonic time is not allowed to make leaps time management in the Erlang VM became forwards and backwards while system time is a scalability bottleneck for many applications, allowed to do this. Erlang system time is thus as illustrated by the parallel benchmark in Fig- just a dynamically varying offset from Erlang ure i. The issue came to prominence as other, monotonic time. more severe, bottlenecks were eliminated. This subsection motivates and outlines the improve- Time Retrieval Retrieval of Erlang system ments to time management that we made; time was previously protected by a global mu- these were incorporated into Erlang/OTP 18.x tex, which made the operation thread safe, but as a new API for time and time warping. The scaled poorly. Erlang system time and Erlang old API is still supported at the time of writing, monotonic time need to run at the same fre- but its use is deprecated. quency, otherwise the time offset between them The original time API provides the would not be constant. In the common case, erlang:now/0 built-in that returns “Erlang sys- monotonic time delivered by the operating sys- tem time” or time since Epoch with micro sec- tem is solely based on the machine’s local clock ond resolution. This time is the basis for all and cannot be changed, while the system time time internally in the Erlang VM. is adjusted using the Network Time Protocol Many of the scalability problems of (NTP). That is, they will run with different fre- erlang:now/0 stem from its specification, writ- quencies. Linux is an exception with a mono- ten at a time when the Erlang VM was not tonic clock that is NTP adjusted and runs with multi-threaded, i.e. SMT-enabled. The docu- the same frequency as system time [76]. To mentation promises that values returned by it align the frequencies of Erlang monotonic time are strictly increasing and many applications and Erlang system time, we adjust the fre- ended up relying on this. For example applica- quency of the Erlang monotonic clock. This tions often employ erlang:now/0 to generate is done by comparing monotonic time and sys- unique integers. tem time delivered by the OS, and calculating Erlang system time should align with the an adjustment. To achieve this scalably, one operating system’s view of time since Epoch or VM thread calculates the time adjustment to “OS system time”. However, while OS system use at least once a minute. If the adjustment time can be freely changed both forwards and needs to be changed, new adjustment informa- backwards, Erlang system time cannot, with- tion is published and used to calculate Erlang out invalidating the strictly increasing value monotonic time in the future. guarantee. The Erlang VM therefore contains a When a thread needs to retrieve time, it reads mechanism that slowly adjusts Erlang system the monotonic time delivered by the OS and time towards OS system time if they do not the time adjustment information previously align. published and calculates Erlang monotonic One problem with time adjustment is that time. To preserve monotonicity it is impor- the VM deliberately presents time with an in- tant that all threads that read the same OS

23 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

180000 monotonic time map this to exactly the same 15B01 (now()) [70016,640] 160000 17.4 (now()) [70016,640] Erlang monotonic time. This requires synchro- 18.1 (now()) [70016,640] nisation on updates to the adjustment infor- 140000 18.1 (monotonic_time()) [70016,640] mation using a readers-writer (RW) lock. This 120000 RW lock is write-locked only when the adjust- 100000 80000 ment information is changed. This means that time (ms) in the vast majority of cases the RW lock will 60000 be read-locked, which allows multiple readers 40000 to run concurrently. To prevent bouncing the 20000 lock cache-line we use a bespoke reader opti- 0 0 10 20 30 40 50 60 70 mised RW lock implementation where reader # Schedulers threads notify about their presence in coun- 30 18.1 (monotonic_time()) [70016,640] ters on separate cache-lines. The concept is 25 similar to the reader indicator algorithm de- scribed by [48, Fig. 11] and alternatives include 20 the ingress-egress counter used by [18] and the 15

SNZI algorithm of [30]. Speedup

10 Timer Wheel and BIF Timer The timer 5 wheel contains all timers set by Erlang pro- cesses. The original implementation was pro- 0 0 10 20 30 40 50 60 70 tected by a global mutex and scaled poorly. To # Schedulers increase concurrency, each scheduler thread has been assigned its own timer wheel that is Figure 20: BenchErl parallel benchmark us- used by processes executing on the scheduler. ing erlang:monotonic_time/0 or The implementation of timers in erlang:now/0 in different Erlang/OTP releases: runtimes (left) and speedup obtained Erlang/OTP uses a built in function (BIF), using erlang:monotonic_time/0 (right). as most low-level operations do. Until Erlang/OTP 17.4, this BIF was also protected by a global mutex. Besides inserting timers provides a default value. The receive after into the timer wheel, the BIF timer implemen- sets a timer when the process blocks in the tation also maps timer references to a timer in receive, and cancels it when a message ar- the timer wheel. To improve concurrency, from rives. In Erlang/OTP 17.4 the total execution Erlang/OTP 18 we provide scheduler-specific time with standard timers is 62% longer than BIF timer servers as Erlang processes. These without timers. Using the improved implemen- keep information about timers in private ETS tation in Erlang/OTP 18.0, total execution time tables and only insert one timer at the time with the optimised timers is only 5% longer into the timer wheel. than without timers. The second micro benchmark repeatedly Benchmarks We have measured several checks the system time, calling the built-in benchmarks on a 16-core Bulldozer machine erlang:now/0 in Erlang/OTP 17.4, and 5 with eight dual CPU AMD Opteron 4376 HEs. calling both erlang:monotonic_time/0 and We present three of them here. erlang:time_offset/0 and adding the results The first micro benchmark compares the ex- in Erlang/OTP 18.0. In this machine, where ecution time of an Erlang receive with that of the VM uses 16 schedulers by default, the 18.0 a receive after that specifies a timeout and release is more than 69 times faster than the 5See §2.5.4 of the RELEASE project Deliverable 2.4 17.4 release. (http://release-project.eu/documents/D2.4.pdf). The third benchmark is the parallel BenchErl

24 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017 benchmark from Section i. Figure 20 shows support for tracing many types of events, and the results of executing the original version of this infrastructure forms the basis of a number this benchmark, which uses erlang:now/0 to of tools for tracing and profiling. Typically the create monotonically increasing unique values, tools build on or specialise the services offered using three Erlang/OTP releases: R15B01, 17.4, by the Erlang virtual machine, through a num- and 18.1. We also measure a version of the ber of built-in functions. Most recently, and benchmark in Erlang/OTP 18.1 where the call since the RELEASE project was planned, the to erlang:now/0 has been substituted with a Observer6 application gives a comprehensive call to erlang:monotonic_time/0. The graph overview of many of these data on a node-by- on its left shows that: 1. the performance of node basis. time management has remained roughly un- As actor frameworks and languages (see changed between Erlang/OTP releases prior Section ii) have only recently become widely to 18.0; 2. the improved time management in adopted commercially, their tool support re- Erlang/OTP 18.x make time management less mains relatively immature and generic in na- likely to be a scalability bottleneck even when ture. That is, the tools support the language using erlang:now/0, and 3. the new time API itself, rather than its distinctively concurrent (using erlang:monotonic_time/0 and friends) aspects. Given the widespread use of Erlang, provides a scalable solution. The graph on the tools developed for it point the way for tools right side of Figure 20 shows the speedup that in other actor languages and frameworks. For the modified version of the parallel benchmark example, just as many Erlang tools use trac- achieves in Erlang/OTP 18.1. ing support provided by the Erlang VM, so can other actor frameworks, e.g. Akka can use the Kamon7 JVM monitoring system. Simi- VII. Scalable Tools larly, tools for other actor languages or frame- works could use data derived through OS-level This section outlines five tools developed in the tracing frameworks DTrace8 and SystemTap9 RELEASE project to support scalable Erlang probes as we show in this section for Erlang, systems. Some tools were developed from provided that the host language has tracing scratch, like Devo, SDMon and WombatOAM, hooks into the appropriate infrastructure. while others extend existing tools, like Percept and Wrangler. These include tooling to trans- form programs to make them more scalable, i. Refactoring for Scalability to deploy them for scalability, to monitor and Refactoring [65, 32, 75] is the process of chang- visualise them. Most of the tools are freely ing how a program works without changing available under open source licences (Devo, what it does. This can be done for readability, Percept2, SD-Mon, Wrangler); while Wombat- for testability, to prepare it for modification or OAM is a commercial product. The tools have extension, or — as is the case here — in order been used for profiling and refactoring the to improve its scalability. Because refactoring ACO and Orbit benchmarks from Section III. involves the transformation of source code, it The Erlang tool “ecosystem” consists of is typically performed using machine support small stand-alone tools for tracing, profiling in a refactoring tool. There are a number of and debugging Erlang systems that can be used tools that support refactoring in Erlang: in the separately or together as appropriate for solv- RELEASE project we have chosen to extend ing the problem at hand, rather than as a sin- Wrangler10 [56]; other tools include Tidier [68] gle, monolithic, super-tool. The tools presented here have been designed to be used as part of 6http://www.erlang.org/doc/apps/observer/ 7http://kamon.io that ecosystem, and to complement already 8http://dtrace.org/blogs/about/ available functionality rather than to duplicate 9https://sourceware.org/systemtap/wiki it. The Erlang runtime system has built-in 10http://www.cs.kent.ac.uk/projects/wrangler/

25 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017 and RefactorErl [44]. para_lib, which provides parallel implementa- tions of map and foreach. The transformation Supporting API Migration The SD Erlang li- from an explicit use of sequential map/foreach braries modify Erlang’s global_group library, to the use of their parallel counterparts is becoming the new s_group library; as a re- very straightforward, even manual refactor- sult, Erlang programs using global_group will ing would not be a problem. However a have to be refactored to use s_group. This kind map/foreach operation could also be imple- of API migration problem is not uncommon, as mented differently using recursive functions, software evolves and this often changes the API list comprehensions, etc.; identifying this kind of a library. Rather than simply extend Wran- of implicit map/foreach usage can be done us- gler with a refactoring to perform this particu- ing Wrangler’s code inspection facility, and a lar operation, we instead added a framework refactoring that turns an implicit map/foreach for the automatic generation of API migration to an explicit map/foreach can also be speci- refactorings from a user-defined adaptor mod- fied using Wrangler’s rule-based transforma- ule. tion API. Our approach to automatic API migration If the computations of two non-trivial tasks works in this way: when an API function’s do not depend on each other, then they can interface is changed, the author of this API be executed in parallel. The Introduce a New function implements an adaptor function, defin- Process refactoring implemented in Wrangler ing calls to the old API in terms of the new. can be used to spawn a new process to execute From this definition we automatically generate a task in parallel with its parent process. The the refactoring that transforms the client code result of the new process is sent back to the par- to use the new API; this refactoring can also be ent process, which will then consume it when supplied by the API writer to clients on library needed. In order not to block other computa- upgrade, allowing users to upgrade their code tions that do not depend on the result returned automatically. The refactoring works by gener- by the new process, the receive expression is ating a set of rules that “fold in” the adaptation placed immediately before the point where the to the client code, so that the resulting code result is needed. works directly with the new API. More details While some tail-recursive list processing of the design choices underlying the work and functions can be refactored to an explicit map the technicalities of the implementation can be operation, many cannot due to data dependen- found in a paper by [52]. cies. For instance, an example might perform a recursion over a list while accumulating results Support for Introducing Parallelism We in an accumulator variable. In such a situa- have introduced support for parallelising ex- tion it is possible to “float out” some of the plicit list operations (map and foreach), for pro- computations into parallel computations. This cess introduction to complete a computation- can only be done when certain dependency ally intensive task in parallel, for introducing a constraints are satisfied, and these are done by worker process to deal with call handling in an program slicing, which is discussed below. Erlang “generic server” and to parallelise a tail recursive function. We discuss these in turn Support for Program Slicing Program slic- now; more details and practical examples of ing is a general technique of program analysis the refactorings appear in a conference paper for extracting the part of a program, also called describing that work [55]. the slice, that influences or is influenced by a Uses of map and foreach in list process- given point of interest, i.e. the slicing criterion. ing are among of the most obvious places Static program slicing is generally based on where parallelism can be introduced. We program dependency including both control have added a small library to Wrangler, called dependency and data dependency. Backward

26 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017 intra-function slicing is used by some of the engage directly with the managed nodes. Each refactorings described above; it is also useful managed node runs a collection of services that in general, and made available to end-users collect metrics, raise alarms and so forth; we under Wrangler’s Inspector menu [55]. describe those now. Our work can be compared with that in PaRTE11 [17], a tool developed in another EU Services WombatOAM is designed to collect, project that also re-uses the Wrangler front end. store and display various kinds of information This work concentrates on skeleton introduc- and event from running Erlang systems, and tion, as does some of our work, but we go these data are accessed and managed through further in using static analysis and slicing in an AJAX-based Web Dashboard; these include transforming programs to make them suitable the following. for introduction of parallel structures. Metrics WombatOAM supports the collection ii. Scalable Deployment of some hundred metrics — including, for instance, numbers of processes on a node We have developed the WombatOAM tool12 to and the message traffic in and out of the provide a deployment, operations and main- node — on a regular basis from the Erlang tenance framework for large-scale Erlang dis- VMs running on each of the hosts. It can tributed systems. These systems typically con- also collect metrics defined by users within sist of a number of Erlang nodes executing on other metrics collection frameworks, such different hosts. These hosts may have different as Folsom13, and interface with other tools hardware or operating systems, be physical or such as graphite14 which can log and dis- virtual, or run different versions of Erlang/OTP. play such information. The metrics can be Prior to the development of WombatOAM, de- displayed as histograms covering different ployment of systems would use scripting in windows, such as the last fifteen minutes, Erlang and the shell, and this is the state of hour, day, week or month. the art for other actor frameworks; it would be possible to adapt the WombatOAM approach Notifications As well as metrics, it can sup- to these frameworks in a straightforward way. port event-based monitoring through the collection of notifications from running Architecture The architecture of nodes. Notifications, which are one time WombatOAM is summarised in Figure 21. events can be generated using the Erlang Originally the system had problems address- System Architecture Support Libraries, ing full scalability because of the role played SASL, which is part of the standard distri- 15 by the central Master node; in its current bution, or the lager logging framework , version an additional layer of Middle Managers and will be displayed and logged as they was introduced to allow the system to scale occur. easily to thousands of deployed nodes. As the diagram shows, the “northbound” interfaces Alarms Alarms are more complex entities. to the web dashboard and the command-line Alarms have a state: they can be raised, and are provided through RESTful connections once dealt with they can be cleared; they to the Master. The operations of the Master also have identities, so that the same alarm are delegated to the Middle Managers that may be raised on the same node multiple times, and each instance will need to be 11http://paraphrase-enlarged.elte.hu/ downloads/D4-3_user_manual.pdf 13Folsom collects and publishes metrics through an 12WombatOAM (https://www.erlang-solutions. Erlang API: https://github.com/boundary/folsom com/products/wombat-oam.html) is a commercial tool 14https://graphiteapp.org available from Erlang Solutions Ltd. 15https://github.com/basho/lager

27 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

Wombat Wombat Web CLI Dashboard REST REST

Northbound Interface

Topology Metrics Notif. Alarms Orchestration API API API API API

Erlang

Master

Node Core Topology Metrics Notif. Alarms Orchestration manager

Core DB Erlang (including Topology and Orchestration Middle Manager data) Node Core Metrics Notif. Alarms Orchestration manager

Erlang

Erlang/RPC elibcloud/ Libcloud Provision software Provision (SSH) VMs (REST)

Infrastructure Provider

Agents Metrics error_logger lager Alarms agent agent agent agent

Managed Erlang Node Virtual machine

Figure 21: The architecture of WombatOAM.

dealt with separately. Alarms are gener- Erlang distribution protocol. If it loses ated by SASL or lager, just as for notifica- the connection towards a node, it periodi- tions. cally tries to reconnect. It also maintains the states of the nodes in the database (e.g. Topology The Topology service handles if the connection towards a node is lost, adding, deleting and discovering nodes. the Node manager changes the node state It also monitors whether they are accessi- to DOWN and raises an alarm). The Node ble, and if not, it notifies the other services, manager doesn’t have a REST API, since and periodically tries to reconnect. When the node states are provided via the Topol- the nodes are available again, it also noti- ogy service’s REST API. fies the other services. It doesn’t have a middle manager part, because it doesn’t Orchestration This service can deploy new talk to the nodes directly: instead it asks Erlang nodes on already running ma- the Node manager service to do so. chines. It can also provision new vir- tual machine instances using several cloud Node manager This service maintains the con- providers, and deploy Erlang nodes on nection to all managed nodes via the those instances. For communicating with

28 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

the cloud providers, the Orchestration provisioning machines; (ii) the username service uses an external library called that will be used when WombatOAM con- Libcloud,16 for which Erlang Solutions has nects to the hosts using SSH. written an open source Erlang wrapper called elibcloud, to make Libcloud eas- 5. Node deployment. To deploy nodes a ier to use from WombatOAM. Note that WombatOAM user needs only to specify WombatOAM Orchestration doesn’t pro- the number of nodes and the node fam- vide a platform for writing Erlang appli- ily these nodes belong to. Nodes can be cations: it provides infrastructure for de- dynamically added to, or removed from, ploying them. the system depending on the needs of the application. The nodes are started, and WombatOAM is ready to initiate and run Deployment The mechanism consists of the the application. following five steps.

1. Registering a provider. WombatOAM pro- iii. Monitoring and Visualisation vides the same interface for different cloud providers which support the Open- A key aim in designing new monitoring and Stack standard or the Amazon EC2 API. visualisation tools and adapting existing ones WombatOAM also provides the same in- was to provide support for systems running on terface for using a fixed set of machines. parallel and distributed hardware. Specifically, In WombatOAM’s backend, this has been in order to run modern Erlang systems, and implemented as two driver modules: the in particular SD Erlang systems, it is necessary elibcloud driver module which uses the to understand both their single host (“multi- elibcloud and Libcloud libraries to com- core”) and multi-host (“distributed”) nature. municate with the cloud providers, and That is, we need to be able to understand how the SSH driver module that keeps track of systems run on the Erlang multicore virtual a fixed set of machines. machine, where the scheduler associated with a core manages its own run queue, and pro- 2. Uploading a release. The release can be ei- cesses migrate between the queues through ther a proper Erlang release archive or a a work stealing algorithm; at the same time, set of Erlang modules. The only important we have to understand the dynamics of a dis- aspect from WombatOAM’s point of view tributed Erlang program, where the user ex- is that start and stop commands should plicitly spawns processes to nodes.17 be explicitly specified. This will enable WombatOAM start and stop nodes when iii.1 Percept2 needed. Percept218 builds on the existing Percept tool 3. Defining a node family. The next step is cre- to provide post hoc offline analysis and visuali- ating the node family, which is the entity sation of Erlang systems. Percept2 is designed that refers to a certain release, contains to allow users to visualise and tune the paral- deployment domains that refer to certain lel performance of Erlang systems on a single providers, and contains other information node on a single manycore host. It visualises necessary to deploy a node. Erlang application level concurrency and iden- tifies concurrency bottlenecks. Percept2 uses 4. Defining a deployment domain. At this step a deployment domain is created that spec- 17This is in contrast with the multicore VM, where ifies(i) which providers should be used for programmers have no control over where processes are spawned; however, they still need to gain insight into the 16The unified cloud API [4]: https://libcloud. behaviour of their programs to tune performance. apache.org 18https://github.com/RefactoringTools/percept2

29 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

Erlang built in tracing and profiling to monitor • Recording dynamic function call process states, i.e. waiting, running, runnable, graph/count/time: the hierarchy struc- free and exiting. A waiting or suspended pro- ture showing the calling relationships cess is considered an inactive and a running or between functions during the program runnable process is considered active. As a pro- run, and the amount of time spent in a gram runs with Percept, events are collected function. and stored to a file. The file is then analysed, with the results stored in a RAM database, and • Tracing of s_group activities in a dis- this data is viewed through a web-based in- tributed system. Unlike global group, terface. The process of offline profiling for a s_group allows dynamic changes to the distributed Erlang application using Percept2 s_group structure of a distributed Erlang is shown in Figure 22. system. In order to support SD Erlang, we Percept generates an application-level have also extended Percept2 to allow the zoomable concurrency graph, showing the profiling of s_group related activities, so number of active processes at each point that the dynamic changes to the s_group during profiling; dips in the graph represent structure of a distributed Erlang system low concurrency. A lifetime bar for each can be captured. process is also included, showing the points We have also improved on Percept as follows. during its lifetime when the process was active, as well as other per-process information. • Enabling finer-grained control of what Percept2 extends Percept in a number of is profiled. The profiling of port activi- ways — as detailed by [53] — including most ties, schedulers activities, message pass- importantly: ing, process migration, garbage collec- tion and s_group activities can be en- • Distinguishing between running and abled/disabled, while the profiling of pro- runnable time for each process: this is cess runnability (indicated by the “proc” apparent in the process runnability com- flag) is always enabled. parison as shown in Figure 23, where or- ange represents runnable and green rep- • Selective function profiling of processes. resents running. This shows very clearly In Percept2, we have built a version of where potential concurrency is not being fprof, which does not measure a function’s exploited. own execution time, but measures every- thing else that fprof measures. Eliminating • Showing scheduler activity: the number measurement of a function’s own execu- of active schedulers at any time during the tion time gives users the freedom of not profiling. profiling all the function calls invoked dur- • Recording more information during exe- ing the program execution. For example, cution, including the migration history of they can choose to profile only functions a process between run queues; statistics defined in their own applications’ code, about message passing between processes: and not those in libraries. the number of messages and the average • message size sent/received by a process; Improved dynamic function callgraph. the accumulated runtime per-process: the With the dynamic function callgraph, a accumulated time when a process is in a user is able to understand the causes of running state. certain events, such as heavy calls of a par- ticular function, by examining the region • Presenting the process tree: the hierarchy around the node for the function, includ- structure indicating the parent-child rela- ing the path to the root of the graph. Each tionships between processes. edge in the callgraph is annotated with

30 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

Figure 22: Offline profiling of a distributed Erlang application using Percept2.

Figure 23: Percept2: showing processes running and runnable (with 4 schedulers/run queues).

the number of times the target function Mac OS X, and SystemTap does the same for is called by the source function as well as Linux; both allow the monitoring of live, run- further information. ning systems with minimal overhead. They can be used by administrators, language devel- Finally, we have also improved the scalability opers and application developers alike to ex- of Percept in three ways. First we have paral- amine the behaviour of applications, language lelised the processing of trace files so that mul- implementations and the operating system dur- tiple data files can be processed at the same ing development or even on live production time. We have also compressed the representa- systems. In comparison to other similar tools tion of call graphs, and cached the history of and instrumentation frameworks, they are rel- generated web pages. Together these make the atively lightweight, do not require special re- system more responsive and more scalable. compiled versions of the software to be ex- amined, nor special post-processing tools to iii.2 DTrace/SystemTap tracing create meaningful information from the data gathered. Using these probs, it is possible to DTrace provides dynamic tracing support for identify bottlenecks both in the VM itself and various flavours of Unix, including BSD and

31 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

Run Queues ple nodes, grouped into s_groups) aspects of 4500 RunQueue15 RunQueue14 RunQueue13 Erlang and SD Erlang systems. Visualisation 4000 RunQueue12 RunQueue11 RunQueue10 is within a browser, with web sockets provid- RunQueue9 3500 RunQueue8 RunQueue7 ing the connections between the JavaScript vi- RunQueue6 3000 RunQueue5 RunQueue4 sualisations and the running Erlang systems, RunQueue3 RunQueue2 2500 RunQueue1 instrumented through the trace tool builder RunQueue0

2000 (ttb).

1500 Figure 25 shows visualisations from devo in both modes. On the left-hand side a sin- 1000 gle compute node is shown. This consists of 500 two physical chips (the upper and lower halves 0 of the diagram) with six cores each; with hy- perthreading this gives twelve virtual cores, Figure 24: DTrace: runqueue visualisation of bang and hence 24 run queues in total. The size of benchmark with 16 schedulers (time on X- axis). these run queues is shown by both the colour and height of each column, and process migra- tions are illustrated by (fading) arc between the in applications. queues within the circle. A green arc shows mi- VM bottlenecks are identified using a large gration on the same physical core, a grey one number of probes inserted into Erlang/OTP’s on the same chip, and blue shows migrations VM, for example to explore scheduler run- between the two chips. queue lengths. These probes can be used to On the right-hand side of Figure 25 is a vi- measure the number of processes per sched- sualisation of SD-Orbit in action. Each node uler, the number of processes moved during in the graph represents an Erlang node, and work stealing, the number of attempts to gain colours (red, green, blue and orange) are used a run-queue lock, how many of these succeed to represent the s_groups to which a node be- immediately, etc. Figure 24 visualises the re- longs. As is evident, the three nodes in the sults of such monitoring; it shows how the size central triangle belong to multiple groups, and of run queues vary during execution of the act as routing nodes between the other nodes. bang BenchErl benchmark on a VM with 16 The colour of the arc joining two nodes repre- schedulers. sents the current intensity of communication Application bottlenecks are identified by an between the nodes (green quiescent; red busi- alternative back-end for Percept2, based on est). DTrace/SystemTap instead of the Erlang built- in tracing mechanism. The implementation re- uses the existing Percept2 infrastructure as far iii.4 SD-Mon as possible. It uses a different mechanism for SD-Mon is a tool specifically designed for mon- collecting information about Erlang programs, itoring SD-Erlang systems. This purpose is a different format for the trace files, but the accomplished by means of a shadow network of same storage infrastructure and presentation agents, that collect data from a running sys- facilities. tem. An example deployment is shown in Figure 26, where blue dots represent nodes iii.3 Devo in the target system and the other nodes make up the SD-Mon infrastructure. The network is Devo19 [10] is designed to provide real-time on- deployed on the basis of a configuration file line visualisation of both the low-level (single describing the network architecture in terms of node, multiple cores) and high-level (multi- hosts, Erlang nodes, global group and s_group 19https://github.com/RefactoringTools/devo partitions. Tracing to be performed on moni-

32 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

tored nodes is also specified within the config- uration file. An agent is started by a master SD-Mon node for each s_group and for each free node. Con- figured tracing is applied on every monitored node, and traces are stored in binary format in the agent file system. The shadow network fol- lows system changes so that agents are started and stopped at runtime as required, as shown in Figure 27. Such changes are persistently stored so that the last configuration can be re- produced after a restart. Of course, the shadow p0 9152 0 network can be always updated via the User Interface. Each agent takes care of an s_group or of a

p0 10890 0 p0 10891 0 free node. At start-up it tries to get in contact with its nodes and apply the tracing to them as stated by the master. Binary files are stored in

p0 9501 0 the host file system. Tracing is internally used in order to track s_group operations happening at runtime. An asynchronous message is sent to the master whenever one of these changes oc- curs. Since each process can only be traced by a single process at a time, each node (included those belonging to more than one s_group) is controlled by only one agent. When a node is removed from a group or when the group is deleted, another agent takes over, as shown in Figure 27. When an agent is stopped, all traces on the controlled nodes are switched off. The monitoring network is also supervised, in order to take account of network fragility, and when an agent node goes down another node is deployed to play its role; there are also periodic consistency checks for the system as a whole, and when an inconsistency is detected then that part of the system can be restarted. Figure 25: Devo: low and high-level visualisations of SD-Mon does more than monitor activities SD-Orbit. one node at a time. In particular inter-node and inter-group messages are displayed at runtime. As soon as an agent is stopped, the related tracing files are fetched across the network by the master and they are made available in a readable format in the master file system. SD-Mon provides facilities for online visual- isation of this data, as well as post hoc offline analysis. Figure 28 shows, in real time, mes- sages that are sent between s_groups. This data

33 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

Figure 26: SD-Mon architecture.

can also be used as input to the animated Devo visualisation, as illustrated in the right-hand side of Figure 25.

VIII. Systemic Evaluation

Preceding sections have investigated the im- provements of individual aspects of an Erlang system, e.g. ETS tables in Section i. This sec- tion analyses the impact of the new tools and technologies from Sections V, VI, VII in concert. We do so by investigating the deployment, re- liability, and scalability of the Orbit and ACO benchmarks from Section III. The experiments reported here are representative. Similar exper- iments show consistent results for a range of micro-benchmarks, several benchmarks, and the very substantial (approximately 150K lines of Erlang) Sim-Diasca case study [16] on sev- eral state of the art NUMA architectures, and the four clusters specified in Appendix A. A co- herent presentation of many of these results is available in an article by [22] and in a RELEASE Figure 27: SD-Mon evolution. Before (top) and after project deliverable20. The bulk of the exper- eliminating s_group 2 (bottom). iments reported here are conducted on the Athos cluster using Erlang/OTP 17.4 and the

20See Deliverable 6.2, available online. http://www. release-project.eu/documents/D6.2.pdf

34 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

Figure 28: SD-Mon: online monitoring.

associated SD Erlang libraries. D-Orbit performs better on a small number The experiments cover two measures of scal- of nodes as communication is direct, rather ability. As Orbit does a fixed size computa- than via a gateway node. As the number of tion, the scaling measure is relative speedup (or nodes grows, however, SD-Orbit delivers better strong scaling), i.e. speedup relative to execu- speedups, i.e. beyond 80 nodes in case of 2M tion time on a single core. As the work in ACO orbit elements, and beyond 100 nodes in case increases with compute resources, weak scaling of 5M orbit elements. When we increase the is the appropriate measure. The benchmarks size of Orbit beyond 5M, the D-Orbit version also evaluate different aspects of s_groups: Or- fails due to the fact that some VMs exceed the bit evaluates the scalability impacts of network available RAM of 64GB. In contrast SD-Orbit connections, while ACO evaluates the impact experiments run successfully even for an orbit of both network connections and the global with 60M elements. namespace required for reliability. ii. Ant Colony Optimisation (ACO) i. Orbit Deployment The deployment and moni- Figure 29 shows the speedup of the D-Orbit toring of ACO, and of the large (150K and SD-Orbit benchmarks from Section i.1. The lines of Erlang) Sim-Diasca simulation using measurements are repeated seven times, and WombatOAM (Section ii) on the Athos cluster we plot standard deviation. Results show that is detailed in [22].

35 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

450

400

300

200 Relative Speedup

100

0 0 50(1200) 100(2400) 150(3600) 200(4800) 257(6168) Number of nodes (cores) D-Orbit W=2M SD-Orbit W=2M D-Orbit W=5M SD-Orbit W=5M

Figure 29: Speedup of distributed Erlang vs. SD Erlang Orbit (2M and 5M elements) [Strong Scaling].

An example experiment deploys 10,000 use standard distributed Erlang configuration Erlang nodes without enabling monitoring, file deployment. and hence allocates three nodes per core (i.e. 72 nodes on each of 139 24-core Athos hosts). Figure 30 shows that WombatOAM deploys Reliability SD Erlang changes the organisa- the nodes in 212s, which is approximately 47 tion of processes and their recovery data at nodes per second. It is more common to have the language level, so we seek to show that at least one core per Erlang node, and in a re- these changes have not disrupted Erlang’s lated experiment 5000 nodes are deployed, one world-class reliability mechanisms at this level. per core (i.e. 24 nodes on each of 209 24-core As we haven’t changed them we don’t exer- Athos hosts) in 101s, or 50 nodes per second. cise Erlang’s other reliability mechanisms, e.g. Crucially in both cases the deployment time those for managing node failures, network con- is linear in the number of nodes. The deploy- gestion, etc. A more detailed study of SD ment time could be reduced to be logarithmic Erlang reliability, including the use of repli- in the number of nodes using standard divide- cated databases for recovering Instant Messen- and-conquer techniques. However there hasn’t ger chat sessions, finds similar results [23]. been the demand to do so as most Erlang sys- We evaluate the reliability of two ACO ver- tems are long running servers. sions using a Chaos Monkey service that kills processes in the running system at random [14]. The measurement data shows two impor- Recall that GR-ACO provides reliability by reg- tant facts: it shows that WombatOAM scales istering the names of critical processes globally, well (up to a deployment base of 10,000 Erlang and SR-ACO registers them only within an nodes), and that WombatOAM is non-intrusive s_group (Section ii). because its overhead on a monitored node is For both GR-ACO and SR-ACO a Chaos typically less than 1.5% of effort on the node. Monkey runs on every Erlang node, i.e. mas- We conclude that WombatOAM is capable ter, submasters, and colony nodes, killing a of deploying and monitoring substantial dis- random Erlang process every second. Both tributed Erlang and SD Erlang programs. The ACO versions run to completion. Recovery, experiments in the remainder of this section at this failure frequency, has no measurable

36 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

200

160 Time (sec)

120

80 0 2000 4000 6000 8000 10000 Number of nodes (N) Experimental data Best fit: 0.0125N+83

Figure 30: Wombat Deployment Time

16

12

8 Execution time (s)

4

0 0 50(1200) 100(2400) 150(3600) 200(4800) 250(6000) Number of nodes (cores) GR-ACO SR-ACO ML-ACO

Figure 31: ACO execution times, Erlang/OTP 17.4 (RELEASE) [Weak Scaling].

37 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017 impact on runtime. This is because processes shows that SD Erlang significantly reduces the are recovered within the Virtual machine using network traffic between Erlang nodes. Even (globally synchronised) local recovery informa- with the s_group name registration SR-ACO tion. For example, on a common X86/Ubuntu has less network traffic than ML-ACO that has platform typical Erlang process recovery times no global name registration. are around 0.3ms, so around 1000x less than the Unix process recovery time on the same iii. Evaluation Summary platform [58]. We have conducted more de- tailed experiments on an Instant Messenger We have shown that WombatOAM is capable benchmark, and obtained similar results [23]. of deploying and monitoring substantial dis- We conclude that both GR-ACO and SR- tributed Erlang and SD Erlang programs like ACO are reliable, and that SD Erlang preserves ACO and Sim-Diasca. The Chaos Monkey ex- distributed Erlang reliability model. The re- periments with GR-ACO and SR-ACO show mainder of the section outlines the impact of that both are reliable, and hence that SD Erlang maintaining the recovery information required preserves the distributed Erlang language-level for reliability on scalability. reliability model. As SD Orbit scales better than D-Orbit, SR- Scalability Figure 31 compares the runtimes ACO scales better than ML-ACO, and SR- of the ML, GR, and SR versions of ACO (Sec- ACO has significantly less network traffic, tion ii) on Erlang/OTP 17.4(RELEASE). As out- we conclude that, even when global recovery lined in Section i.1, GR-ACO not only main- data is not maintained, partitioning the fully- tains a fully connected graph of nodes, it reg- connected network into s_groups reduces net- isters process names for reliability, and hence work traffic and improves performance. While scales significantly worse than the unreliable the distributed Orbit instances (W=2M) and ML-ACO. We conclude that providing reliabil- (W=5M) reach scalability limits at around 40 ity with standard distributed Erlang process and 60 nodes, Orbit scales to 150 nodes on SD registration dramatically limits scalability. Erlang (limited by input size), and SR-ACO While ML-ACO does not provide reliability, is still scaling well on 256 nodes (6144 cores). and hence doesn’t register process names, it Hence not only have we exceeded the 60 node maintains a fully connected graph of nodes scaling limits of distributed Erlang identified which limits scalability. SR-ACO, that main- in Section ii, we have not reached the scaling tains connections and registers process names limits of SD Erlang on this architecture. only within s_groups scales best of all. Fig- Comparing GR-ACO and ML-ACO scalabil- ure 31 illustrates how maintaining the process ity curves shows that maintaining global re- namespace, and fully connected network, im- covery data, i.e. a process name space, dra- pacts performance. This reinforces the evi- matically limits scalability. Comparing GR- dence from the Orbit benchmarks, and oth- ACO and SR-ACO scalability curves shows ers, that partitioning the network of Erlang that scalability can be recovered by partitioning nodes significantly improves performance at the nodes into appropriately-sized s_groups, large scale. and hence maintaining the recovery data only To investigate the impact of SD Erlang on within a relatively small group of nodes. These network traffic, we measure the number of sent results are consistent with other experiments. and received packets on the GPG cluster for three versions of ACO: ML-ACO, GR-ACO, IX. Discussion and SR-ACO. Figure 32 shows the total number of sent packets. The highest traffic (the red line) Distributed actor platforms like Erlang, or belongs to the GR-ACO and the lowest traffic Scala with Akka, are a common choice for belongs to the SR-ACO (dark blue line). This internet-scale system architects as the model,

38 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

Figure 32: Number of sent packets in ML-ACO, GR-ACO, and SR-ACO. with its automatic, and VM-supported reliabil- Legion [40]. We establish scientifically the folk- ity mechanisms makes it extremely easy to engi- lore limitations of around 60 connected nodes neer scalable reliable systems. Targeting emergent for distributed Erlang (Section IV). server architectures with hundreds of hosts The actor model is no panacea, and there can and tens of thousands of cores, we report a still be scalability problems in the algorithms systematic effort to improve the scalability of a that we write, either within a single actor or in leading distributed actor language, while pre- the way that we structure communicating ac- serving reliability. The work is a vade mecum tors. A range of pragmatic issues also impact for addressing scalability of reliable actor lan- the performance and scalability of actor sys- guages and frameworks. It is also high impact, tems, including memory occupied by processes with downloads of our improved Erlang/OTP (even when quiescent), mailboxes filling up, running at 50K a month. etc. Identifying and resolving these problems is where tools like Percept2 and WombatOAM We have undertaken the first systematic are needed. However, many of the scalability study of scalability in a distributed actor lan- issues arise where Erlang departs from the private guage, covering VM, language and persis- state principle of the actor model, e.g. in main- tent storage levels. We have developed the taining shared state in ETS tables, or a shared BenchErl and DE-Bench tools for this purpose. global process namespace for recovery. Key VM-level scalability issues we identify in- clude contention for shared ETS tables and for We have designed and implemented a set commonly-used shared resources like timers. of Scalable Distributed (SD) Erlang libraries Key language scalability issues are the costs of to address language-level scalability issues. The maintaining a fully-connected network, main- key constructs are s_groups for partitioning the taining global recovery information, and ex- network and global process namespace, and plicit process placement. Unsurprisingly the semi-explicit process placement for deploying scaling issues for this distributed actor lan- distributed Erlang applications on large hetero- guage are common to other distributed or par- geneous architectures in a portable way. We allel languages and frameworks with other have provided a state transition operational se- paradigms like CHARM++ [45], Cilk [15], or mantics for the new s_groups, and validated

39 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017 the library implementation against the seman- article. While we report measurements on a tics using QuickCheck (Section V). range of NUMA and cluster architectures, the To improve the scalability of the Erlang VM key scalability experiments are conducted on and libraries we have improved the implemen- the Athos cluster with 256 hosts (6144 cores). tation of shared ETS tables, time management Even when global recovery data is not main- and load balancing between schedulers. Fol- tained, partitioning the network into s_groups lowing a systematic analysis of ETS tables, reduces network traffic and improves the per- the number of fine-grained (bucket) locks and formance of the Orbit and ACO benchmarks of reader groups have been increased. We above 80 hosts. Crucially we exceed the 60 have developed and evaluated four new tech- node limit for distributed Erlang and do not niques for improving ETS scalability:(i) pro- reach the scalability limits of SD Erlang with grammer control of number of bucket locks; 256 nodes/VMs and 6144 cores. Chaos Mon- (ii) a contention-adapting tree data structure key experiments show that two versions of for ordered_sets; (iii) queue delegation lock- ACO are reliable, and hence that SD Erlang pre- ing; and (iv) eliminating the locks in the meta serves the Erlang reliability model. However table. We have introduced a new scheduler util- the ACO results show that maintaining global isation balancing mechanism to spread work to recovery data, i.e. a global process name space, multiple schedulers (and hence cores), and new dramatically limits scalability in distributed synchronisation mechanisms to reduce con- Erlang. Scalability can, however, be recovered tention on the widely-used time management by maintaining recovery data only within ap- mechanisms. By June 2015, with Erlang/OTP propriately sized s_groups. These results are 18.0, the majority of these changes had been in- consistent with experiments with other bench- cluded in the primary releases. In any scalable marks and on other architectures (Section VIII). actor language implementation such thought- In future work we plan to incorporate ful design and engineering will be required RELEASE technologies, along with other tech- to schedule large numbers of actors on hosts nologies in a generic framework for building with many cores, and to minimise contention performant large scale servers. In addition, pre- on shared VM resources (Section VI). liminary investigations suggest that some SD To facilitate the development of large Erlang Erlang ideas could improve the scalability of 21 systems, and to make them understandable we other actor languages. For example the Akka have developed a range of tools. The propri- framework for Scala could benefit from semi- etary WombatOAM tool deploys and monitors explicit placement, and Cloud Haskell from large distributed Erlang systems over multi- partitioning the network. ple, and possibly heterogeneous, clusters or clouds. We have made open source releases of Appendix A: Architecture four concurrency tools: Percept2 now detects concurrency bad smells; Wrangler provides en- Specifications hanced concurrency refactoring; the Devo tool is enhanced to provide interactive visualisation The specifications of the clusters used for mea- of SD Erlang systems; and the new SD-Mon surement are summarised in Table 2. We also tool monitors SD Erlang systems. We an- use the following NUMA machines. (1) An ticipate that these tools will guide the design AMD Bulldozer with 16M L2/16M L3 cache, of tools for other large scale distributed actor 128GB RAM, four AMD Opteron 6276s at languages and frameworks (Section VII). 2.3 GHz, 16 “Bulldozer” cores each, giving a to- tal of 64 cores. (2) An Intel NUMA with 128GB We report on the reliability and scalability RAM, four Intel Xeon E5-4650s at 2.70GHz, implications of our new technologies using a range of benchmarks, and consistently use the 21See Deliverable 6.7 (http://www.release-project. Orbit and ACO benchmarks throughout the eu/documents/D6.7.pdf), available online.

40 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

Table 2: Cluster Specifications (RAM per host in GB).

Cores per max Name Hosts host total avail. Processor RAM Inter- connection

GPG 20 16 320 320 Intel Xeon E5-2640v2 8C, 64 10GB Ethernet 2GHz Kalkyl 348 8 2,784 1,408 Intel Xeon 5520v2 4C, 24–72 InfiniBand 20 2.26GHz Gb/s AMD Opteron 6220v2 Bull- 2:1 oversub- TinTin 160 16 2,560 2,240 dozer 8C, 3.0GHz 64–128 scribed QDR Infiniband Athos 776 24 18,624 6,144 Intel Xeon E5-2697v2 12C, 64 Infiniband 2.7GHz FDR14 each with eight hyperthreaded cores, giving a [3] Bulldozer (microarchitecture), 2015. total of 64 cores. https://en.wikipedia.org/wiki/ Bulldozer_(microarchitecture).

Acknowledgements [4] Apache SF. Liblcoud, 2016. https:// libcloud.apache.org/. We would like to thank the entire RELEASE project team for technical insights and adminis- [5] C. R. Aragon and R. G. Seidel. Random- trative support. Roberto Aloi and Enrique Fer- ized search trees. In Proceedings of the 30th nandez Casado contributed to the development Annual Symposium on Foundations of Com- and measurement of WombatOAM. This work puter Science, pages 540–545, October 1989. has been supported by the European Union [6] Joe Armstrong. Programming Erlang: Soft- grant RII3-CT-2005-026133 “SCIEnce: Symbolic ware for a Concurrent World. Pragmatic Computing Infrastructure in Europe”, IST-2011- Bookshelf, 2007. 287510 “RELEASE: A High-Level Paradigm for Reliable Large-scale Server Software”, and by [7] Joe Armstrong. Erlang. Commun. ACM, the UK’s Engineering and Physical Sciences 53:68–75, 2010. Research Council grant EP/G055181/1 “HPC- GAP: High Performance Computational Alge- [8] Stavros Aronis, Nikolaos Papaspyrou, Ka- bra and Discrete Mathematics”. terina Roukounaki, Konstantinos Sago- nas, Yiannis Tsiouris, and Ioannis E. Venetis. A scalability benchmark suite for References Erlang/OTP. In Torben Hoffman and John Hughes, editors, Proceedings of the Eleventh [1] Gul Agha. ACTORS: A Model of Concurrent ACM SIGPLAN Workshop on Erlang, pages Computation in Distributed Systems. PhD 33–42, New York, NY, USA, September thesis, MIT, 1985. 2012. ACM.

[2] Gul Agha. An overview of Actor lan- [9] Thomas Arts, John Hughes, Joakim Jo- guages. SIGPLAN Not., 21(10):58–67, 1986. hansson, and Ulf Wiger. Testing telecoms

41 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

software with Quviq QuickCheck. In Pro- Symposium, TFP 2014. Revised Selected Pa- ceedings of the 2006 ACM SIGPLAN work- pers, volume 8843 of LNCS, pages 104–121. shop on Erlang, pages 2–10, New York, NY, Springer, 2015. USA, 2006. ACM. [18] Irina Calciu, Dave Dice, Yossi Lev, Vic- [10] Robert Baker, Peter Rodgers, Simon tor Luchangco, Virendra J. Marathe, and Thompson, and Huiqing Li. Multi-level Vi- Nir Shavit. NUMA-aware reader-writer sualization of Concurrent and Distributed locks. In Proceedings of the 18th ACM SIG- Computation in Erlang. In Visual Lan- PLAN Symposium on Principles and Prac- guages and Computing (VLC): The 19th In- tice of Parallel Programming, pages 157–166, ternational Conference on Distributed Multi- New York, NY, USA, 2013. ACM. media Systems (DMS 2013), 2013. [19] Francesco Cesarini and Simon Thomp- [11] Luiz André Barroso, Jimmy Clidaras, and son. Erlang Programming: A Concurrent Urs Hölzle. The Datacenter as a Computer. Approach to Software Development. O’Reilly Morgan and Claypool, 2nd edition, 2013. Media, 1st edition, 2009.

[12] Basho Technologies. Riakdocs. Basho [20] Rohit Chandra, Leonardo Dagum, Dave Bench, 2014. http://docs.basho. Kohr, Dror Maydan, Jeff McDonald, and com/riak/latest/ops/building/ Ramesh Menon. Parallel Programming in benchmarking/. OpenMP. Morgan Kaufmann, San Fran- cisco, CA, USA, 2001. [13] J. E. Beasley. OR-Library: Distributing test problems by electronic mail. The [21] Natalia Chechina, Huiqing Li, Amir Journal of the Operational Research Soci- Ghaffari, Simon Thompson, and Phil ety, 41(11):1069–1072, 1990. Datasets Trinder. Improving the network scalabil- available at http://people.brunel.ac. ity of Erlang. J. Parallel Distrib. Comput., uk/~mastjjb/jeb/orlib/wtinfo.html. 90(C):22–34, 2016.

[14] Cory Bennett and Ariel Tseitlin. Chaos [22] Natalia Chechina, Kenneth MacKenzie, monkey released into the wild. Netflix Simon Thompson, Phil Trinder, Olivier Blog, 2012. Boudeville, Viktória Förd˝os,Csaba Hoch, Amir Ghaffari, and Mario Moro Her- [15] Robert D Blumofe, Christopher F Joerg, nandez. Evaluating scalable distributed Bradley C Kuszmaul, Charles E Leiserson, Erlang for scalability and reliability. IEEE Keith H Randall, and Yuli Zhou. Cilk: An Transactions on Parallel and Distributed Sys- efficient multithreaded runtime system. In tems, 2017. Proceedings of the fifth ACM SIGPLAN sym- posium on Principles and practice of parallel [23] Natalia Chechina, Mario Moro Hernan- programming, pages 207–216. ACM, 1995. dez, and Phil Trinder. A scalable reliable instant messenger using the SD Erlang li- [16] Olivier Boudeville. Technical manual of braries. In Erlang’16, pages 33–41, New the sim-diasca simulation engine. EDF York, NY, USA, 2016. ACM. R&D, 2012. [24] Koen Claessen and John Hughes. [17] István Bozó, Viktória Förd˝os,Dániel Hor- QuickCheck: a lightweight tool for pácsi, Zoltán Horváth, Tamás Kozsik, Ju- random testing of Haskell programs. dit K˝oszegi,and Melinda Tóth. Refactor- In Proceedings of the 5th ACM SIGPLAN ings to enable parallelization. In Jurriaan International Conference on Functional Hage and Jay McCarthy, editors, Trends in Programming, pages 268–279, New York, Functional Programming: 15th International NY, USA, 2000. ACM.

42 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

[25] H. A. J. Crauwels, C. N. Potts, and L. N. Tardiness Problem. Technical Report van Wassenhove. Local search heuristics Research Report 10-03-01, March 2010. for the single machine total weighted tar- Datasets available at http://logistik. diness scheduling problem. INFORMS hsu-hh.de/SMTWTP. Journal on Computing, 10(3):341–350, 1998. The datasets from this paper are included [35] Guillaume Germain. Concurrency ori- in Beasley’s ORLIB. ented programming in termite scheme. In Proceedings of the 2006 ACM SIGPLAN [26] Jeffrey Dean and Sanjay Ghemawat. workshop on Erlang, pages 20–20, New Mapreduce: Simplified data processing on York, NY, USA, 2006. ACM. large clusters. Commun. ACM, 51(1):107– 113, 2008. [36] Amir Ghaffari. De-bench, a benchmark tool for distributed erlang, 2014. https: [27] David Dewolfs, Jan Broeckhove, Vaidy //github.com/amirghaffari/DEbench. Sunderam, and Graham E. Fagg. FT- MPI, fault-tolerant metacomputing and [37] Amir Ghaffari. Investigating the scalabil- generic name services: A case study. In ity limits of distributed Erlang. In Proceed- EuroPVM/MPI’06, pages 133–140, Berlin, ings of the 13th ACM SIGPLAN Workshop Heidelberg, 2006. Springer-Verlag. on Erlang, pages 43–49. ACM, 2014. [28] Alan AA Donovan and Brian W [38] Amir Ghaffari, Natalia Chechina, Phip Kernighan. The Go programming language. Trinder, and Jon Meredith. Scalable persis- Addison-Wesley Professional, 2015. tent storage for Erlang: Theory and prac- [29] Marco Dorigo and Thomas Stützle. Ant tice. In Proceedings of the 12th ACM SIG- Colony Optimization. Bradford Company, PLAN Workshop on Erlang, pages 73–74, Scituate, MA, USA, 2004. New York, NY, USA, 2013. ACM.

[30] Faith Ellen, Yossi Lev, Victor Luchangco, [39] Seth Gilbert and Nancy Lynch. Brewer’s and Mark Moir. SNZI: scalable NonZero conjecture and the feasibility of consistent, indicators. In Proceedings of the 26th Annual available, partition-tolerant web services. ACM Symposium on Principles of Distributed ACM SIGACT News, 33:51–59, 2002. Computing, pages 13–22, New York, NY, USA, 2007. ACM. [40] Andrew S Grimshaw, Wm A Wulf, and the Legion team. The Legion vision of a [31] Jeff Epstein, Andrew P. Black, and Si- worldwide virtual computer. Communica- mon Peyton-Jones. Towards Haskell in tions of the ACM, 40(1):39–45, 1997. the Cloud. In Haskell ’11, pages 118–129. ACM, 2011. [41] Carl Hewitt. Actor model for discre- tionary, adaptive concurrency. CoRR, [32] Martin Fowler. Refactoring: Improving the abs/1008.1459, 2010. Design of Existing Code. Addison-Wesley Longman Publishing Co., Inc., Boston, [42] Carl Hewitt, Peter Bishop, and Richard MA, USA, 1999. Steiger. A universal modular ACTOR for- [33] Ana Gainaru and Franck Cappello. Errors malism for artificial intelligence. In IJ- and faults. In Fault-Tolerance Techniques for CAI’73, pages 235–245, San Francisco, CA, High-Performance Computing, pages 89–144. USA, 1973. Morgan Kaufmann. Springer International Publishing, 2015. [43] Rich Hickey. The Clojure programming [34] Martin Josef Geiger. New instances language. In DLS’08, pages 1:1–1:1, New for the Single Machine Total Weighted York, NY, USA, 2008. ACM.

43 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

[44] Zoltán Horváth, László Lövei, Tamás Tim Menzies and Motoshi Saeki, editors, Kozsik, Róbert Kitlei, Anikó Nagyné Víg, Automated Software Engineering, ASE’12. Tamás Nagy, Melinda Tóth, and Roland IEEE Computer Society, 2012. Király. Building a refactoring tool for Erlang. In Workshop on Advanced Software [53] Huiqing Li and Simon Thompson. Multi- Development Tools and Techniques, WAS- core profiling for Erlang programs using DETT 2008, 2008. Percept2. In Proceedings of the Twelfth ACM SIGPLAN workshop on Erlang, 2013. [45] Laxmikant V Kale and Sanjeev Krishnan. Charm++: a portable concurrent object [54] Huiqing Li and Simon Thompson. Im- oriented system based on c++. In ACM proved Semantics and Implementation Sigplan Notices, volume 28, pages 91–108. Through Property-Based Testing with ACM, 1993. QuickCheck. In 9th International Workshop on Automation of Software Test, 2014. [46] David Klaftenegger, Konstantinos Sago- nas, and Kjell Winblad. On the scalability [55] Huiqing Li and Simon Thompson. Safe of the Erlang term storage. In Proceedings Concurrency Introduction through Slicing. of the Twelfth ACM SIGPLAN Workshop on In PEPM ’15. ACM SIGPLAN, January Erlang, pages 15–26, New York, NY, USA, 2015. September 2013. ACM. [56] Huiqing Li, Simon Thompson, György [47] David Klaftenegger, Konstantinos Sago- Orosz, and Melinda Töth. Refactoring nas, and Kjell Winblad. Delegation lock- with Wrangler, updated. In ACM SIG- ing libraries for improved performance of PLAN Erlang Workshop, volume 2008, 2008. multithreaded programs. In Euro-Par 2014 [57] Frank Lubeck and Max Neunhoffer. Enu- Parallel Processing., volume 8632 of LNCS, merating large Orbits and direct conden- pages 572–583. Springer, 2014. sation. Experimental Mathematics, 10(2):197– [48] David Klaftenegger, Konstantinos Sago- 205, 2001. nas, and Kjell Winblad. Queue delegation [58] Andreea Lutac, Natalia Chechina, Ger- locking. IEEE Transactions on Parallel and ardo Aragon-Camarasa, and Phil Trinder. Distributed Systems, 2017. To appear. Towards reliable and scalable robot com- [49] Rusty Klophaus. Riak core: Building dis- munication. In Proceedings of the 15th Inter- tributed applications without shared state. national Workshop on Erlang, Erlang 2016, In ACM SIGPLAN Commercial Users of pages 12–23, New York, NY, USA, 2016. Functional Programming, CUFP ’10, pages ACM. 14:1–14:1, New York, NY, USA, 2010. ACM. [59] LWN.net. The high-resolution timer API, January 2006. https://lwn.net/ [50] Avinash Lakshman and Prashant Malik. Articles/167897/. Cassandra: A decentralized structured storage system. SIGOPS Oper. Syst. Rev., [60] Kenneth MacKenzie, Natalia Chechina, 44(2):35–40, April 2010. and Phil Trinder. Performance portability through semi-explicit placement in dis- [51] J. Lee et al. Python actor runtime library, tributed Erlang. In Proceedings of the 14th 2010. osl.cs.uiui.edu/parley/. ACM SIGPLAN Workshop on Erlang, pages 27–38. ACM, 2015. [52] Huiqing Li and Simon Thompson. Auto- mated API Migration in a User-Extensible [61] Jeff Matocha and Tracy Camp. A taxon- Refactoring Tool for Erlang Programs. In omy of distributed termination detection

44 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017

algorithms. Journal of Systems and Software, range updates using contention adapting 43(221):207–221, 1998. search trees. In Languages and for Parallel Computing - 28th International [62] Nicholas D Matsakis and Felix S Klock II. Workshop, volume 9519 of LNCS, pages The Rust language. In ACM SIGAda Ada 37–53. Springer, 2016. Letters, volume 34, pages 103–104. ACM, 2014. [72] Marc Snir, Steve W. Otto, D. W. Walker, Jack Dongarra, and Steven Huss- [63] Robert McNaughton. Scheduling with Lederman. MPI: The Complete Reference. deadlines and loss functions. Management MIT Press, Cambridge, MA, USA, 1995. Science, 6(1):1–12, 1959. [73] Sriram Srinivasan and Alan Mycroft. [64] Martin Odersky et al. The Scala program- Kilim: Isolation-typed actors for Java. In ming language, 2012. www.scala-lang.org. ECOOP’08, pages 104–128, Berlin, Heidel- [65] William F. Opdyke. Refactoring Object- berg, 2008. Springer-Verlag. Oriented Frameworks. PhD thesis, Uni- [74] Don Syme, Adam Granicz, and Antonio versity of Illinois at Urbana-Champaign, Cisternino. Expert F# 4.0. Springer, 2015. 1992. [75] Simon Thompson and Huiqing Li. Refac- [66] Nikolaos Papaspyrou and Konstantinos toring tools for functional languages. Jour- Sagonas. On preserving term sharing in nal of Functional Programming, 23(3), 2013. the Erlang virtual machine. In Torben Hoffman and John Hughes, editors, Pro- [76] Marcus Völker. Linux timers, May 2014. ceedings of the 11th ACM SIGPLAN Erlang [77] WhatsApp, 2015. https://www.whatsapp. Workshop, pages 11–20. ACM, 2012. com/. [67] RELEASE Project Team. EU Framework [78] Tom White. Hadoop - The Definitive Guide: 7 Project 287510 (2011–2015), 2015. http: Storage and Analysis at Internet Scale (3. ed., //www.release-project.eu. revised and updated). O’Reilly, 2012. [68] Konstantinos Sagonas and Thanassis [79] Ulf Wiger. Industrial-Strength Functional Avgerinos. Automatic refactoring of programming: Experiences with the Er- Erlang programs. In Principles and Prac- icsson AXD301 Project. In Implementing tice of Declarative Programming (PPDP’09), Functional Languages (IFL’00), Aachen, Ger- pages 13–24. ACM, 2009. many, September 2000. Springer-Verlag. [69] Konstantinos Sagonas and Kjell Winblad. More scalable ordered set for ETS using adaptation. In Proc. ACM SIGPLAN Work- shop on Erlang, pages 3–11. ACM, Septem- ber 2014.

[70] Konstantinos Sagonas and Kjell Winblad. Contention adapting search trees. In Pro- ceedings of the 14th International Sympo- sium on Parallel and Distributed Computing, pages 215–224. IEEE Computing Society, 2015.

[71] Konstantinos Sagonas and Kjell Winblad. Efficient support for range queries and

45