P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017 Scaling Reliably: Improving the Scalability of the Erlang Distributed Actor Platform Phil Trinder,Natalia Chechina,Nikolaos Papaspyrou, Konstantinos Sagonas,Simon Thompson,Stephen Adams,Stavros Aronis, Robert Baker,Eva Bihari,Olivier Boudeville,Francesco Cesarini, Maurizio Di Stefano,Sverker Eriksson,Viktoria Fordos,Amir Ghaffari, Aggelos Giantsios,Rockard Green,Csaba Hoch,David Klaftenegger, Huiqing Li,Kenneth Lundin,Kenneth MacKenzie,Katerina Roukounaki, Yiannis Tsiouris,Kjell Winblad RELEASE EU FP7 STREP (287510) project www.release-project.eu/
April 25, 2017
Abstract
Distributed actor languages are an effective means of constructing scalable reliable systems, and the Erlang programming language has a well-established and influential model. While the Erlang model conceptually provides reliable scalability, it has some inherent scalability limits and these force developers to depart from the model at scale. This article establishes the scalability limits of Erlang systems, and reports the work of the EU RELEASE project to improve the scalability and understandability of the Erlang reliable distributed actor model. We systematically study the scalability limits of Erlang, and then address the issues at the virtual machine, language and tool levels. More specifically: (1) We have evolved the Erlang virtual machine so that it can work effectively in large scale single-host multicore and NUMA architectures. We have made important changes and architectural improvements to the widely used Erlang/OTP release. (2) We have designed and implemented Scalable Distributed (SD) Erlang libraries to address language-level scalability issues, and provided and validated a set of semantics for the new language constructs. (3) To make large Erlang systems easier to deploy, monitor, and debug we have developed and made open source releases of five complementary tools, some specific to SD Erlang. Throughout the article we use two case studies to investigate the capabilities of our new technologies and arXiv:1704.07234v2 [cs.PL] 8 May 2017 tools: a distributed hash table based Orbit calculation and Ant Colony Optimisation (ACO). Chaos Monkey experiments show that two versions of ACO survive random process failure and hence that SD Erlang preserves the Erlang reliability model. While we report measurements on a range of NUMA and cluster architectures, the key scalability experiments are conducted on the Athos cluster with 256 hosts (6144 cores). Even for programs with no global recovery data to maintain, SD Erlang partitions the network to reduce network traffic and hence improves performance of the Orbit and ACO benchmarks above 80 hosts. ACO measurements show that maintaining global recovery data dramatically limits scalability; however scalability is recovered by partitioning the recovery data. We exceed the established scalability limits of distributed Erlang, and do not reach the limits of SD Erlang for these benchmarks at this scale (256 hosts, 6144 cores).
1 I. Introduction stant messaging server [77]. In Erlang, the actors are termed processes Distributed programming languages and and are managed by a sophisticated Virtual frameworks are central to engineering large Machine on a single multicore or NUMA scale systems, where key properties include host, while distributed Erlang provides rela- scalability and reliability. By scalability we tively transparent distribution over networks mean that performance increases as hosts and of VMs on multiple hosts. Erlang is supported cores are added, and by large scale we mean by the Open Telecom Platform (OTP) libraries architectures with hundreds of hosts and tens that capture common patterns of reliable dis- of thousands of cores. Experience with high tributed computation, such as the client-server performance and data centre computing shows pattern and process supervision. Any large- that reliability is critical at these scales, e.g. scale system needs scalable persistent stor- host failures alone account for around one fail- age and, following the CAP theorem [39], ure per hour on commodity servers with ap- Erlang uses and indeed implements Dynamo- proximately 105 cores [11]. To be usable, pro- style NoSQL DBMS like Riak [49] and Cassan- gramming languages employed on them must dra [50]. be supported by a suite of deployment, moni- While the Erlang distributed actor model toring, refactoring and testing tools that work conceptually provides reliable scalability, it has at scale. some inherent scalability limits, and indeed Controlling shared state is the only way to large-scale distributed Erlang systems must de- build reliable scalable systems. State shared by part from the distributed Erlang paradigm in multiple units of computation limits scalability order to scale, e.g. not maintaining a fully con- due to high synchronisation and communica- nected graph of hosts. The EU FP7 RELEASE tion costs. Moreover shared state is a threat project set out to establish and address the scal- for reliability as failures corrupting or perma- ability limits of the Erlang reliable distributed nently locking shared state may poison the actor model [67]. entire system. After outlining related work (Section II) and Actor languages avoid shared state: actors or the benchmarks used throughout the article processes have entirely local state, and only in- (Section III) we investigate the scalability lim- teract with each other by sending messages [2]. its of Erlang/OTP, seeking to identify specific Recovery is facilitated in this model, since ac- issues at the virtual machine, language and tors, like operating system processes, can fail persistent storage levels (Section IV). independently without affecting the state of We then report the RELEASE project work to other actors. Moreover an actor can supervise address these issues, working at the following other actors, detecting failures and taking re- three levels. medial action, e.g. restarting the failed actor. 1. We have designed and implemented a Erlang [6, 19] is a beacon language for re- set of Scalable Distributed (SD) Erlang li- liable scalable computing with a widely em- braries to address language-level reliabil- ulated distributed actor model. It has in- ity and scalability issues. An operational fluenced the design of numerous program- semantics is provided for the key new ming languages like Clojure [43] and F# [74], s_group construct, and the implementa- and many languages have Erlang-inspired tion is validated against the semantics (Sec- actor frameworks, e.g. Kilim for Java [73], tion V). Cloud Haskell [31], and Akka for C#, F# and Scala [64]. Erlang is widely used for build- 2. We have evolved the Erlang virtual machine ing reliable scalable servers, e.g. Ericsson’s so that it can work effectively in large-scale AXD301 telephone exchange (switch) [79], the single-host multicore and NUMA archi- Facebook chat server, and the Whatsapp in- tectures. We have improved the shared
2 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017
ETS tables, time management, and load ter (Section VIII). balancing between schedulers. Most of Contributions. This article is the first sys- these improvements are now included in tematic presentation of the coherent set of the Erlang/OTP release, currently down- technologies for engineering scalable reliable loaded approximately 50K times each Erlang systems developed in the RELEASE month (Section VI). project. Section IV presents the first scalability study 3. To facilitate the development of scalable covering Erlang VM, language, and storage Erlang systems, and to make them main- scalability. Indeed we believe it is the first tainable, we have developed three new comprehensive study of any distributed ac- tools: Devo, SDMon and WombatOAM, tor language at this scale (100s of hosts, and and enhanced two others: the visualisa- around 10K cores). Individual scalability stud- tion tool Percept, and the refactorer Wran- ies, e.g. into Erlang VM scaling [8], or lan- gler. The tools support refactoring pro- guage and storage scaling have appeared be- grams to make them more scalable, eas- fore [38, 37]. ier to deploy at large scale (hundreds of At the language level the design, implemen- hosts), easier to monitor and visualise their tation and validation of the new libraries (Sec- behaviour. Most of these tools are freely tion V) have been reported piecemeal [21, 60], available under open source licences; the and are included here for completeness. WombatOAM deployment and monitor- While some of the improvements made to ing tool is a commercial product (Sec- the Erlang Virtual Machine (Section i) have tion VII). been thoroughly reported in conference pub- Throughout the article we use two bench- lications [66, 46, 47, 69, 70, 71], others are re- marks to investigate the capabilities of our new ported here for the first time (Sections iii, ii). technologies and tools. These are a compu- In Section VII, the WombatOAM and SD- tation in symbolic algebra, more specifically Mon tools are described for the first time, as an algebraic ‘orbit’ calculation that exploits is the revised Devo system and visualisation. a non-replicated distributed hash table, and The other tools for profiling, debugging and an Ant Colony Optimisation (ACO) parallel refactoring developed in the project have previ- search program (Section III). ously been published piecemeal [52, 53, 55, 10], We report on the reliability and scalability but this is their first unified presentation. implications of our new technologies using Or- All of the performance results in Section VIII bit, ACO, and other benchmarks. We use a are entirely new, although a comprehensive Chaos Monkey instance [14] that randomly study of SD Erlang performance is now avail- kills processes in the running system to demon- able in a recent article by [22]. strate the reliability of the benchmarks and to show that SD Erlang preserves the Erlang II. Context language-level reliability model. While we report measurements on a range of NUMA i. Scalable Reliable Programming and cluster architectures as specified in Ap- Models pendix A, the key scalability experiments are conducted on the Athos cluster with 256 hosts There is a plethora of shared memory con- and 6144 cores. Having established scientifi- current programming models like PThreads cally the folklore limitations of around 60 con- or Java threads, and some models, like nected hosts/nodes for distributed Erlang sys- OpenMP [20], are simple and high level. How- tems in Section 4, a key result is to show that ever synchronisation costs mean that these the SD Erlang benchmarks exceed this limit models generally do not scale well, often strug- and do not reach their limits on the Athos clus- gling to exploit even 100 cores. Moreover, re-
3 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017 liability mechanisms are greatly hampered by in AI [42], it has been used widely as a gen- the shared state: for example, a lock becomes eral metaphor for concurrency, as well as being permanently unavailable if the thread holding incorporated into a number of niche program- it fails. ming languages in the 1970s and 80s. More re- The High Performance Computing (HPC) cently it has come back to prominence through community build large-scale (106 core) dis- the rise of not only multicore chips but also tributed memory systems using the de facto larger-scale distributed programming in data standard MPI communication libraries [72]. In- centres and the cloud. creasingly these are hybrid applications that With built-in concurrency and data isola- combine MPI with OpenMP. Unfortunately, tion, actors are a natural paradigm for engi- MPI is not suitable for producing general pur- neering reliable scalable general-purpose sys- pose concurrent software as it is too low level tems [1, 41]. The model has two main concepts: with explicit message passing. Moreover, the actors that are the unit of computation, and most widely used MPI implementations offer messages that are the unit of communication. no fault recovery:1 if any part of the compu- Each actor has an address-book that contains the tation fails, the entire computation fails. Cur- addresses of all the other actors it is aware of. rently the issue is addressed by using what These addresses can be either locations in mem- is hoped to be highly reliable computational ory, direct physical attachments, or network and networking hardware, but there is intense addresses. In a pure actor language, messages research interest in introducing reliability into are the only way for actors to communicate. HPC applications [33]. After receiving a message an actor can do Server farms use commodity computational the following: (i) send messages to another and networking hardware, and often scale to actor from its address-book, (ii) create new ac- 5 around 10 cores, where host failures are rou- tors, or (iii) designate a behaviour to handle the tine. They typically perform rather constrained next message it receives. The model does not computations, e.g. Big Data Analytics, using re- impose any restrictions in the order in which liable frameworks like Google MapReduce [26] these actions must be taken. Similarly, two or Hadoop [78]. The idempotent nature of the messages sent concurrently can be received in analytical queries makes it relatively easy for any order. These features enable actor based the frameworks to provide implicit reliability: systems to support indeterminacy and quasi- queries are monitored and failed queries are commutativity, while providing locality, modu- simply re-run. In contrast, actor languages like larity, reliability and scalability [41]. Erlang are used to engineer reliable general Actors are just one message-based paradigm, purpose computation, often recovering failed and other languages and libraries have related stateful computations. message passing paradigms. Recent example languages include Go [28] and Rust [62] that ii. Actor Languages provide explicit channels, similar to actor mail- boxes. Probably the most famous message The actor model of concurrency consists of in- passing library is MPI [72], with APIs for many dependent processes communicating by means languages and widely used on clusters and of messages sent asynchronously between pro- High Performance Computers. It is, however, cesses. A process can send a message to any arguable that the most important contribution other process for which it has the address (in of the actor model is the one-way asynchronous Erlang the “process identifier” or pid), and communication [41]. Messages are not coupled the remote process may reside on a different with the sender, and neither they are trans- host. While the notion of actors originated ferred synchronously to a temporary container 1Some fault tolerance is provided in less widely used where transmission takes place, e.g. a buffer, a MPI implementations like [27]. queue, or a mailbox. Once a message is sent,
4 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017 the receiver is the only entity responsible for which registers this process name in the node’s that message. process name table if not already present. Sub- Erlang [6, 19] is the pre-eminent program- sequently, these names can be used to refer to ming language based on the actor model, hav- or communicate with the corresponding pro- ing a history of use in production systems, cesses (e.g. send them a message): initially with its developer Ericsson and then my_funky_process ! hello. more widely through open source adoption. There are now actor frameworks for many A distributed Erlang system executes on mul- other languages; these include Akka for C#, F# tiple nodes, and the nodes can be freely de- 2 3 and Scala [64], CAF for C++, Pykka , Cloud ployed across hosts, e.g. they can be located Haskell [31], PARLEY [51] for Python, and Ter- on the same or different hosts. To help make mite Scheme [35], and each of these is currently distribution transparent to the programmer, under active use and development. Moreover, when any two nodes connect they do so transi- the recently defined Rust language [62] has a tively, sharing their sets of connections. With- version of the actor model built in, albeit in an out considerable care this quickly leads to a imperative context. fully connected graph of nodes. A process may be spawned on an explicitly identified node, iii. Erlang’s Support for Concurrency e.g. In Erlang, actors are termed processes, and vir- Remote_Pong_PID = spawn(some_node, tual machines are termed nodes. The key ele- fun some_module:pong/0). ments of the actor model are: fast process cre- After this, the remote process can be addressed ation and destruction; lightweight processes, just as if it were local. It is a significant bur- e.g. enabling 106 concurrent processes on a sin- den on the programmer to identify the remote gle host with 8GB RAM; fast asynchronous nodes in large systems, and we will return to message passing with copying semantics; pro- this in Sections ii, i.2. cess monitoring; strong dynamic typing, and selective message reception. By default Erlang processes are addressed iv. Scalability and Reliability in by their process identifier (pid), e.g. Erlang Systems Erlang was designed to solve a particular Pong_PID = spawn(fun some_module:pong/0) set of problems, namely those in building spawns a process to execute the anonymous telecommunications’ infrastructure, where sys- function given as argument to the spawn prim- tems need to be scalable to accommodate hun- itive, and binds Pong_PID to the new process dreds of thousands of calls concurrently, in soft identifier. Here the new process will exe- real-time. These systems need to be highly- cute the pong/0 function which is defined in available and reliable: i.e. to be robust in the some_module. A subsequent call case of failure, which can come from software or hardware faults. Given the inevitability of Pong_PID ! finish the latter, Erlang adopts the “let it fail” phi- losophy for error handling. That is, encourage sends the messaged finish to the process iden- programmers to embrace the fact that a process tified by Pong_PID. Alternatively, processes can may fail at any point, and have them rely on be given names using a call of the form: the supervision mechanism, discussed shortly, to handle the failures. register(my_funky_name, Pong_PID) Figure 1 illustrates Erlang’s support for con- 2http://actor-framework.org currency, multicores and distribution. Each 3http://pykka.readthedocs.org/en/latest/ Erlang node is represented by a yellow shape,
5 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017
with heartbeats, and can be monitored from outside Erlang, e.g. by an operating system process. However, the most important way to achieve reliability is supervision, which allows a process to monitor the status of a child process and react to any failure, for example by spawning a substitute process to replace a failed process. Figure 1: Conceptual view of Erlang’s concurrency, mul- Supervised processes can in turn supervise ticore support and distribution. other processes, leading to a supervision tree. The supervising and supervised processes may be in different nodes, and on different hosts, and each rectangle represents a host with an IP and hence the supervision tree may span mul- address. Each red arc represents a connection tiple hosts or nodes. between Erlang nodes. Each node can run on To provide reliable distributed service regis- multiple cores, and exploit the inherent con- tration, a global namespace is maintained on currency provided. This is done automatically every node, which maps process names to pids. by the VM, with no user intervention needed. It is this that we mean when we talk about a Typically each core has an associated scheduler ‘reliable’ system: it is one in which a named that schedules processes; a new process will be process in a distributed system can be restarted spawned on the same core as the process that without requiring the client processes also to spawns it, but work can be moved to a different be restarted (because the name can still be used scheduler through a work-stealing allocation for communication). algorithm. Each scheduler allows a process To see global registration in action, consider that is ready to compute at most a fixed num- a pong server process ber of computation steps before switching to another. Erlang built-in functions or BIFs are im- global:register_name(pong_server, plemented in C, and at the start of the project Remote_Pong_PID). were run to completion once scheduled, caus- ing performance and responsiveness problems Clients of the server can send messages to the if the BIF had a long execution time. registered name, e.g. Scaling in Erlang is provided in two differ- global:whereis_name(pong_server)!finish. ent ways. It is possible to scale within a single node by means of the multicore virtual machine If the server fails the supervisor can spawn a exploiting the concurrency provided by the replacement server process with a new pid and multiple cores or NUMA nodes. It is also pos- register it with the same name (pong_server). sible to scale across multiple hosts using multiple Thereafter client messages to the pong_server distributed Erlang nodes. will be delivered to the new server process. We Reliability in Erlang is multi-faceted. As in return to discuss the scalability limitations of all actor languages each process has private maintaining a global namespace in Section V. state, preventing a failed or failing process from corrupting the state of other processes. v. ETS: Erlang Term Storage Messages enable stateful interaction, and con- tain a deep copy of the value to be shared, with Erlang is a pragmatic language and the actor no references (e.g. pointers) to the senders’ model it supports is not pure. Erlang processes, internal state. Moreover Erlang avoids type besides communicating via asynchronous mes- errors by enforcing strong typing, albeit dy- sage passing, can also share data in public namically [7]. Connected nodes check liveness memory areas called ETS tables.
6 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017
The Erlang Term Storage (ETS) mechanism adding global namespaces to ensure reli- is a central component of Erlang’s implementa- ability. The source code for the bench- tion. It is used internally by many libraries and marks, together with more detailed docu- underlies the in-memory databases. ETS tables mentation, is available at https://github. are key-value stores: they store Erlang tuples com/release-project/benchmarks/. The where one of the positions in the tuple serves RELEASE project team also worked to im- as the lookup key. An ETS table has a type prove the reliability and scalability of other that may be either set, bag or duplicate_bag, Erlang programs including a substantial (ap- implemented as a hash table, or ordered_set proximately 150K lines of Erlang code) Sim- which is currently implemented as an AVL tree. Diasca simulator [16] and an Instant Messenger The main operations that ETS supports are ta- that is more typical of Erlang applications [23] ble creation, insertion of individual entries and but we do not cover these systematically here. atomic bulk insertion of multiple entries in a table, deletion and lookup of an entry based on some key, and destructive update. The oper- i. Orbit ations are implemented as C built-in functions Orbit is a computation in symbolic algebra, in the Erlang VM. which generalises a transitive closure compu- The code snippet below creates a set ETS tation [57]. To compute the orbit for a given table keyed by the first element of the en- space [0..X], a list of generators g1, g2, ..., gn are try; atomically inserts two elements with keys applied on an initial vertex x0 ∈ [0..X]. This some_key and 42; updates the value associated creates new values (x1...xn) ∈ [0..X], where with the table entry with key 42; and then looks xi = gi(x0). The generator functions are ap- up this entry. plied on the new values until no new value is Table = ets:new(my_table, generated. [set, public, {keypos, 1}]), Orbit is a suitable benchmark because it has ets:insert(Table, [ a number of aspects that characterise a class {some_key, an_atom_value}, of real applications. The core data structure {42, {a,tuple,value}}]), it maintains is a set and, in distributed envi- ets:update_element(Table, 42, ronments is implemented as a distributed hash [{2, {another,tuple,value}}]), table (DHT), similar to the DHTs used in repli- [{Key, Value}] = ets:lookup(Table, 42). cated form in NoSQL database management systems. Also, in distributed mode, it uses ETS tables are heavily used in many Erlang standard peer-to-peer (P2P) techniques like a applications. This is partly due to the conve- credit-based termination algorithm [61]. By nience of sharing data for some programming choosing the orbit size, the benchmark can be tasks, but also partly due to their fast imple- parameterised to specify smaller or larger com- mentation. As a shared resource, however, ETS putations that are suitable to run on a single tables induce contention and become a scala- machine (Section i) or on many nodes (Sec- bility bottleneck, as we shall see in Section i. tioni). Moreover it is only a few hundred lines of code. III. Benchmarks for scalability As shown in Figure 2, the computation is and reliability initiated by a master which creates a number of workers. In the single node scenario of the The two benchmarks that we use through- benchmark, workers correspond to processes out this article are Orbit, that measures scal- but these workers can also spawn other pro- ability without looking at reliability, and cesses to apply the generator functions on a Ant Colony Optimisation (ACO) that allows subset of their input values, thus creating intra- us to measure the impact on scalability of worker parallelism. In the distributed version
7 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017
rently compute possible solutions to the input problem guided by shared information about good paths through the search space; there is also a certain amount of stochastic variation which allows the ants to explore new direc- tions. Having multiple colonies increases the number of ants, thus increasing the probability Figure 2: Distributed Erlang Orbit (D-Orbit) architec- of finding a good solution. ture: workers are mapped to nodes. We implement four distributed coordination patterns for the same multi-colony ACO com- of the benchmark, processes are spawned by putation as follows. In each implementation, the master node to worker nodes, each main- the individual colonies perform some number taining a DHT fragment. A newly spawned of local iterations (i.e. generations of ants) and process gets a share of the parent’s credit, and then report their best solutions; the globally- returns this on completion. The computation best solution is then selected and is reported is finished when the master node collects all to the colonies, which use it to update their credit, i.e. all workers have completed. pheromone matrices. This process is repeated for some number of global iterations. Two-level ACO (TL-ACO) has a single master ii. Ant Colony Optimisation (ACO) node that collects the colonies’ best solutions Ant Colony Optimisation [29] is a meta- and distributes the overall best solution back to heuristic which has been applied to a large the colonies. Figure 3 depicts the process and number of combinatorial optimisation prob- node placements of the TL-ACO in a cluster lems. For the purpose of this article, we have with NC nodes. The master process spawns applied it to an NP-hard scheduling problem NC colony processes on available nodes. In known as the Single Machine Total Weighted the next step, each colony process spawns NA Tardiness Problem (SMTWTP) [63], where a ant processes on the local node. Each ant iter- number of jobs of given lengths have to be ar- ates IA times, returning its result to the colony ranged in a single linear schedule. The goal is master. Each colony iterates IM times, report- to minimise the cost of the schedule, as deter- ing their best solution to, and receiving the mined by certain constraints. globally-best solution from, the master process. The ACO method is attractive from the point We validated the implementation by applying of view of distributed computing because it TL-ACO to a number of standard SMTWTP can benefit from having multiple cooperating instances [25, 13, 34], obtaining good results colonies, each running on a separate compute in all cases, and confirmed that the number of node and consisting of multiple “ants”. Ants perfect solutions increases as we increase the are simple computational agents which concur- number of colonies. Multi-level ACO (ML-ACO). In TL-ACO the master node receives messages from all of the colonies, and thus could become a bottleneck. ML-ACO addresses this by having a tree of submasters (Figure 4), with each node in the bottom level collecting results from a small number of colonies. These are then fed up through the tree, with nodes at higher levels selecting the best solutions from their children. Figure 3: Distributed Erlang Two-level Ant Colony Op- Globally Reliable ACO (GR-ACO). In ML-ACO timisation (TL-ACO) architecture. if a single colony fails to report back the sys-
8 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017
Figure 4: Distributed Erlang Multi-level Ant Colony Optimisation (ML-ACO) architecture. tem will wait indefinitely. GR-ACO adds fault i. Scaling Erlang on a Single Host tolerance, supervising colonies so that a faulty colony can be detected and restarted, allowing To investigate Erlang scalability we built the system to continue execution. BenchErl, an extensible open source bench- mark suite with a web interface.4 BenchErl Scalable Reliable ACO (SR-ACO) also adds shows how an application’s performance fault-tolerance, but using supervision within changes when resources, like cores or sched- our new s_groups from Section i.1, and the ulers, are added; or when options that control architecture of SR-ACO is discussed in detail these resources change: there. • the number of nodes, i.e. the number of Erlang VMs used, typically on multiple hosts;
IV. Erlang Scalability Limits • the number of cores per node; • the number of schedulers, i.e. the OS threads This section investigates the scalability of that execute Erlang processes in parallel, Erlang at VM, language, and persistent stor- and their binding to the topology of the age levels. An aspect we choose not to explore cores of the underlying computer node; is the security of large scale systems where, • Erlang/OTP release and flavor for example, one might imagine providing en- the ; and hanced security for systems with multiple clus- • the command-line arguments used to start ters or cloud instances connected by a Wide the Erlang nodes. Area Network. We assume that existing secu- rity mechanisms are used, e.g. a Virtual Private 4Information about BenchErl is available at http:// Network. release.softlab.ntua.gr/bencherl/.
9 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017
20000 with intra-worker parallelism, 10048, 128 curve), is almost linear up to 32 cores but in- 18000 without intra-worker parallelism, 10048, 128 creases less rapidly from that point on; we see a 16000 similar but more clearly visible pattern for the 14000 other configuration (red curve) where there is 12000 no performance improvement beyond 32 sched- 10000
time (ms) 8000 ulers. This is due to the asymmetric charac-
6000 teristics of the machine’s micro-architecture,
4000 which consists of modules that couple two con-
2000 ventional x86 out-of-order cores that share the 0 early pipeline stages, the floating point unit, 0 10 20 30 40 50 60 70 # Schedulers and the L2 cache with the rest of the mod- 40 ule [3]. with intra-worker parallelism, 10048, 128 without intra-worker parallelism, 10048, 128 35 Some other benchmarks, however, did not
30 scale well or experienced significant slow-
25 downs when run in many VM schedulers (threads). For example the ets_test benchmark 20
Speedup has multiple processes accessing a shared ETS 15 table. Figure 6 shows runtime and speedup 10 curves for ets_test on a 16-core (eight cores 5 with hyperthreading) Intel Xeon-based ma- 0 chine. It shows that runtime increases beyond 0 10 20 30 40 50 60 70 # Schedulers two schedulers, and that the program exhibits a slowdown instead of a speedup. Figure 5: Runtime and speedup of two configu- For many benchmarks there are obvious rea- rations of the Orbit benchmark using sons for poor scaling like limited parallelism Erlang/OTP R15B01. in the application, or contention for shared re- sources. The reasons for poor scaling are less Using BenchErl, we investigated the scala- obvious for other benchmarks, and it is exactly bility of an initial set of twelve benchmarks these we have chosen to study in detail in sub- and two substantial Erlang applications us- sequent work [8, 46]. ing a single Erlang node (VM) on machines A simple example is the parallel BenchErl with up to 64 cores, including the Bulldozer benchmark, that spawns some n processes, machine specified in Appendix A. This set of each of which creates a list of m timestamps experiments, reported by [8], confirmed that and, after it checks that each timestamp in the some programs scaled well in the most recent list is strictly greater than the previous one, Erlang/OTP release of the time (R15B01) but sends the result to its parent. Figure 7 shows also revealed VM and language level scalability that up to eight cores each additional core leads bottlenecks. to a slowdown, thereafter a small speedup is Figure 5 shows runtime and speedup curves obtained up to 32 cores, and then again a slow- for the Orbit benchmark where master and down. A small aspect of the benchmark, eas- workers run on a single Erlang node in con- ily overlooked, explains the poor scalability. figurations with and without intra-worker par- The benchmark creates timestamps using the allelism. In both configurations the program erlang:now/0 built-in function, whose imple- scales. Runtime continuously decreases as we mentation acquires a global lock in order to re- add more schedulers to exploit more cores. turn a unique timestamp. That is, two calls to The speedup of the benchmark without intra- erlang:now/0, even from different processes worker parallelism, i.e. without spawning ad- are guaranteed to produce monotonically in- ditional processes for the computation (green creasing values. This lock is precisely the bot-
10 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017
10000 160000 [2512,1264,640] [70016,640] 9000 140000
8000 120000 7000 100000 6000 80000 5000 time (ms) time (ms) 60000 4000 40000 3000
2000 20000
1000 0 0 2 4 6 8 10 12 14 16 0 10 20 30 40 50 60 70 # Schedulers # Schedulers
1.6 1 [2512,1264,640] [70016,640] 0.9 1.4 0.8 1.2 0.7
1 0.6
0.5 Speedup 0.8 Speedup
0.4 0.6 0.3 0.4 0.2
0.2 0.1 0 2 4 6 8 10 12 14 16 0 10 20 30 40 50 60 70 # Schedulers # Schedulers
Figure 6: Runtime and speedup of the Figure 7: Runtime and speedup of the BenchErl ets_test BenchErl benchmark using benchmark called parallel using Erlang/OTP R15B01. Erlang/OTP R15B01. tleneck in the VM that limits the scalability tion as there is no need to discover nodes, and of this benchmark. We describe our work to the design works well for small numbers of address VM timing issues in Figure iii. nodes. However at emergent server architec- ture scales, i.e. hundreds of nodes, this design Discussion Our investigations identified con- becomes very expensive and system architects tention for shared ETS tables, and for must switch from the default Erlang model, e.g. commonly-used shared resources like timers, they need to start using hidden nodes that do as the key VM-level scalability issues. Sec- not share connection sets. tion VI outlines how we addressed these issues We have investigated the scalability limits in recent Erlang/OTP releases. imposed by network connectivity costs using several Orbit calculations on two large clusters: ii. Distributed Erlang Scalability Kalkyl and Athos as specified in Appendix A. The Kalkyl results are discussed by [21], and Network Connectivity Costs When any nor- Figure 29 in Section i shows representative mal distributed Erlang nodes communicate, results for distributed Erlang computing or- they share their connection sets and this typi- bits with 2M and 5M elements on Athos. In cally leads to a fully connected graph of nodes. all cases performance degrades beyond some So a system with n nodes will maintain O(n2) scale (40 nodes for the 5M orbit, and 140 nodes connections, and these are relatively expensive for the 2M orbit). Figure 32 illustrates the ad- TCP connections with continual maintenance ditional network traffic induced by the fully traffic. This design aids transparent distribu- connected network. It allows a comparison be-
11 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017 tween the number of packets sent in a fully con- paper [37] investigates the impact of data size, nected network (ML-ACO) with those sent in and computation time in P2P calls both inde- a network partitioned using our new s_groups pendently and in combination, and the scaling (SR-ACO). properties of the common Erlang/OTP generic server processes gen_server and gen_fsm. Global Information Costs Maintaining Here we focus on the impact of different pro- global information is known to limit the portions of global commands, mixing global scalability of distributed systems, and crucially with P2P and local commands. Figure 9 the process namespaces used for reliability shows that even a low proportion of global are global. To investigate the scalability commands limits the scalability of distributed limits imposed on distributed Erlang by Erlang, e.g. just 0.01% global commands limits such global information we have designed scalability to around 60 nodes. Figure 10 re- and implemented DE-Bench, an open source, ports the latency of all commands and shows parameterisable and scalable peer-to-peer that, while the latencies for P2P and local com- benchmarking framework [37, 36]. DE-Bench mands are stable at scale, the latency of the measures the throughput and latency of global commands increases dramatically with distributed Erlang commands on a cluster of scale. Both results illustrate that the impact of Erlang nodes, and the design is influenced global operations on throughput and latency by the Basho Bench benchmarking tool for in a distributed Erlang system is severe. Riak [12]. Each DE-Bench instance acts as a peer, providing scalability and reliability by eliminating central coordination and any single point of failure. Explicit Placement While network connectiv- To evaluate the scalability of distributed ity and global information impact performance Erlang, we measure how adding more hosts at scale, our investigations also identified ex- increases the throughput, i.e. the total num- plicit process placement as a programming issue ber of successfully executed distributed Erlang at scale. Recall from Section iii that distributed commands per experiment. Figure 8 shows Erlang requires the programmer to identify an the parameterisable internal workflow of DE- explicit Erlang node (VM) when spawning a Bench. There are three classes of commands process. Identifying an appropriate node be- in DE-Bench:(i) Point-to-Point (P2P) commands, comes a significant burden for large and dy- where a function with tunable argument size namic systems. The problem is exacerbated in and computation time is run on a remote large distributed systems where (1) the hosts node, include spawn, rpc, and synchronous may not be identical, having different hardware calls to server processes, i.e. gen_server capabilities or different software installed; and or gen_fsm. (ii) Global commands, which (2) communication times may be non-uniform: entail synchronisation across all connected it may be fast to send a message between VMs nodes, such as global:register_name and on the same host, and slow if the VMs are on global:unregister_name. (iii) Local commands, different hosts in a large distributed system. which are executed independently by a single These factors make it difficult to deploy ap- node, e.g. whereis_name, a look up in the plications, especially in a scalable and portable local name table. manner. Moreover while the programmer may The benchmarking is conducted on 10 to 100 be able to use platform-specific knowledge to host configurations of the Kalkyl cluster (in decide where to spawn processes to enable an steps of 10) and measures the throughput of application to run efficiently, if the application successful commands per second over 5 min- is then deployed on a different platform, or if utes. There is one Erlang VM on each host and the platform changes as hosts fail or are added, one DE-Bench instance on each VM. The full this becomes outdated.
12 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017
Figure 8: DE-Bench’s internal workflow.
Figure 9: Scalability vs. percentage of global commands in Distributed Erlang.
Discussion Our investigations confirm three ing pids in data structures. In Section V we language-level scalability limitations of Erlang develop language technologies to address these from developer folklore.(1) Maintaining a fully issues. connected network of Erlang nodes limits scal- ability, for example Orbit is typically limited to just 40 nodes. (2) Global operations, and cru- iii. Persistent Storage cially the global operations required for relia- Any large scale system needs reliable scalable global namespace bility, i.e. to maintain a , seri- persistent storage, and we have studied the ously limit the scalability of distributed Erlang scalability limits of Erlang persistent storage al- process placement systems. (3) Explicit makes it ternatives [38]. We envisage a typical large hard to built performance portable applications server having around 105 cores on around for large architectures. These issues cause de- 100 hosts. We have reviewed the require- signers of reliable large scale systems in Erlang ments for scalable and available persistent stor- to depart from the standard Erlang model, e.g. age and evaluated four popular Erlang DBMS using techniques like hidden nodes and stor- against these requirements. For a target scale
13 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017
Figure 10: Latency of commands as the number of Erlang nodes increases. of around 100 hosts, Mnesia and CouchDB the scalability of Riak is much improved in are, unsurprisingly, not suitable. However, subsequent versions. Dynamo-style NoSQL DBMS like Cassandra and Riak have the potential to be. V. Improving Language Scalability We have investigated the current scalabil- ity limits of the Riak NoSQL DBMS using the This section outlines the Scalable Distributed Basho Bench benchmarking framework on a (SD) Erlang libraries [21] we have designed cluster with up to 100 nodes and indepen- and implemented to address the distributed dent disks. We found that that the scalabil- Erlang scalability issues identified in Section ii. ity limit of Riak version 1.1.1 is 60 nodes on SD Erlang introduces two concepts to improve the Kalkyl cluster. The study placed into the scalability. S_groups partition the set of nodes public scientific domain what was previously in an Erlang system to reduce network con- well-evidenced, but anecdotal, developer expe- nectivity and partition global data (Section i.1). rience. Semi-explicit placement alleviates the issues of We have also shown that resources like mem- explicit process placement in large heteroge- ory, disk, and network do not limit the scal- neous networks (Section i.2). The two features ability of Riak. By instrumenting the global are independent and can be used separately and gen_server OTP libraries we identified a or in combination. We overview SD Erlang in specific Riak remote procedure call that fails Section i, and outline s_group semantics and to scale. We outline how later releases of Riak validation in Sections ii, iii respectively. are refactored to eliminate the scalability bot- tlenecks. i. SD Erlang Design i.1 S_groups Discussion We conclude that Dynamo-like NoSQL DBMSs have the potential to deliver re- reduce both the number of connections a node liable persistent storage for Erlang at our target maintains, and the size of name spaces, i.e. scale of approximately 100 hosts. Specifically they minimise global information. Specifically an Erlang Cassandra interface is available and names are registered on, and synchronised be- Riak 1.1.1 already provides scalable and avail- tween, only the nodes within the s_group. An able persistent storage on 60 nodes. Moreover s_group has the following parameters: a name,
14 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017 a list of nodes, and a list of registered names. A structured as a tree, ring, or some other topol- node can belong to many s_groups or to none. ogy. How large should the s_groups be? Smaller If a node belongs to no s_group it behaves as a s_groups mean more inter-group communica- usual distributed Erlang node. tion, but the synchronisation of the s_group The s_group library defines the functions state between the s_group nodes constrains the shown in Table 1. Some of these functions maximum size of s_groups. We have not found manipulate s_groups and provide information this constraint to be a serious restriction. For about them, such as creating s_groups and pro- example many s_groups are either relatively viding a list of nodes from a given s_group. small, e.g. 10-node, internal or terminal ele- The remaining functions manipulate names ments in some topology, e.g. the leaves and registered in s_groups and provide informa- nodes of a tree. How do nodes from different tion about these names. For example, to regis- s_groups communicate? While any two nodes ter a process, Pid, with name Name in s_group can communicate in an SD Erlang system, to SGroupName we use the following function. The minimise the number of connections communi- name will only be registered if the process is cation between nodes from different s_groups being executed on a node that belongs to the is typically routed via gateway nodes that be- given s_group, and neither Name nor Pid are long to both s_groups. How do we avoid single already registered in that group. points of failure? For reliability, and to minimise communication load, multiple gateway nodes s_group:register_name(SGroupName, Name, and processes may be required. Pid) -> yes | no Information to make these design choices is provided by the tools in Section VII and by To illustrate the impact of s_groups on scal- benchmarking. A further challenge is how to ability we repeat the global operations exper- systematically refactor a distributed Erlang ap- iment from Section ii (Figure 9). In the SD plication into SD Erlang, and this is outlined in Erlang experiment we partition the set of nodes Section i. A detailed discussion of distributed into s_groups each containing ten nodes, and system design and refactoring in SD Erlang hence the names are replicated and synchro- provided in a recent article [22]. nised on just ten nodes, and not on all nodes We illustrate typical SD Erlang system de- as in distributed Erlang. The results in Fig- signs by showing refactorings of the Orbit and ure 11 show that with 0.01% of global opera- ACO benchmarks from Section III. In both dis- tions throughput of distributed Erlang stops tributed and SD Erlang the computation starts growing at 40 nodes while throughput of SD on the Master node and the actual computation Erlang continues to grow linearly. is done on the Worker nodes. In the distributed The connection topology of s_groups is ex- Erlang version all nodes are interconnected, tremely flexible: they may be organised into and messages are transferred directly from the a hierarchy of arbitrary depth or branching, sending node to the receiving node (Figure 2). e.g. there could be multiple levels in the tree In contrast, in the SD Erlang version nodes of s_groups; see Figure 12. Moreover it is not are grouped into s_groups, and messages are necessary to create an hierarchy of s_groups, transferred between different s_groups via Sub- for example, we have constructed an Orbit im- master nodes (Figure 13). plementation using a ring of s_groups. A fragment of code that creates an s_group Given such a flexible way of organising dis- on a Sub-master node is as follows: tributed systems, key questions in the design of an SD Erlang system are the following. How create_s_group(Master, should s_groups be structured? Depending on GroupName, Nodes0) -> the reason the nodes are grouped – reducing case s_group:new_s_group(GroupName, the number of connections, or reducing the Nodes0) of namespace, or both – s_groups can be freely {ok, GroupName, Nodes} ->
15 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017
Table 1: Summary of s_group functions.
Function Description new_s_group(SGroupName, Nodes) Creates a new s_group consisting of some nodes. delete_s_group(SGroupName) Deletes an s_group. add_nodes(SGroupName, Nodes) Adds a list of nodes to an s_group. remove_nodes(SGroupName, Nodes) Removes a list of nodes from an s_group. s_groups() Returns a list of all s_groups known to the node. own_s_groups() Returns a list of s_group tuples the node belongs to. own_nodes() Returns a list of nodes from all s_groups the node belongs to. own_nodes(SGroupName) Returns a list of nodes from the given s_group. info() Returns s_group state information. register_name(SGroupName, Name, Pid) Registers a name in the given s_group. re_register_name(SGroupName, Name, Pid) Re-registers a name (changes a registration) in a given s_group. unregister_name(SGroupName, Name) Unregisters a name in the given s_group. registered_names({node,Node}) Returns a list of all registered names on the given node. registered_names({s_group,SGroupName}) Returns a list of all registered names in the given s_group. whereis_name(SGroupName, Name) Return the pid of a name registered in the given whereis_name(Node, SGroupName, Name) s_group. send(SGroupName, Name, Msg) Send a message to a name registered in the given send(Node, SGroupName, Name, Msg) s_group.
Master ! {GroupName, Nodes}; distributed Erlang and SD Erlang Orbit and _ -> io:format("exception: message") ACO is presented in Section VIII. end.
Similarly, we introduce s_groups in the i.2 Semi-Explicit Placement GR-ACO benchmark from Section ii to cre- Recall from Section iii that distributed Erlang ate Scalable Reliable ACO (SR-ACO); see Fig- spawns a process onto an explicitly named ure 12. Here, apart from reducing the num- Erlang node, e.g. ber of connections, s_groups also reduce the global namespace information. That is, in- spawn(some_node,fun some_module:pong/0). stead of registering the name of a pid glob- ally, i.e. with all nodes, the names is regis- and also recall the portability and program- tered only on all nodes in the s_group with ming effort issues associated with such explicit s_group:register_name/3. placement in large scale systems discussed in A comparative performance evaluation of Section ii.
16 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017
1e+09
9e+08
7e+08
5e+08 Throughput
3e+08
1e+08 10 20 30 40 50 60 70 80 90 100 Number of nodes Distributed Erlang SD Erlang
Figure 11: Global operations in Distributed Erlang vs. SD Erlang.
represents a sub-master process
represents a colony process master node master process represents an ant process Level 0
P1 P2 represents an s_group P16 Level 1
represents a node
node1 node2 node16 node1 node1 node16 node1 node1 node16 colony nodes in this level Level 2
Figure 12: SD Erlang ACO (SR-ACO) architecture.
To address these issues we have developed a ated hosts, such as total and currently available semi-explicit placement library that enables the RAM, installed software, hardware configura- programmer to select nodes on which to spawn tion, etc. The second deals with a notion of com- processes based on run-time information about munication distances which models the commu- the properties of the nodes. For example, if nication times between nodes in a distributed a process performs a lot of computation one system. Therefore, instead of specifying would like to spawn it on a node with consid- a node we can use the attr:choose_node/1 erable computation power, or if two processes function to define the target node, i.e. are likely to communicate frequently then it would be desirable to spawn them on the same spawn(attr:choose_node(Params), node, or nodes with a fast interconnect. fun some_module:pong/0). We have implemented two Erlang libraries [60] report an investigation into the commu- to support semi-explicit placement [60]. The nication latencies on a range of NUMA and first deals with node attributes, and describes cluster architectures, and demonstrate the ef- properties of individual Erlang VMs and associ- fectiveness of the placement libraries using the
17 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017
Figure 13: SD Erlang (SD-Orbit) architecture.
ML-ACO benchmark on the Athos cluster. replicated on all nodes of the associated group. A node has the following four parame- ters: node_id identifier, node_type that can ii. S_group Semantics be either hidden or normal, connections, and For precise specification, and as a basis for val- group_names, i.e. names of groups the node idation, we provide a small-step operational belongs to. The node can belong to either a list semantics of the s_group operations [21]. Fig- of s_groups or one of the free groups. The type ure 14 defines the state of an SD Erlang system of the free group is defined by the node type. and associated abstract syntax variables. The Connections are a set of node_ids. abstract syntax variables on the left are defined Transitions in the semantics have the form 0 as members of sets, denoted {}, and these in (state, command, ni) −→ (state , value) meaning turn may contain tuples, denoted (), or further that executing command on node ni in state 0 sets. In particular nm is a process name, p a returns value and transitions to state . pid, ni a node_id, and nis a set of node_ids. The semantics is presented in more detail The state of a system is modelled as a four by [21], but we illustrate it here with the tuple comprising a set of s_groups, a set of s_group:registered_names/1 function from f ree_groups, a set of f ree_hidden_groups, and Section i.1. The function returns a list of names a set of nodes. Each type of group is associ- registered in s_group s if node ni belongs to the ated with nodes and has a namespace. An s_group, an empty list otherwise (Figure 15). s_group additionally has a name, whereas a Here ⊕ denotes disjoint set union; IsSGroupN- f ree_hidden_group consists of only one node, ode returns true if node ni is a member of some i.e. a hidden node simultaneously acts as a s_group s, false otherwise; and OutputNms re- node and as a group, because as a group it has turns a set of process names registered in the a namespace but does not share it with any ns namespace of s_group s. other node. Free normal and hidden groups have no names, and are uniquely defined by iii. Semantics Validation the nodes associated with them. Therefore, group names, gr_names, are either NoGroup As the semantics is concrete it can readily be or a set of s_group_names. A namespace is a made executable in Erlang, with lists replacing set of name and process id, pid, pairs and is sets throughout. Having an executable seman-
18 P. Trinder, N. Chechina, N. Papaspyrou, K. Sagonas, S. Thompson, et al. • Scaling Reliably • 2017
(grs, fgs, fhs, nds) ∈ {state} ≡ {s_group}, {free_group}, {free_hidden_group}, {node} gr ∈ grs ≡ {s_group} ≡ s_group_name, {node_id}, namespace fg ∈ fgs ≡ {free_group} ≡ {node_id}, namespace fh ∈ fhs ≡ {free_hidden_group} ≡ node_id, namespace nd ∈ nds ≡ {node} ≡ node_id, node_type, connections, gr_names gs ∈ {gr_names} ≡ NoGroup, {s_group_name} ns ∈ {namespace} ≡ {(name, pid)} cs ∈ {connections} ≡ {node_id} nt ∈ {node_type} ≡ {Normal, Hidden} s ∈ {NoGroup, s_group_name}
Figure 14: SD Erlang state [21].