Work-Stealing, Locality-Aware Actor

Saman Barghi Martin Karsten David R. Cheriton School of Computer Science David R. Cheriton School of Computer Science Univeristy of Waterloo Univeristy of Waterloo Waterloo, Canada Waterloo, Canada Email: [email protected] Email: [email protected]

Abstract—The actor programming model is gaining popular- to achieve the best possible runtime performance. The C++ ity due to the prevalence of multi-core systems along with the Actor Framework (CAF) [7] provides a runtime for the actor rising need for highly scalable and distributed applications. model and has a low memory footprint and improved CPU Frameworks such as Akka, Orleans, Pony, and C++ Actor Framework (CAF) have been developed to address these utilization to support a broad range of applications. application requirements. Each framework provides a runtime This paper studies the challenges for a popular type of system to schedule and run millions of actors, potentially actor scheduler in CAF, work-stealing, on NUMA platforms. on multi-socket platforms with non-uniform memory access Actor models have specific characteristics that need to be (NUMA). However, the literature provides only limited research taken into account when designing scheduling policies. The that studies or improves the performance of actor-based appli- cations on NUMA systems. This paper studies the performance contributions are: a) the structured presentation of these penalties that are imposed on actors running on a NUMA characteristics, b) an improved hierarchical scheduling pol- system and characterizes applications based on the actor type, icy for actor runtime systems, c) the experimental evaluation behavior, and communication pattern. This information is used of the new scheduling proposals and study their performance to identify workloads that can benefit from improved locality in comparison to a randomized work-stealing scheduler. on a NUMA system. In addition, two locality- and NUMA- aware work-stealing schedulers are proposed and their their The rest of the paper is organized as follows. The next respective execution overhead in CAF is studied on both AMD section provides background information related to schedul- and Intel machines. The performance of the proposed work- ing, the actor programming model, and the C++ Actor stealing schedulers is evaluated against the default scheduler Framework (CAF). Section 3 describes existing research in CAF. work related to the problem studied in this paper, while Keywords-Actors, Scheduling, NUMA, Locality Section 4 presents a discussion of workload and application characteristics, followed by the actual scheduling proposal. I.INTRODUCTION An experimental evaluation is provided in Section 5 and the Modern computers utilize multi-core processors to in- paper is concluded with brief remarks in Section 6. crease performance, because the breakdown of Dennard II.BACKGROUND scaling [1] makes substantial increase of clock frequencies unlikely and because the depth and complexity of instruction A. The Actor Programming Model pipelines have also reached a breaking point. Other hardware In the actor model the term actor describe autonomous scaling limitations have led to the emergence of non-uniform objects that communicate asynchronously through messages. memory access (NUMA) multi-chip platforms [2] as a trade- Each actor has a unique address that is used to send mes- off between low-latency and symmetric memory access. sages to that actor, and each actor has a mailbox that is used Multi-core and multi-chip computing platforms provide a to queue received messages before processing. Actors do not shared memory interface across a non-uniform memory share state and only communicate by sending messages to hierarchy by way of hardware-based cache coherence [3]. each other. Sending a message is a nonblocking operation The actor model of computing [4], [5], [6] is a model for and an actor processes each message in a single atomic step. writing concurrent applications for parallel and distributed Actors may perform three types of action as a result of systems. The actor model provides a high-level abstraction receiving a message: (1) send messages to other actors, (2) of concurrent tasks where information is exchanged by create new actors, (3) update their local state [6], [8]. Actors message passing, in contrast to task parallelism where tasks can change their behavior as a result of updating their local communicate by sharing memory. A fundamental building state. In principle, message processing in a system of actors block of any software framework implementing the actor is non-deterministic, because reliable, in-order delivery of model is the proper distribution and scheduling of actors messages is not guaranteed. This non-deterministic nature on multiple underlying processors (often represented by ker- of actors makes it hard to predict their behavior based on nel/system threads) and the efficient use of system resources static compile-time analysis or dynamic analysis at runtime. Despite their differences [9], all actor systems provide execution. In contrast, actors in an actor system can wait for a runtime that multiplexes actors onto multiple system events and other messages, or cooperatively yield execution threads to take advantage of multi-core hardware and provide to guarantee fairness. Hence, the execution pattern of actors concurrency and parallelism. For example, Erlang [10] uses is different from tasks in a task-parallel workload. This a virtual machine along with a work-stealing scheduler to affects both scheduling objectives and locality. distribute actors and to balance the workload. Akka [11] Finally, since actors fully encapsulate their internal state provides various policies to map actors to threads and by and only communicate through passing messages that are default uses a fork/join policy. Pony [12] provides only a placed into the respective mailboxes of other actors, the work-stealing scheduler, while CAF provides both work- resulting memory access pattern is not necessarily the same sharing and work-stealing policies. The type of the actor as the access pattern seen in task-parallel frameworks. For runtime can influence how tasks should be scheduled in instance, in OpenStream [14] each consumer task has mul- regard to locality. For example, in a managed language such tiple data-buffers for each producer task. OpenMP 4.0 [15] as Erlang actors have their own private heap and stack allows task dependencies through shared memory, however whereas in CAF actor objects and variables are directly this is based on a deterministic sporadic DAG model which allocated from the default global heap. only allows dependencies to be defined among sibling tasks. Scheduling is affected by implementation details of the Therefore, although locality-aware scheduling is a well- actor framework and the type of workload. Most importantly, studied topic for task parallelism, due to those differences, depending on the type of an actor and how memory is it cannot automatically be assumed that findings for task allocated and accessed by that actor, scheduling might or parallelism directly translate into similar findings for actor might not benefit from locality improvements related to CPU models - or vice versa. caches or NUMA. Therefore, it is important to identify all factors at play when it comes to locality-aware scheduling. C. Work-Stealing Scheduling These factors must be studied both individually and in com- Multiprocessor scheduling is a well-known NP-hard prob- bination to determine scenarios that benefit from locality- lem. In practice, runtime schedulers apply broad strategies to aware scheduling in a message-passing software framework. satisfy objectives, such as resource utilization, load balanc- ing, and fairness. Work-stealing has emerged as a popular B. Actor Model vs. Task Parallelism strategy for task placement and also load balancing [16]. Actor-based systems can benefit from work-stealing due Work-stealing primarily addresses resource utilization by to the dynamic nature of tasks and their asynchronous exe- stipulating that a processor without work “steals” tasks from cution, which is similar to task parallelism. However, most another processor. variations of task parallelism, e.g., dataflow programming, Work-stealing has been investigated for general multi- are used to deconstruct strict computational problems to ex- threaded programs with arbitrary dependencies [17], gener- ploit hardware parallelism. As such, interaction patterns be- alizing from its classical formulation limited to the fork-join tween tasks are usually deterministic, because dependencies pattern. The main overhead of work-stealing occurs during betweens tasks are known at runtime. Tasks are primarily the stealing phase when an idle processor polls other deques concerned with the availability of input data and therefore to find work, which might cause interprocessor communica- do not have any need to track their state. In contrast, the actor tion and contention that negatively impact performance. model provides nondeterministic, asynchronous, message- The particulars of victim selection vary among work-stealing passing concurrency. Computation in the actor model cannot schedulers. In Randomized Work-Stealing (RWS), when a be considered as a directed acyclic graph (DAG) [7] (e.g., worker runs out of work it chooses the victims randomly. assume DAG computation through fork/join [13]) and The first randomized work-stealing algorithm for fully-strict actors usually have to maintain state. Due to these differ- computing is given in [17]. The algorithm has an expected ences, the internals of an actor runtime, such as scheduling, execution time of T1/P +O(T∞) on P processors, and also differ from runtime systems aimed at task parallelism. has much lower communication cost than work-sharing. Most importantly, applications written using the actor For message-driven applications, such as those built with model, such as a chat server, are sensitive to latency and actor-based programming, these bounds can only be re- fairness for individual tasks. Therefore, a scheduler designed garded as an approximation. The reason is that deep re- for an actor system must be both efficient and fair, otherwise cursion does not occur in event-based actor systems, since applications show a long tail in their latency distribution. On computation is driven by asynchronous message passing and the contrary, for task parallelism, as long as the the entire cannot be considered as a DAG. problem space is explored in an acceptable time, fairness It has been shown that work-stealing fundamentally is and latency of individual tasks does not matter. efficient for message-driven applications [18]. However, Furthermore, in task parallelism many lightweight tasks random victim selection is not scalable [19], because it does are created that run from start to end without yielding not take into account locality, architectural diversity, and the memory hierarchy [18], [20], [21]. In addition, RWS does Accordingly, efficient scheduling of actors on NUMA not consider data distribution and the cost of inter-node task machines requires careful analysis of applications built using migration on NUMA platforms [21], [22], [23]. actor programming. Application analysis must be combined with a proper understanding of the underlying memory D. C++ Actor Framework hierarchy to limit the communication overhead, scheduling costs, and achieve the best possible runtime performance. The C++ Actor Framework (CAF) [7] provides a runtime that multiplexes N actors to M threads on the local system. III.RELATED WORK The number of threads (M) is configurable and by default is There has been very limited research addressing locality- equal to the number of cores available on the system, while aware or specifically NUMA-aware scheduling for actor the number of actors (N) changes during runtime. Actors in runtime systems. Francesquini et al. [22] provide a NUMA- CAF transition between four states: ready, waiting, running, aware runtime environment based on the Erlang virtual and done. An actor changes its state from waiting to ready machine. They identify actor lifespan and communication in reaction to a message being placed in its mailbox. Actors cost as information that the actor runtime can use to improve in CAF are modeled as lightweight state machines that are performance. Actors with a longer lifespan that create and implemented in user space and cannot be preempted. communicate with many short-lived actors are called hub CAF’s work-stealing scheduler uses a double-ended task actors. The proposed runtime system lowers the communi- queue per worker . Worker threads treat this deque as cation cost among actors and their hub actor by placing the LIFO and other threads treat the queue as FIFO. New tasks short-lived actors on the same NUMA node as the hub actor, that are created by an actor are pushed to the head of the called home node. When a worker thread runs out of work, it local queue of the worker thread where the actor is running. first tries to steal from workers on the same NUMA node. If Tasks that are created by spawning actors outside of other unsuccessful, the runtime system tries to migrate previously actors, e.g., from the main() function, are placed into the migrated actors back to that home node. The private heap of task queues in a round robin manner for load balancing. each actor is allocated on the home node, so executing on Actors create new tasks by either spawning a new actor the home node improves locality. As a last resort the runtime or sending a message to an existing actor with an empty steals actors from other NUMA nodes and moves them to mailbox. If the receiver actor’s mailbox is not empty, sending the worker’s NUMA node. a message to its mailbox does not result in creation of a new Although the evaluation results look promising, the task since a task already processes the existing messages. caveat, as stated by the authors, is in assuming that hub The RWS scheduler in the CAF runtime uses a uniform actors are responsible for the creation of the majority of random number generator to randomly select victims when actors. This is a strong assumption only applies to some a worker thread runs out of work. Although CAF provides a applications. Also, when multiple hub actors are responsible support layer for seamless heterogeneous hardware to bridge for creating actors, the communication pattern among none- architectural design gaps between platforms, it does not yet hub actors can still be complicated. Another assumption is provide a locality-aware work-stealing scheduler. that all NUMA nodes have the same distance from each other, and the scheduler does not take the CPU cache hier- E. NUMA Effects archy into account. The approach presented takes advantage Contemporary large-scale shared-memory computer sys- of knowledge that is available within the Erlang virtual tems are built using non-uniform memory access (NUMA) machine, but not necessarily available in an unmanaged hardware where each processor chip has direct access to language runtime, such as CAF. a part of the overall system memory, while access to the In contrast, the work presented here is not based on any remaining remote memory (which is local to other processor assumptions about the communication pattern among actors chips) is routed through an interconnect and thus slower. or information available through a virtual machine runtime. NUMA provides the means to reach a higher core count Instead, it is solely focused on improving performance by than single-chip systems, albeit at the (non-uniform) cost improving the scheduler. Also, the full extent and variability of occasionally accessing remote memory. Clearly, with of the memory hierarchy is taken into account. remote memory, the efficacy of the CPU caching hierarchy A simple affinity-type modification to task scheduling becomes even more important. However, CPU caches are is reported in [26]. Tasks in this system are blocking on shared between cores with the particular nature of sharing a channel waiting for messages to arrive, and thus show depending on the particulars of the hardware architecture. similar behavior to actors waiting on empty mailboxes. In Therefore, new scheduling algorithms are proposed that are contrast to basic task scheduling, an existing task that is aware of the memory hierarchy and shared resources to unblocked is never placed on the local queue. Instead, it is exploit cache and memory locality and minimize overhead always placed at the end of the queue of the worker thread [16], [20], [21], [22], [23], [24], [25]. previously executing the task. A task is only migrated to another worker thread by stealing. This modification leads to Producer-Single-Consumer (MPSC) FIFO queue in addition significant performance improvements for certain workloads to a work-stealing dequeue. The MPSC queue is only and thus contradicts assumptions about treating the queues processed when the deque is empty. However, this approach in LIFO manner. However, the system has a well-defined is not applicable to latency-sensitive actor application for communication pattern and long task lifespans, in contrast two reasons: first, the actor model is nondeterministic and to an actor system that is non-deterministic with a mixture data dependence difficult to infer at runtime. Second, adding of short- and long-lived actors. We have implemented both a a lower-priority MPSC queue adds complexity and can LIFO and an affinity policy using our hierarchical scheduler cause some actors to be inactive for a long time, which and present results in Section V. violates fairness and thus causes long tail latencies for the Various locality-aware work-stealing schedulers have been application. Moreover, the proposed deferred memory allo- proposed for other parallel programming models and shown cation relies on knowing the task dependencies in advance, to improve performance. Suksompong et al. [25] investigate which is not possible with the actor model. Therefore, this localized work-stealing and provide running time bounds optimization cannot be applied to the actor model. The when workers try to steal their own work back. Acar et topology-aware work-stealing introduced by this work is al. [27] study the data locality of work-stealing scheduling similar to ours, but it is evaluated in combination with on shared-memory machines and provide lower and upper deferred memory allocation and work-pushing. Thus, it is bounds for the number of cache misses, and also provide not possible to discern the isolated contribution of topology- a locality-guided work-stealing scheduling scheme. Chen et aware work-stealing. al. [28] present a cache-aware two-tier scheduler that uses an Majo et al. [2] specify that optimizing for data locality can automatic partitioning method to divide an execution DAG counteract the benefits of cache contention avoidance and into the inter-chip tier and the intra-chip tier. vice versa. In Section V we present results that demonstrate Olivier et al. [29] provide a hierarchical scheduling strat- this effect for actor workloads where aggressive optimization egy where threads on the same chip share a FIFO task for locality increases the last-level cache contention. queue. In this proposal, load balancing is performed by work sharing within a chip, while work-stealing only happens be- IV. LOCALITY-AWARE SCHEDULERFOR ACTORS tween chips. In follow-up work, the proposal is improved by This section first discusses the key characteristics of an using a shared LIFO queue to exploit cache locality between actor-based application and workload that need to be consid- sibling tasks as well as between a parent and newly created ered when designing a locality-aware, work-stealing sched- task [23]. Moreover, the work-stealing strategy is changed, uler. These characteristics also provide valuable hints for so that only a single thread can steal work on behalf of other designing evaluation benchmarks. Based on these findings, threads on the same chip to limit the number of costly remote a novel hierarchical, locality-aware work-stealing scheduler steals. Pilla et al. [30] propose a hierarchical load balancing is presented. approach to improve the performance of applications on parallel multi-core systems and show that Charm++ can A. Characteristics of Actor Applications benefit from such a NUMA-aware load balancing strategy. Key operations can be slowed down when an actor Min et al. [21] propose a hierarchical work-stealing sched- migrates to another core on a NUMA system, depending uler that uses the Hierarchical Victim Selection (HVS) policy on the NUMA distance. This performance degradation can to determine from which thread a thief steals work, and the come from messages that arrive from another core, or from Hierarchical Chunk Selection (HCS) policy that determines accessing the actor’s state that is allocated on different how much work a thief steals from the victim. The HVS NUMA node. Depending on the type of actor and the policy relies on the scheduler having information about the communication pattern, the amount of degradation differs. memory hierarchy: cache, socket, and node (this work also Therefore, improving locality does not benefit all work- considers many-core clusters). Threads first try to steal from loads. We identify the following factors in applications and the nearest neighbors and only upon failure move up the workloads for actor-based programming that can affect the locality hierarchy. The number of times that each thread tries performance of a work-stealing scheduler on a hierarchical to steal from different levels of the hierarchy is configurable. NUMA machine: The victim selection strategy presented here in Section IV-B 1) Memory allocated for actor and access pattern: Actors is similar to HVS, but takes NUMA distances into account. sometimes only perform computations on data passed to Drebes et al. [31], [32] combine topology-aware work- them through messages. For simplicity, we denote actors stealing with work pushing and dependence-aware memory that only depend on message data as stateless actors, and allocation to improve NUMA locality and performance for actors that do manage local state as stateful actors. data-flow task parallelism. Work pushing transfers a task to a Stateful actors allocate local memory upon creation and worker whose node contains the task’s input data according access it or perform other computations depending on the to some dependence heuristics. Each worker has a Multiple- type of a message and their state when they receive a message. A stateful actor with sizable state and intensive The scheduler builds a representation of the locality memory access to that state is better executed closer to the hierarchy using the hwloc library [33] that uses hardware NUMA node where it was created. Also, for better cache information to determine the memory hierarchy, which the locality, especially if the actor receives messages from its scheduler represents as a tree. The root is a place-holder spawner frequently, it is better to keep such an actor on the representing the whole system, while intermediate levels rep- same core, or a core that shares a higher-level CPU cache resent NUMA nodes and groups, taking into account NUMA with the spawner core. The reason is that those messages distances. Subsequent nodes represent shared caches and the are hot in the cache of the spawner actor. leaves represent the cores on the system. This approach is On the other hand, stateless actors do not allocate any independent from any particular hardware architecture. local memory and can be spawned multiple times for better 2) Actor Placement: For fully-strict computations, data scalability. For such actors, the required memory to process dependencies of a task only go to its parent. Thus the messages is allocated when they start processing a message natural placement for new tasks the core of the parent and deallocated when the processing is done. Therefore, the task. However, actors can communicate arbitrarily and thus, only substantial memory that is accessed is the one allocated local placement of newly created actors does not guarantee for the received message by the sender of that message. Such the best performance. For example, actors receiving remote actors are better executed closer to the message sender. messages pollute the CPU cache for actors that execute later 2) Message size and access pattern: The size of messages and process messages from their parents. Also, as stated has a direct impact on the performance and locality of actors earlier, depending on the size of the message in comparison on NUMA machines. Messages are allocated on the NUMA to state variables, placing the actor in the sender’s NUMA node of the sender, but accessed on the core that is executing node can help or hurt performance. Determining the best the receiver actor. If the size of messages is typically larger strategy at runtime can add significant overhead, so there is than the size of the local state of an actor, and the receiving no apparent optimal approach. actor accesses the message intensively, actors are better to The exception are hub actors [22], i.e., long-living actors be activated on the same node as the sender of the message. that spawn many children and communicate with them 3) Communication pattern: Since the actor model is non- frequently. Such actors place high demand on the memory deterministic, it is difficult to generally analyze the commu- allocator and can interfere with each other if placed on nication pattern between actors. Two actors that are sending the same NUMA node. Furthermore, if a locality-aware messages to each other can go through different states and affinity policy tries to keep actors on their home node, thus have various memory access patterns. In addition, the placing multiple hub actors on the same NUMA node type and size of each message can vary depending on the further increases contention over shared resources and thus state of the actor. We do not make any assumptions about reduces the performance. Hence, our scheduler uses the same the communication pattern of actors, unlike others [22]. algorithm for initial placement of hub actors [22] to spread Aside from illustrating the trade-offs involved in actor them across different NUMA nodes. The programmer needs scheduling, these observations are also useful to determine to annotate hub actors. The system then tags corresponding which benchmarks realistically demonstrate the benefits of structures at compile time and the runtime scheduler uses locality-aware work stealing schedulers and which represent this information to place such actors far from each other. worst-case tests. 3) Locality-aware Work-stealing: A locality-aware victim selection policy attempts to keep tasks closer to the core B. Locality-Aware Scheduler (LAS) that created them or was running them previously to take Our locality-aware scheduler consists of three stages: advantage of better cache locality. Depending on the hard- memory hierarchy detection, work placement, and work ware architecture, cores might share higher-level CPU cache. stealing. When an application starts running, the scheduler Therefore, in our scheduler the thief worker thread first determines the memory hierarchy of the machine. Also, a steals from worker threads executing on nearby cores in the new actor is is placed on the local or a remote NUMA node same NUMA node with shared caches to improve locality. depending on the type of the actor. Finally, when a worker If there is no work available in the local NUMA node, thread runs out of work it uses a locality-aware policy to the hierarchical victim selection policy tries to steal jobs steal work from another work thread. from worker threads of other NUMA nodes with increasing 1) Memory Hierarchy Detection: Our work-stealing al- NUMA distance. The goal of NUMA-aware work stealing gorithm needs to be aware of the memory hierarchy of the is to avoid migrating actors between NUMA nodes to the underlying system. In addition to the cache and NUMA extent possible, and thus to remove the need for remote hierarchy, differing distances between NUMA nodes are an memory accesses. important factor in deciding where to steal tasks from. Ac- Limiting the worker threads to initially choose their cess latencies can vary significantly based on the topological victims within their own NUMA node can lead to more distance between access node and storage node. frequent contention over deques on the local NUMA node in comparison to using the random victim selection strategy. Algorithm 1 Hierarchical Victim Selection For example, if a single queue still has work, while all other 1: T: Memory hierarchy tree worker threads run out of work, is the worst-case scenario 2: C: Set of cores under v for a work-stealing scheduler. However, this case appears 3: p: Number of threads polling under local NUMA node frequently in actor applications, where a hub actor creates 4: r: Number of steal attempts for v multiple actors and other worker threads steal from the local 5: procedure CHOOSEVICTIM(V) deque of the thread that runs the hub actor. Our investigation 6: if r = Size(C) and v 6= root(T ) then shows that when stealing fine-grained tasks with workloads 7: v ← parent(v) that are 20µs or shorter, the performance penalty ratio in- 8: if v is in local NUMA node then creases exponentially as the number of thief threads increase. 9: S = {s | all non-empty local deques} For more coarse-grained tasks, the performance penalty is 10: if S = ∅ then not significant, since the probability of contention decreases. 11: v ← parent(v) To alleviate this problem, our scheduler keeps track of the size(C) 12: else if size(S) = 1 and p > 2 then number of threads per NUMA node that are polling the local 13: return ∅ node. This number is used along with the approximate size 14: else return random from S of the deques in the node to reduce the number of threads 15: return random from C that are simultaneously polling a deque (Algorithm 1). If there is only a single non-empty deque and more than half of threads under that node are polling that deque, the thief because actors are moved only by stealing. However, it adds thread backs off and tries again later. overhead to saturated workers and increases contention when In addition, polling the queue of many other worker placing actors on remote deques, which is further discussed threads with empty queues can result in wasting CPU cycles in Section V-C. when the number of potential victims is limited. In CAF, a worker thread constantly polls its own deque and after a V. EXPERIMENTSAND EVALUATION certain number of attempts, polls a victim deque. To avoid Experiments are conducted on an Intel and an AMD wasting CPU cycles, we have modified the deque and added machine that have different NUMA topologies and memory an approximate size of the deque using a counter. A thief hierarchies. The Intel machine is a Xeon with 4 sockets, 4 uses this approximate size when it attempts to steal from NUMA nodes, and 32 cores. Each NUMA node has 64 GB other workers executing on the same NUMA node. If there of memory for a total of 256 GB. Each socket has 8 cores, are non-empty queues, it chooses one randomly, otherwise if running at 2.3 GHz, that share 16 MB L3 Cache, and each all the queues are empty, the thief immediately moves up to core has private L1 and L2 caches. Each NUMA node is the next higher level (Algorithm 1). This approach removes only directly connected to two other NUMA nodes. Hpyer- the overhead of polling empty queues on the local NUMA threading is disabled. The AMD machine is an Opteron with node and thus decreases the number of wasted CPU cycles. 4 sockets, 8 NUMA nodes, and 64 cores. Each NUMA node Since there is a fixed number of cores on a NUMA node, contains 64 GB of memory for total of 512 GB. Each socket scanning their queue sizes adds little overhead that remains has 8 cores running at 2.5 GHz. Each core has a private L1 constant even when the application scales. data cache, and shares the L1 instruction cache and an L2 When a worker runs out of work, it becomes a thief and cache (2 MB) with a neighbour core. All cores in the same uses the memory hierarchy tree provided by the scheduler socket share one L3 cache (6 MB). The experiments are to perform hierarchical victim selection as described in performed with CAF version 0.12.2, compiled with GCC Algorithm 1. The updated vertex v is passed to the function 5.4.0, on Ubuntu Linux 16.04 with kernel 4.4.0. each time to complete the tree traversal. An empty result or The experiments compare CAF’s default Randomized a victim with an empty deque means that the thief has to Work-Stealing (RWS) scheduler with LAS/L and LAS/A. try again. We first evaluate the performance using benchmarks to study We have created two variants of LAS that differ in their the effect of scheduling policy on different communication placement strategy. When an existing actor is unblocked by patterns and message sizes. Next, we use a simple chat server a message, the local variant (LAS/L) places the actor on the to observe the efficiency of schedulers for an application that local deque, while the affinity variant (LAS/A) places the has a large number of actors with non-trivial communication actor at the end of the deque of the worker thread previously patterns, different behaviors, and various message sizes. executing it. In both cases, newly created actors are pushed to the head of the local deque. LAS/L is similar to typi- A. Benchmarks cal work-stealing placement where all activated and newly The first set of experiments attempts to isolate the effects created tasks are pushed to the head of the local deque. of each scheduling policy for different actor communication LAS/A improves actor-to-thread (and thus to-core) affinity, patterns. A subset of benchmarks from the BenchErl [34] and Savina [35] benchmark suites is chosen that represents B. Experiments typical communication patterns used in actor-based appli- cations. Some of these benchmarks are adopted from task- The benchmark results are shown in Figure 1. The exe- parallelism benchmarks, but modified to fit the actor model. cution time is the average of 10 runs and normalized to the slowest scheduler. All experiments are configured to keep all cores busy most of the time, i.e., the system operates at peak load. • Big (BIG): In a many-to-many message passing sce- nario many actors are spawned and each one sends a ping The RWS scheduler performs relatively better for the message to all other actors. An actor responds with a pong BIG benchmark and it outperforms both LAS/L and LAS/A. message to any ping message it receives. This benchmark represents a symmetric many-to-many com- • Bang (BANG): In a many-to-one scenario, multiple munication pattern where all actors are sending messages senders flood the one receiver with messages. Senders send to each other. This workload benefits from a symmetric messages in a loop without waiting for any response. distribution of work. Other experiments (not shown here) • Logistic Map Series (LOGM): A synchronous request- show that using the NUMA interleave memory allocation response benchmark pairs control actors with compute actors policy improves the performance further. For this particular to calculate logistic map polynomials through a sequence of workload, improving locality does not translate to improving requests and responses between each pair. the performance. • All-Pairs Shortest Path (APSP): This benchmark is a The BANG benchmark represents workloads using many- weighted graph exploration application that uses the Floyd- to-one communication. Messages have very small sizes and Warshall algorithm to compute the shortest path among all no computation is performed. Since the receiver’s mailbox pairs of nodes. The weight matrix is divided into blocks. is the bottleneck, improving locality does not significantly Each actor performs calculations on a particular block and affect the overall performance. LAS/L only improves the communicates with the actors holding adjacent blocks. performance slightly by allocating more messages on the • Concurrent Dictionary (CDICT): This benchmark local node. maintains a key-value store by spawning a dictionary actor LOGM and APSP both create multiple actors during with a constant-time data structure (hash table). It also startup and each actor frequently communicates with a spawns multiple sender actors that send write and read limited number of other actors. In addition, computation requests to the dictionary actor. Each request is served with depend on an actor’s local state and message content. For a constant-time operation on the hash table. both workloads, LAS/L and LAS/A outperform RWS by • Concurrent Sorted Linked-List (CSLL): This benchmark a great margin. In such workloads, each actor can only is similar to CDICT but the data-structure has linear access be activated by one of the actors it communicates with. If time (linked list). The time to serve each request depends the one of the communicating actors is stolen and executes on type of the operation and the location of the requested item. another core, in RWS and LAS/L it causes the other actors Also, actors can inquire about the size of the list, which to follow and execute on the new core upon activation. Since requires iterating through all items. all actors maintain local state that is allocated upon creation • NQueens first N Solutions (NQN): A divide-and- of the actor, all actors that are part of the communication conquer style algorithm searches a solution for the problem: group experience longer memory access times if one of them ”How can N queens be placed on an N × N chessboard, migrates to another NUMA node. Keeping actors on the so that no pair attacks each other?” same NUMA node and closer to the core they were run- • Trapezoid approximation (TRAPR): This benchmark ning before can prevent this. LAS/A improves performance consists of a master actor that partitions an integral and further by preventing other actors to migrate along with the assigns each part to a worker. After receiving all responses stolen actor. Even though actors are occasionally moved to they are added up to approximate the total integral. The another NUMA node, the rest of the group stays on their message size and computation time is the same for all own NUMA node. Thus, LAS/A performs better than LAS/L workers. by preventing group migration of actors. For LOGM, since • Publish-Subscribe (PUBSUB): Publish/subscribe is an each pair of actors is isolated from other actors, stealing important communication pattern in actor programs that is one actor translates to moving one other actor along with it. used extensively in many applications, such as chat servers However, in APSP each actor communicates with multiple and message brokers. This benchmark is implemented using actors, which means stealing one actor can cause a chain CAF’s group communication feature and measures the end- reaction and several actors that do not directly communicate to-end latency of individual messages. It represents a one-to- with the stolen actor might also migrate. LAS/A therefore many communication pattern where a publisher actor sends has a stronger effect on APSP than LOGM. messages to multiple subscribers. Actors can subscribe to CDICT and CSLL represent workloads where a central more than one publisher. actor spawns multiple worker actors and communicates with ni h okratreeue ratraiey ni tis it until thread. worker alternatively, idle or an executed by actor stolen worker latency additional the introduces This until away. actor. central right worker the same as continue the thread on can actors worker execution unblocks With LAS/L that However, previous work. so its on seeking thread, actor and worker worker a idle unblocks response being the corresponding up LAS/A, The actor. end central threads the worker from wait- response increased inactive, a resulting become for the ultimately ing lookup actors of most Because linear times, list. the service linked with the actor into comparison central time the in on negligible imposes LAS/A becomes that overhead the First, can application. time overhead the additional service down this the slow small, fairly slightly Since is times. request in each access imbalance for cores memory on an placed higher cause being constant are can with roughly actors LAS/A some in because setting, times, table a service hash such all a In CDICT from time. In improved served accesses. to placing are memory leads requests faster actors, actor to central worker due the the performance to and closer messages actors actor accessing worker central and allocating the ma- are the between operations Since of schedulers. locality improved locality-aware jority from both benefits by CDICT provided from slowest requests to actors. write normalized and worker are read the receives manag- results and for The structure responsible data Xeon. is a Intel ing actor (b) central and The Opteron frequently. AMD them (a) on schedulers various with benchmarks the running better). of is Results (lower scheduler 1. Figure Q sadvd-n-oqe loih hr master a where algorithm divide-and-conquer a is NQN RWS. and LAS/L outperforms LAS/A CSLL in However, Normalized execution time Normalized execution time 100 100 20 40 60 80 20 40 60 80 0 0

BIG BIG

BANG BANG

LOGM LOGM

APSP APSP

CDICT CDICT (b) (a) otepbihrta ed hmamsaeadimproves and message a than performance them worse shows sends LAS/A However, locality. that the closer publisher subscribers the the keeps to LAS/L in RWS. used with is policy comparison LAS/L when improvement latency message actor, master significantly. the improved with is master are the performance to the communications closer tasks all very these Since run are and actor. tasks steal to the cores chance and local the of actor, increases on master scheduling created the locality-aware are fine-grained, of actors deque all Since local to benchmark. the up this performance for the message, times improve a LAS/A 10 back and send LAS/L exit. calculations, and some perform actor, ter on impact significant case. than a faster this times has in 5 RWS actor perform LAS/A master and LAS/L the each computation. performance. all to to during closer and and actors accessed worker message other placing are the and message locality of Improving each content in the items each on and for depends performed producing message computation constantly The fashion. are round-robin messages. assigns actors a consuming that in worker actor actors all worker master the Therefore, the to to tasks further new back and the spawning reports of it it instead a But to actors, subtasks. performs new assigned fixed smaller to actor task task among the worker the divides work on Each the operation actors. recursive dividing worker for of number responsible is actor h USBbnhaksossgicn end-to-end significant shows benchmark PUBSUB The mas- a from message a receive actors worker TRAPAR In

CSLL CSLL

NQN NQN

TRAPR TRAPR

PUBSUB PUBSUB Scheduling Policy Scheduling Policy LAS/A LAS/L RWS LAS/A LAS/L RWS h ne ahn hwsmlrrsls u r o shown not are but results, similar show on machine Experiments Intel machine. the AMD the operations on write memory executed submit experiment of actors Worker size actor. the requests central and the by messages performed of size the the make to modified locality-aware is value of benchmark performance CDICT the The on schedulers. size message of effect decrease. performance a (PUBSUB) and contention case high one causes in LAS/A However, in LAS/L. better locality with or where similar comparison cases performs locality- LAS/A most from performance, in improves benefit Moreover, TRAPR), schedulers. and content (NQN aware the multiple message on or depend the one that of computations actor perform central and a CSLL), times with and communicate (CDICT contents actors cen- message or a access with and communicate actor actors tral other APSP), workloads of and cluster hand, (LOGM small a actors other with communicating are the actors where from On benefit schedulers. not do locality-aware (BIG) patterns communication to-many the on Thus, effective machine. machine. more AMD with AMD slightly the node are for NUMA The schedulers higher a locality-aware each section. is to times of this tasks access setup of moves higher RWS beginning NUMA that the the differences probability at in These explained differences machine. machine, the Intel from the come and machine AMD are there subscribers. when and contention publishers higher of on to number actors large leads unblocking this constantly cores, local the are other than newly publishers rather cores the Since other place of core. deque threads The the worker on time. LAS/A, tasks the created with of that most contention is worker lock reason that by reveals stalled code are the threads Profiling policies. two other the is (lower scheduler slowest the to the normalized of are size better) the times varying execution for Results The 2. Figure ehv efre nte xeiett td the study to experiment another performed have We many- with workloads that indicate results the general, In the from results the between differences minor are There

iecngrbefrec e-au ar hsaffects This pair. key-value each for configurable size Normalized execution time 100 20 40 60 80 20% 0

ftetm.Fgr hw h eut o this for results the shows 2 Figure time. the of 1

4 M

e 16 s s a g

e 64

S i z e

( 256 w o r d value

s 1024 )

nCITbenchmark. CDICT in 4056

8112 Scheduler L L RWS A A S S / / A L cosad100gop.Ec srhsrno ubrof number random has user Each groups. 10000 and which actors subscribers, remote all pattern. a to representing communication to actor one-to-many message out an a the group, creates sent a forwards was to group sent it not, is the If if message list. a as blocked If message it its client. the in message, is encrypts the sender it receives the whether actor checks receiver central first the the and cache When local The the actor. storage. both in that logged to also message is the message the to reference forwards the and find actor, to the receiving ID decrypts user first its receiver it the message, uses stores a message, actor. receives cache actor actor local session a session a by When controlled a each cache local is by a addition, in controlled messages information In storage of key-value actor. log in-memory database a an and in session stored each respectively. about users of Information lists corresponding the represent communication which of encryption. context as the such clients, in remote out with would carried that be operations normally server post-processing chat and the generated However, pre- is process. implements same not workload the does in the consumed server and the and the operations for implementation network state created the include in communication are simplify the group To groups holds CAF. based Chat publish/subscribe that application. the server actor using the an in by session (session) represented user [36] Each is chats. to group similar and one-to-one server supports chat that a implemented have we scenario, Server Chat C. shown space). (not have limited LAS/L fact, to we outperform due even requests, In can write RWS decreases. of that observed percentage RWS the the and increasing therefore when LAS/L and problem between contention difference the that avoids contention. such also lower L3, by RWS in compensated contention is actor. overhead the access dictionary avoids remote NUMA which the other well, to down actors as distributes slows nodes hand, other which the cache, on contention same LAS/A, creates L3 the this the actor, on dictionary lower actors in the into value most as fit keeps node the not NUMA LAS/L some do as Since objects unblock However, caches. and level to nodes. messages larger, actor NUMA gets size remote dictionary on the actors causes higher overhead, it imposes additional adds therefore because also LAS/A and times. access nodes memory NUMA distributes RWS among performance. improved tasks 256 the L2), and improves messages than (L1 further caches most level smaller locality lower since into sizes fit and objects value locality and for improves LAS/L LAS/A words. and RWS both space. limited to due here h htsre scngrdt u ih1million 1 with run to configured is server chat The list, blocked and list, group list, friend a has user Each realistic more a using LAS of variants both evaluate To outperforms scheduler LAS/L the that show results The 6 10 106 Throughput Throughput 1000 5 10 105 800 4 10 104 900 3 700 10 103

2 10 102 600 800 1 10 101 Latency(ms) Latency(ms) 100 500 100 Throughput(Kmsg/sec) 700 Throughput(Kmsg/sec)

10 1 10 1 400 10 2 10 2 600

10 3 300 10 3 RWS LAS/L LAS/A RWS LAS/L LAS/A (a) (b) Figure 3. Throughput and distribution of end-to-end message latency using different scheduling policies on (a) AMD Opteron and (b) Intel Xeon. Latencies are quantified on the left Y-axis that a logarithmic scale (lower is better). Throughput is quantified on the 2nd Y-axis (Higher is better).

friends (max. 100), subscribes to a random number of groups VI.CONCLUSION (max. 10), and blocks random number of users (max. 5). This paper present a study of the effectiveness of locality- Random numbers uniformly distributed and all session and aware schedulers for actor runtime systems. Various charac- group actors are spawned before the experiment starts. To teristics of the actor model and message passing frameworks simulate receiving messages from users, every second each are investigated that can affect execution performance on session actor uses a timer to send a message with a random NUMA machines. In addition, the applicability of existing length up to 1024 bytes to a randomly chosen user or work-stealing schedulers is discussed. We use these findings group from its friend or group list. 5% of the messages to develop two variants of a novel locality-aware work steal- are sent to groups and the rest are direct messages. The ing scheduler (LAS) for the C++ Actor Framework (CAF) experiment drives each system to peak load. We measure the that takes into account the distance between cores and overall throughput, along with the end-to-end latency of each NUMA nodes. The performance of LAS/L and LAS/A is message from the moment the sender actor (on the server compared with CAF’s default randomized victim scheduler. side) generates the message until the moment the receiver Locality-aware work stealing shows comparable or better actor would be ready to send it to the remote client. performance in most cases. However, it is also demonstrated that the effectiveness of locality-aware schedulers depends Figure 3 uses letter-value plots to show the distribution on the workload. In particular, affinity-unblocking can cause of latencies using different scheduling policies on the AMD lock contention. and Intel machines. LAS/L has higher throughput and better latency distribution than the other two policies on both AMD ACKNOWLEDGMENT and Intel machines. LAS/A has significant lower throughput This work has been supported by the Natural Sciences than both RWS and LAS/L and the latency distribution an Engineering Research Council of Canada (NSERC) and shows a higher average latency with a long latency tail Huawei Technologies. that goes up to 1000 seconds on both machines. Profiling indicates that the workload is dominated by deque lock REFERENCES contention, similar to PUBSUB. This contention causes vari- [1] M. Bohr, “A 30 Year Retrospective on Dennard’s MOSFET able and high latency numbers, and fairly low throughput, Scaling Paper,” IEEE Solid-State Circuits Society Newsletter, because threads are relatively often blocked on a lock. vol. 12, no. 1, pp. 11–13, Winter 2007.

Hence, although LAS/A is performing fairly good in sim- [2] Z. Majo and T. R. Gross, “Memory System Performance in a NUMA Multicore Multiprocessor,” in Proc. 4th Ann. Int’l ple scenarios and benchmarks, lock contention significantly Conf. Systems and Storage, 2011, pp. 12:1–12:10. affects its performance when the application scales and number of actors increases. Therefore, the proposal in [26] [3] D. Molka, D. Hackenberg, R. Schone, and M. S. Muller, “Memory Performance and Cache Coherency Effects on an does not apply to large-scale actor-based applications. The Intel Nehalem Multiprocessor System,” in Proc. 18th Int’l LAS/L policy, on the other hand, shows stable performance Conf. Parallel Architectures and Compilation Techniques, improvements over RWS and reduces the latency. 2009, pp. 261–270. [4] C. Hewitt, P. Bishop, and R. Steiger, “A Universal Modular [22] E. Francesquini, A. Goldman, and J. F. Mhaut, “A NUMA- ACTOR Formalism for Artificial Intelligence,” in Advance Aware Runtime Environment for the Actor Model,” in 2013 Papers of the Conf., vol. 3. Stanford Research Inst., 1973, 42nd Int’l Conf. on Parallel Processing, 2013, pp. 250–259. p. 235. [23] S. L. Olivier, A. K. Porterfield, K. B. Wheeler, M. Spiegel, [5] C. Hewitt and H. G. Baker, “Actors and Continuous Function- and J. F. Prins, “OpenMP Task Scheduling Strategies for als,” Massachusetts Inst. of Technology, Tech. Rep., 1978. Multicore NUMA Systems,” The Int’l J. High Performance Computing Applications, vol. 26, no. 2, pp. 110–124, 2012. [6] G. Agha, Actors: A Model of Concurrent Computation in Distributed Systems. MIT Press, 1986. [24] S. Zhuravlev, J. C. Saez, S. Blagodurov, A. Fedorova, and M. Prieto, “Survey of Scheduling Techniques for Addressing [7] D. Charousset, R. Hiesgen, and T. C. Schmidt, “Revisiting Shared Resources in Multicore Processors,” ACM Computing Actor Programming in C++,” Computer Languages, Systems Surveys (CSUR), vol. 45, no. 1, pp. 4:1–4:28, Dec. 2012. & Structures, vol. 45, pp. 105–131, April 2016. [25] W. Suksompong, C. E. Leiserson, and T. B. Schardl, “On the [8] G. Agha, “Concurrent Object-oriented Programming,” Com- Efficiency of Localized Work Stealing,” Information Process- mun. ACM, vol. 33, no. 9, pp. 125–141, Sep. 1990. ing Letters, vol. 116, no. 2, pp. 100–106, 2016. [9] J. De Koster, T. Van Cutsem, and W. De Meuter, “43 Years [26] Z. Vrba, P. Halvorsen, and C. Griwodz, “A Simple Im- of Actors: A Taxonomy of Actor Models and Their Key provement of the Work-stealing Scheduling Algorithm,” in Properties,” in Proc. 6th Int’l Workshop on Programming Proc. Int’l Conf. Complex, Intelligent and Software Intensive Based on Actors, Agents, and Decentralized Control, 2016, Systems, 2010, pp. 925–930. pp. 31–40. [27] U. A. Acar, G. E. Blelloch, and R. D. Blumofe, “The Data [10] J. Armstrong, R. Virding, C. Wikstrom,¨ and M. Williams, Locality of Work Stealing,” in Proc. 12th Ann. ACM Symp. “Concurrent Programming in ERLANG,” 1993. Parallel Algorithms and Architectures, 2000, pp. 1–12. [11] J. Boner,´ “Introducing Akka - Simpler Scalability, Fault- [28] Q. Chen, M. Guo, and Z. Huang, “Adaptive Cache Aware Tolerance, Concurrency & Remoting through Actors,” 2010, Bitier Work-Stealing in Multisocket Multicore Architectures,” http://jonasboner.com/introducing-akka. IEEE Trans. Parallel and Distributed Systems, vol. 24, no. 12, [12] S. Clebsch, S. Drossopoulou, S. Blessing, and A. McNeil, pp. 2334–2343, 2013. “Deny Capabilities for Safe, Fast Actors,” in Proc. 5th Int’l [29] S. L. Olivier, A. K. Porterfield, K. B. Wheeler, and J. F. Prins, Workshop on Programming Based on Actors, Agents, and “Scheduling Task Parallelism on Multi-socket Multicore Sys- Decentralized Control, 2015, pp. 1–12. tems,” in Proc. 1st Int’l Workshop on Runtime and Operating [13] R. D. Blumofe et al., Cilk: An Efficient Multithreaded Runtime Systems for Supercomputers, 2011, pp. 49–56. System. ACM, 1995, vol. 30, no. 8. [30] L. L. Pilla et al., “A Hierarchical Approach for Load Balanc- [14] A. Pop and A. Cohen, “OpenStream,” ACM Trans. Architec- ing on Parallel Multi-core Systems,” in Proc. 41st Int’l Conf. ture and Code Optimization, vol. 9, no. 4, pp. 1–25, 2013. Parallel Processing, 2012, pp. 118–127. [15] OpenMP, “OpenMP Application Program Interface Version [31] A. Drebes, K. Heydemann, N. Drach, A. Pop, and A. Co- 4.0,” 2013. hen, “Topology-Aware and Dependence-Aware Scheduling and Memory Allocation for Task-Parallel Languages,” ACM [16] J. Yang and Q. He, “Scheduling Parallel Computations by Trans. Architecture and Code Optimization, vol. 11, no. 3, Work Stealing: A Survey,” Int’l J. Parallel Programming, pp. pp. 1–25, Aug. 2014. 1–25, 2017. [32] A. Drebes, A. Pop, K. Heydemann, A. Cohen, and N. Drach, [17] R. D. Blumofe and C. E. Leiserson, “Scheduling Mul- “Scalable Task Parallelism for NUMA,” in Proc. Int’l Conf. tithreaded Computations by Work Stealing,” J. the ACM Parallel Architectures and Compilation (PACT 16), 2016. (JACM), vol. 46, no. 5, pp. 720–748, Sep. 1999. [33] F. Broquedis et. al, “hwloc: A Generic Framework for Man- [18] Z. Vrba, H. Espeland, P. Halvorsen, and C. Griwodz, “Limits aging Hardware Affinities in HPC Applications,” in 2010 18th of Work-Stealing Scheduling,” in Job Scheduling Strategies Euromicro Conf. on Parallel, Distributed and Network-based for Parallel Processing, 2009, pp. 280–299. Processing. IEEE, 2010, pp. 180–186. [19] J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, [34] S. Aronis et al., “A Scalability Benchmark Suite for Er- and J. Nieplocha, “Scalable Work Stealing,” in Proc. Conf. lang/OTP,” in Proc. 11th ACM SIGPLAN Workshop on Erlang High Performance Computing Networking, Storage and Anal- Workshop, 2012, pp. 33–42. ysis, 2009, pp. 53:1–53:11. [35] S. M. Imam and V. Sarkar, “Savina - An Actor Benchmark [20] Y. Guo, J. Zhao, V. Cave, and V. Sarkar, “SLAW: a Scalable Suite: Enabling Empirical Evaluation of Actor Libraries,” in Locality-aware Adaptive Work-stealing Scheduler,” in 2010 Proc. 4th Int’l Workshop on Programming Based on Actors IEEE Int’l Symp. Parallel Distributed Processing (IPDPS 10), Agents & Decentralized Control, 2014, pp. 67–80. 2010, pp. 1–12. [36] Riot Games, “Chat Service Architecture: Servers,” [21] S. Min, C. Iancu, and K. Yelick, “Hierarchical Work Stealing Sep. 2015, https://engineering.riotgames.com/news/ on Manycore Clusters,” in In 5th Conf. Partitioned Global chat-service-architecture-servers. Address Space Programming Models, 2011.