Analysis of Application Sensitivity to System Performance Variability in a Dynamic Task Based Runtime

Galen Shipman† Kevin Pedretti Ramanan Sankaran Patrick McCormick Stephen Olivier Oak Ridge National Laboratory Los Alamos National Laboratory Kurt B. Ferreira Oak Ridge, TN, USA Los Alamos, NM, USA Sandia National Laboratories Albuquerque, NM, USA

Sean Treichler Michael Bauer Alex Aiken NVIDIA Stanford University Santa Clara, CA Stanford, CA, USA

ABSTRACT tem thereby providing a level of isolation of system services Application scalability can be significantly impacted by node from the application. Management of resource contention level performance variability in HPC. While previous studies at the application level is handled by the application de- have demonstrated the impact of one source of variability, veloper. Many application developers opt for a static par- OS noise, in message passing runtimes, none have explored titioning of resources mirroring domain decomposition due the impact on dynamically scheduled runtimes. In this pa- to its simplicity. However, this approach leaves many HPC per we examine the impact that OS noise has on the Le- applications vulnerable to the effects of OS noise, especially gion runtime. Our work shows that 2.5% net performance at scale. variability at the node level can result in 25% application The growing effects of OS noise has been one of the con- slowdown for MPI+OpenACC based runtimes compared to tributing factors in the development of dynamic runtime 2% slowdown for Legion. We then identify the mechanisms systems such as Charm++ [1], the Open Community Run- that contribute to better noise absorption in Legion, quan- time [2], Uintah [3], StarSs [4] and Legion [5]. Dynamic run- tifying their impact. Furthermore, we assess the impact of time systems have the potential to react to the performance OS noise at the granularity of communication, task schedul- variability introduced by resource contention and mitigate ing, and application level tasks within Legion demonstrating its effects. While prior work has examined the impact of that where noise is injected can significantly effect scalabil- OS noise within static runtime environments, no prior work ity. The implications of this study on OS and runtime ar- has examined the effects of OS noise on dynamic runtime chitecture is then discussed. systems. Our work explores these effects and demonstrates the ways that a dynamic runtime system such as Legion can mitigate the effects of OS noise in comparison to traditional 1. INTRODUCTION static systems such as MPI and OpenACC. We investigate Modern high performance computing systems and appli- these effects within the context of S3D, a production-grade cations are composed of a variety of system and user level turbulent combustion simulation. This paper makes the fol- services, all of which are subject to varying degrees of perfor- lowing novel contributions: mance variability. In particular, system level services such 1. A comparison of the impact of OS noise on both static as parallel file systems, I/O forwarding layers, and system and dynamic runtime systems using a real-world ap- monitoring daemons can compete with application threads plication for node level and system wide resources. Similarly, at the 2. An analysis of the mechanisms that help absorb OS application level, multi-physics packages and coupled ana- noise in a dynamic runtime lytics can create contention for resources. Emerging power 3. Identification of dynamic runtime tasks that are more management services such as power capping at the node and susceptible to OS noise CPU/GPU level can introduce similar performance variabil- ity in the form of application and system level threads com- Section 2 gives a brief overview of the Legion runtime sys- peting for execution under a fixed power budget. All of these tem. Section 3 describes each version of the S3D application sources of variability can significantly harm the performance used in this study. Our evaluation and results are then pre- and scalability of production applications. sented in Section 4 and we conclude with implications of our Traditionally the management of resource contention has work for future systems in Section 5. been an (OS) level function and in HPC the general approach has been to minimize OS level ser- vice interrupts also known as OS noise. In practice, this is 2. THE LEGION RUNTIME SYSTEM done by containing OS level services through core special- Legion [5, 6, 7] is a dynamic task-based runtime system ization or by function shipping OS services to a remote sys- which relies on a dynamic program analysis to both discover parallelism and map applications on target machines. Le- a GPU’s framebuffer, or remote-DMA-accessible memory on gion programs are structured as a hierarchical tree of tasks, another node), with edges indicating memories accessible with tasks permitted to launch arbitrary numbers of sub- to a given processor or pairs of memories between which tasks. All data in Legion is stored within logical regions, a efficient copies can be performed. relational abstraction that decouples data specification from The client’s control over the system is maximized by hav- both its layout and placement in the . Im- ing the performance of operations mapped to the machine portantly, all tasks are required to name the set of logical model be as predictable as possible. Thus, Realm focuses on regions that they will access during their execution. The Le- providing an efficient mechanism for execution and not in- gion runtime performs a dynamic dependence analysis based terfering with the mapping decisions made by the client. It on the logical region usage of different tasks to implicitly ex- also tries to reduce, or at least isolate, performance variabil- tract parallelism from Legion programs. From this analysis, ity in the the underlying operating system and hardware. the Legion runtime constructs an explicit asynchronous task The most obvious way in which Realm does this is in the graph to be executed by the lower-level Realm runtime (see way it sets the CPU affinity masks of threads. Realm uses Section 2.1). To hide the latency of this analysis, Legion a large number of threads internally: relies on a deferred execution model [8], in which the run- • A persistent is created for each “CPU” proces- time is performing its dynamic analysis well in advance of sor exposed in the machine graph. (The number to the actual application execution. This allows Legion to dis- expose is configurable at runtime startup.) All tasks cover task parallelism, execute tasks out-of-order, and hide mapped onto that processor by the client will be ex- operation latencies while still preserving sequential program ecuted in this thread. With good mapping decisions, semantics. these threads should be constantly busy. The dynamic analysis performed by Legion is important • A persistent thread is created for each “GPU” proces- for two reasons. First, automatic extraction of parallelism sor that runs application tasks mapped to that proces- eases the burden on programmers. Second, it makes it pos- sor. The typical activity of a task mapped to a GPU sible to dynamically adapt how a Legion program is mapped processor is to perform CUDA kernel launches, and onto a target architecture. To map a Legion application onto these threads spend most of their time waiting for the a particular machine, every task is assigned a processor on CUDA operations to complete. which to run, and a physical instance must be created (or • Persistent threads are also created for “utility” proces- reused) in some memory for each logical region requested by sors, which are used by the Legion runtime for its dy- a task. These decisions are referred to as the mapping of a namic analysis. This analysis work is bursty, so while Legion program. In Legion, all of these mapping decisions is it common to expose several utility processors, they are exposed directly to the application programmer through are often idle. a mapping interface. A crucial aspect of the Legion mapping • Background DMA threads are created for each NUMA interface is that it allows the construction of mapper objects domain and each GPU to perform copies that have (also called mappers) to handle dynamic queries from the been requested by the Realm client. These threads runtime about how to map tasks and logical regions. By are often very busy, but entirely memory-bound. making this interface dynamic, mappers can introspect the • Finally, several background progress threads are cre- state of the application and the underlying hardware when ated to deal with sending and receiving inter-node mes- making mapping decisions. For example, mappers can ob- sages via the GASNet API[9]. serve which processors are executing tasks faster or slower (possibly depending on OS noise) and schedule tasks onto Realm uses sched_setaffinity on Linux-based systems more lightly loaded processors. A mapper can continually to control which CPU cores may be used by each thread. For monitor the changing state of the machine, and decide the each “CPU” processor thread, an available CPU core is cho- best course of action as anomalous behavior occurs through- sen and the thread’s affinity mask is set to include only that out a run. core. The core is also removed from every other thread’s affinity mask, eliminating processor interference due to any 2.1 Realm other Realm threads. On machines that support Hyper- Threading[10], the second core is removed from the affinity Realm[8] is a low-level runtime that provides a portable mask of all threads. Once all “CPU” processor threads have interface for execution of asynchronous task graphs on sys- been assigned their cores, all other Realm threads share the tems with heterogeneous computational resources and com- remaining cores, including any enabled by Hyper-Threading, plicated memory hierarchies. Realm uses a deferred execu- allowing resource sharing for the bursty workloads these tion model in which all operations (tasks, copies, even criti- other threads perform. The affinity mask of the original cal sections in the form of reservations) are included in the application thread is also adjusted, in the hopes that any task graph with dependencies captured explicitly by events. other background threads that are created (e.g. for asyn- Realm handles the of operations (i.e., deter- chronous file I/O) are similarly kept from interfering with mining when dependencies have been satisfied and when the main computation threads that are being used by the the assigned execution resources are available), but leaves application. the mapping decisions (i.e. where tasks and data should be placed) to the Realm client - a combination of the Legion runtime and the application’s mapper objects in this case. 3. APPLICATION To allow the client to make intelligent mapping decisions, Our goal in this work is to understand the effects of OS Realm provides a model of the underlying machine in the noise on production applications. We therefore focused our form of a graph of processors (e.g. x86 CPU cores, CUDA- work on a real production application: S3D, a turbulent capable GPUs) and memories (e.g. a node’s system memory, combustion simulation. We first provide a brief overview of ae piiainto httkstesto oe allocated nodes of genetic-algorithm set parallel the takes a that GAMPI, tool pure-MPI use optimization the achieved We based of is levels. case placement two that the task at In distance here, the described travel. minimize implementation to task to have the as messages when such such chosen tremendously is benefits placement topol- the application cartesian to the the in due ogy, neighbors the regular among very passing message be to seen communication. (p2p) is point-to-point illus- and neighbor is 1 nearest S3D Fig. of in pattern communication trated perfor- The not periodic usually critical. for are mance required but monitoring, are and communications synchronization message Global point-to-point asynchronous passing. using called ex- topology, also zone box topology, gridpoints a ghost communication 5 The cartesian and dimensional exchanged. 4 the between be are occurs the change to oper- that filtering respectively, difference zones and wide, derivatives finite all ghost the utilize The require computing to solution for node node. used network multiple that ators each on systems, on available multi-core placed cores On are tasks MPI balancing. achieve load to tasks MPI perfect was paralleliza- multiple mesh the (SPMD) among cartesian partitioned structured equally data three-dimensional multiple The tion. program single based only MPI - S3D 3.1 model. 48 chemical of n-heptane/air representative size wall- grid a as each a using presented in for node solution, timestep, used to per is time time problem clock the benchmark measures same and case, The paper. other this flows. non-reacting to for comparison used in are that cost solvers computational flow significant the a of are properties fraction these compute computa- that The kernels thermophysical, tional properties. the transport for molecular and models chemical detailed using flow S3D chemically reacting multi-species in simulating used for formulation tailored specifically The is estimators. error built-in time scheme with integration in Runge-Kutta advanced using multi-stage explicit and mesh an methods with Cartesian difference finite structured order are conventional higher equations a The on energy. equations and solved conservation momentum unsteady mass, species coupled for fully the geometries. solves canonical It in combustion turbulent of simulation study. of our implementations in different used several S3D the cover then and S3D processes 256 at stencil) 1: Figure ic h omncto sdmntdb point-to-point by dominated is communication the Since MPI- pure a used S3D of implementation original The in examined are S3D of implementations different Three numerical direct the for solver flow reacting a is [11] S3D

Receiving MPI 100 150 200 250 50

0

ppo fSDcmuiainpten( point (9 pattern communication S3D of Spyplot 0 50 Sending MPIProcess

x 100 , y

and z 150

egbr ftethree- the of neighbors 3 200

rdonsper gridpoints 250 ie miial. hnpsil,tempe ilload will deter- mapper usually the is possible, machine When given a empirically.) for bandwidth strategy mined bus correct system on the and depends - sizes, mappings memory two speeds, the processor of performance relative (The and GPUs a and CPUs GPU the strategies: between work mapping divides different strategy two supports which improved its to mitigation. contributors exe- noise key show, OS and are will parallelism out-of-order we additional tasks dif- as discover cute and on to state, and work ability simulation physical tasks Legion’s the various these of of subsets the ferent Many of time each each computations. For used chemical perform be simulation. sub-tasks will the that step, of regions execution logical the throughout creates task top-level The un- I/O file and monitoring, interface, code changed. user the of with main bulk deals ap- the the that This allowing moving while of step. Legion time benefits into next computation performance the the calculate yields to back proach side handed then Legion is the data to the checkpointing, the or filter- of analysis, occasional state some ing, Legion performing updated possibly the the After receives for simulation. and waits finish thread to computation. application equivalent computation Fortran the data performs main input that The explicit the task Legion of by a handoff to followed a with calculation integration) the right-hand-side Runge-Kutta to of call re- stages it entire implementation, an MPI-only places original the from Starting S3D of Version MPI+Legion 3.3 S3D-MPI. to similar and I/O GAMPI collective using MPI computed placement uses task task it However, on-node multi-node S3D-MPI. require the for not above does MPI described so 1 as placement and uses node only network S3D-OpenACC per device. rank the on asyn- streams execution parallel executed multiple of using be accelerator to the allowed S3D-OpenACC on chronously has in in advances regions support parallel recent compiler the Furthermore, and directives [13]. OpenACC accelerators Ref. on the executed in be described to kernels as to the addition hy- allow in was to added S3D OpenMP were Once directives OpenACC efficiency. hybridized the of be bridized, loss to could without loops that OpenMP grid regions with various computational the Therefore, large in combining form by region. cost refactored OpenMP fixed was each the S3D for offset threads to the computation therefore spawning and enough granularity having fine would not too approach having an from suffered such have au- However, through or parallelization. manually tomatic either OpenMP be to could converted that they easily loops grid, in Cartesian occur three-dimensional computations the S3D’s traverse of multicore while all parallel Since architectures massively systems. on accelerated scalability GPU its improving to also Levesque it by porting directives thereby OpenACC and OpenMP MPI+OpenACC - S3D 3.2 [12]. distance min- internode by mean ordering task the a imizing computes and scheduler batch the by h einvrino 3 ssacsoie mapper customized a uses S3D of version Legion The tasks. of tree a using implemented is code Legion The approach. alternate an takes S3D of version Legion The with augmented and refactored was S3D MPI-only The taeypae smc oka osbeo h GPU. the on possible as work much as places strategy integrate ucin(..six (i.e. function tal. et mixed [13] all- balance work by assigning tasks to sets of processors where For our experiments, we used the three versions of the S3D the task can execute. In Section 4 we will show how this application described in Section 3: a CPU only MPI version, load balancing feature improves tolerance to OS noise. a mixed CPU/GPU MPI+OpenACC version, and a Legion version. The Legion version was tested in two configura- 4. EVALUATION tions: 1) All CPU, where S3D tasks were mapped to only CPUs, and 2) CPU+GPU, where S3D tasks were mapped In this section we present our experimental evaluation of to both CPUs and GPUs. Using the selfish detour utility, the impact of system performance variability on three ver- noise is injected into each of the S3D versions. The same sions of S3D each optimized for different runtimes. First set of noise signatures is tested across all S3D versions and we describe the experimental environment 4.1 used to con- this noise is selectively targeted at one or more bound CPU duct our evaluation. Next we describe our testing methodol- threads. The hybrid MPI+OpenACC and Legion versions ogy 4.2 describing how performance variability is introduced differ significantly from the MPI only version of S3D in that in each of the three versions of S3D and the amount of per- the former run a single process per node while the MPI only formance variability“injected”into the application. We then version is optimally configured with a single process per core present the impact of this induced performance variability on this system. This impacts our noise injection strategy as (noise injection) 4.3 on the performance of the three versions a single signal handler for noise is generally configured for of S3D. Finally, we conclude this section with an analysis of each process. To accommodate this difference, for the MPI the mechanisms 4.4 within Legion that help mitigate perfor- only version, the noise interval is increased by the number mance variability. of processes per node and the interval timer used for each 4.1 Experimental Environment process is offset by local process number × noise duration resulting in a wave front of noise across the local processes. All experiments were performed on two Cray XK-7 sys- In all versions noise injection is not coordinated between tems, the Titan supercomputer and a smaller scale testbed nodes. of the same configuration at the Oak Ridge National Lab- The noise signatures used for our experiments were based oratory. Each compute node in these systems is composed on previous work [19, 20]. For the MPI version of S3D, of one 16-core AMD Opteron running at 2200MHz and one we explore the impact of injecting noise to a single process Nvidia K20X GPU with 32 Gigabytes of DRAM memory per node vs. distributing noise across all MPI processes on and 6 Gigabytes of GDDR memory. Nodes are intercon- the node. For the MPI+OpenACC version, noise is injected nected via the Cray Gemini interconnect in a 3-D torus into the single master process running on the host, affecting topology. The system software environment was Cray CLE all CPU threads within the OpenACC runtime. For both 5.2UP02. Legion versions we explore the effects of noise injection on To verify the experimental results as well as get a better different thread types: understanding of how noise gets amplified and absorbed in S3D, a validated OS noise simulator was also used. This • S3D Fortran Thread: All noise is injected on the simulator framework comprises LogGOPSim [14] and the tool thread running the Fortran component of the appli- chain developed by Levy et al. [15]. LogGOPSim uses the cation. This thread is mainly busy at start-up and LogGOPS model, an extension of the well known LogP tear-down, and idle for most of a run. model [16], to simulate application traces that contain all • App Thread 1: All noise is injected on the first CPU exchanged messages and group operations. In this way, Log- processor thread allocated to application level tasks. GOPSim reproduces all happens-before dependencies and the This thread is bound to a dedicated physical core. transitive closures of all delay chains of the application exe- • Realm Threads: All noise is injected on the low-level cution. It can also extrapolate traces from small application Realm threads responsible for handling active mes- runs with p processes to application runs with k·p processes. sages, performing data movement, and offloading work The extrapolation produces exact communication patterns onto GPU processors. Realm threads are bound to for all collective communications and approximates point- the set of physical cores not used by CPU processor to-point communications [14]. Levy et al.’s tool chain adds threads. the capability to simulate resilience activities such as check- • Legion Threads: All noise is injected on the Legion pointing and other scenarios such as in-situ analytics and runtime threads responsible for performing the dy- coupled codes. This tool chain extension has been validated namic program analysis necessary to map and execute against experiments and established models in [15, 17]. a Legion program on a target machine. Legion threads OS Noise was injected into the application using the Self- are bound to the same physical cores as Realm threads. ish Detour [18] utility developed as part of the ZeptoOS • Any Thread: Noise is injected on all threads. project at Argonne National Laboratory. This utility uti- lizes the SIGALRM signal available in POSIX environments The ability to control where noise is injected within the to trigger noise injection within one or more application level Legion runtime provides the opportunity to assess not only threads at a configurable interval. A signal handler then the ability of dynamic task scheduling to potentially absorb samples a random distribution to determine if a configurable noise but also the impact of system noise on distinct task amount of work will be conducted within the signal handler types in the runtime. thereby creating additional work for the CPU and block- Each noise profile was injected using two different meth- ing the application thread during execution of the handler. ods, unscaled and scaled. For the unscaled method, the noise For all of our experiments we set the probability to unity, profile being injected on each node was distributed evenly guaranteeing that work will be conducted at each interval. among the threads being targeted on that node, resulting in the node as a whole experiencing one copy of the noise pro- 4.2 Testing Methodology file. For the scaled noise injection method, the noise profile 1.2 0.90 No Noise No Noise App Thread 1 App Thread 1 Realm Threads Realm Threads 1.1 S3D Fortran Thread 0.85 S3D Fortran Thread Legion Threads Legion Threads Any Thread Any Thread 1.0

0.80

0.9

Runtime/timestep (sec) Runtime/timestep (sec) 0.75 0.8

0.7 0.70 22 23 24 25 26 27 28 29 210 22 23 24 25 26 27 28 29 210 Node count Node count Figure 2: Legion CPU+GPU version with 10Hz-2500us- Figure 4: Legion CPU + GPU version with 10Hz-2500us- unscaled noise signature (runtime per step) scaled noise signature (runtime per step)

2.0 In the Legion and MPI+OpenACC versions this is accom- No Noise plished by injecting the noise signature with frequency scaled 1.9 Master Process by 10Hz × cores per node to the single application process per node. In the MPI only version this is accomplished by 1.8 injecting the noise signature without frequency scaling to every process in the “Distribute to All Processes” configura- 1.7 tion and the frequency scaled noise signature to the first MPI 1.6 process on each node in the “Concentrate to Process 0” con- figuration. As illustrated in Figure 4 the Legion CPU+GPU 1.5 version is marginally impacted by this noise signature with the most significant impact (up to 6% degradation) occur-

Runtime/timestep (sec) 1.4 ring when noise is injected to any thread or restricted to just the low-level Realm threads. This is a result of noise im- 1.3 pacting communication and work scheduling threads in the 1.2 runtime limiting the efficient scheduling of tasks and com- 22 23 24 25 26 27 28 29 210 Node count munication of data dependencies in the S3D stencil. The Figure 3: OpenACC version with 10Hz-2500us-unscaled Legion GPU+CPU version realizes little performance im- noise signature (runtime per step) pact, < 1%, when noise is injected into an application CPU processor thread (“App Thread 1”) or the Fortran thread. was first scaled up by the number of CPU cores per node, Figure 5 illustrates a similar performance impact in the Le- then distributed evenly among the threads being targeted gion CPU only version when noise is injected into the Le- on each node. At a node level, the scaled method results in gion, Realm, or any thread with a performance degradation a net noise equal to one copy of the noise profile per core. of up to 16% compared to 5% or less when noise is injected into the application threads. These results indicate that 4.3 Noise Injection: Impact on S3D Versions Legion is better able to adapt to noise on the application We first examine the impact of a noise duration of approx- threads than noise on the Legion runtime’s communication imately 2500usec occurring at a frequency of 10Hz which and utility threads. The MPI+OpenACC version slowdown corresponds empirically to a loaded preemption of a process is more pronounced, resulting in 16% − 26% degradation as by the operating system. the noise impacts any communication or CPU level calcu- Figures 2,3 illustrate the performance in terms of lations within the process as illustrated in Figure 6. The walltime/timestep for the Legion CPU+GPU and MPI only version exhibits up to a 4% slowdown when noise MPI+OpenACC S3D versions for unscaled noise injection. is distributed across all processes on a node. When noise is The Legion CPU+GPU version shows very little perfor- concentrated on node local process 0, performance degrada- mance impact, 1 − 2%, to this noise pattern irrespective of tion from 4 − 16% was observed as illustrated in Figure 7. the number of nodes. The MPI+OpenACC version incurs This is a result of a single process per node significantly de- a minimum of .75% performance degradation at 16 nodes grading local stencil computations. When contrasted with and maximum of 4.5% at 1024 nodes. Overall S3D absorbs the Legion CPU only version with noise injection directed this noise signature well for both of these configurations at a single computation task we begin to see the impact of and does not appear to amplify noise as a function of scale static mapping of tasks to cores when 2.5% net noise is in- due to its use of a nearest neighbor stencil communication duced. Dynamic task scheduling as in the Legion CPU only pattern. The Legion All CPU and MPI only versions were version alleviates some of the impact of this type of noise. similarly insensitive. Next we consider a shorter duration noise signature Next we examine the impact of this same noise signature, (250usec) with higher frequency (100Hz). S3D is in gen- 10Hz − 2500usec, scaled by the number of cores per node. eral better able to contend with this noise signature when 9.5 7.0 No Noise No Noise 9.0 App Thread 1 6.5 Distributed to All Processes Realm Threads Concentrated to Process 0 6.0 8.5 S3D Fortran Thread Legion Threads 5.5 8.0 Any Thread 5.0 7.5 4.5 7.0 4.0

Runtime/timestep (sec) 6.5 Runtime/timestep (sec) 3.5

6.0 3.0

5.5 2.5 22 23 24 25 26 27 28 29 210 22 23 24 25 26 27 28 29 210 Node count Node count Figure 5: Legion all CPU version with 10Hz-2500us-scaled Figure 7: MPI only version with 10Hz-2500us-scaled noise noise signature (runtime per step) signature (runtime per step)

2.6 1.10 No Noise No Noise Master Process App Thread 1 2.4 1.05 Realm Threads 1.00 S3D Fortran Thread 2.2 Legion Threads 0.95 Any Thread 2.0 0.90 1.8 0.85 1.6

Runtime/timestep (sec) Runtime/timestep (sec) 0.80

1.4 0.75

1.2 0.70 22 23 24 25 26 27 28 29 210 22 23 24 25 26 27 28 29 210 Node count Node count Figure 6: OpenACC version with 10Hz-2500us-scaled noise Figure 8: Legion all GPU version with 100Hz-250us- signature (runtime per step) unscaled noise signature (runtime per step) compared to the 10Hz − 2500usec noise signature. In the then drops to 0% as scale increases to 512 and 1024 nodes. unscaled noise signature case we see very little performance Conversely, S3D exhibits very little performance degrada- impact, from 1−3% as illustrated in Figures 8 and 9. Legion tion when noise is isolated to communication and schedul- CPU only and MPI only exhibit similar results. ing threads at smaller node counts but then increases sig- When noise is scaled by core count the performance im- nificantly as scale increases, indicating that at larger node pact is more pronounced. The Legion CPU+GPU version counts Legion’s dynamic scheduling of application level tasks exhibits up to an 8% slowdown with this noise signature can effectively mitigate noise even as scale increases but that with the largest slowdown at 256 nodes and noise injected communication and task scheduling threads are more sus- to either Realm or Legion level threads again demonstrat- ceptible to OS noise in this application. ing that the performance degradation of communication and In all of these cases, the performance impact of directing scheduling threads is the most impacting to application per- noise to the “S3D Fortran Core” is significantly lower than formance for S3D. Application performance is minimally af- the impact of noise on the Realm or Legion threads. Recall fected across all scales when noise is only injected in ap- that all of these threads are sharing the same set of physical plication level tasks. More pronounced is the performance cores (i.e. the ones not assigned to run application tasks), so impact of OS noise on the Legion CPU only version of S3D. noise events handled by any of these threads have identical Figure 10 illustrates the impact this noise signature has on impacts on the availability of execution resources, poten- the Legion CPU only version. Of note is the significant dif- tially displacing communication and scheduling work being ference in performance degradation when noise is injected in performed by the runtime. The difference is in which run- communication and scheduling threads, up to 12.8%, com- time operations are interrupted (because they are assigned pared to a maximum of 6% when noise is isolated to ap- to the thread that is handling the noise event). The Fortran plication computation tasks. Perhaps more significant is thread has no work assigned to it during the bulk of the the inverse relationship between these scenarios in terms of S3D execution, while the Legion and Realm threads have performance degradation as a function of scale. Figure 11 scheduling and communication operations whose delay im- provides a more detailed view of this phenomena. S3D per- pacts overall performance more, especially at scale. formance is degraded by 6% from 16 to 256 nodes and The MPI+OpenACC versions exhibits moderate perfor- 1.8 14 No Noise No Noise Master Process 12 App Thread 1 1.7 Realm Threads 10 S3D Fortran Thread Legion Threads 1.6 8 Any Thread

1.5 6

Change (%) 4 1.4

Runtime/timestep (sec) 2 1.3 0

1.2 2 22 23 24 25 26 27 28 29 210 22 23 24 25 26 27 28 29 210 Node count Node count Figure 9: OpenACC version with 100Hz-250us-unscaled Figure 11: Legion all cpu version with 100Hz-250us-scaled noise signature (runtime per step) noise signature (Percentage degradation)

9.5 1.9 No Noise No Noise 9.0 App Thread 1 Master Process Realm Threads 1.8 8.5 S3D Fortran Thread Legion Threads 1.7 8.0 Any Thread

7.5 1.6

7.0 1.5 Runtime/timestep (sec) Runtime/timestep (sec) 6.5 1.4 6.0

5.5 1.32 3 4 5 6 7 8 9 10 22 23 24 25 26 27 28 29 210 2 2 2 2 2 2 2 2 2 Node count Node count Figure 10: Legion all cpu version with 100Hz-250us-scaled Figure 12: OpenACC version with 100Hz-250us-scaled noise signature (runtime per step) noise signature (runtime per step) mance degradation under this noise signature, between experiment load balancing reduces the impact of OS noise 4 − 7.5% as illustrated in Figure 12. The MPI only ver- directed at an application level task from 20% to 5%. These sion exhibits a 4 − 7% degradation when noise is injected in results indicates that load balancing of computational tasks node local process 0 and a minimal degradation of 1% when plays a significant role in reducing the impact of OS noise. noise is spread across all processes on the node. Compared with the longer duration, lower frequency noise signature Table 1: Legion all cpu version runtime/step with and with- out load balancing (10Hz −2500usec) the statically mapped version of S3D are Noise signature With load Without significantly less impacted by this noise signature. This is (scaled - App Thread 1) balancing load balancing in contrast to the Legion version of S3D whose performance No noise 6.007 6.01 degradation is similar across both noise signatures and with 100Hz-250us 6.373 7.3 respect to which threads are targeted with noise. 10Hz-2500us 6.337 7.27 4.4 Mechanisms in Legion that Mitigate Noise While our performance studies suggest that dynamic task In our next experiment we demonstrate the impact of task scheduling can alleviate some forms of OS noise a more thor- scheduling windows by reducing the window size from 1024 ough analysis of the scheduling mechanisms that contribute to 2. As in the load balancing experiment we fix the num- to this is in order. Legion provides two primary mechanisms ber of nodes at 64 and examine the impact of our two noise for dynamic task scheduling, task scheduling windows sim- signatures with load balancing enabled. Figures 13 and 14 ilar to out-of-order instruction windows [21] and load bal- illustrate the impact of reducing the window size with noise ancing of tasks across processors. To evaluate the impact signatures of 100Hz − 250us and 10Hz − 2500us respec- of each of these mechanisms we begin with an experiment tively. As the window size decreases from 1024 to 256 the at a fixed node count (64) with the same task window size impact of noise remains fairly constant indicating that Le- (1024) used in our previous experiments and comparing the gion is still able to identify sufficient work within the task no-noise case with the scaled noise signatures with and with- stream such that the induced noise continues to have mini- out load balancing. Table 1 summarizes these results. In this mal impact on performance. As the task window is reduced to 128 Legion is no longer able to find sufficient work re- advance by the runtime. This result indicates that on node sulting in performance degradation that begins to dominate load balancing and out-of-order execution strategies are im- any induced noise effects. Performance continues to degrade portant mechanisms in dealing with performance variability. significantly as the window size is decreased further. These Our results also demonstrate that application perfor- mance is significantly impacted by where noise is injected and what threads of execution are interrupted or displaced

11 by OS noise. Previous work has demonstrated that how No Noise noise is generated (noise signature) is often more important App Thread 1 10 than the net noise on a system. Our work demonstrates that OS noise that impacts communication and scheduling threads in a dynamic runtime environment reduces applica- 9 tion performance up to 12% and grows as a function of node scale. Conversely, under the same noise signature targeted 8 instead to application tasks results in at most a 6% perfor- mance degradation at 16 nodes and decreases to 0% at 1024 7

Runtime/timestep (sec) nodes. This is more pronounced in our experimentation as the number of cores dedicated to the runtime is 1/8th of 6 those dedicated to application level tasks. This places addi- tional constraints on the runtime system when attempting

5 to mitigate this form of noise. Future systems such as Intel’s 210 29 28 27 26 25 24 23 22 21 Window size Knights Landing will contain significantly more processing Figure 13: Legion all cpu version runtime/step with varying cores than today’s systems. In these systems, runtimes such window size and 100Hz-250us scaled noise signature as Legion will likely be able to consume more processing cores with similar overheads presenting new opportunities to mitigate the impact of OS noise within communication 11 and scheduling threads. No Noise App Thread 1 The observed differences in the noise sensitivity of the 10 different Legion threads show that, in dynamic runtime en- vironments, having the OS run a background operation on

9 a separate thread can be significantly better than “hijack- ing” a user-level thread. The former consumes execution re- sources, but still allows the runtime to load-balance across 8 the remaining resources, while the latter also delays specific application or runtime operations that have been assigned 7

Runtime/timestep (sec) to that user-level thread.

6 Acknowledgments

5 The majority of the work presented in this work is supported 210 29 28 27 26 25 24 23 22 21 Window size by the Director, Office of Advanced Scientific Computing Figure 14: Legion all cpu version runtime/step with varying Research, Office of Science, of the United States Depart- window size and 10Hz-2500us scaled noise signature ment of Energy, under the guidance of Dr. Sonia Sachs. results indicate that dynamic load balancing coupled with Additional support was provided by the Department of En- the ability to find sufficient parallelism through out-of-order ergy’s National Security Administration, Advanced Simula- task windowing to schedule both play a significant role in tion and Computing Program. mitigating node level performance variability. This research used resources of the Oak Ridge Leader- ship Computing Facility at the Oak Ridge National Labo- 5. CONCLUSIONS ratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05- This paper has presented a study of the impact of OS noise 00OR22725. in dynamic runtime environments. Previous works have fo- Los Alamos National Laboratory is operated by Los cused on the impact of static runtime environments (MPI) Alamos National Security LLC for the U.S. Department of and have not explored the impact of OS noise on different Energy under contract DE-AC52-06NA25396. Sandia Na- application level tasks such as computation, communication, tional Laboratories is a multi-program laboratory managed and task scheduling. Our results show that dynamic runtime and operated by Sandia Corporation, a wholly owned sub- environments such as Legion can absorb OS noise when load sidiary of Lockheed Martin Corporation, for the U.S. De- balancing and out-of-order task scheduling mechanisms are partment of Energy’s National Nuclear Security Adminis- employed. We show that when noise is injected into a single tration under contract DE-AC04-94AL85000. core with a 10Hz 2500us signature the MPI+OpenACC ver- LA-UR-15-22880 Approved for public release; distribution sion of S3D realizes up to a 25% slowdown while the Legion is unlimited. CPU+GPU version realizes at most a 2% slowdown. We fur- ther demonstrate that adaptive load balancing can reduce overall impact of noise in the Legion CPU only version from 6. REFERENCES 20% slowdown to 6% when sufficient work is scheduled in [1] L. V. Kale and S. Krishnan, CHARM++: a portable concurrent object oriented system based on C++. direct numerical simulations of turbulent combustion ACM, 1993, vol. 28, no. 10. using S3D,” Computational Science & Discovery, [2] “Open Community Runtime,” https://xstackwiki. vol. 2, p. 015001, 2009. modelado.org/images/1/13/Ocr-v0.9-spec.pdf, [12] R. Sankaran, J. Angel, and W. M. Brown, “Genetic September 2014. [Online]. Available: algorithm based task reordering to improve the https://xstackwiki.modelado.org/images/1/13/ performance of batch scheduled massively parallel Ocr-v0.9-spec.pdf scientific applications,” Concurrency and [3] Q. Meng, A. Humphrey, and M. Berzins, “The uintah Computation: Practice and Experience, Mar. 2015. framework: a unified heterogeneous task scheduling [Online]. Available: http://onlinelibrary.wiley.com/ and runtime system,” in High Performance doi/10.1002/cpe.3457/abstract Computing, Networking, Storage and Analysis (SCC), [13] J. Levesque, R. Sankaran, and R. Grout, “Hybridizing 2012 SC Companion:, Nov 2012, pp. 2441–2448. S3D into an Exascale application using OpenACC: An [4] J. Planas, R. M. Badia, E. Ayguad´e, and J. Labarta, approach for moving to multi-petaflops and beyond,” “Hierarchical task-based programming with starss,” in International Conference for High Performance International Journal of High Performance Computing Computing, Networking, Storage and Analysis (SC12), Applications, vol. 23, no. 3, pp. 284–299, 2009. Nov. 2012, pp. 1–11. [5] M. Bauer, S. Treichler, E. Slaughter, and A. Aiken, [14] T. Hoefler, T. Schneider, and A. Lumsdaine, “Legion: Expressing locality and independence with “LogGOPSim - Simulating Large-Scale Applications in logical regions,” in Proceedings of the International the LogGOPS Model,” in Proceedings of the 19th Conference on High Performance Computing, ACM International Symposium on High Performance Networking, Storage and Analysis, ser. SC ’12. Los Distributed Computing. ACM, Jun. 2010, pp. Alamitos, CA, USA: IEEE Computer Society Press, 597–604. 2012, pp. 66:1–66:11. [Online]. Available: [15] S. Levy, B. Topp, K. B. Ferreira, D. Arnold, http://dl.acm.org/citation.cfm?id=2388996.2389086 T. Hoefler, and P. Widener, “Using simulation to [6] S. Treichler, M. Bauer, and A. Aiken, “Language evaluate the performance of resilience strategies at support for dynamic, hierarchical data partitioning,” scale,” in High Performance Computing, Networking, in Proceedings of the 2013 ACM SIGPLAN Storage and Analysis (SCC), 2013 SC Companion:. International Conference on Object Oriented IEEE, 2013. Programming Systems Languages & Applications, [16] D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. ser. OOPSLA ’13. New York, NY, USA: ACM, 2013, Schauser, E. Santos, R. Subramonian, and T. von pp. 495–514. [Online]. Available: Eicken, “Logp: towards a realistic model of parallel http://doi.acm.org/10.1145/2509136.2509545 computation,” SIGPLAN Not., vol. 28, no. 7, pp. [7] M. Bauer, S. Treichler, E. Slaughter, and A. Aiken, 1–12, Jul. 1993. “Structure slicing: Extending logical regions with [17] S. Levy, B. Topp, K. B. Ferreira, D. Arnold, fields,” in Proceedings of the International Conference P. Widener, and T. Hoefler, “Using simulation to on High Performance Computing, Networking, Storage evaluate the performance of resilience strategies and and Analysis, ser. SC ’14. Los Alamitos, CA, USA: process failures,” Sandia National Laboratories, IEEE Computer Society Press, 2014 – to appear. Technical Report SAND2014-0688, 2014. [8] S. Treichler, M. Bauer, and A. Aiken, “Realm: An [18] “Selfish Detour Noise Injection Utility,” http://www. event-based low-level runtime for distributed memory mcs.anl.gov/research/projects/zeptoos/downloads/. architectures,” in Proceedings of the 23rd International [Online]. Available: http://www.mcs.anl.gov/ Conference on Parallel Architectures and Compilation, research/projects/zeptoos/downloads/ ser. PACT ’14. New York, NY, USA: ACM, 2014, [19] K. B. Ferreira, R. Brightwell, and P. G. Bridges, pp. 263–276. [Online]. Available: “Characterizing application sensitivity to OS http://doi.acm.org/10.1145/2628071.2628084 interference using kernel-level noise injection,” in [9] K. Yelick, D. Bonachea, W. yu Chen, P. Colella, Proceedings of the 2008 ACM/IEEE Conference on K. Datta, J. Duell, S. L. Graham, P. Hargrove, Supercomputing (Supercomputing’08), November 2008. P. Hilfinger, P. Husbands, C. Iancu, A. Kamil, [20] K. B. Ferreira, P. Widener, S. Levy, D. Arnold, and R. Nishtala, J. Su, M. Welcome, and T. Wen, T. Hoefler, “Understanding the effects of “Productivity and performance using partitioned communication and coordination on checkpointing at global address space languages,” 2007. scale,” in Proceedings of the International Conference [10] D. Koufaty and D. T. Marr, “Hyperthreading for High Performance Computing, Networking, Storage technology in the netburst microarchitecture,” IEEE and Analysis, ser. SC ’14. Piscataway, NJ, USA: Micro, vol. 23, no. 2, pp. 56–65, Mar. 2003. [Online]. IEEE Press, 2014, pp. 883–894. [Online]. Available: Available: http://dx.doi.org/10.1109/SC.2014.77 http://dx.doi.org/10.1109/MM.2003.1196115 [21] P. K. Dubey, G. B. Adams, and M. Flynn, “Instruction [11] J. H. Chen, A. Choudhary, B. d. Supinski, window size trade-offs and characterization of program M. DeVries, E. R. Hawkes, S. Klasky, W. K. Liao, parallelism,” Computers, IEEE Transactions on, K. L. Ma, J. Mellor-Crummey, N. Podhorszki, vol. 43, no. 4, pp. 431–442, 1994. R. Sankaran, S. Shende, and C. S. Yoo, “Terascale