Optimizing the LULESH Stencil Code Using Concurrent Collections

Optimizing the LULESH Stencil Code using Concurrent Collections Chenyang Liu Milind Kulkarni Purdue University Purdue University 465 Northwestern Avenue 465 Northwestern Avenue West Lafayette, Indiana West Lafayette, Indiana [email protected] [email protected] ABSTRACT world phenomenon. However, the domain scientists who de- Writing scientific applications for modern multicore ma- velop these models require expertise in parallel programming chines is a challenging task. There are a myriad of hard- to effectively run their applications in order to take advan- ware solutions available for many different target appli- tage of modern and future architectures. An intuitive way cations, each having their own advantages and trade-offs. to solve this problem is to use an architecture-agnostic pro- An attractive approach is Concurrent Collections (CnC), gramming language that separates the responsibilities of the which provides a programming model that separates the domain scientist from the programming expert. This can be concerns of the application expert from the performance ex- achieved using Concurrent Collections (CnC), a parallel pro- pert. CnC uses a data and control flow model paired with gramming model that combines ideas from previous work in philosophies from previous data-flow programming models tuple-spaces, streaming languages, and dataflow languages and tuple-space influences. By following the CnC program- [2, 7, 13]. ming paradigm, the runtime will seamlessly exploit avail- In CnC, algorithmic programs are expressed as a set of able parallelism regardless of the platform; however, there partially-ordered operations with explicitly defined depen- are limitations to its effectiveness depending on the algo- dencies. The CnC runtime exploits potential parallelism, rithm. In this paper, we explore ways to optimize the per- so long as it satisfies the dependencies set by the program- formance of the proxy application, Livermore Unstructured mer. The responsibility of the programmer is minimal, only Lagrange Explicit Shock Hydrodynamics (LULESH), writ- requiring them to provide the basic semantic scheduling ten using Concurrent Collections. The LULESH algorithm constraints of their problem. The advantages of simpli- is expressed as a minimally-constrained set of partially- fied programmability comes at the expense of sub-optimal ordered operations with explicit dependencies. However, fine-grained parallelism for these programs. In this paper, performance is plagued by scheduling overhead and synchro- we explore how to rewrite CnC programs to make specific nization costs caused by the fine granularity of computation equivalent-state transformations to boost parallel perfor- steps. In LULESH and similar stencil-codes, we show that mance. We investigate the effects of these transformation an algorithmic CnC program can be tuned by coalescing techniques on the Livermore Unstructured Lagrange Ex- CnC elements through step fusion and tiling to become a plicit Shock Hydro (LULESH) mini-app, a hydrodynamics well-tuned and scalable application running on multi-core code created in the DARPA UHPC program [9, 11]. systems. With these optimizations, we achieve up to 38x We discover that there are many opportunities to tune speedup over the original implementation with good scala- complex CnC programs. Algorithms with many unique com- bility for up to 48 processor machines. putational steps can be transformed to alter the working- set granularity through step fusion or distribution. These transformations are similar to classic loop fusion and dis- Keywords tribution, providing benefits through computation reorder- Concurrent Collections, LULESH, Tiling ing while preserving execution semantics. Furthermore, we look at tiling the computation steps to increase the working set size and reduce the amount of fine-grained parallelism. 1. INTRODUCTION These techniques help reduce scheduling overhead and im- Effectively programming scientific applications is a very prove data locality for stencil codes such as LULESH. In our challenging task. There are many research fields dedicated experiments, we show that step fusion and tiling greatly im- to producing new models and methods for simulating real pact performance and scalability in the mini-app, giving up to 38x speedup over the sequential baseline implementation on 48 cores for a moderate (603) simulation size. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation 2. OVERVIEW OF CNC on the first page. To copy otherwise, to republish, to post on servers or to redistribute In this section, we provide a brief overview of Concur- to lists, requires prior specific permission and/or a fee. rent Collections (CnC). We summarize the basic concepts WOLFHPC2015 November 15-20 2015, Austin, TX, USA and highlight the important aspects that make CnC a com- Copyright is held by the owner/author(s). Publication rights licensed to ACM 978-1-4503-4016-8/15/11 ...$15.00. pelling programming model for scientific applications such http://dx.doi.org/10.1145/2830018.2830024.. as LULESH. A more in-depth description of CnC can be found from previous related work [2, 3, 14]. in the tag collection. In our example, a tag is created for The CnC programming model separates the algorithmic every reduction step that is needed to execute and is rep- semantics of a program from how the program is executed. resented as the yellow triangle. Finally, the program data This is an attractive solution for a domain scientist, whose is stored in the (data) item collection, shown in Figure 1 primary concern is focused on algorithmic correctness and by the rectangular nodes. The item collection contains in- stability. CnC relies on a tuning expert, whose expertise is stances of data that steps use as input or outputs. Each in performance, to properly map the application to a specific data item is dynamic and is atomically assigned once in its platform. This separation of concerns simplifies the roles of lifetime (dynamic single-assignment) in order to provide the each party. The domain scientist can program without need- runtime with definitive information and prevent race condi- ing to reason about parallelism or architecture constraints, tions. These data items are also important during execution while the tuning expert can focus on performance given the for scheduling; many steps consume data from previous steps maximum freedom to map the execution onto the hardware. as well as produce them, indicating program dependencies. The restrictions are that CnC programs must adhere to In a typical CnC program, the step collection will be de- a certain set of rules. Firstly, computation units (steps) in fined similarly to a traditional C function, except it can- CnC must execute statelessly, so they cannot modify any not hold any state and must receive data from other steps global data. Instead, the language uses built-in put/get through data items. For our reduction, the program will operations to store and retrieve data items, which assist first prescribe the reduction computation step using a con- the runtime with scheduling dependencies. Secondly, all trol tag ti. Once prescribed, the step will be scheduled to data items exhibit the dynamic single-assignment property, run once all data dependencies are resolved. In the example, meaning they are immutable once they are written to. This the reduction step waits to get 4 values, x1 to x4, indicat- requirement prevents data races during parallel execution ing the function will consume those values produced from a and helps resolve data dependencies during runtime. In the previous computation. Just as the get construct indicates following section, we give a brief overview of the CnC pro- a consumer dependence, the producers of the data will programming model. duce those values by initiating a put. Once the computation is performed in the reduction step, it will also store the re- CnC Programming Model sult F by initiating a put to its respective data item using a We describe the CnC language using a simple example of a tag for future retreival. The get and put constructs assists reduction routine. Assume that we wish to sum values from the runtime in scheduling steps and resolving dependencies. 4 A step collection can only be associated with a single tag four different sources, F = P x . Figure 1 shows a CnC i collection/space; however, a single tag collection can pre- i=1 representation of how a domain expert may represent the scribe multiple step collections. Steps are created when a computation and its dependencies. control tag prescribes a particular instance of the step collection. While the existence of a tag prescribed for a specific step means that instance of a step will execute, it does not specify when that step executes. Execution of that step depends on the availability of data items (consumer dependencies) as well as the CnC scheduler which determines how tasks are handled. How and when tags are assigned is left up to the programmer. CnC allows for prescribing tags in ad- vance, leaving the scheduler manage dormant tags, as well as dynamically prescribing tags for each step instance as their data becomes ready. Previous work shows tradeoffs exist for each implementation, but it is a flexible parameter that can be tuned [4]. Tasks are scheduled as the program executes, with steps producing data and sometimes prescribing tags. A step becomes available as soon as its tag is prescribed, and produced data items become ready once they are atomically written. Whenever item dependencies are ready and the tag has been prescribed, a step becomes enabled and may be executed. The runtime exploits parallelism by scheduling the available step instances as soon as they become enabled. Figure 1: Reduction Operation in CnC Once all prescribed steps have been executed, the CnC program completes. There are three types of nodes in this figure: computa- In principle, CnC programs should be well-tailored to run tion steps, data items, and control tags.

Optimizing the LULESH Stencil Code Using Concurrent Collections

Here I Led Subcommittee Reports Related to Data-Intensive Science and Post-Moore Computing) and in CRA’S Board of Directors Since 2015

Ultra-Large-Scale Sites

Optimizing Applications for Multicore by Intel Software Engineer Levent Akyil Welcome to the Parallel Universe

August 2009 Volume 34 Number 4

High Performance Neural Network Inference, Streaming, and Visualization of Medical Images Using FAST

Comparison and Analysis of Parallel Computing Performance Using Openmp and MPI

Threading Building Blocks Getting Started Guide

Programming Models & Environments Summit

Mlperf Inference Benchmark

Evaluating the Separation of Algorithm and Implementation Within Existing Programming Models

Flow Graphs, Speculative Locks, and Task Arenas in Intel® Threading Building Blocks

Intel Concurrent Collections for Haskell Ryan Newton, Chih-Ping Chen, and Simon Marlow