SUBMISSION TO IEEE TRANS. ON COMPUTERS 1 Scheduling Weakly Consistent C Concurrency for Reconfigurable Hardware Nadesh Ramanathan, John Wickerson, Member, IEEE, and George A. Constantinides Senior Member, IEEE Abstract—Lock-free algorithms, in which threads synchronise These reorderings are invisible in a single-threaded context, not via coarse-grained mutual exclusion but via fine-grained but in a multi-threaded context, they can introduce unexpected atomic operations (‘atomics’), have been shown empirically to behaviours. For instance, if another thread is simultaneously be the fastest class of multi-threaded algorithms in the realm of conventional processors. This article explores how these writing to z, then reordering two instructions above may algorithms can be compiled from C to reconfigurable hardware introduce the behaviour where x is assigned the latest value via high-level synthesis (HLS). but y gets an old one.1 We focus on the scheduling problem, in which software The implication of this is not that existing HLS tools are instructions are assigned to hardware clock cycles. We first wrong; these optimisations can only introduce new behaviours show that typical HLS scheduling constraints are insufficient to implement atomics, because they permit some instruction when the code already exhibits a race condition, and races reorderings that, though sound in a single-threaded context, are deemed a programming error in C [2, §5.1.2.4]. Rather, demonstrably cause erroneous results when synthesising multi- the implication is that if these memory accesses are upgraded threaded programs. We then show that correct behaviour can be to become atomic (and hence allowed to race), then existing restored by imposing additional intra-thread constraints among scheduling constraints are insufficient. the memory operations. In addition, we show that we can support the pipelining of loops containing atomics by injecting further One approach for implementing atomics correctly is to inter-iteration constraints. We implement our approach on two enclose each atomic operation in its own critical region, and constraint-based scheduling HLS tools: LegUp 4.0 and LegUp ensure that the surrounding lock() and unlock() calls 5.1. We extend both tools to support two memory models that cannot be reordered. We show that this approach scales poorly are capable of synthesising atomics correctly. The first memory model only supports sequentially consistent (SC) atomics and the and inhibits loop pipelining. Instead, we frame the implemen- second supports weakly consistent (‘weak’) atomics as defined by tation of atomics as a scheduling problem: we treat atomic the 2011 revision of the C standard. Weak atomics necessitate accesses as regular memory accesses but impose additional fewer constraints than SC atomics, but suffice for many multi- intra-thread dependencies when devising a schedule for each threaded algorithms. We confirm, via automatic model-checking, thread. that we correctly implement the semantics in accordance with the C standard. A case study on a circular buffer suggests that By default, C atomics enforce sequential consistency (SC), on average circuits synthesised from programs that schedule which means that all threads maintain a completely consistent atomics correctly can be 6x faster than an existing lock-based view of shared memory, and memory accesses always occur implementation of atomics, that weak atomics can yield a further in the order specified by the programmer [3]. Though simple 1.3x speedup, and that pipelining can yield a further 1.3x for programmers to understand, SC is an expensive guarantee speedup. for language implementations to meet in the presence of Index Terms—High-Level Synthesis, HLS, Lock-Free Algo- optimisations by compilers (such as constant propagation, rithms, Atomic Operations, FPGA. which can disrupt the order of memory accesses) and by architectures (such as store buffering, which can delay the I. INTRODUCTION propagation of writes to other threads). In his comprehensive empirical study, Gramoli [1] demon- In fact, many multi-threaded algorithms do not need all strates that, when writing multi-threaded programs for conven- threads to share a completely consistent view of shared mem- tional multi-processors, the most efficient way to synchronise ory, and hence can tolerate weakly consistent atomics, which threads is to use fine-grained atomic operations (‘atomics’) – do not provide this guarantee in general. These ‘weak atomics’ as opposed to, for instance, coarse-grained mutual exclusion include the acquire/release and relaxed atomics provided by based on locks. In this article, we explore how lock-free the 2011 revision of the C standard (‘C11’) [2, §7.17.3], programs can be compiled from C to reconfigurable hardware and later incorporated into OpenCL [4, §3.3.4]. The exact via high-level synthesis (HLS), and the performance benefits guarantees provided by these operations are specified by each of doing so. language’s memory consistency model; the rough idea is that We focus on the scheduling stage of synthesis, in which while SC forbids all reorderings, acquire loads cannot be software instructions are assigned to hardware clock cy- executed later, release stores cannot be executed earlier, and cles. Typical HLS schedulers seek to maximise instruction- relaxed accesses can be moved freely. We show that C11’s level parallelism by allowing independent instructions to be acquire/release and relaxed consistency can be implemented executed out-of-order or simultaneously. In particular, non- aliasing memory accesses, or those that exhibit only read- 1Throughout this article, we use thread to refer both to software threads after-read dependencies (e.g. x=z; y=z), can be reordered. and to the hardware modules synthesised from them. SUBMISSION TO IEEE TRANS. ON COMPUTERS 2 using fewer dependencies than SC, and hence offer the po- II. BACKGROUND tential for more efficient scheduling. We also show how we This section summarises existing HLS support for multi- can enable loop pipelining – an optimisation that is inhibited threaded programming (§II-A), explains how HLS tools per- in the presence of locks but becomes available in our lock- form scheduling (§II-B), and introduces the C11 memory free setting – by selectively imposing constraints between the consistency model (§II-C). memory operations in successive iterations of a loop. Unfortunately, weak atomics are notoriously hard to im- A. High-level synthesis for multi-threaded programs plement correctly. A failure to anticipate their complex and Several HLS tools only accept sequential input, counterintuitive behaviours has been the root cause of bugs deriving parallelisation opportunities either automatically in compilers [5], language specifications [6], and vendor- (e.g. ROCCC [12]) or with the aid of synthesis endorsed programming guides [7]. To build confidence that directives (e.g. Vivado HLS [13]). Other tools accept our work implements C11 atomics correctly, we use the Alloy multi-threaded input but only allow threads to synchronise model checker [8], first to debug our implementation during via locks (e.g. LegUp [9] and Kiwi [14]) or via execution development, and then to verify automatically that any C11 barriers (e.g. SDAccel [15]). Some HLS tools also support the program (with a bounded number of memory accesses) will OpenMP programming standard, which defines an atomic be synthesised correctly. directive that enables lock-free programming. Leow et al. [16] We implement our approach on two versions of the LegUp transform OpenMP to Handel-C for hardware synthesis and HLS framework [9]. We treat these two versions as separate Cilardo et al. [17] generate heterogeneous hardware/software tools for memory-related optimisations, as discussed in §V-B. systems with OpenMP. Neither of these works support the We evaluate our approach in the context of both these tools explicit multi-threading constructs defined by the Pthreads via a case study: an application in which threads communicate standard, so a direct comparison with the present work is via lock-free circular buffers. On average, we show that using difficult. Altera’s SDK for OpenCL [18] supports lock-free SC atomics yields a 6x speedup compared to lock-based programming via SC atomics [19], though the commercial implementation of atomics, that switching from SC atomics nature of the tool makes it difficult to ascertain exactly how to weak atomics (where safe to do so) yields a further 1.3x these operations are implemented. LEAP facilitates parallel speedup, and that enabling loop pipelining of weak atomics memory access through its provision of memory hierarchies can yield a further 1.3x speedup. that potentially can be shared among Pthreads in a lock-free In summary, manner [20]. The most important point of comparison between the tools • we show that traditional HLS schedulers cannot (in gen- reviewed above and the present work is that this is the first to eral) synthesise multi-threaded algorithms without relying synthesise hardware from software that features weak atomics on locks, because some instruction reorderings permit- (as defined by C11 [2] and OpenCL 2.x [4]). Efficient im- ted by standard dependence-based schedulers that only plementations of weak atomics have been extensively studied consider aliasing memory dependencies can introduce in the conventional processor domain, with one study sug- erroneous behaviours, and we illustrate this using the gesting that they can yield average whole-program speedups open-source
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages15 Page
-
File Size-