Chronos: Efficient Speculative Parallelism for Accelerators

Chronos: Efficient Speculative Parallelism for Accelerators Maleen Abeydeera Daniel Sanchez [email protected] [email protected] Massachusetts Institute of Technology Massachusetts Institute of Technology Abstract 1 Introduction We present Chronos, a framework to build accelerators for The impending end of Moore’s Law is forcing architectures applications with speculative parallelism. These applications to rely on application- or domain-specific accelerators to consist of atomic tasks, sometimes with order constraints, improve performance. Accelerators require large amounts of and need speculative execution to extract parallelism. Prior parallelism. Consequently, prior accelerators have focused work extended conventional multicores to support specu- on domains where parallelism is easy to exploit, such as deep lative parallelism, but these prior architectures are a poor learning [12, 13, 37], and rely on conventional parallelization match for accelerators because they rely on cache coherence techniques, such as data-parallel or dataflow execution [48]. and add non-trivial hardware to detect conflicts among tasks. However, many applications do not have such easy-to-extract Chronos instead relies on a novel execution model, Spa- parallelism, and have remained off-limits to accelerators. tially Located Ordered Tasks (SLOT), that uses order as the In this paper, we focus on building accelerators for appli- only synchronization mechanism and limits task accesses cations that need speculative execution to extract parallelism. to a single read-write object. This simplification avoids the These applications consist of tasks that are created dynami- need for cache coherence and makes speculative execution cally and operate on shared data, and where operations on cheap and distributed. Chronos abstracts the complexities of shared data must happen in a certain order for execution to speculative parallelism, making accelerator design easy. be correct. Order constraints may arise from the need to pre- We develop an FPGA implementation of Chronos and use it serve atomicity (e.g., operations across tasks must be ordered to build accelerators for four challenging applications. When to not interleave with each other), or from the need to order run on commodity AWS FPGA instances, these accelerators tasks due to application semantics (e.g., tasks dequeued from outperform state-of-the-art software versions running on a a priority queue). Enforcing these order constraints a priori, higher-priced multicore instance by 3.5× to 15.3×. before running each task, is often too costly and/or limits parallelism. Thus, it is preferable to run tasks speculatively CCS Concepts • Computer systems organization → and check that they followed a correct order a posteriori. Multicore architectures. For instance, consider discrete event simulation, which has wide applicability in simulating digital circuits, networked Keywords speculative parallelism; fine-grain parallelism; systems, and physical processes. Discrete event simulation accelerators; specialization; FPGA. consists of dynamically created tasks that may operate on the same simulated object and must run in the correct simulated ACM Reference Format: Maleen Abeydeera and Daniel Sanchez. 2020. Chronos: Efficient time order. Running these tasks non-speculatively requires Speculative Parallelism for Accelerators . In Proceedings of the Twenty- excessive synchronization and limits parallelism [10, 28]. Fifth International Conference on Architectural Support for Program- Running tasks speculatively is far more profitable [32, 34]. ming Languages and Operating Systems (ASPLOS ’20), March 16–20, To make speculation efficient, prior work has proposed 2020, Lausanne, Switzerland. ACM, New York, NY, USA, 16 pages. hardware support for speculation, including Thread-Level https://doi.org/10.1145/3373376.3378454 Speculation [21, 34, 53, 55, 57], and Hardware Transactional Memory [1, 6, 9, 20, 26, 29, 30, 46]. Unfortunately, prior spec- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are ulative architectures are hard to apply to accelerators, be- not made or distributed for profit or commercial advantage and that copies cause they all rely on coherent cache hierarchies to perform bear this notice and the full citation on the first page. Copyrights for com- speculative execution, modifying the coherence protocol to ponents of this work owned by others than the author(s) must be honored. detect conflicts among tasks. This is a natural match formul- Abstracting with credit is permitted. To copy otherwise, or republish, to ticores, which already have a coherence protocol. But such post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. a solution would be onerous and complex for an acceler- ASPLOS ’20, March 16–20, 2020, Lausanne, Switzerland ator: it would require implementing coherent caches and © 2020 Copyright held by the owner/author(s). Publication rights licensed speculation-tracking structures that, while a minor overhead to ACM. for general-purpose cores, are too expensive for small, spe- ACM ISBN 978-1-4503-7102-5/20/03...$15.00 cialized ones. https://doi.org/10.1145/3373376.3378454 1 To address this challenge, in this paper we present a hard- PrioQueue<Time, GateInput> eventQueue; ware system that implements speculative execution without void simToggle(Time time, GateInput input) { Gate gate = input.gate; using coherence. Instead, this system follows a data-centric ap- bool outToggled = gate.simulateToggle(input); proach, where shared data is mapped across the system; work if (outToggled) { is divided into small tasks that access at most one shared // Toggle all inputs connected to this gate for (GateInput i : gate.connectedInputs()) { object each; and tasks are always sent to run at the place Time nextTime = time + gate.delay(input, i); where their data is mapped. To enforce atomicity across task eventQueue.enqueue(nextTime, i); groups, or other order constraints, tasks are ordered through }}} timestamps (these are program-specified logical timestamps ... // Enqueue initial events (input waveforms) completely decoupled from physical time). // Main loop while (!eventQueue.empty()) { We formalize these semantics through the Spatially Located (time, input) = eventQueue.dequeue(); Ordered Tasks (SLOT) execution model. In SLOT, all work simToggle(time, input); happens through tasks that are ordered using timestamps. A } Listing 1. Sequential implementation of des. task may create children tasks ordered after them, and parent tasks communicate input values to children directly. Each more parallelism. These results show that FPGAs are a prac- task must operate on a single read-write object, which must tical and cost-effective way to accelerate applications with be declared when the task is created (besides this restriction, speculative parallelism. tasks may access an arbitrary amount of read-only data). In summary, this paper contributes: We leverage SLOT to implement Chronos, a novel acceler- • SLOT, the first execution model that supports speculative ation framework for speculative algorithms. Each Chronos parallelism without cache coherence (Sec. 3). instance consists of spatially distributed tiles. Each tile has • Chronos, a customizable framework that implements the multiple processing elements (PEs) that execute tasks, and a SLOT execution model and makes it easy to accelerate local cache. Each tile also implements hardware to queue applications with speculative parallelism (Sec. 4). tasks, dispatch them to PEs, track their speculative state, and • A detailed evaluation of Chronos using commodity FPGAs abort or commit them in timestamp order. Chronos maps in the cloud that demonstrates significant speedups for sev- read-write objects across tiles, and sends each newly created eral challenging applications, analyzes system efficiency, task to the tile where its read-write object is mapped. This and quantifies the benefits of customization (Sec. 6). enables completely distributed operation without a cache Our Chronos implementation is open-source and available coherence protocol. at https://chronos-arch.csail.mit.edu. Chronos provides a common framework to accelerate speculative algorithms, abstracting away the complexities of task 2 Motivation and Background management and speculative execution. Developers need only express their application as SLOT tasks coded against a In this section we first present a case for speculative paral- high-level API. To achieve high performance, Chronos sup- lelism through a simple application, discrete event simulation ports two types of customization. First, applications can cus- (des). We then review the types of parallelism exploited by tomize the PEs, which can be specified in RTL or described prior accelerators, and see that most do not exploit speculative using High-Level Synthesis (HLS). PEs can also be general- parallelism. Finally, we review prior speculative architectures, purpose cores, so developers can start with a software im- and use des to identify a key simplification that these archi- plementation and specialize tasks as needed to achieve high tectures have missed: support for task order avoids the need performance. Second, Chronos lets applications turn off un- for coherence-based conflict detection, motivating SLOT. needed features. For example, if the algorithm is

Load more