Chronos: Efficient Speculative Parallelism for Accelerators Maleen Abeydeera Daniel Sanchez [email protected] [email protected] Massachusetts Institute of Technology Massachusetts Institute of Technology Abstract 1 Introduction We present Chronos, a framework to build accelerators for The impending end of Moore’s Law is forcing architectures applications with speculative parallelism. These applications to rely on application- or domain-specific accelerators to consist of atomic tasks, sometimes with order constraints, improve performance. Accelerators require large amounts of and need speculative execution to extract parallelism. Prior parallelism. Consequently, prior accelerators have focused work extended conventional multicores to support specu- on domains where parallelism is easy to exploit, such as deep lative parallelism, but these prior architectures are a poor learning [12, 13, 37], and rely on conventional parallelization match for accelerators because they rely on cache coherence techniques, such as data-parallel or dataflow execution [48]. and add non-trivial hardware to detect conflicts among tasks. However, many applications do not have such easy-to-extract Chronos instead relies on a novel execution model, Spa- parallelism, and have remained off-limits to accelerators. tially Located Ordered Tasks (SLOT), that uses order as the In this paper, we focus on building accelerators for appli- only synchronization mechanism and limits task accesses cations that need speculative execution to extract parallelism. to a single read-write object. This simplification avoids the These applications consist of tasks that are created dynami- need for cache coherence and makes speculative execution cally and operate on shared data, and where operations on cheap and distributed. Chronos abstracts the complexities of shared data must happen in a certain order for execution to speculative parallelism, making accelerator design easy. be correct. Order constraints may arise from the need to pre- We develop an FPGA implementation of Chronos and use it serve atomicity (e.g., operations across tasks must be ordered to build accelerators for four challenging applications. When to not interleave with each other), or from the need to order run on commodity AWS FPGA instances, these accelerators tasks due to application semantics (e.g., tasks dequeued from outperform state-of-the-art software versions running on a a priority queue). Enforcing these order constraints a priori, higher-priced multicore instance by 3.5× to 15.3×. before running each task, is often too costly and/or limits parallelism. Thus, it is preferable to run tasks speculatively CCS Concepts • Computer systems organization → and check that they followed a correct order a posteriori. Multicore architectures. For instance, consider discrete event simulation, which has wide applicability in simulating digital circuits, networked Keywords speculative parallelism; fine-grain parallelism; systems, and physical processes. Discrete event simulation accelerators; specialization; FPGA. consists of dynamically created tasks that may operate on the same simulated object and must run in the correct simulated ACM Reference Format: Maleen Abeydeera and Daniel Sanchez. 2020. Chronos: Efficient order. Running these tasks non-speculatively requires Speculative Parallelism for Accelerators . In Proceedings of the Twenty- excessive synchronization and limits parallelism [10, 28]. Fifth International Conference on Architectural Support for Program- Running tasks speculatively is far more profitable [32, 34]. ming Languages and Operating Systems (ASPLOS ’20), March 16–20, To make speculation efficient, prior work has proposed 2020, Lausanne, Switzerland. ACM, New York, NY, USA, 16 pages. hardware support for speculation, including Thread-Level https://doi.org/10.1145/3373376.3378454 Speculation [21, 34, 53, 55, 57], and Hardware Transactional Memory [1, 6, 9, 20, 26, 29, 30, 46]. Unfortunately, prior spec- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are ulative architectures are hard to apply to accelerators, be- not made or distributed for profit or commercial advantage and that copies cause they all rely on coherent cache hierarchies to perform bear this notice and the full citation on the first page. Copyrights for com- speculative execution, modifying the coherence protocol to ponents of this work owned by others than the author(s) must be honored. detect conflicts among tasks. This is a natural match formul- Abstracting with credit is permitted. To copy otherwise, or republish, to ticores, which already have a coherence protocol. But such post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. a solution would be onerous and complex for an acceler- ASPLOS ’20, March 16–20, 2020, Lausanne, Switzerland ator: it would require implementing coherent caches and © 2020 Copyright held by the owner/author(s). Publication rights licensed speculation-tracking structures that, while a minor overhead to ACM. for general-purpose cores, are too expensive for small, spe- ACM ISBN 978-1-4503-7102-5/20/03...$15.00 cialized ones. https://doi.org/10.1145/3373376.3378454

1 To address this challenge, in this paper we present a hard- PrioQueue eventQueue; ware system that implements speculative execution without void simToggle(Time time, GateInput input) { Gate gate = input.gate; using coherence. Instead, this system follows a data-centric ap- bool outToggled = gate.simulateToggle(input); proach, where shared data is mapped across the system; work if (outToggled) { is divided into small tasks that access at most one shared // Toggle all inputs connected to this gate for (GateInput i : gate.connectedInputs()) { object each; and tasks are always sent to run at the place Time nextTime = time + gate.delay(input, i); where their data is mapped. To enforce atomicity across task eventQueue.enqueue(nextTime, i); groups, or other order constraints, tasks are ordered through }}} timestamps (these are program-specified logical timestamps ... // Enqueue initial events (input waveforms) completely decoupled from physical time). // Main loop while (!eventQueue.empty()) { We formalize these semantics through the Spatially Located (time, input) = eventQueue.dequeue(); Ordered Tasks (SLOT) execution model. In SLOT, all work simToggle(time, input); happens through tasks that are ordered using timestamps. A } Listing 1. Sequential implementation of des. task may create children tasks ordered after them, and parent tasks communicate input values to children directly. Each more parallelism. These results show that FPGAs are a prac- task must operate on a single read-write object, which must tical and cost-effective way to accelerate applications with be declared when the task is created (besides this restriction, speculative parallelism. tasks may access an arbitrary amount of read-only data). In summary, this paper contributes: We leverage SLOT to implement Chronos, a novel acceler- • SLOT, the first execution model that supports speculative ation framework for speculative algorithms. Each Chronos parallelism without cache coherence (Sec. 3). instance consists of spatially distributed tiles. Each tile has • Chronos, a customizable framework that implements the multiple processing elements (PEs) that execute tasks, and a SLOT execution model and makes it easy to accelerate local cache. Each tile also implements hardware to queue applications with speculative parallelism (Sec. 4). tasks, dispatch them to PEs, track their speculative state, and • A detailed evaluation of Chronos using commodity FPGAs abort or commit them in timestamp order. Chronos maps in the cloud that demonstrates significant speedups for sev- read-write objects across tiles, and sends each newly created eral challenging applications, analyzes system efficiency, task to the tile where its read-write object is mapped. This and quantifies the benefits of customization (Sec. 6). enables completely distributed operation without a cache Our Chronos implementation is open-source and available coherence protocol. at https://chronos-arch.csail.mit.edu. Chronos provides a common framework to accelerate spec- ulative algorithms, abstracting away the complexities of task 2 Motivation and Background management and speculative execution. Developers need only express their application as SLOT tasks coded against a In this section we first present a case for speculative paral- high-level API. To achieve high performance, Chronos sup- lelism through a simple application, discrete event simulation ports two types of customization. First, applications can cus- (des). We then review the types of parallelism exploited by tomize the PEs, which can be specified in RTL or described prior accelerators, and see that most do not exploit speculative using High-Level Synthesis (HLS). PEs can also be general- parallelism. Finally, we review prior speculative architectures, purpose cores, so developers can start with a software im- and use des to identify a simplification that these archi- plementation and specialize tasks as needed to achieve high tectures have missed: support for task order avoids the need performance. Second, Chronos lets applications turn off un- for coherence-based conflict detection, motivating SLOT. needed features. For example, if the algorithm is naturally resilient to out-of-order writes (e.g., if updates are monotonic), 2.1 A case for speculative parallelism applications can disable rollback on misspeculation. We illustrate the utility of speculative parallelism through We evaluate Chronos by implementing it on an FPGA and des, a discrete event simulator for digital circuits [28]. List- use it to implement accelerators for several graph analytics ing 1 shows code for a sequential implementation of des. and simulation applications. We use four hard-to-parallelize Each des task processes a gate input toggling at a particular applications with speculative parallelism. We deploy these ac- time. If this input toggle causes the gate’s output to toggle, celerators on commodity AWS FPGA instances. We compare the task enqueues events for all inputs connected to that out- these accelerators with state-of-the-art software implemen- put at the appropriate . The sequential implementation tations of these applications running on a higher-priced 40- processes one task at a time in simulated time order, and thread multicore instance. Chronos achieves speedups of up maintains the set of tasks to process in a priority queue. to 15.3× and gmean 5.4× over the software versions. Chronos Fig. 1a shows a circuit with input waveforms and prop- outperforms the multicore baseline despite running at a 19× agation delays, and Fig. 1b shows the task diagram of an slower frequency, because it exploits orders of magnitude execution of des on this circuit. Arrows between tasks show

2

ihu sn ah oeec.Ised tue value-based uses it Instead, coherence. cache using without work, useless reduce to scheduling priority use they while simple. very are which consume components framework the while that, observe we al 4. Table rnatoa eoyo accelerators: on memory Transactional ie fcode: of costs implementation Lines of Analysis 6.4 upino ahfaeokcmoetadP.Overall, PE. and component framework each of sumption ofitdtcin eyn naps-opeinvalidation apost-completion on relying detection, conflict only tasks, transactions. among unordered constraints order strict support not do they Further, performance. reduced from suffers hence and cache, desirable. not are protocol coherence a of overheads the accelerators, FPGA high-throughput pro- for Unfortunately, a coherence tocol. by augmenting isachieved where detection cores soft conflict with prototype in- a and implementing on FPGAs, focus using stead acceleration application target not do [ FPGAs on systems HTM demonstrated PEs, of those to comparable are they resources, substantial and lines, simpler: lines, much 100 is just application takes each contrast, SystemVer- By of lines ilog. 20000 over take Chronos components The framework parallelism. speculative extract to accelerators range would speedups 1 FPGA, from an on PEs specialized having 4 speedups from achieve ranging would Chronos RISC-V ASIC an baseline, 13 to For limited GB/s). is 50 provement about has of prototype bandwidth FPGA memory (the a FPGA MHz 16 125 a the achieves over ASIC provement GHz 2 the and bound with frequency. notchange would 1/16th, bandwidth by off-chip bandwidth since memory DDR throttle we frequency, PEs application-specific and components framework diinlRltdWork Related Additional 7 PAutilization: FPGA eeainuigT.Hwvr hyd o s non-chip an use not do they However, TM. using celeration URAM BRAM (K) FFs Us(K) LUTs ioT [ TM Kilo [ al. et Ma except applications all find We . 5 × e-ieFG eorecnupinfrec fthe of each for consumption resource FPGA Per-tile astar ( sssp 18 42 1680 1790 , 800 895 steol ytmta agt PAac- FPGA targets that system only the is ] . hoo ae tsml odsg custom design to simple it makes Chronos 19 7 o3 to ) saon 0 lines. 600 around is ×

rpsst mlmn T nGPUs on HTM implement to proposes ] Available ( al hw h PArsuc con- resource FPGA the shows 4 Table 85-72 - 5 12 38 0.5 12 17 color des . 12 0.3 8 6 . 64 - - -

7 TQ × , ( maxflow color .

o244 to ) CQ 7 × hs oprdt h CPU the to compared Thus, . TSB ). color and , . 8

× Cache 8 ( 08 10 7 4 10 6 4 7 11 7 ------des , des r o bandwidth- not are color × 47 efrac im- performance .Hwvr they However, ]. .Cmae to Compared ). ro okhas work Prior

color maxflow

rud300 around sssp h im- the , astar sssp color 12 nTa,Aua ukr,Goe hn,Qa Nguyen, Quan Zhang, Guowei Mukkara, Anurag Tsai, An esneeytakMr efe,Vco ig olEe,Po- Joel Emer, Ying, Victor Jeffrey, Mark thank sincerely We tobuild framework first the Chronos, presented have We okefcetagrtm ieBlmnFr for like Bellman-Ford algorithms work-efficient a upre npr yNFgat AER1594and CAREER-1452994 grants NSF by part in supported was ceeaosfrgahalgorithms: graph for Accelerators scuil t igeeetqeestrtsaon .5events 0.15 50 around a saturates cycle, queue per event single its crucial: is that approach design high-throughput distributed, centralized Chronos’s a why uses shows This FPGA. an on accelerator tosoftware. [ al. detection et conflict Rahman as such aspects but other process, leaves rollback [ and chip versioning Rollback speculative The the protocol. celerates Warp Time as- the different of for pects accelerators proposed has simulation event less to resort result a as and ordering, task strict support both [ algorithms, FPGA graph for for accelerators proposed also have fea- key a Chronos. forwarding, for which speculative ture management, support version to hard lazy it to makes val- restricted read is of and logging ues) requiring This (e.g., conflicts. expensive detect is to re-read technique are values read where phase H-846,NFSCgatECA1402 n yaSony grant. a research by and E2CDA-1640012, grant NSF/SRC SHF-1814969, work This feedback. helpful their the for and reviewers Konami, anonymous Shuichi Yamaguchi, Keiko Lee, Ryong Hyun Acknowledgments versions. software-parallel their over applications same 5.4 demonstrate we where AWSFPGAs, commodity on accelerators and these deploy analytics We graph simulation. in applications challenging several ate cache for need to the tasks coherence. avoiding limits object, read-write that single model a execution access new a SLOT, re- by on cheap lying execution speculative makes Chronos allelism. Conclusion 8 techniques. Chronos’s from benefit could may parallelism,and which limit simulation, CMB-style non-speculative use tems [ RAMP ulation. with dealing as capacity. such queue so, on-chip doing limited when arise consider that not issues do subtle they how Hence, explore applications. not actual do accelerate and to tasks long with microbenchmark a iuainaccelerators: Simulation iei [ FireSim ceeaosfrapiain ihodrdseuaiepar- speculative ordered with applications for accelerators ytm oevr amne l vlae hi einusing design their evaluated al. et Rahman Moreover, system. eipeetCrnso nFG n s tt acceler- to it use and FPGA an on Chronos implement We sim- architectural accelerate to used been also have FPGAs 38 14 iuae ag,saeotcutr.Teesys- These clusters. scale-out large, simulates ] , × 40 oe aktruhu hna1-ieChronos 16-tile a than throughput task lower 51 n SC[ ASIC and ] 60 mlmn iceeeetsimulation event discrete a implement ] , 61 iuae utcr ytm,and systems, multicore simulates ] ro oki aalldiscrete parallel in work Prior 23 , 49 × .Hwvr oeo them of none However, ]. ma peu o the for speedup gmean ueosohrwork other Numerous sssp 17 . ac- ] A Artifact Appendix A.4 Installation A.1 Abstract 1. Launch an AWS f1.2xlarge instance using the Amazon Our artifact consists of the source code for the Chronos FPGA FPGA Developer AMI. Log into the instance. acceleration framework; pre-compiled FPGA images for our 2. Extract the Chronos artifact .zip file, and navigate to its evaluated configurations (to facilitate a quick evaluation); base directory. and scripts to set up the development environment, compile 3. Run source install.sh. This will clone the Amazon the images from source code, run the experiments in the FPGA SDK repository and install the necessary drivers. paper, and regenerate the graphs. 4. Run aws configure to set up the instance with your This appendix describes how to use Chronos to reproduce AWS credentials. the paper’s results, and explains how to set up and run other 5. (Optional) Install the GNU RISC-V embedded GCC com- Chronos configurations and experiments. All experiments piler within the instance (https://xpack.github.io/riscv- are run on the Amazon AWS f1.2xlarge instance, configured none-embed-gcc/). This step is optional because the dis- using the Amazon-provided FPGA Developer AMI. tribution already includes pre-compiled RISC-V binaries necessary for the workflow. A.2 Artifact check-list (meta-information) A.5 Experiment workflow • Compilation: Xilinx Vivado, GNU RISC-V embedded GCC com- piler. We provide an automated workflow to validate the main re- • Run-time environment: Amazon AWS FPGA instance. sults in the paper from scratch. Note that this process involves • Hardware: Xilinx UltraScale VU9P. synthesizing multiple Chronos instances for each application, • How much disk space required (approximately)?: 2GB. a process that takes about two weeks to complete. • How much time is needed to prepare workflow (approxi- To facilitate a quick evaluation, we also provide precom- mately)?: Approx. 1 hour. piled FPGA images of the Chronos instances; when using • How much time is needed to complete experiments (ap- these images, reproducing the results takes about two hours. proximately)?: 2 weeks to reproduce the full results from scratch, The cl_chronos/validation/scripts/ directory contains or 2 hours if using the precompiled images. The tutorials (Sec. A.7) take about 2 days each, or 2 hours if using precompiled images. the necessary scripts to validate the results from the paper. • Publicly available?: Yes. The full process is explained in comments in the master script • Code licenses (if publicly available)?: GPL v2. run_validation.py. • Archived (provide DOI)?: 10.5281/zenodo.3558760 To run all experiments from scratch, run: python run_validation.py A.3 Description To run all experiments with precompiled images, run: python run_validation.py --precompiled A.3.1 How delivered This will download a list of precompiled image IDs from a Our artifact can be downloaded from https://doi.org/10.5281/ shared S3 bucket and run the rest of the workflow. zenodo.3558760 as a .zip file. Sec. A.7 includes two smaller tutorials using Chronos, which can be completed in about 2 hours. A.3.2 Hardware dependencies A.6 Evaluation and expected result Chronos is designed to run on an Amazon AWS f1.2xlarge instance configured with the Amazon FPGA Developer AMI. Running run_validation.py would generate all evaluation plots (Figures 10-14). A.3.3 Software dependencies A.7 Experiment customization The main dependence is Xilinx Vivado 2018.2, which comes This section provides two smaller tutorials on using Chronos. with the FPGA Developer AMI. The RISC-V Chronos variant First, we illustrate the SLOT programming model using a relies on the GNU RISC-V embedded GCC compiler. sample application running on a Chronos instance with RISC- V soft cores. Second, we describe how to generate Chronos A.3.4 Data sets instances with specialized cores. For small, testing runs, we include scripts to generate syn- Before starting either tutorial, run source aws_setup.sh to thetic datasets. The experiments in the paper use large, pub- configure the necessary environment variables and to define licly available datasets from other projects. Since datasets are the $CL_DIR environment variable to point to the cl_chronos large and publicly available, they are not included directly subdirectory. Please see README.txt here for more detailed in the artifact code. Instead, the artifact includes scripts to information, including topics not covered in this workflow, download these datasets. These datasets are also archived, such as how to simulate Chronos RTL and how to debug with the DOI 10.5281/zenodo.3563178. Chronos.

13 A.7.1 Tutorial 1: Chronos using RISC-V soft cores Next, compile and run the test_chronos program that Step 1: Generate a test graph. transfers the input graph to the FPGA, collects results, and The graph_gen tool can be used to generate test graphs to analyzes performance. test our implementation of sssp. cd $CL_DIR/software/runtime cd $CL_DIR/tools/graph_gen make make ./test_chronos --n_tiles=1 sssp ./graph_gen sssp grid 20 This generates a 20x20 grid graph with random weights. A.7.2 Tutorial 2: Chronos with specialized cores Step 2: Synthesize a Chronos image with RISC-V soft cores. The RTL code for specialized applications can be found in The output of this step is an Amazon FPGA Image ID $CL_DIR/design/apps/. For this example, we will again use (AGFI-ID) that can loaded into the FPGA. This step will take sssp; other applications are similar. about 8 hours to complete. If you’d like to skip this step, you To generate a Chronos instance with these cores, run: can instead use the pre-synthesized FPGA image with the ./scripts/gen_cores.py sssp AGFI-ID (agfi-02159d0614fb731a9). The rest of the steps are same as in Tutorial 1, except that 1. Configure Chronos to use RISC-V cores. the test_chronos script does not take a cd $CL_DIR/design/ argument. ./scripts/gen_cores.py riscv A precompiled sssp Chronos instance is also available 2. Run synthesis with the AGFI-ID = agfi-0d3750b6360762108. cd $CL_DIR/build/scripts ./aws_build_dcp_from_cl.sh A.7.3 Customized configurations and applications This script launches a Vivado synthesis/place-and-route Customizing Chronos parameters: The file config.sv con- job. The output of this process is a placed-and-routed tains the configuration parameters of Chronos. These include design, produced at: the number of tiles, the sizes for various queues and cache $CL_DIR/build/checkpoints/to_aws/ parameters. .Developer_CL.tar Porting new applications: The first step in porting a new 3. Create an FPGA image. (The commands below follow application is to break the application down into SLOT tasks the standard instructions on how to generate a runnable (single-object tasks ordered using timestamps). Initially, these FPGA image from the placed-and-routed design, at https: tasks can be expressed as software functions and run on a //github.com/aws/aws-fpga/blob/master/hdk/README. Chronos instance with RISC-V cores. md#step3.) Once the SLOT implementation is verified, a specialized First, copy the design file to a location in Amazon S3: core can be designed for each task. Please refer to the script aws s3 cp $CL_DIR/build/checkpoints/to_aws/ $CL_DIR/design/scripts/gen_cores.py on how to integrate .Developer_CL.tar .tar new specialized cores into the Chronos workflow. Then, create the FPGA image aws ec2 create-fpga-image --name --input-storage-location Bucket=, References Key= --logs-storage-location [1] C. Scott Ananian, Krste Asanović, Bradley C. Kuszmaul, Charles E. Leis- Bucket=, Key= erson, and Sean Lie. 2005. Unbounded transactional memory. In Proc. of the 11th IEEE intl. symp. on High Performance Computer Architecture Running this command generates an AGFI-ID that can (HPCA-11). be used to load the image into the FPGA. [2] Richard J. Anderson and João C. Setubal. 1992. On the parallel imple- mentation of Goldberg’s maximum flow algorithm. In Proc. of the 4th Step 3: Compile sssp RISC-V code. ACM Symp. on Parallelism in Algorithms and Architectures (SPAA). This step requires the RISC-V embedded GCC compiler. [3] AWS FPGA Hardware and Software Development Kit. 2017. https: You can skip this step by using the precompiled binaries from //github.com/aws/aws-fpga. [4] Ranjita Bhagwan and Bill Lin. 2000. Fast and scalable priority queue ar- $CL_DIR/riscv-code/binaries in the next step. chitecture for high-speed network switches. In Proc. of the IEEE Infocom To build sssp from source, run: 2000. cd $CL_DIR/riscv-code/sssp [5] Guy E. Blelloch, Jonathan C. Hardwick, Siddhartha Chatterjee, Jay make Sipelstein, and Marco Zagha. 1993. Implementation of a portable nested data-parallel language. In Proc. of the ACM SIGPLAN Symp. on Principles Step 4: Run sssp on the FPGA. and Practice of Parallel Programming (PPoPP). [6] Jayaram Bobba, Kevin E. Moore, Haris Volos, Luke Yen, Mark D. Hill, First load the generated image into the FPGA (This com- Michael M. Swift, and David A. Wood. 2007. Performance pathologies mand may have to be run twice the first time it is loaded). in hardware transactional memory. In Proc. of the 34th annual Intl. sudo fpga-load-local-image -S 0 -I Symp. on Computer Architecture (ISCA-34).

14 [7] Christopher D. Carothers, David Bauer, and Shawn Pearce. 2000. ROSS: Energy-efficient Accelerator for Graph Analytics. In Proc. of the 49th A High-performance, Low Memory, Modular Time Warp System. In annual IEEE/ACM intl. symp. on Microarchitecture (MICRO-49). Proc. of the 14th Workshop on Parallel and Distributed Simulation [24] Lance Hammond, Mark Willey, and Kunle Olukotun. 1998. Data spec- (PADS). ulation support for a chip multiprocessor. In Proc. of the 8th intl. conf. [8] Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan G. Bronson, on Architectural Support for Programming Languages and Operating Christos Kozyrakis, and Kunle Olukotun. 2011. Hardware Acceleration Systems (ASPLOS-VIII). of Transactional Memory on Commodity Systems. In Proc. of the 16th [25] Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. intl. conf. on Architectural Support for Programming Languages and Davis, Ben Hertzberg, Manohar K. Prabhu, Honggo Wijaya, Christos Operating Systems (ASPLOS-XVI). Kozyrakis, and Kunle Olukotun. 2004. Transactional memory coher- [9] Hassan Chafi, Jared Casper, Brian D. Carlstrom, Austen McDonald, ence and consistency. In Proc. of the 31st annual Intl. Symp. on Computer Chi Cao Minh, Woongki Baek, Christos Kozyrakis, and Kunle Olukotun. Architecture (ISCA-31). 2007. A scalable, non-blocking approach to transactional memory. [26] Tim Harris, James Larus, and Ravi Rajwar. 2010. Transactional memory. In Proc. of the 13th IEEE intl. symp. on High Performance Computer Synthesis Lectures on Computer Architecture (2010). Architecture (HPCA-13). [27] William Hasenplaugh, Tim Kaler, Tao B. Schardl, and Charles E. Leis- [10] K. Mani Chandy and Jayadev Misra. 1981. Asynchronous distributed erson. 2014. Ordering heuristics for parallel graph coloring. In Proc. simulation via a sequence of parallel computations. Commun. ACM 24, of the 26th ACM Symp. on Parallelism in Algorithms and Architectures 4 (1981). (SPAA). [11] Tao Chen, Shreesha Srinath, and G. Edward Batten, Christopher Suh. [28] Muhammad Amber Hassaan, Martin Burtscher, and Keshav Pingali. 2018. An Architectural Framework for Accelerating Dynamic Parallel 2011. Ordered vs. unordered: a comparison of parallelism and work- Algorithms on Reconfigurable Hardware. In Proc. of the 51st annual efficiency in irregular algorithms. In Proc. of the ACM SIGPLAN Symp. IEEE/ACM intl. symp. on Microarchitecture (MICRO-51). on Principles and Practice of Parallel Programming (PPoPP). [12] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, [29] Maurice Herlihy and J. Eliot B. Moss. 1993. Transactional memory: Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. Architectural support for lock-free data structures. In Proc. of the 20th 2014. DaDianNao: A Machine-Learning Supercomputer. In Proc. of the annual Intl. Symp. on Computer Architecture (ISCA-20). 47th annual IEEE/ACM intl. symp. on Microarchitecture (MICRO-47). [30] Syed Ali Raza Jafri, Gwendolyn Voskuilen, and T. N. Vijaykumar. 2013. [13] Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A Spatial Wait-n-GoTM: improving HTM performance by serializing cyclic de- Architecture for Energy-efficient Dataflow for Convolutional Neural pendencies. In Proc. of the 18th intl. conf. on Architectural Support for Networks. In Proc. of the 43rd annual Intl. Symp. on Computer Architec- Programming Languages and Operating Systems (ASPLOS-XVIII). ture (ISCA-43). [31] D. Jefferson, B. Beckman, F. Wieland, L. Blume, and M. Diloreto. 1987. [14] Guohao Dai, Yuze Chi, Yu Wang, and Huazhong Yang. 2016. FPGP: Time Warp Operating System. In Proceedings of the Eleventh ACM Graph Processing Framework on FPGA A Case Study of Breadth-First Symposium on Operating Systems Principles. Search. In Proc. of the 2016 ACM/SIGDA International Symposium on [32] David R. Jefferson. 1985. Virtual time. ACM TOPLAS 7, 3 (1985). Field-Programmable Gate Arrays (FPGA). [33] Mark C. Jeffrey, Suvinay Subramanian, Maleen Abeydeera, Joel Emer, [15] C. Demetrescu, A. Goldberg, and D. Johnson. 2006. 9th DIMACS Im- and Daniel Sanchez. 2016. Data-centric execution of speculative par- plementation Challenge: Shortest Paths. http://www.dis.uniroma1.it/ allel programs. In Proc. of the 49th annual IEEE/ACM intl. symp. on ~challenge9 Microarchitecture (MICRO-49). [16] Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The im- [34] Mark C. Jeffrey, Suvinay Subramanian, Cong Yan, Joel Emer, and Daniel plementation of the Cilk-5 multithreaded language. In Proc. of the ACM Sanchez. 2015. A scalable architecture for ordered parallelism. In Proc. SIGPLAN Conf. on Programming Language Design and Implementation of the 48th annual IEEE/ACM intl. symp. on Microarchitecture (MICRO- (PLDI). 48). [17] R. M. Fujimoto, J.-J. Tsai, and G. C. Gopalakrishnan. 1992. Design and [35] Mark C. Jeffrey, Victor A. Ying, Suvinay Subramanian, Hyun Ryong evaluation of the rollback chip: special purpose hardware for Time Lee, Joel Emer, and Daniel Sanchez. 2018. Harmonizing Speculative and Warp. IEEE Trans. Comput. 41, 1 (1992). Non-Speculative Execution in Architectures for Ordered Parallelism. [18] Wilson W. L. Fung and Tor M. Aamodt. 2013. Energy Efficient GPU In Proc. of the 51st annual IEEE/ACM intl. symp. on Microarchitecture Transactional Memory via Space-Time Optimizations. In Proc. of the (MICRO-51). 46th annual IEEE/ACM intl. symp. on Microarchitecture (MICRO-46). [36] Mark T. Jones and Paul E. Plassmann. 1993. A Parallel Graph Coloring [19] Wilson W. L. Fung, Inderpreet Singh, Andrew Brownsword, and Tor M. Heuristic. SIAM J. Sci. Comput. 14, 3 (1993). Aamodt. 2011. Hardware Transactional Memory for GPU Architectures. [37] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gau- In Proc. of the 44th annual IEEE/ACM intl. symp. on Microarchitecture rav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, (MICRO-44). Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, [20] Epifanio Gaona-Ramirez, Rubén Titos-Gil, Juan Fernandez, and Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Manuel E. Acacio. 2010. Characterizing energy consumption in hard- Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hag- ware transactional memory systems. In Proc. of the 22nd symp. on mann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Computer Architecture and High Performance Computing (SBAC-PAD Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit 22). Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, [21] María Jesús Garzarán, Milos Prvulovic, José María Llabería, Víctor James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Viñals, Lawrence Rauchwerger, and Josep Torrellas. 2003. Tradeoffs Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire in buffering speculative memory state for thread-level speculation in Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray multiprocessors. In Proc. of the 9th IEEE intl. symp. on High Performance Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Computer Architecture (HPCA-9). Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, [22] Mordechai Haklay and Patrick Weber. 2008. Openstreetmap: User- Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Stein- generated street maps. IEEE Pervasive Computing 7, 4 (2008). berg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia [23] Tae Jun Ham, Lisa Wu, Narayanan Sundaram, Nadathur Satish, and Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Margaret Martonosi. 2016. Graphicionado: A High-performance and Eric Wilcox, and Doe Hyun Yoon. 2017. In-Datacenter Performance

15 Analysis of a Tensor Processing Unit. In Proc. of the 44th annual Intl. [54] Matthew D. Sinclair, Johnathan Alsop, and Sarita V. Adve. 2015. Ef- Symp. on Computer Architecture (ISCA-44). ficient GPU Synchronization without Scopes: Saying No to Complex [38] Sagar Karandikar, Howard Mao, Donggyu Kim, David Biancolin, Consistency Models. In Proc. of the 48th annual IEEE/ACM intl. symp. Alon Amid, Dayeol Lee, Nathan Pemberton, Emmanuel Amaro, Colin on Microarchitecture (MICRO-48). Schmidt, Aditya Chopra, Qijing Huang, Kyle Kovacs, Borivoje Nikolic, [55] Gurindar S. Sohi, Scott E. Breach, and T. N. Vijaykumar. 1995. Multi- Randy Katz, Jonathan Bachrach, and Krste Asanović. 2018. Firesim: scalar processors. In Proc. of the 22nd annual Intl. Symp. on Computer FPGA-accelerated Cycle-exact Scale-out System Simulation in the Pub- Architecture (ISCA-22). lic Cloud. In Proc. of the 45th annual Intl. Symp. on Computer Architecture [56] SpinalHDL. 2018. A FPGA friendly 32 bit RISC-V CPU implementation. (ISCA-45). https://github.com/SpinalHDL/VexRiscv. [39] Jure Leskovec and Andrej Krevl. 2014. SNAP datasets: Stanford large [57] J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. network dataset collection. http://snap.stanford.edu/data. Mowry. 2000. A scalable approach to thread-level speculation. In Proc. [40] Zhaoshi Li, Leibo Liu, Yangdong Deng, Shouyi Yin, Yao Wang, and of the 27th annual Intl. Symp. on Computer Architecture (ISCA-27). Shaojun Wei. 2017. Aggressive Pipelining of Irregular Applications [58] Suvinay Subramanian. 2018. Architectural Techniques to Unlock Ordered on Reconfigurable Hardware. In Proc. of the 44th annual Intl. Symp. on and Nested Speculative Parallelism. Ph.D. Dissertation. Massachusetts Computer Architecture (ISCA-44). Institute of Technology. [41] Kyle Locke. 2011. Parameterizable Content-Addressable Memory. Xil- [59] Suvinay Subramanian, Mark C. Jeffrey, Maleen Abeydeera, Hyun Ryong inx Application Note (2011). Lee, Victor A. Ying, Joel Emer, and Daniel Sanchez. 2017. Fractal: An [42] Xiaoyu Ma, Dan Zhang, and Derek Chiou. 2017. FPGA-Accelerated execution model for fine-grain nested speculative parallelism. In Proc. Transactional Execution of Graph Workloads. In Proc. of the 2017 of the 44th annual Intl. Symp. on Computer Architecture (ISCA-44). ACM/SIGDA International Symposium on Field-Programmable Gate Ar- [60] Zhangxi Tan, Andrew Waterman, Rimas Avizienis, Yunsup Lee, Henry rays (FPGA). Cook, David Patterson, and Krste Asanović. 2010. RAMP gold: an [43] Steve Margerm, Amirali Sharifian, Apala Guha Guha, and Gilles Shri- FPGA-based architecture simulator for multiprocessors. In Proc. of the raman, Arrvindh Shriraman Pokam. 2018. TAPAS: Generating Par- 47th Design Automation Conf. (DAC-47). allel Accelerators from Parallel Programs. In Proc. of the 51st annual [61] John Wawrzynek, David Patterson, Mark Oskin, Shih-Lien Lu, Christo- IEEE/ACM intl. symp. on Microarchitecture (MICRO-51). foros Kozyrakis, James C Hoe, Derek Chiou, and Krste Asanovic. 2007. [44] Ulrich Meyer and Peter Sanders. 1998. Delta-Stepping: A Parallel Single RAMP: Research accelerator for multiple processors. IEEE Micro 27, 2 Source Shortest Path Algorithm. In Proc. of the 6th Annual European (2007). Symposium on Algorithms (ESA). [62] Sewook Wee, Jared Casper, Njuguna Njoroge, Yuriy Tesylar, Daxia Ge, [45] Vincent Mirian and Paul Chow. 2012. FCache: A System for Cache Christos Kozyrakis, and Kunle Olukotun. 2007. A Practical FPGA- Coherent Processing on FPGAs. In Proc. of the 2012 ACM/SIGDA Inter- based Framework for Novel CMP Research. In Proceedings of the 2007 national Symposium on Field Programmable Gate Arrays (FPGA). ACM/SIGDA 15th International Symposium on Field Programmable Gate [46] Kevin Moore, Jayaram Bobba, Michelle Moravan, Mark D. Hill, and Arrays (FPGA). David Wood. 2006. LogTM: Log-based transactional memory. In Proc. [63] H. J. Yang, K. Fleming, M. Adler, and J. Emer. 2014. LEAP Shared of the 12th IEEE intl. symp. on High Performance Computer Architecture Memories: Automating the Construction of FPGA Coherent Memories. (HPCA-12). In Proc. of the Annual International Symposium on Field-Programmable [47] Njuguna Njoroge, Jared Casper, Sewook Wee, Yuriy Teslyar, Daxia Custom Computing Machines (FCCM). Ge, Christos Kozyrakis, and Kunle Olukotun. 2007. ATLAS: A Chip- [64] Luke Yen, Jayaram Bobba, Michael R. Marty, Kevin E. Moore, Haris multiprocessor with Transactional Memory Support. In Proc. of the Volos,Mark D. Hill, Michael M. Swift, and David A. Wood. 2007. LogTM- conf. on Design, Automation and Test in Europe (DATE). SE: Decoupling hardware transactional memory from caches. In Proc. [48] Tony Nowatzki, Vinay Gangadhar, Newsha Ardalani, and Karthikeyan of the 13th IEEE intl. symp. on High Performance Computer Architecture Sankaralingam. 2017. Stream-Dataflow Acceleration. In Proc. of the (HPCA-13). 44th annual Intl. Symp. on Computer Architecture (ISCA-44). [65] Ye Zhang, Lawrence Rauchwerger, and Josep Torrellas. 1998. Hardware [49] Muhammet Mustafa Ozdal, Serif Yesil, Taemin Kim, Andrey Ayupov, for speculative run-time parallelization in distributed shared-memory John Greth, Steven Burns, and Ozcan Ozturk. 2016. Energy Efficient multiprocessors. In Proc. of the 4th IEEE intl. symp. on High Performance Architecture for Graph Analytics Accelerators. In Proc. of the 43rd Computer Architecture (HPCA-4). annual Intl. Symp. on Computer Architecture (ISCA-43). [50] Keshav Pingali, Donald Nguyen, Milind Kulkarni, Martin Burtscher, M. Amber Hassaan, Rashid Kaleem, Tsung-Hsien Lee, Andrew Lenharth, Roman Manevich, Mario Méndez-Lojo, Dimitrios Prount- zos, and Xin Sui. 2011. The tao of parallelism in algorithms. In Proc. of the ACM SIGPLAN Conf. on Programming Language Design and Imple- mentation (PLDI). [51] Shafiur Rahman, Nael Abu-Ghazaleh, and Walid Najjar. 2017. PDES- A: A Parallel Discrete Event Simulation Accelerator for FPGAs. In Proc. of the ACM SIGSIM Conference on Principles of Advanced Discrete Simulation (PADS). [52] Ravi Rajwar and James R Goodman. 2002. Transactional lock-free execution of lock-based programs. In Proc. of the 10th intl. conf. on Ar- chitectural Support for Programming Languages and Operating Systems (ASPLOS-X). [53] Jose Renau, Karin Strauss, Luis Ceze, Wei Liu, Smruti Sarangi, James Tuck, and Josep Torrellas. 2005. Thread-level speculation on a CMP can be energy efficient. In Proc. of the Intl. Conf. on Supercomputing (ICS’05).

16