Chronos: Efficient Speculative Parallelism for Accelerators Maleen Abeydeera Daniel Sanchez [email protected] [email protected] Massachusetts Institute of Technology Massachusetts Institute of Technology Abstract 1 Introduction We present Chronos, a framework to build accelerators for The impending end of Moore’s Law is forcing architectures applications with speculative parallelism. These applications to rely on application- or domain-specific accelerators to consist of atomic tasks, sometimes with order constraints, improve performance. Accelerators require large amounts of and need speculative execution to extract parallelism. Prior parallelism. Consequently, prior accelerators have focused work extended conventional multicores to support specu- on domains where parallelism is easy to exploit, such as deep lative parallelism, but these prior architectures are a poor learning [12, 13, 37], and rely on conventional parallelization match for accelerators because they rely on cache coherence techniques, such as data-parallel or dataflow execution [48]. and add non-trivial hardware to detect conflicts among tasks. However, many applications do not have such easy-to-extract Chronos instead relies on a novel execution model, Spa- parallelism, and have remained off-limits to accelerators. tially Located Ordered Tasks (SLOT), that uses order as the In this paper, we focus on building accelerators for appli- only synchronization mechanism and limits task accesses cations that need speculative execution to extract parallelism. to a single read-write object. This simplification avoids the These applications consist of tasks that are created dynami- need for cache coherence and makes speculative execution cally and operate on shared data, and where operations on cheap and distributed. Chronos abstracts the complexities of shared data must happen in a certain order for execution to speculative parallelism, making accelerator design easy. be correct. Order constraints may arise from the need to pre- We develop an FPGA implementation of Chronos and use it serve atomicity (e.g., operations across tasks must be ordered to build accelerators for four challenging applications. When to not interleave with each other), or from the need to order run on commodity AWS FPGA instances, these accelerators tasks due to application semantics (e.g., tasks dequeued from outperform state-of-the-art software versions running on a a priority queue). Enforcing these order constraints a priori, higher-priced multicore instance by 3.5× to 15.3×. before running each task, is often too costly and/or limits parallelism. Thus, it is preferable to run tasks speculatively CCS Concepts • Computer systems organization → and check that they followed a correct order a posteriori. Multicore architectures. For instance, consider discrete event simulation, which has wide applicability in simulating digital circuits, networked Keywords speculative parallelism; fine-grain parallelism; systems, and physical processes. Discrete event simulation accelerators; specialization; FPGA. consists of dynamically created tasks that may operate on the same simulated object and must run in the correct simulated ACM Reference Format: Maleen Abeydeera and Daniel Sanchez. 2020. Chronos: Efficient time order. Running these tasks non-speculatively requires Speculative Parallelism for Accelerators . In Proceedings of the Twenty- excessive synchronization and limits parallelism [10, 28]. Fifth International Conference on Architectural Support for Program- Running tasks speculatively is far more profitable [32, 34]. ming Languages and Operating Systems (ASPLOS ’20), March 16–20, To make speculation efficient, prior work has proposed 2020, Lausanne, Switzerland. ACM, New York, NY, USA, 16 pages. hardware support for speculation, including Thread-Level https://doi.org/10.1145/3373376.3378454 Speculation [21, 34, 53, 55, 57], and Hardware Transactional Memory [1, 6, 9, 20, 26, 29, 30, 46]. Unfortunately, prior spec- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are ulative architectures are hard to apply to accelerators, be- not made or distributed for profit or commercial advantage and that copies cause they all rely on coherent cache hierarchies to perform bear this notice and the full citation on the first page. Copyrights for com- speculative execution, modifying the coherence protocol to ponents of this work owned by others than the author(s) must be honored. detect conflicts among tasks. This is a natural match formul- Abstracting with credit is permitted. To copy otherwise, or republish, to ticores, which already have a coherence protocol. But such post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. a solution would be onerous and complex for an acceler- ASPLOS ’20, March 16–20, 2020, Lausanne, Switzerland ator: it would require implementing coherent caches and © 2020 Copyright held by the owner/author(s). Publication rights licensed speculation-tracking structures that, while a minor overhead to ACM. for general-purpose cores, are too expensive for small, spe- ACM ISBN 978-1-4503-7102-5/20/03...$15.00 cialized ones. https://doi.org/10.1145/3373376.3378454
1 To address this challenge, in this paper we present a hard- PrioQueue
2
ihu sn ah oeec.Ised tue value-based uses it Instead, coherence. cache using without work, useless reduce to scheduling priority use they while simple. very are which consume components framework the while that, observe we al 4. Table rnatoa eoyo accelerators: on memory Transactional ie fcode: of costs implementation Lines of Analysis 6.4 upino ahfaeokcmoetadP.Overall, PE. and component framework each of sumption ofitdtcin eyn naps-opeinvalidation apost-completion on relying detection, conflict only tasks, transactions. among unordered constraints order strict support not do they Further, performance. reduced from suffers hence and cache, desirable. not are protocol coherence a of overheads the accelerators, FPGA high-throughput pro- for Unfortunately, a coherence tocol. by augmenting isachieved where detection cores soft conflict with prototype in- a and implementing on FPGAs, focus using stead acceleration application target not do [ FPGAs on systems HTM demonstrated PEs, of those to comparable are they resources, substantial and lines, simpler: lines, much 100 is just application takes each contrast, SystemVer- By of lines ilog. 20000 over take Chronos components The framework parallelism. speculative extract to accelerators range would speedups 1 FPGA, from an on PEs specialized having 4 speedups from achieve ranging would Chronos RISC-V ASIC an baseline, 13 to For limited GB/s). is 50 provement about has of prototype bandwidth FPGA memory (the a FPGA MHz 16 125 a the achieves over ASIC provement GHz 2 the and bound with frequency. notchange would 1/16th, bandwidth by off-chip bandwidth since memory DDR throttle we frequency, PEs application-specific and components framework diinlRltdWork Related Additional 7 PAutilization: FPGA eeainuigT.Hwvr hyd o s non-chip an use not do they However, TM. using celeration URAM BRAM (K) FFs Us(K) LUTs ioT [ TM Kilo [ al. et Ma except applications all find We . 5 × e-ieFG eorecnupinfrec fthe of each for consumption resource FPGA Per-tile astar ( sssp 18 42 1680 1790 , 800 895 steol ytmta agt PAac- FPGA targets that system only the is ] . hoo ae tsml odsg custom design to simple it makes Chronos 19 7 o3 to ) saon 0 lines. 600 around is ×
rpsst mlmn T nGPUs on HTM implement to proposes ] Available ( al hw h PArsuc con- resource FPGA the shows 4 Table 85-72 - 5 12 38 0.5 12 17 color des . 12 0.3 8 6 . 64 - - -
7 TQ × , ( maxflow color .
o244 to ) CQ 7 × hs oprdt h CPU the to compared Thus, . TSB ). color and , . 8
× Cache 8 ( 08 10 7 4 10 6 4 7 11 7 ------des , des r o bandwidth- not are color × 47 efrac im- performance .Hwvr they However, ]. .Cmae to Compared ). ro okhas work Prior
color maxflow
rud300 around sssp h im- the , astar sssp color 12 nTa,Aua ukr,Goe hn,Qa Nguyen, Quan Zhang, Guowei Mukkara, Anurag Tsai, An esneeytakMr efe,Vco ig olEe,Po- Joel Emer, Ying, Victor Jeffrey, Mark thank sincerely We tobuild framework first the Chronos, presented have We okefcetagrtm ieBlmnFr for like Bellman-Ford algorithms work-efficient a upre npr yNFgat AER1594and CAREER-1452994 grants NSF by part in supported was ceeaosfrgahalgorithms: graph for Accelerators scuil t igeeetqeestrtsaon .5events 0.15 50 around a saturates cycle, queue per event single its crucial: is that approach design high-throughput distributed, centralized Chronos’s a why uses shows This FPGA. an on accelerator tosoftware. [ al. detection et conflict Rahman as such aspects but other process, leaves rollback [ and chip versioning Rollback speculative The the protocol. celerates Warp Time as- the different of for pects accelerators proposed has simulation event less to resort result a as and ordering, task strict support both [ algorithms, FPGA graph for for accelerators proposed also have fea- key a Chronos. forwarding, for which speculative ture management, support version to hard lazy it to makes val- restricted read is of and logging ues) requiring This (e.g., conflicts. expensive detect is to re-read technique are values read where phase H-846,NFSCgatECA1402 n yaSony grant. a research by and E2CDA-1640012, grant NSF/SRC SHF-1814969, work This feedback. helpful their the for and reviewers Konami, anonymous Shuichi Yamaguchi, Keiko Lee, Ryong Hyun Acknowledgments versions. software-parallel their over applications same 5.4 demonstrate we where AWSFPGAs, commodity on accelerators and these deploy analytics We graph simulation. in applications challenging several ate cache for need to the tasks coherence. avoiding limits object, read-write that single model a execution access new a SLOT, re- by on cheap lying execution speculative makes Chronos allelism. Conclusion 8 techniques. Chronos’s from benefit could may parallelism,and which limit simulation, CMB-style non-speculative use tems [ RAMP ulation. with dealing as capacity. such queue so, on-chip doing limited when arise consider that not issues do subtle they how Hence, explore applications. not actual do accelerate and to tasks long with microbenchmark a iuainaccelerators: Simulation iei [ FireSim ceeaosfrapiain ihodrdseuaiepar- speculative ordered with applications for accelerators ytm oevr amne l vlae hi einusing design their evaluated al. et Rahman Moreover, system. eipeetCrnso nFG n s tt acceler- to it use and FPGA an on Chronos implement We sim- architectural accelerate to used been also have FPGAs 38 14 iuae ag,saeotcutr.Teesys- These clusters. scale-out large, simulates ] , × 40 oe aktruhu hna1-ieChronos 16-tile a than throughput task lower 51 n SC[ ASIC and ] 60 mlmn iceeeetsimulation event discrete a implement ] , 61 iuae utcr ytm,and systems, multicore simulates ] ro oki aalldiscrete parallel in work Prior 23 , 49 × .Hwvr oeo them of none However, ]. ma peu o the for speedup gmean ueosohrwork other Numerous sssp 17 . ac- ] A Artifact Appendix A.4 Installation A.1 Abstract 1. Launch an AWS f1.2xlarge instance using the Amazon Our artifact consists of the source code for the Chronos FPGA FPGA Developer AMI. Log into the instance. acceleration framework; pre-compiled FPGA images for our 2. Extract the Chronos artifact .zip file, and navigate to its evaluated configurations (to facilitate a quick evaluation); base directory. and scripts to set up the development environment, compile 3. Run source install.sh. This will clone the Amazon the images from source code, run the experiments in the FPGA SDK repository and install the necessary drivers. paper, and regenerate the graphs. 4. Run aws configure to set up the instance with your This appendix describes how to use Chronos to reproduce AWS credentials. the paper’s results, and explains how to set up and run other 5. (Optional) Install the GNU RISC-V embedded GCC com- Chronos configurations and experiments. All experiments piler within the instance (https://xpack.github.io/riscv- are run on the Amazon AWS f1.2xlarge instance, configured none-embed-gcc/). This step is optional because the dis- using the Amazon-provided FPGA Developer AMI. tribution already includes pre-compiled RISC-V binaries necessary for the workflow. A.2 Artifact check-list (meta-information) A.5 Experiment workflow • Compilation: Xilinx Vivado, GNU RISC-V embedded GCC com- piler. We provide an automated workflow to validate the main re- • Run-time environment: Amazon AWS FPGA instance. sults in the paper from scratch. Note that this process involves • Hardware: Xilinx UltraScale VU9P. synthesizing multiple Chronos instances for each application, • How much disk space required (approximately)?: 2GB. a process that takes about two weeks to complete. • How much time is needed to prepare workflow (approxi- To facilitate a quick evaluation, we also provide precom- mately)?: Approx. 1 hour. piled FPGA images of the Chronos instances; when using • How much time is needed to complete experiments (ap- these images, reproducing the results takes about two hours. proximately)?: 2 weeks to reproduce the full results from scratch, The cl_chronos/validation/scripts/ directory contains or 2 hours if using the precompiled images. The tutorials (Sec. A.7) take about 2 days each, or 2 hours if using precompiled images. the necessary scripts to validate the results from the paper. • Publicly available?: Yes. The full process is explained in comments in the master script • Code licenses (if publicly available)?: GPL v2. run_validation.py. • Archived (provide DOI)?: 10.5281/zenodo.3558760 To run all experiments from scratch, run: python run_validation.py A.3 Description To run all experiments with precompiled images, run: python run_validation.py --precompiled A.3.1 How delivered This will download a list of precompiled image IDs from a Our artifact can be downloaded from https://doi.org/10.5281/ shared S3 bucket and run the rest of the workflow. zenodo.3558760 as a .zip file. Sec. A.7 includes two smaller tutorials using Chronos, which can be completed in about 2 hours. A.3.2 Hardware dependencies A.6 Evaluation and expected result Chronos is designed to run on an Amazon AWS f1.2xlarge instance configured with the Amazon FPGA Developer AMI. Running run_validation.py would generate all evaluation plots (Figures 10-14). A.3.3 Software dependencies A.7 Experiment customization The main dependence is Xilinx Vivado 2018.2, which comes This section provides two smaller tutorials on using Chronos. with the FPGA Developer AMI. The RISC-V Chronos variant First, we illustrate the SLOT programming model using a relies on the GNU RISC-V embedded GCC compiler. sample application running on a Chronos instance with RISC- V soft cores. Second, we describe how to generate Chronos A.3.4 Data sets instances with specialized cores. For small, testing runs, we include scripts to generate syn- Before starting either tutorial, run source aws_setup.sh to thetic datasets. The experiments in the paper use large, pub- configure the necessary environment variables and to define licly available datasets from other projects. Since datasets are the $CL_DIR environment variable to point to the cl_chronos large and publicly available, they are not included directly subdirectory. Please see README.txt here for more detailed in the artifact code. Instead, the artifact includes scripts to information, including topics not covered in this workflow, download these datasets. These datasets are also archived, such as how to simulate Chronos RTL and how to debug with the DOI 10.5281/zenodo.3563178. Chronos.
13 A.7.1 Tutorial 1: Chronos using RISC-V soft cores Next, compile and run the test_chronos program that Step 1: Generate a test graph. transfers the input graph to the FPGA, collects results, and The graph_gen tool can be used to generate test graphs to analyzes performance. test our implementation of sssp. cd $CL_DIR/software/runtime cd $CL_DIR/tools/graph_gen make make ./test_chronos --n_tiles=1 sssp
14 [7] Christopher D. Carothers, David Bauer, and Shawn Pearce. 2000. ROSS: Energy-efficient Accelerator for Graph Analytics. In Proc. of the 49th A High-performance, Low Memory, Modular Time Warp System. In annual IEEE/ACM intl. symp. on Microarchitecture (MICRO-49). Proc. of the 14th Workshop on Parallel and Distributed Simulation [24] Lance Hammond, Mark Willey, and Kunle Olukotun. 1998. Data spec- (PADS). ulation support for a chip multiprocessor. In Proc. of the 8th intl. conf. [8] Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan G. Bronson, on Architectural Support for Programming Languages and Operating Christos Kozyrakis, and Kunle Olukotun. 2011. Hardware Acceleration Systems (ASPLOS-VIII). of Transactional Memory on Commodity Systems. In Proc. of the 16th [25] Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. intl. conf. on Architectural Support for Programming Languages and Davis, Ben Hertzberg, Manohar K. Prabhu, Honggo Wijaya, Christos Operating Systems (ASPLOS-XVI). Kozyrakis, and Kunle Olukotun. 2004. Transactional memory coher- [9] Hassan Chafi, Jared Casper, Brian D. Carlstrom, Austen McDonald, ence and consistency. In Proc. of the 31st annual Intl. Symp. on Computer Chi Cao Minh, Woongki Baek, Christos Kozyrakis, and Kunle Olukotun. Architecture (ISCA-31). 2007. A scalable, non-blocking approach to transactional memory. [26] Tim Harris, James Larus, and Ravi Rajwar. 2010. Transactional memory. In Proc. of the 13th IEEE intl. symp. on High Performance Computer Synthesis Lectures on Computer Architecture (2010). Architecture (HPCA-13). [27] William Hasenplaugh, Tim Kaler, Tao B. Schardl, and Charles E. Leis- [10] K. Mani Chandy and Jayadev Misra. 1981. Asynchronous distributed erson. 2014. Ordering heuristics for parallel graph coloring. In Proc. simulation via a sequence of parallel computations. Commun. ACM 24, of the 26th ACM Symp. on Parallelism in Algorithms and Architectures 4 (1981). (SPAA). [11] Tao Chen, Shreesha Srinath, and G. Edward Batten, Christopher Suh. [28] Muhammad Amber Hassaan, Martin Burtscher, and Keshav Pingali. 2018. An Architectural Framework for Accelerating Dynamic Parallel 2011. Ordered vs. unordered: a comparison of parallelism and work- Algorithms on Reconfigurable Hardware. In Proc. of the 51st annual efficiency in irregular algorithms. In Proc. of the ACM SIGPLAN Symp. IEEE/ACM intl. symp. on Microarchitecture (MICRO-51). on Principles and Practice of Parallel Programming (PPoPP). [12] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, [29] Maurice Herlihy and J. Eliot B. Moss. 1993. Transactional memory: Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. Architectural support for lock-free data structures. In Proc. of the 20th 2014. DaDianNao: A Machine-Learning Supercomputer. In Proc. of the annual Intl. Symp. on Computer Architecture (ISCA-20). 47th annual IEEE/ACM intl. symp. on Microarchitecture (MICRO-47). [30] Syed Ali Raza Jafri, Gwendolyn Voskuilen, and T. N. Vijaykumar. 2013. [13] Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A Spatial Wait-n-GoTM: improving HTM performance by serializing cyclic de- Architecture for Energy-efficient Dataflow for Convolutional Neural pendencies. In Proc. of the 18th intl. conf. on Architectural Support for Networks. In Proc. of the 43rd annual Intl. Symp. on Computer Architec- Programming Languages and Operating Systems (ASPLOS-XVIII). ture (ISCA-43). [31] D. Jefferson, B. Beckman, F. Wieland, L. Blume, and M. Diloreto. 1987. [14] Guohao Dai, Yuze Chi, Yu Wang, and Huazhong Yang. 2016. FPGP: Time Warp Operating System. In Proceedings of the Eleventh ACM Graph Processing Framework on FPGA A Case Study of Breadth-First Symposium on Operating Systems Principles. Search. In Proc. of the 2016 ACM/SIGDA International Symposium on [32] David R. Jefferson. 1985. Virtual time. ACM TOPLAS 7, 3 (1985). Field-Programmable Gate Arrays (FPGA). [33] Mark C. Jeffrey, Suvinay Subramanian, Maleen Abeydeera, Joel Emer, [15] C. Demetrescu, A. Goldberg, and D. Johnson. 2006. 9th DIMACS Im- and Daniel Sanchez. 2016. Data-centric execution of speculative par- plementation Challenge: Shortest Paths. http://www.dis.uniroma1.it/ allel programs. In Proc. of the 49th annual IEEE/ACM intl. symp. on ~challenge9 Microarchitecture (MICRO-49). [16] Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The im- [34] Mark C. Jeffrey, Suvinay Subramanian, Cong Yan, Joel Emer, and Daniel plementation of the Cilk-5 multithreaded language. In Proc. of the ACM Sanchez. 2015. A scalable architecture for ordered parallelism. In Proc. SIGPLAN Conf. on Programming Language Design and Implementation of the 48th annual IEEE/ACM intl. symp. on Microarchitecture (MICRO- (PLDI). 48). [17] R. M. Fujimoto, J.-J. Tsai, and G. C. Gopalakrishnan. 1992. Design and [35] Mark C. Jeffrey, Victor A. Ying, Suvinay Subramanian, Hyun Ryong evaluation of the rollback chip: special purpose hardware for Time Lee, Joel Emer, and Daniel Sanchez. 2018. Harmonizing Speculative and Warp. IEEE Trans. Comput. 41, 1 (1992). Non-Speculative Execution in Architectures for Ordered Parallelism. [18] Wilson W. L. Fung and Tor M. Aamodt. 2013. Energy Efficient GPU In Proc. of the 51st annual IEEE/ACM intl. symp. on Microarchitecture Transactional Memory via Space-Time Optimizations. In Proc. of the (MICRO-51). 46th annual IEEE/ACM intl. symp. on Microarchitecture (MICRO-46). [36] Mark T. Jones and Paul E. Plassmann. 1993. A Parallel Graph Coloring [19] Wilson W. L. Fung, Inderpreet Singh, Andrew Brownsword, and Tor M. Heuristic. SIAM J. Sci. Comput. 14, 3 (1993). Aamodt. 2011. Hardware Transactional Memory for GPU Architectures. [37] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gau- In Proc. of the 44th annual IEEE/ACM intl. symp. on Microarchitecture rav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, (MICRO-44). Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, [20] Epifanio Gaona-Ramirez, Rubén Titos-Gil, Juan Fernandez, and Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Manuel E. Acacio. 2010. Characterizing energy consumption in hard- Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hag- ware transactional memory systems. In Proc. of the 22nd symp. on mann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Computer Architecture and High Performance Computing (SBAC-PAD Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit 22). Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, [21] María Jesús Garzarán, Milos Prvulovic, José María Llabería, Víctor James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Viñals, Lawrence Rauchwerger, and Josep Torrellas. 2003. Tradeoffs Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire in buffering speculative memory state for thread-level speculation in Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray multiprocessors. In Proc. of the 9th IEEE intl. symp. on High Performance Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Computer Architecture (HPCA-9). Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, [22] Mordechai Haklay and Patrick Weber. 2008. Openstreetmap: User- Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Stein- generated street maps. IEEE Pervasive Computing 7, 4 (2008). berg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia [23] Tae Jun Ham, Lisa Wu, Narayanan Sundaram, Nadathur Satish, and Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Margaret Martonosi. 2016. Graphicionado: A High-performance and Eric Wilcox, and Doe Hyun Yoon. 2017. In-Datacenter Performance
15 Analysis of a Tensor Processing Unit. In Proc. of the 44th annual Intl. [54] Matthew D. Sinclair, Johnathan Alsop, and Sarita V. Adve. 2015. Ef- Symp. on Computer Architecture (ISCA-44). ficient GPU Synchronization without Scopes: Saying No to Complex [38] Sagar Karandikar, Howard Mao, Donggyu Kim, David Biancolin, Consistency Models. In Proc. of the 48th annual IEEE/ACM intl. symp. Alon Amid, Dayeol Lee, Nathan Pemberton, Emmanuel Amaro, Colin on Microarchitecture (MICRO-48). Schmidt, Aditya Chopra, Qijing Huang, Kyle Kovacs, Borivoje Nikolic, [55] Gurindar S. Sohi, Scott E. Breach, and T. N. Vijaykumar. 1995. Multi- Randy Katz, Jonathan Bachrach, and Krste Asanović. 2018. Firesim: scalar processors. In Proc. of the 22nd annual Intl. Symp. on Computer FPGA-accelerated Cycle-exact Scale-out System Simulation in the Pub- Architecture (ISCA-22). lic Cloud. In Proc. of the 45th annual Intl. Symp. on Computer Architecture [56] SpinalHDL. 2018. A FPGA friendly 32 bit RISC-V CPU implementation. (ISCA-45). https://github.com/SpinalHDL/VexRiscv. [39] Jure Leskovec and Andrej Krevl. 2014. SNAP datasets: Stanford large [57] J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. network dataset collection. http://snap.stanford.edu/data. Mowry. 2000. A scalable approach to thread-level speculation. In Proc. [40] Zhaoshi Li, Leibo Liu, Yangdong Deng, Shouyi Yin, Yao Wang, and of the 27th annual Intl. Symp. on Computer Architecture (ISCA-27). Shaojun Wei. 2017. Aggressive Pipelining of Irregular Applications [58] Suvinay Subramanian. 2018. Architectural Techniques to Unlock Ordered on Reconfigurable Hardware. In Proc. of the 44th annual Intl. Symp. on and Nested Speculative Parallelism. Ph.D. Dissertation. Massachusetts Computer Architecture (ISCA-44). Institute of Technology. [41] Kyle Locke. 2011. Parameterizable Content-Addressable Memory. Xil- [59] Suvinay Subramanian, Mark C. Jeffrey, Maleen Abeydeera, Hyun Ryong inx Application Note (2011). Lee, Victor A. Ying, Joel Emer, and Daniel Sanchez. 2017. Fractal: An [42] Xiaoyu Ma, Dan Zhang, and Derek Chiou. 2017. FPGA-Accelerated execution model for fine-grain nested speculative parallelism. In Proc. Transactional Execution of Graph Workloads. In Proc. of the 2017 of the 44th annual Intl. Symp. on Computer Architecture (ISCA-44). ACM/SIGDA International Symposium on Field-Programmable Gate Ar- [60] Zhangxi Tan, Andrew Waterman, Rimas Avizienis, Yunsup Lee, Henry rays (FPGA). Cook, David Patterson, and Krste Asanović. 2010. RAMP gold: an [43] Steve Margerm, Amirali Sharifian, Apala Guha Guha, and Gilles Shri- FPGA-based architecture simulator for multiprocessors. In Proc. of the raman, Arrvindh Shriraman Pokam. 2018. TAPAS: Generating Par- 47th Design Automation Conf. (DAC-47). allel Accelerators from Parallel Programs. In Proc. of the 51st annual [61] John Wawrzynek, David Patterson, Mark Oskin, Shih-Lien Lu, Christo- IEEE/ACM intl. symp. on Microarchitecture (MICRO-51). foros Kozyrakis, James C Hoe, Derek Chiou, and Krste Asanovic. 2007. [44] Ulrich Meyer and Peter Sanders. 1998. Delta-Stepping: A Parallel Single RAMP: Research accelerator for multiple processors. IEEE Micro 27, 2 Source Shortest Path Algorithm. In Proc. of the 6th Annual European (2007). Symposium on Algorithms (ESA). [62] Sewook Wee, Jared Casper, Njuguna Njoroge, Yuriy Tesylar, Daxia Ge, [45] Vincent Mirian and Paul Chow. 2012. FCache: A System for Cache Christos Kozyrakis, and Kunle Olukotun. 2007. A Practical FPGA- Coherent Processing on FPGAs. In Proc. of the 2012 ACM/SIGDA Inter- based Framework for Novel CMP Research. In Proceedings of the 2007 national Symposium on Field Programmable Gate Arrays (FPGA). ACM/SIGDA 15th International Symposium on Field Programmable Gate [46] Kevin Moore, Jayaram Bobba, Michelle Moravan, Mark D. Hill, and Arrays (FPGA). David Wood. 2006. LogTM: Log-based transactional memory. In Proc. [63] H. J. Yang, K. Fleming, M. Adler, and J. Emer. 2014. LEAP Shared of the 12th IEEE intl. symp. on High Performance Computer Architecture Memories: Automating the Construction of FPGA Coherent Memories. (HPCA-12). In Proc. of the Annual International Symposium on Field-Programmable [47] Njuguna Njoroge, Jared Casper, Sewook Wee, Yuriy Teslyar, Daxia Custom Computing Machines (FCCM). Ge, Christos Kozyrakis, and Kunle Olukotun. 2007. ATLAS: A Chip- [64] Luke Yen, Jayaram Bobba, Michael R. Marty, Kevin E. Moore, Haris multiprocessor with Transactional Memory Support. In Proc. of the Volos,Mark D. Hill, Michael M. Swift, and David A. Wood. 2007. LogTM- conf. on Design, Automation and Test in Europe (DATE). SE: Decoupling hardware transactional memory from caches. In Proc. [48] Tony Nowatzki, Vinay Gangadhar, Newsha Ardalani, and Karthikeyan of the 13th IEEE intl. symp. on High Performance Computer Architecture Sankaralingam. 2017. Stream-Dataflow Acceleration. In Proc. of the (HPCA-13). 44th annual Intl. Symp. on Computer Architecture (ISCA-44). [65] Ye Zhang, Lawrence Rauchwerger, and Josep Torrellas. 1998. Hardware [49] Muhammet Mustafa Ozdal, Serif Yesil, Taemin Kim, Andrey Ayupov, for speculative run-time parallelization in distributed shared-memory John Greth, Steven Burns, and Ozcan Ozturk. 2016. Energy Efficient multiprocessors. In Proc. of the 4th IEEE intl. symp. on High Performance Architecture for Graph Analytics Accelerators. In Proc. of the 43rd Computer Architecture (HPCA-4). annual Intl. Symp. on Computer Architecture (ISCA-43). [50] Keshav Pingali, Donald Nguyen, Milind Kulkarni, Martin Burtscher, M. Amber Hassaan, Rashid Kaleem, Tsung-Hsien Lee, Andrew Lenharth, Roman Manevich, Mario Méndez-Lojo, Dimitrios Prount- zos, and Xin Sui. 2011. The tao of parallelism in algorithms. In Proc. of the ACM SIGPLAN Conf. on Programming Language Design and Imple- mentation (PLDI). [51] Shafiur Rahman, Nael Abu-Ghazaleh, and Walid Najjar. 2017. PDES- A: A Parallel Discrete Event Simulation Accelerator for FPGAs. In Proc. of the ACM SIGSIM Conference on Principles of Advanced Discrete Simulation (PADS). [52] Ravi Rajwar and James R Goodman. 2002. Transactional lock-free execution of lock-based programs. In Proc. of the 10th intl. conf. on Ar- chitectural Support for Programming Languages and Operating Systems (ASPLOS-X). [53] Jose Renau, Karin Strauss, Luis Ceze, Wei Liu, Smruti Sarangi, James Tuck, and Josep Torrellas. 2005. Thread-level speculation on a CMP can be energy efficient. In Proc. of the Intl. Conf. on Supercomputing (ICS’05).
16