Speeding up FPGA Placement: Parallel Algorithms and Methods

Speeding Up FPGA Placement: Parallel Algorithms and Methods Matthew An, J. Gregory Steffan, Vaughn Betz Department of Electrical and Computer Engineering University of Toronto, Ontario, Canada fansiyuan, steffan, [email protected] Abstract—Placement of a large FPGA design now commonly memory (TM) and thread-level speculation (TLS) that aim to requires several hours, significantly hindering designer produc- make parallel programming easier. Our contributions include a tivity. Furthermore, FPGA capacity is growing faster than CPU quantitative comparison of the speedup and quality-of-results speed, which will further increase placement time unless new obtained with various parallel algorithmic and programming approaches are found. Multi-core processors are now ubiqui- approaches. We find that while TM and TLS simplify parallel tous, however, and some recent processors also have hardware programming, neither can achieve a compelling combination support for transactional memory (TM), making parallelism an increasingly attractive approach for speeding up placement. We of speedup and placement quality. Our best algorithms require investigate methods to parallelize the simulated annealing place- more programming effort than TM or TLS, but outperform ment algorithm in VPR, which is widely used in FPGA research. prior approaches: without loss of placement quality, we can We explore both algorithmic changes and the use of different reach 5.9x speedup with a deterministic algorithm and 34x parallel programming paradigms and hardware, including TM, speedup with a non-deterministic one. thread-level speculation (TLS) and lock-free techniques. We find that hardware TM enables large speedups (8.1x on average), This paper is organized as follows. We first detail the but compromises “move fairness” and leads to an unacceptable relevant prior work, and then outline the three broad parallel quality loss. TLS scales poorly, with a maximum 2.2x speedup, but approaches we investigate. We then quantitatively compare the preserves quality. A new dependency checking parallel strategy result quality and speedup achieved with each approach, and achieves the best balance: the deterministic version achieves 5.9x finally conclude. speedup and no quality loss, while the non-deterministic, lock-free version can scale to a 34x speedup. II. BACKGROUND Keywords—FPGA placement; parallel placement; simulated annealing; transactional memory A. Simulated Annealing Placement SA mimics the process of controlled cooling in metallurgy I. INTRODUCTION to produce high-quality objects [8]. The placer begins with Over the last 15 years, the size of FPGA devices has a random placement of blocks on the FPGA and evaluate a been growing at nearly four times the rate of single-core large number of perturbations to the placement, called moves. CPU performance [1]. As a result, CAD tools typically spend In VPR [9], moves can be split into two phases. During hours compiling large designs targeting modern FPGAs. For proposal, a block and a destination are randomly chosen. example, a recent study of 21 large FPGA benchmarks found During evaluation, the block is moved to its destination (if that placement is the most time-consuming CAD step and the destination already contains a block, the two blocks are comprised 49% of total compile time in Altera’s Quartus II swapped) and the change in placement cost (according to some CAD system [2]. The largest design that would fit in a 40 nm cost function) is computed. Moves that decrease the cost of FPGA required over 16 hours to place. the overall placement are always accepted, while those that increase cost are still accepted with some probability to avoid Quartus II uses simulated annealing (SA) placement [1], being trapped in a local minimum of the cost function. which is commonly used for FPGAs because it handles both le- gality constraints and non-linear delay functions well. Analytic placement is also used commercially [3], but it is usually paired B. Prior Work in Parallel SA with annealing for fine-tuning the result [4], [5]. Consequently, reducing the computation time of SA using readily available A simple way to parallelize SA is to distribute the work parallel hardware is an attractive option. However, SA makes within one move across multiple threads, but there is insuffi- a series of local improvement “decisions” to generate a high- cient parallelism for this approach to scale well [10]. A more quality placement, and as one decision impacts subsequent de- effective way is to distribute moves among multiple threads to cisions, the algorithm is naturally sequential and parallelization be evaluated in parallel. However, conflicts can occur when two is non-trivial. threads attempt to move the same block to different locations (Figure 1a), different blocks to the same location (Figure 1b), There have been several recent efforts in parallelizing or blocks connected to the same net, which in some cases will SA placement for FPGAs [1], [6], [7]. We evaluate new cause both threads to incorrectly evaluate the cost function for parallel approaches that build on these prior techniques, and this net (Figure 1c). To ensure that the final placement is valid, also leverage new processor features such as transactional the parallel algorithm must either detect and resolve conflicts Software TM (STM) implements tracking, conflict detection, and rollback entirely in code. To apply STM to a program, the compiler or programmer inserts extra instructions to track memory accesses made inside transactions. Hardware TM (HTM) relies on extensions to the memory subsystem to detect conflicts and perform rollbacks. Automatic tracking by the hardware provides two major advantages: tracking instructions Fig. 1. Examples of move conflicts. Arrows represent moves being evaluated become unnecessary and the programmer only needs to anno- in parallel. tate transaction boundaries; and the overhead associated with tracking becomes minimal. The authors of TVPR observed that TVPR should attain much higher speedups using HTM [1], [7], or prevent them by guaranteeing the independence of [7]. However, unlike STM, hardware has limited capacity and moves being evaluated in parallel [11], [6]. hence the amount of data that can be tracked by any HTM An early parallel implementation for standard cell place- system is limited. Transactions that reach this limit must be ment is in the TimberWolf placement and routing package [11]. aborted and retried non-transactionally, which prevents parallel It partitions the chip to ensure that moves do not conflict. execution and negatively impacts performance. A related approach for FPGAs was developed in [6], where Some of the latest multi-core CPUs, in particular IBM’s each thread generates moves in a local region distinct from Blue Gene/Q [13] and Intel’s Haswell [14], now incorporate those of all other threads, and also uses stale location data for HTM support, motivating investigation into the utility of HTM the placement of blocks outside its region. Region boundaries for parallelizing CAD algorithms. move periodically to allow blocks to move outside their original regions. Threads also periodically broadcast updated III. PARALLEL PLACERS placement locations to other threads to limit how “stale” their data can be. This algorithm scales very well (51x speedup We present three different parallel SA algorithms, some over sequential VPR, with 16 threads) and is deterministic, of which are coded with multiple techniques. For algorithms but incurs a 10% quality loss. with several implementations, we name them as Algorithm- Technique, e.g. MoveSpec-STM, which is Move Speculation The parallelized Quartus II placer (Q2P) speculatively implemented using software transactional memory. evaluates moves in parallel and uses a manually coded dependency checker to detect conflicts [1]. Whenever possible, it A. Move Speculation resolves conflicts by providing speculative moves with updated information and repairing them. Q2P attains limited speedup This is the approach taken by TVPR. Entire moves are (2.4x on 8 threads), but it is deterministic and maintains the made speculative by enclosing them in transactions, and the same placement quality as the original sequential algorithm. TM system automatically detects conflicts between them. On a The conflict detection and resolution components are more conflict, one of the conflicting moves is aborted and rolled back difficult to code than the partitioning-based approaches above. by the TM system, so the other move still has a consistent view of global state. If a move successfully commits, any changes Transactional VPR (TVPR) also evaluates parallel moves it makes to the placement will be valid. Figure 2 shows an speculatively, and leverages software TM to automatically example: detect and resolve conflicts [7]. TVPR scales better than Q2P (self-relative speedup of 7.3x on 8 threads), but suffers 1) Thread 1 starts transaction T1 and proposes to move block from excessive TM overhead and only attains real speedup A, while thread 2 starts transaction T2 and proposes to for some benchmarks, averaging about 0.9x across all tested move block B. Neither transaction can see the conflict benchmarks. The average quality loss of 1% is negligible. The with connection AC because they have not written any- use of TM makes TVPR non-deterministic, but easier to code thing to memory yet. than all of the placers mentioned above. 2) T1 moves A to its destination and evaluates new costs for connections AB and AC. At the same time, T2 moves C. Transactional Memory B to its destination. Now the conflict is detected by the TM system because T1 is reading the location of B while TM provides atomicity, consistency, and isolation for arbi- T2 is writing to it. Either T1 or T2 must abort, and the trary sections of code accessing shared memory, making them programmer cannot specify which one, so the behavior of behave like database transactions [12]. TM makes it possible Move Speculation is fundamentally non-deterministic. to parallelize a program by dividing it into tasks, assigning 3) If T1 aborts: the tasks to all available threads, and executing each task as a transaction.

Load more