Performance and Analysis of Sorting Algorithms on the SRC 6 Reconfigurable Computer
Total Page:16
File Type:pdf, Size:1020Kb
2 Nov. 2005 Performance and Analysis of Sorting Algorithms on the SRC 6 Reconfigurable Computer John Harkins, Tarek El-Ghazawi, Esam El-Araby, Miaoqing Huang The George Washington University [email protected], {tarek, esam, mqhuang}@gwu.edu Abstract architecture in this paper. Sorting is perhaps the most widely studied problem in 2.1. SRC 6 versus 6E Hardware Architecture computer science and is frequently used as a benchmark Changes of a system’s performance. This work compares the execution speed of the FPGA processing elements to the The most significant change to the hardware microprocessor processing elements in the SRC 6 architecture between the SRC 6 and the 6E is the addition reconfigurable computer using the following algorithms of a 16 port high-speed crossbar switch as shown in for sorting unsigned integer keys: Quick Sort, Heap Sort, Figure 1. Each switch port supplies dedicated input and Radix Sort, Bitonic Sort, and Odd/Even Merge. SRC output paths at a peak bandwidth of 1.6 GB/s for each compiler performance is also examined. The results show path. Multiple switches allow the system to be scaled to that, for sorting, FPGA technology may not be the best up to 256 nodes. processor choice and that factors such as memory A second change is the addition of a new external bandwidth, clock speed, algorithm computational memory node that attaches directly to the switch as requirements and an algorithm’s ability to be pipelined shown in Figure 1. Memory nodes are available in 4, 8 all have an impact on FPGA performance. 16, & 32 GB capacities. The third major change involves the upgrade of the Keywords: reconfigurable computer, sorting, FPGA, microprocessor nodes from dual 1 GHz PentiumIII’s to hardware compiler, SRC, Quick Sort, Heap Sort, Radix dual 2.8 GHz Xeon microprocessors. This allows much Sort, Bitonic Sort, Odd/Even Merge. higher rate DMA transfers in the new system (sustained 1.4 GB/s DMA rates). 1. Introduction Field Programmable Gate Array (FPGA) technology has recently been added to several high-performance computing systems as a means to increase overall system performance [1], [2], [3]. In this study, we use a reconfigurable computer from SRC Computers to evaluate the performance of reconfigurable computers for sorting. Several researchers have investigated an earlier model Figure 1 SRC 6 system architecture. SRC system, the SRC 6E, with largely favorable results including crypto algorithm speedups on the order of The subject system configuration is as follows: 10~1000x [4], [5], and wavelet processing speedups on the order of 10x [6]. • two microprocessor nodes (four Xeons) • two FPGA nodes (four FPGAs total; two per 2. SRC 6 System node) • 16 GB memory node The SRC 6 reconfigurable computer includes both The PE specifications as tested are: microprocessors and FPGAs for use as processing elements (PEs) and offers C and Fortran compilers for • Xeons: 2.8 GHz, 512 KB L2 cache, 1 GB main translating functions into executable FPGA hardware. memory per node (shared) Several of the previously referenced papers provide • FPGAs: Xilinx VirtexII XC2V6000, 100 MHz, detailed descriptions of the earlier SRC 6E system. We with six shared local memories per node, each will only point out the major changes to the new bank 512 K x 64 bits wide. 1 2 Nov. 2005 Figure 2 MIMD implementation in FPGAs. Note that the SRC system architecture imposes a fixed Figure 3 SIMD implementation in FPGAs. clock rate on the FPGA resources of 100 MHz regardless of the potential maximum speed for code compiled to run in parallel on their own data sources in memory banks A in an FPGA PE. through F. All sorting instances in the MIMD model will be coded using a sequential implementation of the their 2.2. SRC 6 FPGA C Compiler respective algorithms. One drawback to this approach is that each parallel instance in the FPGA stalls on Coding for the FPGA resources in the SRC 6 follows completion until all parallel instances have completed. the same procedures as those for the 6E. Code can be This means that sorts with data dependent performance, compiled to hardware from C, Fortran, Verilog, and such as Quick Sort and Heap Sort, will cause all parallel VHDL. SRC’s mcc compiler is used to compile code for instances to execute at the speed of the instance with the the FPGAs. Microprocessor codes are compiled using least favorable key ordering in its bank. Intel’s C compiler, icc. Another possible drawback to this approach is that a six-way merge operation is needed to fully sort the results of each parallel sort. However, this invites a possible 3. Methodology hybrid FPGA-microprocessor approach with the microprocessor handling merging while the FPGAs The main goal of this study was to compare the run- perform the sorting. time performance of an FPGA versus a microprocessor on integer sorting, a common general-purpose computing 3.2. SIMD Model benchmark. Since achieving the best performance typically requires some amount of optimization, our The single instruction stream multiple data stream approach was, (a) to develop the algorithm for both (SIMD) model [7] relies on a single central controller that architectures using as generic and similar a code base as synchronously controls one or multiple instances of a possible and, (b) to then look for optimization hardware sorter with multiple data sources. An example opportunities to improve performance. As a result, the using two parallel instances is shown in Figure 3. The best implementations for the microprocessors and the SIMD model can be used for algorithms that can be FPGAs usually took slightly different approaches. These expressed using parallel sorting networks; it cannot be differences are explained in the text. For the most part, used for data dependent algorithms. All parallel instances the microprocessor code implementations are textbook in a SIMD implementation are synchronous and complete examples compiled with strong speed optimization turned at the same time. on. The FPGA implementations are based on the architecture models described in the following paragraphs. 4. Algorithm Selection 3.1. MIMD Model The sorting algorithms selected as the basis for this The multiple instruction stream multiple data stream study include Quick Sort, Heap Sort, Radix Sort, Bitonic (MIMD) model [7] is a simple way to express parallelism Sort, and Odd/Even Merge. Algorithms were selected in FPGAs as shown in Figure 2. In this model, multiple based on three main factors: (1) the algorithm’s relevance instances of the same hardware sorter, designated by the to sorting in general, (2) the ease with which an algorithm dotted boxes F1 through F6, operate asynchronously but could be implemented on FPGAs, and (3) the expected performance gain in implementing an algorithm in 2 2 Nov. 2005 hardware versus software. For the remainder of this paper, made for the FPGA version. A few array references were assume N represents the total number of keys to be converted to scalar references to aid the FPGA C compiler sorted, the key size is 64 bits, and all logarithms are base in generating efficient hardware. All code for the Heap 2. Sort implementations was written in C. 4.1. Quick Sort 4.3. Radix Sort Quick Sort is a divide and conquer algorithm that is Radix Sort is a non-comparison based sort that works most often implemented using recursion [8]. Quick Sort by performing successive stable sorts of the next has worst-case running time of Θ(N2) but is typically significant radix digit until all digits are sorted. The O(Nlog(N)) and in practice one of the fastest of the Radix Sort algorithm we implement makes use of comparison based sorting algorithms. The initial feeling Counting Sort to sort the radix digits during each pass was that Quick Sort would be difficult to implement on [13]. It does this by keeping a tally of each occurrence of FPGAs due to its recursive nature (it requires a stack) and the current radix digit within each key. Keys are then that the performance may suffer because the sequential ordered according to the radix value by noting that the implementation has several nested loops. The FPGA C first occurrence of a radix digit is placed in the position compiler does not support recursion and typically inserts equal to the sum of tallies for all digits smaller than the clock penalties (delays) for nested loops. However, no one being stored. The sum for each radix is incremented study on sorting can be complete without the inclusion of as keys with that radix are sorted. Quick Sort. Furthermore, non-recursive implementations Since non-comparison based sorts can achieve O(N) are possible and the FPGA block RAMs are available to performance and involve operations that are suited to help implement a stack. In addition, Quick Sort can be FPGAs, Radix Sort was expected to perform well in designed to sort in-place and the MIMD architecture hardware. This sort also fits the MIMD model; however, model can be applied in this case. one drawback to Radix Sort is that it typically requires The microprocessor and FPGA implementations are O(2N) memory. The codes used for the microprocessor both based on sequential, in-place implementations of the and the FPGA instances were sequential. The optimal algorithm. The microprocessor code uses recursion and number of bits to sort per pass on the microprocessor side includes the optimization to switch from Quick Sort to was found to be eight; on the FPGA side, the optimal Insertion Sort when partitions fall below 20 keys (the was the maximum feasible which was thirteen. Both value of 20 was selected after empirical testing determined implementations for the Radix Sort were written in C.