A Comparative Study of Sorting Algorithms with FPGA Acceleration by High Level Synthesis
Total Page:16
File Type:pdf, Size:1020Kb
A Comparative Study of Sorting Algorithms with FPGA Acceleration by High Level Synthesis Yomna Ben Jmaa1,2, Rabie Ben Atitallah2, David Duvivier2, Maher Ben Jemaa1 1 NIS University of Sfax, ReDCAD, Tunisia 2 Polytechnical University Hauts-de-France, France [email protected], fRabie.BenAtitallah, [email protected], [email protected] Abstract. Nowadays, sorting is an important operation the risk of accidents, the traffic jams and pollution. for several real-time embedded applications. It is one Also, ITS improve the transport efficiency, safety of the most commonly studied problems in computer and security of passengers. They are used in science. It can be considered as an advantage for various domains such as railways, avionics and some applications such as avionic systems and decision automotive technology. At different steps, several support systems because these applications need a applications need to use sorting algorithms such sorting algorithm for their implementation. However, as decision support systems, path planning [6], sorting a big number of elements and/or real-time de- cision making need high processing speed. Therefore, scheduling and so on. accelerating sorting algorithms using FPGA can be an attractive solution. In this paper, we propose an efficient However, the complexity and the targeted hardware implementation for different sorting algorithms execution platform(s) are the main performance (BubbleSort, InsertionSort, SelectionSort, QuickSort, criteria for sorting algorithms. Different platforms HeapSort, ShellSort, MergeSort and TimSort) from such as CPU (single or multi-core), GPU (Graphics high-level descriptions in the zynq-7000 platform. In Processing Unit), FPGA (Field Programmable addition, we compare the performance of different Gate Array) [14] and heterogeneous architectures algorithms in terms of execution time, standard deviation can be used. and resource utilization. From the experimental results, we show that the SelectionSort is 1.01-1.23 times faster Firstly, the FPGA is the most preferable platform than other algorithms when N < 64; Otherwise, TimSort is the best algorithm. for implementing the sorting process. Thus, industry uses frequently FPGAs for many real-time Keywords. FPGA, sorting algorithms, heterogeneous applications improve performance in terms of architecture CPU/FPGA, zynq platform. execution time and energy consumption [4, 5]. On the other hand, using an FPGA board allows to build complex applications which have a high 1 Introduction performance. These applications are being made by receiving a large number of available At present, Intelligent Transportation Systems programmable fabrics. They provide an imple- (ITS) is an advanced application combining en- mentation of massively-parallel architectures [5]. gineering transport, communication technologies The increase in the complexity of the applications and geographical information systems. These has led to high-level design methodologies[10]. systems [15] play a significant part in minimizing Hence, High-Level Synthesis (HLS) tools have Computación y Sistemas, Vol. 23, No. 1, 2019, pp. 213–230 ISSN 1405-5546 doi: 10.13053/CyS-23-1-2999 214 Yomna Ben Jmaa, Rabie Ben Atitallah, David Duvivier, Maher Ben Jemaa been developed to improve the productivity of is lower than an optimal parameter (OP) which FPGA-based designs. depends on the target architecture and the However, the first generation of HLS has failed sorting implementation; otherwise, MergeSort is to meet hardware designers’ expectations; some considered with integrating some other steps to reasons have facilated researchers to continue improve the execution time. producing powerful hardware devices. Among The main goal of our work is to create a new these reasons, we quote the sharp increase simulator which simulates the behavior of elements of silicon capability; recent conceptions tend to / components related to intelligent transportation employ heterogeneous Systems on Chips (SoCs) systems; In [2], the authors implemented an and accelerators. The use of behavioral designs execution model for a future test bench and in place of Register-Transfer-Level (RTL) designs simulation. However, they propose a new allows improving design productivity, reducing the hardware and software execution support for the time-to-market constraints, detaching the algorithm next generation of test and simulation system from architecture to provide a wide exploration for in the field of avionics. In [33], the authors implementing solutions [11]. From a high-level developed an efficient algorithm for helicopter path programming language (C / C ++), Xilinx created planning. They proposed different scheduling the vivado HLS [34] tool to generate hardware methods for the optimization of the process on accelerators. a real time high performance heterogeneous The sorting algorithm is an important process architecture CPU/FPGA. The authors in [25] for several real-time embedded applications. presented a new method of 3D path design There are many sorting algorithms [17] in the using the concept of “Dubins gliding symmetry literature such as BubbleSort, InsertionSort and conjecture“. This method has been integrated in SelectionSort which are simple to develop and to a real-time decisional system to solve the security realize, but they have a weak performance (time problem. complexity is O(n2)). Several researchers have In our research group, several researchers used MergeSort, HeapSort and QuickSort with proposed a new adaptive approach for 2D path O(nlog(n)) time complexity to resolve the restricted planning according to the density of obstacle performance of these algorithms [17]. On the one in a static environment. They improved this hand, HeapSort starts with the construction of a approach into a new method of 3D path planning heap for the data group. Hence, it is essential to with multi-criteria optimizing. The main objective remove the greatest element and to put it at the end of this work is to propose a different optimized of the partially-sorted table. Moreover, QuickSort hardware accelerated version of sorting algo- is very efficient in the partition phase for dividing rithms (BubbleSort, InsertionSort, SelectionSort, the table into two. However, the selection of the HeapSort, ShellSort, QuickSort, TimSort and value of the pivot is an important issue. On the MergeSort) from High-Level descriptions in avionic other hand, MergeSort gives a comparison of each applications. We use several optimization steps to element index, chooses the smallest element, puts obtain an efficient hardware implementation in two them in an array and merges two sorted arrays. different cases: software and hardware for different Furthermore, other sorting algorithms are vectors and permutations. improving the previous one. For example, The paper is structured as follows: Section ShellSort is an enhancement of the insertion II presents several studies of different sorting sort of algorithm. It divides the list into a algorithms using different platforms (CPU, GPU minimum number of sub-lists which are sorted and FPGA). Section III shows a design flow of with the insertion sort algorithm. Hybrid sorting our application. Section IV gives an overview algorithms emerged as a mixture between several of sorting algorithms. Section V describes our sorting algorithms. For example, Timsort combines architecture and a variety of optimizations of MergeSort and InsertionSort. These algorithms hardware implementation. Section VI shows choose the InsertionSort if the number of elements experimental results. We conclude in Section VII. Computación y Sistemas, Vol. 23, No. 1, 2019, pp. 213–230 ISSN 1405-5546 doi: 10.13053/CyS-23-1-2999 A Comparative215 Study of Sorting Algorithms with FPGA Acceleration by High Level Synthesis Table 1. Complexity of the sorting algorithms — According to the number of elements to be sorted, we choose the corresponding sorting Best Average Worst BubbleSort O(n) O(n2) O(n2) algorithms. For example, if the number of ele- InsertionSort O(n) O(n2) O(n2) ments is small, then InsertionSort is selected, 2 2 2 SelectionSort O(n ) O(n ) O(n ) otherwise; MergeSort is recommended. HeapSort O(n log(n)) O(n log(n)) O(n log(n)) ShellSort O(n log(n)) O(n log(n)2) O(n log(n)2) QuickSort O(n log(n)) O(n log(n)) O(n2) Zurek et al. [35] proposed two different hardware MergeSort O(n log(n)) O(n log(n)) O(n log(n)) TimSort O(n) O(n log(n)) O(n log(n)) implementation algorithms: the quick-merge parallel and the hybrid algorithm (parallel bitonic algorithm on the GPU + sequence MergeSort on 2 Background and Related Work CPU) using a framework openMP and CUDA. They compared two new implementations with a different number of elements. The obtained result Sorting is a common process and it is considered shows that multicore sorting algorithms are the as one of the well-known problems in the best scalable and the most efficient. The GPU computational world. To achieve this, several sorting algorithms, compared to a single core, are algorithms are available in different research up to four times faster than the optimized quick works. They can be organized in various ways: sort algorithm. The implemented hybrid algorithm (executed partially on CPU and GPU) is more — Depending on the algorithm time complexity, efficient than algorithms only run on the GPU Table I presents three different cases (best, (despite transfer delays) but a little slower than the average and worst) of the complexity for most efficient, quick-merge parallel CPU algorithm. several sorting algorithms. We can mention They showed that the hybrid algorithm is slower that QuickSort, HeapSort, MergeSort, timSort than the most efficient quick-merge parallel CPU have a best complexity of O(nlog(n)) in algorithm. average case. By contrast, the worst case Abirami et al. [1] presented an efficient hardware performance for the four algorithms is O(n2) implementation of the MergeSort algorithm with obtained by QuickSort. A simple pretest Designing Digital Circuits. They measured makes all algorithms in the best case to O(n) the efficiency, reliability and complexity of the complexity. MergeSort algorithm with a digital circuit.