Recent Researches in Telecommunications, Informatics, Electronics and Signal Processing

Sorting networks on FPGA

Devi Prasad, Mohamad Yusri Mohamad Yusof, Smruti Santosh Palai, Ahmad Hafez Nawi Microelectronics Department, MIMOS Berhad. Technology Park Malaysia, Kuala Lumpur, Malaysia-57000. [email protected], [email protected], [email protected], [email protected] http://www.mimos.my

Abstract: - Speed and efficiency of are essential for high speed data processing. FPGA based hardware accelerators show better performance than the general purpose processors. Similarly traditional algorithms may not be always efficient on FPGAs. Sorting networks have come as suitable alternatives which can be implemented on FPGAs efficiently. Each application has its own constraint on latency and throughput. A careful selection of a with suitable number of pipeline stages performs at higher throughput, without contributing much latency.

Key-Words: - FPGA, Sorting networks.

1 Introduction and then sorts the columns of the array [10], which With the demand for high speed network and results in the data sequence being partially sorted. As the computing, speed and parallel algorithms have become process repeats the array becomes narrow. Each time the essential tools for development. Many of these number of columns will keep decreasing. In the end, the operations were performed by a general purpose array will have a single column. Shell method processor [1]. But now days due to the availability of actually groups the data at each step, rather than sorting FPGAs, many researchers try to implement various the data by itself. At each step, either or algorithms on FPGAs more efficiently [2] [3]. FPGAs is used to arrange the data. The number of are often used as hardware accelerators. times the data elements need to be rearranged is reduced One of the commonly used operations in high speed data in this type of sorting method. processing is data sorting. The most commonly used sorting is Bubble sorting. For efficient and reduced operations implementation of sorting, Batcher proposed a technique of sorting using sorting networks [4] [5] [6]. Many of these are implemented of FPGAs and general purpose processors [7] [8] [9]. In this paper we evaluate various sorting networks based on the complexity and speed, focusing on FPGA implementation. For the analysis purposes all the networks are configured to accept eight unsorted numbers, of eight bit wide each. The result will be the sorted numbers. Fig.1. Shell Sort, n=8. 2.3 Odd-even transposition sort 2 Sorting Networks In general, odd-even transposition sort compares the 2.1 Bubble Sort adjacent pair of data in an array to be sorted and, if a In bubble sort the adjacent pair of data elements are pair is found to be in the wrong order then those compared and swapped if they are found in wrong order, elements are swapped. In the first step, odd index and and this process is repeated until the last two elements the adjacent even index elements are compared and are of the array are compared. With each pass in the bubble swapped, if found in wrong order. In the next step, even sort, by compare and swap process the smaller elements index and the adjacent odd index elements are compared bubble or move up to their designated locations in the and are swapped, if found in wrong order. This process array. continues with alternating (odd, even) and (even, odd) 2.2 Shell Sort phases, until no swapping of data elements are required. Shell Sort is one of the oldest sorting algorithms, which Thus the resultant array is a sorted one. This network arranges the data sequence in a two dimensional array comprises of the same number of comparator stages as the number of inputs. In each stage either odd or even

ISBN: 978-1-61804-005-3 29 Recent Researches in Telecommunications, Informatics, Electronics and Signal Processing

index positions are compared with their respective neighbors. Each stage alternates between even and odd.

Fig.4. Odd-even , n=8. A summary of complexities of above mentioned given in Fig.2. Odd-even transposition Sort, n=8. Table 1. 2.4 Bitonic Sort TABLE 1 Bitonic Sort is a which is designed for SUMMARY OF COMPLEXITIES parallel machines. On any arbitrary sequence to be Comparator Sort Network Comparators sorted, bitonic sort produces a bitonic sequence by Stages sorting the two halves of the input sequence in opposite Odd-even O(n) O(n2) directions. transposition sort Bubble Sort O(n) O(n2) Bitonic Sort O(log(n)2) O(n·log(n)2) Odd-Even mergesort O(log(n)2) O(n·log(n)2) O(log(n)2) O(n·log(n)2) Where n is the number of inputs.

3 FPGA Implementation All the five algorithms are described by Verilog HDL language in two different approaches. One approach is without any pipeline registers and other is with pipeline registers. Each module accepts eight parallel input data of width eight bits each and the clock. The output is the sorted data. The functionality of each of the designed Fig.3. Bitonic Sort, n=8. modules is verified by the simulation in ModelSim , as A bitonic sequence is one, which consists of two sub- shown in Fig.5. sequences, one that monotonically increasing and the other monotonically decreasing. Hence for any arbitrary sequence of length n, in the bitonic sort, first two n/2- element sorts are performed, one increasing and the other is decreasing. This results in an n-element bitonic sequence. This entire sequence is now bitonically sorted to produce a sorted (monotonic) sequence. 2.5 Odd-even merge sort The earlier odd-even transposition sorting algorithm has a complexity of O(n2). With such a complexity, for any large sequence with sequence length n the number of steps to perform a complete sort will be very high in real time situations. Odd- even merge sort solves this problem. In odd-even merge sort, all the odd index elements and even index elements are sorted separately and then merged; this step is repeated until we get a completely sorted sequence. Odd-even merge sort is also called as optimal sorting algorithm. Fig.5. Functional simulation.

ISBN: 978-1-61804-005-3 30 Recent Researches in Telecommunications, Informatics, Electronics and Signal Processing

Each module is then synthesized and implemented on a similar performance, other sorting algorithms require using Xilinx ISE [11]. Then each module is tested by more number of pipeline stages. downloading the bit file to Xilinx Virtex-5 LX50 FPGA. References: TABLE 2 [1] Amato N., Ravishankar Iyer, Sharad Sundaresan, SUMMARY OF FPGA IMPLEMENTATION Yan Wu, “A Comparison of Parallel Sorting (NO PIPELINE STAGES) Algorithms on Different Architectures,” in Frequency Technical Report No. 98/029, Department of Sort Network LUT FLOP (MHz) . Texas A &M University, 1996. Odd-even 1237 128 72 [2] R. Mueller,J. Teubner, G. Alonso, “Data Processing transposition sort on FPGAs,” Proceedings of the VLDB Endowment., vol. 2, Aug. 2009. Bouble Sort 1331 128 46 [3] Martinez J., Cumplido R., Feregrino C., “An FPGA- Bitonic Sort 800 128 85 based parallel sorting architecture for the Burrows

Odd-Even 654 128 101 Wheeler transform,” Reconfigurable Computing and mergesort FPGAs., 2005. Shellsort 885 128 61 [4] Y.Jun, Li Na, D. Jun, Guo Y., Tang Z., “A research of high-speed Batcher's odd-even merging network,”

TABLE 3 E-Health Networking, Digital Ecosystems and SUMMARY OF FPGA IMPLEMENTATION Technologies (EDT), Vol.1, pp. 77-80, April 2010. (WITH PIPELINE STAGES) [5] K.E. Batcher, “Sorting networks and their applications,” Proceedings of the AFIPS Spring Joint Freque Pipelin Computer Conference 32, pp. 307–314, July 1968. Sort Network LUT FLOP ncy e [6] Knuth, The Art of Computer Programming – Volume (MHz) Stages 3: Sorting and Searching (2nd Edition). Addison- Odd-even 632 512 353 7 Wesley, 1998. transposition [7] M. Matveev, “Implementation of the sorting sort schemes in a programmable logic,” 6th Workshop on Bouble Sort 650 512 363 12 Electronics for LHC Experiments, pp. 476–478, Bitonic Sort 540 448 353 5 Sept. 2000. Odd-Even 425 368 368 5 [8] R.M. Alonso, D. T. Lucio, “Parallel Architecture for mergesort the Solution of Linear Equation Systems Implemented in FPGA,”Proceedings of the 2009 Shellsort 537 448 360 10 Electronics, Robotics and Automotive Mechanics Table 2 & Table 3 summarize performance of the non Conference (cerma 2009), pp. 275–280. pipeline and pipeline stages of designs. [9] R.M. Alonso, D. T. Lucio, “A Programmable Accelerator for next generation wireless It is observed that pipeline stages based algorithms have communications,” 17th European Signal Processing better operating frequency than the non pipelined stage Conference (EUSIPCO 2009), pp. 1294–1298. algorithms. On the contrary, pipeline stages consume [10] D.L. Shell, “A High-Speed Sorting Procedure,” more FPGA resources than the non pipeline stages. The Communications of the ACM., pp. 30–32, 1959. Odd-even merge sort consumes fewer resources and [11] http://www.xilinx.com. gives optimal operating frequency performance compared to the bitonic sort which not only consumes more resources but also has a poor operating frequency performance. Both modules have a throughput of sorting eight data per clock with a latency of five.

4 Conclusion It is often critical do decide the best sorting algorithm for a given application. This is based on the tradeoff between pipeline stages, area and speed. It is observed that by adding five pipeline registers for odd even merge sort, throughput can be increased significantly without much increase in area (FPGA resource). To achieve

ISBN: 978-1-61804-005-3 31