Merge Roberto Hibbler Dept. of Florida Institute of Technology Melbourne, FL 32901 [email protected]

ABSTRACT solution in Section 3, the evaluation of our results in Section 4, Given an array of elements, we want to arrange those elements and our final conclusion in Section 5. into a sorted order. To sort those elements, we will need to make comparisons between the individual elements efficiently. uses a divide and conquer strategy to sort an array 2. RELATED WORK efficiently while making the least number of comparisons The three algorithms that we will discuss are , between array elements. Our results show that for arrays with , and . All three are large numbers of array elements, Merge Sort is more efficient algorithms, just as Merge Sort. than three other comparison sort algorithms, Bubble Sort, Insertion Sort, and Selection Sort. Our theoretical evaluation The Bubble Sort algorithm works by continually swapping shows that Merge Sort beats a quadratic , while adjacent array elements if they are out of order until the array is our empirical evaluation shows that on average Merge Sort is 32 in sorted order. Every iteration through the array places at least times faster than Insertion Sort, the current recognized most one element at its correct position. Although algorithmically efficient comparison algorithm, with one real data set. correct, Bubble Sort is inefficient for use with arrays with a large number of array elements and has a worst case time Keywords complexity. Knuth observed[1], also, that while Bubble Sort Merge Sort, , comparisons, Selection Sort, time shares the worst-case time complexity with other prevalent complexity sorting algorithms, compared to them it makes far more element swaps, resulting in poor interaction with modern CPU hardware. 1. INTRODUCTION We intend to show that Merge Sort needs to make on average The ability to arrange an array of elements into a defined order fewer element swaps than Bubble Sort. is very important in Computer Science. Sorting is heavily used with online stores, were the order that services or items were The Selection Sort algorithm arranges array elements in order by purchased determines what orders can be filled and who first finding the minimum value in the array and swapping it receives their order first. Sorting is also essential for the with the array element that is in its correct position depending database management systems used by banks and financial on how the array is being arranged. The process is then repeated systems, such as the New Stock Exchange, to track and rank the with the second smallest value until the array is sorted. This billions of transactions that go on in one day. There are many creates two distinctive regions within the array, the half that is algorithms, which provide a solution to sorting arrays, including sorted and the half that has not been sorted. Selection Sort shows algorithms such as Bubble Sort[1], Insertion Sort[2], and an improvement over Bubble Sort by not comparing all the Selection Sort[3]. While these algorithms are programmatically elements in its unsorted half until it is time for that element to be correct, they are not efficient for arrays with a large number of placed into its sorted position. This makes Selection Sort less elements and exhibit quadratic time complexity. affected by the input’s order. Though, it is still no less inefficient with arrays with a large number of array elements. We are given an array of comparable values. We need to arrange Also, even with the improvements Selection Sort still shares the these values into either an ascending or descending order. same worst-case time complexity of . We intend to show that Merge Sort will operate at a worst-case time complexity We introduce the Merge Sort algorithm. The Merge Sort faster than . algorithm is a divide-and-conquer algorithm. It takes input of an array and divides that array into sub arrays of single elements. A The Insertion Sort algorithm takes elements from the input array single element is already sorted, and so the elements are sorted and places those elements in their correct place into a new array, back into sorted arrays two sub-arrays at a time, until we are left shifting existing array elements as needed. Insertion Sort with a final . We contribute the following: improves over Selection Sort by only making as many comparisons as it needs to determine the correct position of the 1. We introduce the Merge Sort algorithm. current element, while Selection Sort makes comparisons 2. We show that theoretically Merge Sort has a worst- against each element in the unsorted part of the array. In the case time complexity faster than . average case, Insertion Sort’s time complexity is , but its 3. We show that empirically Merge Sort is faster than worst case is , the same as Bubble Sort and Selection Selection Sort. Sort. The tradeoff of Insertion Sort is that on the average more

elements are swapped as array elements are shifted within the This paper will discuss in Section 2 comparison sort algorithms array with the addition of new elements. Even with its related to the problem, followed by the detailed approach of our limitations, Selection Sort is the current fastest comparison based since it equals Bubble Sort and array after the comparisons are made, the remainder is added so Selection Sort in the worst case, but is exceptionally better in the no elements are lost (lines 24 – 27). In the following examples, average case. We intend to show that Merge Sort operates at an using the given input, the division of the array (Figure 3) and average case time complexity faster than . how the array is merged back into a sorted array (Figure 4) are illustrated.

3. APPROACH Inputs – A: array of n elements A large array with an arbitrary order needs to be arranged in an (38 27 43 3 9 82 10 1) ascending or descending order, either lexicographically or Output – array A in ascending order numerically. Merge sort can solve this problem by using two key ideas.

The first key idea of merge sort is that a problem can be divided and conquered. The problem can be broken into smaller arrays, and those arrays can be solved. Second, by dividing the array into halves, then dividing those halves by recursively halving them into arrays of single elements, two sorted arrays are merged into one array, as a single element array is already sorted. Refer to the following pseudocode:

Input – A: array of n elements Figure 3: Shows the splitting of the input array into single Output – array A sorted in ascending order element arrays. 1. proc mergesort(A: array) 2. var array left, right, result 3. if length(A)<=1 4. return(A) 5. var middle=length(A)/2 6. for each x in A up to middle 7. add x to left 8. for each x in A after middle 9. add x to right 10. left=mergesort(left) 11. right=mergesort(right) 12. result=merge(left,right) 13. return result

Figure 1: Pseudocode of the Merge Sort algorithm Figure 4: Shows the merging of the single element arrays during the Merge Step. Input – left:array of m elements, right: array of k elements As the example shows, array A is broken in half continuously Output – array result sorted in ascending order until they are in arrays of only a single element, then those 14. proc merge(left: array, right: array) single elements are merged together until they form a single 15. var array result 16. which length(left) > 0 and length(right) > sorted array in ascending order. 0 17. if first(left) <= first(right) 18. append first(left) to result 4. EVALUATION 19. left=rest(left) 20. else 4.1 Theoretical Analysis 21. append first(right) to result 4.1.1 Evaluation Criteria 22. right=rest(right) All comparison based sorting algorithms count the comparisons 23. end while 24. if length(left) > 0 of array elements as one of their key operations. The Merge Sort 25. append left to result algorithm can be evaluated by measuring the number of 26. if length(right) > 0 comparisons between array elements. As the key operation, we 27. append right to result can measure the number of comparisons made to determine the 28. return result overall efficiency of the algorithm. We intend to show that Figure 2: Pseudocode of the Merge step of Merge Sort because the Merge Sort algorithm makes less comparisons over the currently acknowledged most efficient algorithm, Insertion Sort[Sec 2], Merge Sort is the most efficient comparison sort As the pseudocode shows, after the array is broken up into a left algorithm. half and a right half (lines 5 - 9), the two halves are divided recursively (lines 10 – 11) until they are all within a single element array. Then, the two halves’ elements are compared to determine how the two arrays should be arranged (lines 16 -22). Should any one half contain elements not added to the sorted 4.1.2 Merge Sort Case Scenarios 4.1.2.2 Best Case 4.1.2.1 Worst Case The best case of Merge Sort, depicted in Figure 5, occurs when Merge Sort makes the element comparisons we want to measure the largest element of one array is smaller than any element in the other. In this scenario, only comparisons of array during the merge step, where pairs of arrays are recursively 2 merged into a single array. Merge Sort’s worst case, depicted in elements are made. Using the same process that we used to Figure 5, is the scenario where during each recursive call of the determine the total number of comparisons for the worst case, merge step, the two largest elements are located in different we get the following equations: arrays. This forces the maximum number of comparisons to occur. In this case, the Merge Sort algorithm’s efficiency can be (11) represented by the number of comparisons made during each = 2 + recursive call of the merge step, which is described in the (12) = 2 + n following recurrence equation where variable is denoted as the (13) array size and T(n) refers to the total comparisons in the merge = ∗ 0 + log step:

Similarly to earlier, equation (11) can be expanded to find a pattern; equation (12) can then be created by substituting k, and = 2 + − 1 (1) by solving for k get equation (13), which is the total number of (2) 1 = 0 comparisons for the best case of Merge sort. This also results in

a Big O time complexity of log , just like the worst case. Equation (1) gives the total number of comparisons that occur in Merge Sort dependent on the number of array elements. The 2T(n/2) refers to the comparisons made to the two halves of the 4.1.3 Analysis array before the arrays are merged. The n-1 refers to the total Comparing the time complexities of Merge Sort and Insertion comparisons in the merge step. Equation (2) states the base case, Sort, Insertion Sort beats Merge Sort in the best case, as Merge which is no comparisons are needed for a single element array. Sort has a Big-O time complexity of log while Insertion With these two equations, we can determine the total number of Sort’s is . However, taking the worst cases, Merge Sort is comparisons by looking at each recursive call of the merge step. faster than Insertion Sort with a time complexity of still We will next solve equation (1) by expanding it to isolating n log over Insertion Sort’s, which is equivalent to . and perform substitution. This adds to the theory that Merge Sort will overall be a better algorithm over Insertion Sort with less than optimum input. (3) = 2 2 + − 1 + − 1 (4) = 4 + − 2 + − 1 (5) = 4 2 + −1+−2+−1 (6) = 8 +−4+−2+−1

By expanding equation (1) to get equations (3), (4), and (5) we can discern a pattern. By using the variable k to indicate the depth of the recursion, we get the following equation:

(7) = 2 + − 2 − 1

We can solve equation (7) by using the base case in equation (2) and determine the value of k, which refers to the depth of recursion. Figure 5: Depicts how single element arrays are merged

together during the Merge step in the best and worst case 2 = (8) (9) = log (10) = ∗0+log − + 1 4.2 Empirical Analysis 4.2.1 Evaluation Criteria By making the statement in equation (8), equation (7) and The goal of this evaluation is to demonstrate the improved equation (2) are equal. We can then solve equation (8) for k, and efficiency of the Merge Sort algorithm based on the execution’s get equation (9). Equation (9) can then be used to reduce CPU runtime. We also intend to show that the CPU runtimes for equation (7) to equation (10), which represents the total number each run are statistically different between Merge Sort and of comparisons of array elements in the merge step this results Insertion Sort within a 95% confidence interval. in a Big O time complexity of log overall.

4.2.2 Evaluation Procedures were less than 100 and when the array sizes were greater than The efficiency of Merge Sort can be measured by determining 100. the CPU runtime of an implementation of the algorithm to sort a number of elements versus the CPU runtime it takes an implementation of the Insertion Sort algorithm. The dataset used 5. CONCLUSION is a server log of connecting IP Addresses over the course of one The purpose of this paper was to introduce the Merge Sort day. All the IP Addresses are first extracted from the log, in a algorithm and show its improvements over its predecessors. Our separate process, and placed within a file. Subsets of the IP theoretical analysis shows that compared to the Bubble Sort, Addresses are then sorted using both the Merge Sort algorithm Insertion Sort, and Selection Sort algorithms, Merge Sort has a and the Insertion Sort algorithm with the following array sizes: faster Big O worst-case time complexity of log . Or 5, 10, 15, 20, 30, 50, 100, 1000, 5000, 10000, 15000, and 20000. empirical analysis shows that compared to Insertion Sort, Merge Each array size was run ten times and the average of the CPU Sort is 32 times faster for arrays larger than 100 elements. This runtimes was taken. Both algorithms take as its parameter an makes Merge Sort far more efficient than Insertion Sort for array array of IP Addresses. For reproducibility, the dataset used for sizes larger than 100. For array sizes less than and greater than the evaluation can be found at 100, the Merge Sort and Insertion Sort algorithms’ CPU “http://cs.fit.edu/~pkc/pub/classes/writing/httpdJan24.log.zip ”. runtimes were found to be statistically different within a 95% The original dataset has over thirty thousand records. For the confidence interval. purposes of this experiment, only twenty thousand were used. The tests were run on a PC running Microsoft Windows XP with Although Merge Sort has been shown to be better in the worst the following specifications: Intel Core 2 Duo CPU E8400 at case for array sizes larger than 100, it was still slower than 3.00 GHz with 3 GB of RAM. The algorithms were Insertion Sort with array sizes less than 100. This can be implemented in Java using the Java 6 API. explained by the overhead required for the creation and merging of all the arrays during the merge step. A future improvement to the Merge Sort algorithm would be to determine a way to reduce 4.2.3 Results and Analysis this overhead. Compared to the Insertion Sort algorithm, the Merge Sort algorithm shows faster CPU runtimes for array sizes over 100. A summary of the results is contained in Figure 6. 6. REFERENCES [1] Knuth, D. Sorting by Exchanging. “The Art of Computer Programming, Volume 3: Sorting and Searching, Second Edition.” 1997. Pp.106-110 [2] Knuth, D. Sorting by Selection. “The Art of Computer Programming, Volume 3: Sorting and Searching, Second Edition.” 1997. Pp.138-141 [3] Knuth, D. Sorting by Insertion. “The Art of Computer Programming, Volume 3: Sorting and Searching, Second Edition.” 1998. Pp.80-105

Figure 6: Depicts the slower speed of Insertion Sort vs. Table 1: Anaysis of Merge Sort and Insertion Sort t test for Merge Sort. each array size. Array With Merge Sort being red and Insertion Sort being blue, Figure Size t Critical P(T<=t) 6 shows the relative execution speed curves of the two 5 2.144786681 3.94348E-06 algorithms. It can be seen that for an array sizes under 100, the 10 2.228138842 1.19285E-06 Insertion Sort algorithm has much faster execution times, showing a clear advantage. This is probably due to the overhead 15 2.200985159 2.03073E-05 incurred during the creation and deleting of new arrays during 20 2.228138842 2.13539E-05 splitting. Around array sizes of 100, the algorithm’s curves 30 2.262157158 5.0094E-07 intersect, and Merge Sort begins to show an advantage over 50 2.262157158 2.26509E-06 Insertion Sort. As the size of the array grows larger, Insertion 100 2.119905285 0.224732042 Sort’s execution runtime increases at a faster pace than Merge Sort. This shows that Insertion Sort will progressively get worse 1000 2.131449536 8.36349E-13 than Merge Sort as you increase the array size, while Merge 5000 2.144786681 2.45523E-33 Sort’s efficiency will decrease at a slower rate. It can be 10000 2.262157158 4.95691E-06 concluded from the results that for array sizes over 100 Insertion 15000 2.262157158 6.60618E-18 Sort will be unsuitable compared to the efficiency presented when using the Merge Sort algorithm. We performed a two tail t 20000 2.262157158 8.20913E-17 test. As can be seen in Table 1, we found a significant difference in the CPU runtimes between array sizes when the array sizes