Roberto Hibbler Dept. of Computer Science Florida Institute of Technology Melbourne, FL 32901 [email protected]

ABSTRACT solution in Section 3, the evaluation of our results in Section 4, Given an array of elements, we want to arrange those elements and our final conclusion in Section 5. into a sorted order. To sort those elements, we will need to make comparisons between the individual elements efficiently. Merge Sort uses a divide and conquer strategy to sort an array efficiently 2. RELATED WORK while making the least number of comparisons between array The three that we will discuss are [1] , elements. Our results show that for arrays with large numbers of [2], and [3]. All three are comparison array elements, Merge Sort is more efficient than three other sort algorithms, just as Merge Sort. comparison sort algorithms, Bubble Sort[1], Insertion Sort[3], and Selection Sort[2]. Our theoretical evaluation shows that Merge The Bubble Sort[1] works by continually swapping Sort beats a quadratic , while our empirical adjacent array elements if they are out of order until the array is in evaluation shows that on average Merge Sort is 32 times faster sorted order. Every iteration through the array places at least one than Insertion Sort[3], the current recognized most efficient element at its correct position. Although algorithmically correct, comparison algorithm, with ten different data sets. Bubble Sort[1] is inefficient for use with arrays with a large number of array elements and has a time complexity. Keywords Knuth observed, also, that while Bubble Sort[1] shares the worst- Merge Sort, , comparisons, Selection Sort[2], arrange case time complexity with other prevalent sorting algorithms, compared to them it makes far more element swaps, resulting in 1. INTRODUCTION poor interaction with modern CPU hardware. We intend to show The ability to arrange an array of elements into a defined order is that Merge Sort needs to make on average fewer element swaps very important in Computer Science. Sorting is heavily used with than Bubble Sort[1] . online stores, were the order that services or items were purchased determines what orders can be filled and who receives their order The Selection Sort[2]algorithm arranges array elements in order first. Sorting is also essential for the database management by first finding the minimum value in the array and swapping it systems used by banks and financial systems, such as the New with the array element that is in its correct position depending on Stock Exchange, to track and rank the billions of transactions that how the array is being arranged. The process is then repeated with go on in one day. There are many algorithms, which provide a the second smallest value until the array is sorted. This creates solution to sorting arrays, including algorithms such as Bubble two distinctive regions within the array, the half that is sorted and Sort[1], Insertion Sort[3], and Selection Sort[2]. While these the half that has not been sorted. Selection Sort[2]shows an algorithms are programmatically correct, they are not efficient for improvement over Bubble Sort[1] by not comparing all the arrays with a large number of elements and exhibit quadratic time elements in its unsorted half until it is time for that element to be complexity. placed into its sorted position. This makes Selection Sort[2]less affected by the input’s order. Though, it is still no less inefficient We are given an array of comparable values. We need to arrange with arrays with a large number of array elements. Also, even these values into either an ascending or descending order. with the improvements Selection Sort[2]still shares the same worst-case time complexity of . We intend to show that We introduce the Merge Sort algorithm. The Merge Sort Merge Sort will operate at a worst-case time complexity faster algorithm is a divide-and-conquer algorithm. It takes input of an than . array and divides that array into sub arrays of single elements. A single element is already sorted, and so the elements are sorted The Insertion Sort[3]algorithm takes elements from the input back into sorted arrays two sub-arrays at a time, until we are left array and places those elements in their correct place into a new with a final sorted array. We contribute the following: array, shifting existing array elements as needed. Insertion Sort[3]improves over Selection Sort[2]by only making as many 1. We introduce the Merge Sort algorithm. comparisons as it needs to determine the correct position of the 2. We show that theoretically Merge Sort has a worst-case current element, while Selection Sort[2]makes comparisons against each element in the unsorted part of the array. In the time complexity better than . 3. We show that empirically Merge Sort is faster than average case, Insertion Sort[3]’s time complexity is , but its Selection Sort[2] over ten data sets. worst case is , the same as Bubble Sort[1] and Selection Sort[2]. The tradeoff of Insertion Sort[3]is that on the average This paper will discuss in Section 2 comparison sort algorithms more elements are swapped as array elements are shifted within related to the problem, followed by the detailed approach of our the array with the addition of new elements. We intend to show that Merge Sort operates at an average case time complexity faster (38 27 43 3 9 82 10 1) Output – array A in ascending order than .

3. APPROACH A large array with an arbitrary order needs to be arranged in an ascending or descending order, either lexicographically or numerically. Merge sort can solve this problem by using two key ideas.

The first key idea of merge sort is that a problem can be divided and conquered. The problem can be broken into smaller arrays, and those arrays can be solved. Second, by dividing the array into halves, then dividing those halves by recursively halving them into arrays of single elements, two sorted arrays are merged into one array, as a single element array is already sorted. Refer to the following pseudocode: Figure 1: Shows the splitting of the input array into single element arrays. Input – A: array of n elements Output – array A sorted in ascending order 1. proc mergesort(A: array) 2. var array left, right, result 3. if length(A)<=1 4. return(A) 5. var middle=length(A)/2 6. for each x in A up to middle 7. add x to left 8. for each x in A after middle 9. add x to right 10. left=mergesort(left) 11. right=mergesort(right) 12. result=merge(left,right) 13. return result

Input – left:array of m elements, right: array of k elements Figure 2: Shows the merging of the single element arrays Output – array result sorted in ascending order during the Merge Step. 14. proc merge(left: array, right: array) 15. var array result 16. which length(left) > 0 and length(right) > 0 As the example shows, array A is broken in half continuously 17. if first(left) <= first(right) until they are in arrays of only a single element, then those single 18. append first(left) to result 19. left=rest(left) elements are merged together until they form a single sorted array 20. else in ascending order. 21. append first(right) to result 22. right=rest(right) 23. end while 4. EVALUATION 24. if length(left) > 0 25. append left to result 4.1 Theoretical Analysis 26. if length(right) > 0 4.1.1 Evaluation Criteria 27. append right to result 28. return result All comparison based sorting algorithms count the comparisons of array elements as one of their key operations. The Merge Sort algorithm can be evaluated by measuring the number of As the pseudocode shows, after the array is broken up into a left comparisons between array elements. As the key operation, we half and a right half (lines 5 - 9), the two halves are divided can measure the number of comparisons made to determine the recursively (lines 10 – 11) until they are all within a single overall efficiency of the algorithm. We intend to show that element array. Then, the two halves’ elements are compared to because the Merge Sort algorithm makes less comparisons over determine how the two arrays should be arranged (lines 16 -22). the currently acknowledged most efficient algorithm, Insertion Should any one half contain elements not added to the sorted Sort[3], Merge Sort is the most efficient comparison sort array after the comparisons are made, the remainder is added so algorithm. no elements are lost (lines 24 – 27). In the following examples, using the given input, the division of the array (Figure 1) and how the array is merged back into a sorted array (Figure 2) are 4.1.1.1 Merge Sort Case Scenarios illustrated. 4.1.1.1.1 Worst Case Merge Sort makes the element comparisons we want to measure Inputs – A: array of n elements during the merge step, where pairs of arrays are recursively merged into a single array. Merge Sort’s worst case, depicted in total number of comparisons for the worst case, we get the Figure 3, is the scenario where during each recursive call of the following equations: merge step, the two largest elements are located in different arrays. This forces the maximum number of comparisons to occur. (11) In this case, the Merge Sort algorithm’s efficiency can be = 2 + represented by the number of comparisons made during each = 2 + (12) recursive call of the merge step, which is described in the (13) following recurrence equation where variable n is denoted as the = ∗ 0 + log array size and T(n) refers to the total comparisons in the merge step: Similarly to earlier, equation (11) can be expanded to find a pattern; equation (12) can then be created by substituting k, and = 2 + − 1 (1) by solving for k get equation (13), which is the total number of comparisons for the best case of Merge sort This also results in a 1 = 0 (2) Big O time complexity of log , just like the worst case.

Equation (1) gives the total number of comparisons that occur in Merge Sort dependent on the number of array elements. The 4.1.2 Insertion Sort[3] Case Scenarios 2T(n/2) refers to the comparisons made to the two halves of the 4.1.2.1 Worst Case array before the arrays are merged. The n-1 refers to the total Insertion Sort[3] builds the sorted array one element at a time, comparisons in the merge step. Equation (2) states the base case, removing one element from the input data and placing it in its which is no comparisons are needed for a single element array. correct location each iteration until the array is in sorted order. With these two equations, we can determine the total number of The worst case for Insertion Sort[3] is if the input array is in comparisons by looking at each recursive call of the merge step. reverse from sorted order. In this case every iteration of the inner We will next solve equation (1) by expanding it to isolating n and loop will scan and shift the entire sorted region of the array before perform substitution. inserting the next element. Denoting n as the number of array elements, the number of comparisons can be described by the (3) following equation: = 2 2 + − 1 + − 1 = 4 + − 2 + − 1 (4) = −1 + −2 + ...+ 1 (14) (5) = 4 2 + −1+−2+−1 (6) This results in a Big O time complexity of in the worst = 8 +−4+−2+−1 case.

By expanding equation (1) to get equations (3), (4), and (5) we 4.1.2.2 Best Case can discern a pattern. By using the variable k to indicate the depth The best case for Insertion Sort[3] is when the input array is of the recursion, we get the following equation: already sorted. In this scenario, one element is removed from the input array and placed into the sorted array without the need of shifting elements. This results in a Big O time complexity of = 2 + − 2 − 1 (7) in the best case.

We can solve equation (7) by using the base case in equation (2) and determine the value of k, which refers to the depth of 4.1.3 Analysis recursion. Comparing the time complexities of Merge Sort and Insertion Sort[3], Insertion Sort[3] beats Merge Sort in the best case, as 2 = (8) Merge Sort has a Big-O time complexity of log while (9) Insertion Sort[3] is . However, taking the worst cases, Merge = log (10) Sort is faster than Insertion Sort[3] with a time complexity of still = ∗0+log − + 1 log over Insertion Sort[3]’s which is equivalent to . This adds to the theory that Merge Sort will overall be a better By making the statement in equation (8), equation (7) and algorithm over Insertion Sort[3] with less than optimum input. equation (2) are equal. We can then solve equation (8) for k, and get equation (9). Equation (9) can then be used to reduce equation (7) to equation (10), which represents the total number of comparisons of array elements in the merge step This results in a Big O time complexity of log overall.

4.1.1.2 Best Case The best case of Merge Sort, depicted in Figure 3, occurs when the largest element of one array is smaller than any element in the other. In this scenario, only comparisons of array elements 2 are made. Using the same process that we used to determine the

Figure 4: Depicts the sl ower speed of In sertion Sort vs Merge Sort with array sizes greater than 1000 .

Figure 3: Depicts how single e lement arrays are merged together during the Merge step in the best and worst case

4.2 Empirical Analysis 4.2.1 Evaluation Criteria The goal of this evaluation is to demonstrate the improved efficiency of the Merge Sort algorithm based on the execution’s CPU runtime.

4.2.2 Evaluation Procedures The efficiency of Merge Sort can be measured by determining the CPU runtime of an implementation of the algorithm to sort a number of elements versus the CPU runtime it takes an implementation of the Insertion Sort[3] algorithm. The dataset used is a server log of connecting IP Addresses over the course of Figure 5: Depicts th e faster speed o f Insertion Sort until one day. All the IP Addresses are first extracted from the log, in a around array sizes of 1000. separate process, and placed within a file. Subsets of the IP Addresses are then sorted using both the Merge Sort algorithm and the Insertion Sort[3] algorithm with the following array sizes: With Merge Sort being red and Insertion Sort being blue, Figure 5 5, 10, 15, 20, 30, 50, 100, 1000, 5000, 10000, 15000, and 20000. shows the relative execution speed curves of the two algorithms. Each array size was run ten times and the average of the CPU It can be seen that for an array sizes under 1000, the Insertion runtimes was taken. Both algorithms take as its parameter an array Sort[3] algorithm has much faster execution times, showing a of IP Addresses. For reproducibility, the dataset used for the clear advantage. This is probably due to the overhead incurred evaluation can be found at during the creation and deleting of new arrays during splitting. “http://cs.fit.edu/~pkc/pub/classes/writing/httpdJan24.log.zip ”. Around array sizes of 1000, the algorithm’s curv es intersect, and The original dataset has over thirty thousand records. For the Merge Sort begins to show an advantage over Insertion Sort[3], purposes of this experiment, only twenty thousand were used. The which can be seen in Figure 4 . As the size of the array grows tests were run on a PC running Microsoft Windows XP with the larger, Insertion Sort[3] ’s execution runtime increases at a faster following specifications: Intel Core 2 Duo CPU E8400 at 3.00 pace than Merge Sort. This shows that Insertion Sort[3] will GHz with 3 GB of RAM. The algorithms were implemented in progressively g et worse than Merge Sort as you increase the array Java using the Java 6 API. size, while Merge Sort’s efficiency will decrease at a slower rate. It can be concluded from the results that for array sizes over 1000 Insertion Sort[3] will be unsuitable compared to the efficiency 4.2.2.1 Results and Analysis prese nted when using the Merge Sort algorithm. Compared to the Insertion Sort[3] algorithm, the Merge Sort algorithm shows faster CPU runtimes for array sizes over 1000. A summary of the results is contained in Figure 4 and Figure 5 .

5. CONCLUSION The purpose of this paper was to introduce the Merge Sort algorithm and show its improvements over its predecessors. Our 6. REFERENCES theoretical analysis shows that compared to the Bubble Sort[1], [1] Knuth, D. Sorting by Exchanging. “The Art of Computer Insertion Sort[3], and Selection Sort[2] algorithms, Merge Sort Programming, Volume 3: Sorting and Searching, Second Edition.” 1997. Pp.106-110 has a faster Big O worst-case time complexity of log . Or empirical analysis shows that compared to Insertion Sort[3], [2] Knuth, D. Sorting by Selection. “The Art of Computer Merge Sort is 32 times faster for arrays larger than 1000 elements. Programming, Volume 3: Sorting and Searching, Second This makes Merge Sort far more efficient than Insertion Sort for Edition.” 1997. Pp.138-141 array sizes larger than 1000. [3] Knuth, D. Sorting by Insertion. “The Art of Computer Programming, Volume 3: Sorting and Searching, Second Although Merge Sort has been shown to be better in the worst Edition.” 1998. Pp.80-105 case for array sizes larger than 1000, it was still slower than Insertion Sort with array sizes less than 1000. This can be explained by the overhead required for the creation and merging of all the arrays during the merge step. A future improvement to the Merge Sort algorithm would be to determine a way to reduce this overhead.