Merge Sort Luis Quiles Florida Institute of Technology [email protected]

Merge Sort Luis Quiles Florida Institute of Technology Lquiles@Fit.Edu

ABSTRACT 2.1 Selection Sort Currently, the best solution to the task of sorting is the Insertion Selection Sort is performed by selecting the smallest value of the Sort algorithm. The performance of Insertion Sort diminishes as list, then placing it at the beginning of the list [2]. This process is datasets grow larger due to its quadratic time complexity. We repeated once for every value of the list. Selection Sort is sound and easy to understand. It’s also very slow, and has a time introduce a new algorithm called Merge Sort, which utilizes a 2 Divide and Conquer approach to solve the sorting problem with a complexity of O(N ) for both its worst and best case inputs due to time complexity that is lower than quadratic time. We will show the many comparisons it performs. that Merge Sort has a logarithmic time complexity of A single pass of Selection Sort can only guarantee that one O(N*log(N)). We will also analyze Merge Sort and its closest element is sorted because only one element is placed into its competitor to verify that Merge Sort performs fewer comparisons correct position. In developing merge sort, we thought of an and has a lower time complexity than Insertion Sort. algorithm that would not only always sort every element in place correctly, but also sort faster on certain inputs. Keywords Divide and Conquer, Sorting, Merge Sort 2.2 Insertion Sort Insertion Sort is another algorithm which solves the problem. 1. INTRODUCTION Given a list of comparable values, search the list from the The task of sorting is a problem that commonly occurs in beginning. When a value is smaller than a previously viewed applications that retrieve information from a database or search value, move the new value into its correct position and shift all of through a file system. In both situations, the tables and file system the previous values forward [1]. Like Selection Sort, Insertion can be very large, containing thousands of sources of data that Sort is also sound and has a time complexity of O(N2) for the need to be sorted quickly. For this reason, fast algorithms and worst case. Unlike Selection Sort, Insertion Sort has a time solutions are desirable. complexity of O(N) for its best case input, and Insertion Sort will sort faster because it performs fewer comparisons [4]. Insertion Sort, Selection Sort, and Bubble Sort are three existing algorithms that solve the task of sorting a list of comparable Insertion Sort moves from the beginning of the list to the end of values. Of these three algorithms, Insertion Sort performs best, yet the list exactly once. During the process of sorting, if a value is its performance diminishes rapidly as dataset sizes increase. This out of place, it is moved backward into its correct position [1]. decrease in performance is due to its quadratic time complexity, With a list that is mostly sorted, Insertion Sort will correctly which is an indicator that as a problem doubles in size, the time reorder the list faster than Selection Sort can because it has fewer required to solve it will quadruple in size. comparisons to perform and it only has to move a few elements a small number of times. Given a list of comparable values, we desire an algorithm that can sort this list in polynomial time and has a time complexity lower Insertion Sort is a faster solution than Selection Sort, but it’s still than the quadratic time complexity of Insertion Sort. not fast enough for demanding applications. Our Merge Sort algorithm will perform faster than Insertion Sort We introduce a new algorithm called Merge Sort. The Merge Sort algorithm solves the problem of rearranging a list of linearly 2.3 Bubble Sort comparable values in less than quadratic time by using a Divide- Bubble sot takes a different approach to solving our problem, and-Conquer approach to produce the output. This algorithm which is why it is a good algorithm to include in our research. produces the rearranged list faster than other algorithms, such as Bubble Sort’s approach to the problem is to make multiple passes Insertion Sort; Selection Sort; and Bubble Sort. through the algorithm, swapping any two consecutive values that Section 2 describes the Selection, Insertion, and Bubble Sort are out of their proper order [3]. This process of full passes with algorithms. Section 3 details the Merge Sort algorithm being order swapping has the effect of moving at least one value – the introduced. Section 4 proves that Merge Sort has a logarithmic largest unsorted value – into its correct place. time. Section 5 provides empirical evidence of Merge Sort’s The idea of sorting multiple items into their correct positions in superiority over Insertion Sort. Section 6 summarizes our results. each pass is a great idea, and it is used again in our merge sort algorithm. Also, due to this swapping process, it’s possible to stop 2. RELATED WORK sorting early; if any pass of Bubble Sort is completed without any We began our work on developing a new algorithm by first swaps, then the list is fully sorted. looking at other solutions to the problem of reordering a list of Due to the total number of swaps that Bubble Sort performs comparable values that are given in an arbitrary order. during each pass, this sort is actually slower than Insertion and Selection Sort. As you will see in the next section, one of the improvements that our merge sort algorithm makes over Bubble to produce the sorted list. Figure 1 demonstrates the Divide Sort is that it will continue to sort quickly even with the worst algorithm, while Figure 2 shows the resulting sort that occurs possible inputs. when Merge is called on Line 5. In the Merge algorithm, to merge the left and right into list L, we 3. APPROACH take the lowest value of either list that has not been placed in L, A list, L, is known to have a total of N linearly comparable and place it at the end of L( Lines 8-10 ). Once either left or right is elements in an arbitrary order. The task required is to rearrange exhausted, L is filled with the remainder of the list that has not the list’s elements in ascending order. been merged ( Lines 11-12 ). Merge Sort is known as a Divide and Conquer algorithm: given a task, divide it into parts; then solve the parts individually. There are two parts to the Merge Sort algorithm. 1. Divide the list into two separate lists 2. Merge the new lists together in order First let’s look at Divide . The algorithm is simple: divide list L into left and right halves. This process is recursive, and does not stop until only one element remains in the list. A single-element list is already sorted .

Figure 3 Figure 3, above, illustrates the Merge algorithm ( Lines 6-12 ). The black arrows above the left and right lists represent i and j from the pseudo-code. In (a) , these values are initially zero, so the first elements from each are compared. In (b) and (c) , the lowest element from the previous step was appended to the end of L, and Figure 1 (Divide) Figure 2 (Merge) both i and j have increased. Finally, (d) represents Lines 11-12 , Merge does what its name implies: it merges two lists ( left and where right is appended to L because left has been exhausted. right ) into a single list ( L) in ascending order. Merge also makes a claim: lists left and right are already sorted. Since left and right 4. THEORETICAL EVALUATION are already sorted in ascending order, the two lists can be merged by repeatedly removing the next lowest value in both lists and 4.1 Criteria adding it to the result list L. This operation requires only a single The goal of both Merge Sort and Insertion Sort is to sort a list of comparison per value added to the result list, until one list is comparable elements. The key operation in the execution of this exhausted. goal is the comparison between list elements during the sorting process. This theoretical evaluation will measure the total number MERGESORT: of comparisons between list elements in Merge Sort and in Input: List L (contains N elements) Output: List L (ascending order) Bubble Sort according to each algorithm’s worst and best case MERGESORT(LIST L): input. 1. if(N > 1) 2. (List left, List right) <- DIVIDE(L) 4.2 Merge Sort 3. Left <- MERGESORT(left) In merge sort, the comparisons occur during the merging step, 4. right <- MERGESORT(right) when two sorted lists are combined to output a single sorted list. 5. L <- MERGE(left, right) During the merging step, the first available element of each list is MERGE: compared and the lower value is appended to the output list. Input: List A (sorted ascending order), When either list runs out of values, the remaining elements of the List B (sorted ascending order) opposing list are appended to the output list. Output: List L (sorted ascending order) MERGE(List left, List right) 4.2.1 Worst Case 6. i = j = k = 0 7. while(i < LEN(left) and j < LEN(right)) The worst case scenario for Merge Sort is when, during every 8. if(left[i] < right[j]) merge step, exactly one value remains in the opposing list; in 9. L[k++] = left[i++] other words, no comparisons were skipped. This situation occurs 10. else L[k++] = right[j++] when the two largest value in a merge step are contained in 11. if(i == LEN(left)) append(L, rest(left)) opposing lists. When this situation occurs, Merge Sort must 12. else append(L, rest(right)) continue comparing list elements in each of the opposing lists until the two largest values are compared. In the pseudo-code, Line 1 is the stopping condition for Divide . The recursive Divide occurs on Lines 2-4. Specifically, Lines 3-4 perform the recursive call. Finally, the lists are merged on Line 5

The complexity of worst-case Merge Sort is: T ( N ) = 2T ( N / )2 + N / 2 10.2( ) = + − T (N ) 2T (N )2/ N 1 )1.2( T ( N ) = 2[2 T ( N / )4 + N / ]4 + N / 2 11.2( ) = + − + − T (N ) 2[2 T (N )4/ N 2/ ]1 N 1 )2.2( T ( N ) = 2[4 T ( N )8/ + N ]8/ + N 12.2( ) = + − + − T (N ) 2[4 T (N )8/ N 4/ ]1 2N 3 )3.2( T ( N ) = 8T ( N )8/ + 3N / 2 13.2( ) T (N ) = 8T (N )8/ + N + N + N − 4 − 2 −1 )4.2( k N kN T (N ) = 2 T ( ) + 14.2( ) 2 k 2 Eq. 2.1 is the recurrence relation for Merge Sort. T(N) refers to N the total number of comparisons between list elements in Merge = T (N ) log 2 N 15.2( ) Sort when we are sorting the entire list of N elements. The divide 2 stage performs Merge Sort on two halves of the list, which is what 2*T(N/2) refers to. The final part of the Eq., N-1, refers to the Eq. 2.14 is produced from 2.10 using the same substitution total comparisons in the merge step that returns the sorted list of process (2.11-2.13) as before. Like Eq. 2.9 from the worst case, N elements. Eq. 2.15 shows that Merge Sort has O(Nlog(N)) complexity with the best possible input. Eq. 2.1 describes the number of comparisons that occur in a merge sort, which is a recursive procedure. Since the method is recursive, we will not be able to count every comparison that 4.3 Results occurs by only looking at a single call to Merge. Instead, we need While Insertion Sort’s best case for sorting certainly does beat to unroll the recursion and count the total number of comparisons. Merge Sort’s best case, it is usually far more interesting to know about the worst case scenario. Since we are far more likely to sort Equations 2.2-2.3 perform this action of unrolling the recursion an unsorted list than a pre-sorted list, comparing the worst cases by performing substitution. We know what the value of T(N) is between the two provides far more relevant information. from 2.1, and by substitution we know the value of T(N/2). Eq. 2.4 is just a clearer view of Eq. 2.3. At this point we are in the The exception to this is when we have some prior knowledge that third recursive call of Merge Sort, and a pattern has become clear the list will usually already be sorted. enough to produce Eq. 2.5 below: Comparing the worst cases for Insertion Sort and Merge Sort, the complexity of Merge Sort is far better. The number of k N k comparisons performed by Insertion Sort is bounded above by an T ( N ) = 2 T ( ) + kN − 2( − )1 )5.2( 2 2 k N function, while Merge Sort is bounded above by an N*log(N) function. Merge Sort’s logarithmic time complexity is much more In Eq. 2.5, a new variable called “k” is introduced. This variable efficient for large inputs than Insertion Sort’s complexity. represents the depth of the recursion. When the sort is recursively Theoretically, Merge Sort is a better sorting algorithm than dividing the input list, the recursion stops when the list contains a Insertion Sort when there is no prior knowledge of the data to be A single element list is already sorted. single element. sorted. Now we show through experiments that this holds true. T )1( = 0 )6.2( 5. EMPIRICAL EVALUATION k = 2 N )7.2( The goal of this evaluation is to show a clear improvement in the

k = log N )8.2( task of sorting when using Merge Sort instead of Insertion Sort. 2 The evaluation criterion is a measurement of the processor time = − + for each algorithm. T (N ) N log 2 N N 1 )9.2( By substituting Eq. 2.6 through Eq. 2.8 into Eq. 2.5, we eliminate 5.1 Environment and Data the k term and reduce the recurrence relation to produce the The evaluation of Merge Sort was conducted on a Windows complexity for merge sort, Eq. 2.9. Thus we’ve shown that Merge machine containing a 7200RPM internal drive, an Intel Q6600 Sort has O(N*log(N)) complexity with the worst possible input. quad-core processor, and 3.5GB of DDR2 ram. The Merge Sort and Insertion Sort algorithms were run on the Java 1.5.0 Virtual

Machine with no other processes being run on the same CPU 4.2.2 Best Case core. The data for the evaluation can be found at http://cs.fit.edu/~pkc/classes/writing/httpdJan24.log.zip Merge sort’s best case is when the largest element of one sorted sub-list is smaller than the first element of its opposing sub-list, for every merge step that occurs. Only one element from the 5.2 Procedure opposing list is compared, which reduces the number of The data file above contains over 320 thousand records, which is comparisons in each merge step to N/2. too many records to deal with under the default memory constraints of the JVM. Instead, 20 thousand records are read at a time and used at a time with the procedure outlined below. This complete process is repeated 10 times to produce 10 samples. To get a better analysis of Merge Sort and Insertion Sort, the procedure, outlined below, was repeated ten times, and the time measurements were averaged to produce the results in this paper. Table 1: t-test Variable 1 Variable 2 The procedure is as follows: Mean 3379.42 107.06 1. Store the first 20000 records in a list. Variance 9295934.83 6705.57 2. Choose any 500 consecutive records to sort. t Stat 6.974 3. Sort these 500 records separately using Merge Sort and Insertion Sort. P(T<=t) two-tail 1.97 4. Record the completion time (the moment a sort is called until t Critical two-tail 1.83 the moment that sort complete) of each sort. 5. Repeat steps 2-4, incrementing the number of consecutive 6. CONCLUSION records by 500. In this paper we introduced the Merge Sort algorithm which takes a Divide-and-Conquer approach to rearrange a list of linearly 5.3 Results comparable values into ascending order. We have shown that Figure 4 below shows average of the results of the experiment. In Merge Sort is a faster sorting algorithm than Insertion Sort, both the figure, the completion time for Merge Sort is in red and the theoretically and empirically. We have also shown that Merge completion time for Insertion Sort is in blue. Sort has a best- and worst-case time complexity of O(Nlog 2(N)). Despite its time complexity, Merge Sort’s performance degrades when the input size decreases during the Split phase. This is due to the overhead of creating, working with, and destroying many small lists. One possible improvement to the Merge Sort algorithm may include using a “faster” sorting algorithm when list sizes decrease below a certain size to prevent the speed decrease caused by the overhead of data storage. Such future work would also include determining whether such a size should be an input parameter, and whether it should be a discrete value or a percentage of the input size.

7. REFERENCES [1] Cormen, Thomas, et. al. Introduction to Algorithms, 2 nd Edition . 15-19. MIT Press. 2001. [2] Sedgewick, Robert. 2003. “Selection Sort”. Algorithms in Java, 3 rd Edition. 283-284. Addison-Wesley, Princeton University. Figure 4 [3] Sedgewick, Robert. 2003. “Bubble Sort”. Algorithms in From Figure 4 we clearly see that, as the number of sorted records Java, 3 rd Edition. 285-287. Addison-Wesley, Princeton increases, the time Insertion Sort takes to sort will also increase at University. a very rapid rate. In fact, performing a curve fitting on the graph [4] Sedgewick, Robert. 2003. “Insertion Sort”. Algorithms in produces a line that fits the data points very well. Merge sort, on rd the other hand, takes so much less time that it is unclear from Java, 3 Edition. 288-289. Addison-Wesley, Princeton Figure 4 whether the completion time is increasing. Linear University. regression was performed on the data to verify that the time was increasing – it is in fact linearly increasing as the size of the data increases. Table 1 shows the t-test performed on the 10 experimental samples. Our critical value was 2.02, and we exceeded it, meaning our Merge Sort algorithm is clearly faster at sorting than Insertion Sort. From the results we have shown that Merge Sort sorts faster than Insertion Sort for data sets of any significant size, and will continue to do so no matter how much the data size is increased. Also, from the same results, the difference between the two sorts for list sizes smaller than 2000 is so insignificantly small that using Insertion Sort would not give any short term speed increase.