Luis Quiles Florida Institute of Technology [email protected]

ABSTRACT 2.1 Currently, the best solution to the task of is the Insertion Selection sort is performed by selecting the smallest value of the Sort . The performance of diminishes as list, then placing it at the beginning of the list [2]. This process is datasets grow larger due to its quadratic . We repeated once for every value of the list. Selection sort is sound introduce a new algorithm called Merge Sort, which utilizes a and easy to understand. It’s also very slow, and always performs Divide and Conquer approach to solve the sorting problem with a with the same speed. time complexity that is lower than quadratic time. We will show A single pass of Selection sort can only guarantee that one that Merge Sort has a logarithmic time complexity of element is sorted because only one element is placed into its O(N*log(N)). We will also analyze Merge Sort and its closest correct position. In developing merge sort, we thought of an competitor to verify that Merge Sort performs fewer comparisons algorithm that would not only always sort every element in place and has a lower time complexity than Insertion Sort. correctly, but also sort faster on certain inputs. Keywords 2.2 Insertion Sort Divide and Conquer, Sorting, Merge Sort Insertion sort is another algorithm which solves the problem. Given a list of comparable values, search the list from the 1. INTRODUCTION beginning. When a value is smaller than a previously viewed The task of sorting is a problem that commonly occurs in value, move the new value into its correct position and shift all of applications that retrieve information from a database or search the previous values forward [1]. Like Selection sort, Insertion sort through a file system. In both situations, the tables and file system is also sound and easy to understand. Unlike Selection sort, can be very large, containing thousands of sources of data that Insertion sort can sort faster on certain inputs [4]. need to be sorted quickly. For this reason, fast and Insertion sort moves from the beginning of the list to the end of solutions are desirable. the list exactly once. During the process of sorting, if a value is Insertion Sort, Selection Sort, and are three existing out of place, it is moved backward into its correct position [1]. algorithms that solve the task of sorting a list of comparable With a list that is mostly sorted, Insertion sort will correctly values. Of these three algorithms, Insertion Sort performs best, yet reorder the list faster than Selection sort can because it has fewer its performance diminishes rapidly as dataset sizes increase. This comparisons to perform and it only has to move a few elements a decrease in performance is due to its quadratic time complexity, small number of times. which is an indicator that as a problem doubles in size, the time Insertion sort is a faster solution than Selection sort, but it’s still required to solve it will quadruple in size. not fast enough. We desire an algorithm that can solve the problem in polynomial time that is faster than the quadratic time guarantee of Insertion 2.3 Bubble Sort Sort. Bubble sot takes a different approach to solving our problem, which is why it is a good algorithm to include in our research. We introduce a new algorithm called Merge Sort. The Merge Sort Bubble sort’s approach to the problem is to make multiple passes algorithm solves the problem of rearranging a list of linearly through the algorithm, swapping any two consecutive values that comparable values in less than quadratic time by using a Divide- are out of their proper order [3]. This process of full passes with and-Conquer approach to produce the output. This algorithm order swapping has the effect of moving at least one value – the produces the rearranged list faster than other algorithms, such as largest unsorted value – into its correct place. This is different Insertion Sort; Selection Sort; and Bubble Sort. from Selection sort and Insertion sort, which are only capable of Section 2 describes the Selection, Insertion, and Bubble Sort sorting exactly one value at a time. algorithms. Section 3 details the Merge Sort algorithm being The idea of sorting multiple items into their correct positions in introduced. Section 4 proves that Merge Sort has a logarithmic each pass is a great idea, and it is used again in our merge sort time. Section 5 provides empirical evidence of Merge Sort’s algorithm. Also, due to this swapping process, it’s possible to stop superiority over Insertion Sort. Section 6 summarizes our results. sorting early; if any pass of bubble sort is completed without any swaps, then the list is fully sorted. 2. RELATED WORK Despite the novel approach that bubble sort takes to solving the We began our work on developing a new algorithm by first sorting problem, the algorithm is still a slow sort. As you will see looking at other solutions to the problem of reordering a list of in the next section, one of the improvements that our merge sort comparable values that are given in an arbitrary order. algorithm makes over bubble sort is that it will continue to sort quickly even with the worst possible inputs. 3. APPROACH In the Merge algorithm, to merge the left and right into list L, we A list, L, is known to have a total of N linearly comparable take the lowest value of either list that has not been placed in L, elements in an arbitrary order. The task required is to rearrange and place it at the end of L(Lines 8-10). Once either left or right is the list’s elements in ascending order. exhausted, L is filled with the remainder of the list that has not been merged (Lines 11-12). Merge Sort is known as a Divide and Conquer algorithm: given a task, divide it into parts; then solve the parts individually. There are two parts to the Merge Sort algorithm. 1. Divide the list into two separate lists 2. Merge the new lists together in order First let’s look at Divide. The algorithm is simple: divide list L into left and right halves. This process is recursive, and does not stop until only one element remains in the list. A single-element list is already sorted. Merge does what its name implies: it merges two lists (left and right) into a single list (L) in ascending order. Merge also makes a claim: lists left and right are already sorted. Since left and right are already sorted in ascending order, the two lists can be merged by repeatedly removing the next lowest value in both lists and Figure 3, above, illustrates the Merge algorithm (Lines 6-12). adding it to the result list L. This operation requires only a single The black arrows above the left and right arrays represent i and j comparison per value added to the result list, until one list is from the pseudo-code. In (a), these values are initially zero, so the exhausted. first elements from each are compared. In (b) and (c), the lowest MERGESORT: element from the previous step was appended to the end of L, and Input: List L (contains N elements) both i and j have increased. Finally, (d) represents Lines 11-12, Output: List L (ascending order) where right is appended to L because left has been exhausted. MERGESORT(LIST L): 1. if(N > 1) 2. (List left, List right) <- DIVIDE(L) 4. THEORETICAL EVALUATION 3. Left <- MERGESORT(left) 4. right <- MERGESORT(right) 4.1 Criteria 5. L <- MERGE(left, right) The goal of both Merge Sort and Insertion Sort is to sort a list of comparable elements. The key operation in the execution of this MERGE: goal is the comparison between array elements during the sorting Input: List A (sorted ascending order), process. This theoretical evaluation will measure the total number List B (sorted ascending order) of comparisons between array elements in Merge Sort and in Output: List L (sorted ascending order) Bubble Sort according to each algorithm’s worst and best case MERGE(List left, List right) 6. i = j = k = 0 input. 7. while(i < LEN(left) and j < LEN(right)) 8. if(left[i] < right[j]) 4.2 Merge Sort 9. L[k++] = left[i++] In merge sort, the comparisons occur during the merging step, 10. else L[k++] = right[j++] when two sorted arrays are combined to output a single sorted 11. if(i == LEN(left)) append(L, rest(left)) 12. else append(L, rest(right)) array. During the merging step, the first available element of each array is compared and the lower value is appended to the output In the pseudo-code, Line 1 is the stopping condition for Divide. array. When either array runs out of values, the remaining The recursive Divide occurs on Lines 2-4. Specifically, Lines 3-4 elements of the opposing array are appended to the output array. perform the recursive call. Finally, the lists are merged on Line 5 to produce the sorted list. Figure 1 demonstrates the Divide 4.2.1 Worst Case algorithm, while Figure 2 shows the resulting sort that occurs The worst case scenario for Merge Sort is when, during every when Merge is called on Line 5. merge step, exactly one value remains in the opposing array; in other words, no comparisons were skipped. This situation occurs when the two largest value in a merge step are contained in opposing arrays. When this situation occurs, Merge Sort must continue comparing array elements in each of the opposing arrays until the two largest values are compared. The complexity of worst-case Merge Sort is: N T(N) = 2T( /2) + N – 1 (2.1) N N T(N) = 2[2T( /4) + /2 – 1] + N – 1 (2.2) N N T(N) = 4[2T( /8) + /4 – 1] + 2N – 3 (2.3) k N kN (2.14) N T (N)  2 T ( k )  T(N) = 8T( /8) + N + N + N – 4 – 2 – 1 (2.4) 2 2 N T(N) = / log N (2.15) Eq. 2.1 is the for Merge Sort. T(N) refers to 2 2 the total number of comparisons between array elements in Merge Eq. 2.14 is produced from 2.10 using the same substitution Sort when we are sorting the entire array of N elements. The process (2.11-2.13) as before. Like Eq. 2.9 from the worst case, divide stage performs Merge Sort on two halves of the array, Eq. 2.15 shows that Merge Sort has O(Nlog(N)) complexity with which is what 2*T(N/2) refers to. The final part of the Eq., N-1, the best possible input. refers to the total comparisons in the merge step that returns the sorted array of N elements. 4.3 Insertion Sort Eq. 2.1 describes the number of comparisons that occur in a Insertion sort passes forward (left-to-right) through the list once, merge sort, which is a recursive procedure. Since the method is comparing consecutive elements to see if any are out of place. If recursive, we will not be able to count every comparison that such a value is found, it is swapped backward (right-to-left) occurs by only looking at a single call to Merge. Instead, we need through the list until it is correctly sorted. to unroll the recursion and count the total number of comparisons. 4.3.1 Worst Case Equations 2.2-2.3 perform this action of unrolling the recursion Insertion sort’s worst case input is an array sorted in reverse by performing substitution. We know what the value of T(N) is order. In its worst case, Insertion sort places the newest element at from 2.1, and by substitution we know the value of T(N/2). Eq. the beginning of the list because it is the smallest value of the 2.4 is just a clearer view of Eq. 2.3. At this point we are in the known portion of the list. For the ith element, Insertion Sort must third recursive call of Merge Sort, and a pattern has become clear perform i-1 comparisons to place the element correctly. enough to produce Eq. 2.5 below: N 1 N (N  )1 k N k T (N )  i  (3.1) T (N)  2 T( )  kN  2(  )1 (2.5)  2 2k i1 Eq. 3.1 shows that Insertion Sort has O(N2) complexity with the In Eq. 2.5, a new variable called “k” is introduced. This variable worst possible input. represents the depth of the recursion. When the sort is recursively dividing the input array, the recursion stops when the array 4.3.2 Best Case contains a single element. A single element array is already The best case for Insertion Sort is when the input is already sorted. sorted. Since every element is correctly sorted, no swaps occur. This leads to insertion sort having the complexity below: T(1) = 0 (2.6) T(N) = N – 1 (3.2) 2k = N (2.7) The best case for Insertion Sort is O(N) complexity. k = log2 N (2.8) By substituting Eq. 2.6 through Eq. 2.8 into Eq. 2.5, we eliminate 4.4 Results the k term and reduce the recurrence relation to produce the While Insertion Sort’s best case for sorting certainly does beat complexity for merge sort: Merge Sort’s best case, it is usually far more interesting to know about the worst case scenario. Since we are far more likely to sort T(N) = N log N – N + 1 (2.9) 2 an unsorted list than a pre-sorted list, comparing the worst cases Eq. 2.9 shows that Merge Sort has O(N*log(N)) complexity with between the two provides far more relevant information. the worst possible input. The exception to this is when we have some prior knowledge that the list will usually already be sorted. 4.2.2 Best Case Merge sort’s best case is when the largest element of one sorted Comparing the worst cases for Insertion Sort and Merge Sort, the sub-array is smaller than the first element of its opposing sub- complexity of Merge Sort is far better. The number of array, for every merge step that occurs. Only one element from the comparisons performed by Insertion Sort is bounded above by an opposing array is compared, which reduces the number of N2 function, while Merge Sort is bounded above by an N*log(N) comparisons in each merge step to N/2. function. Merge Sort’s logarithmic time complexity is much more efficient for large inputs than Insertion Sort’s complexity. N N T(N) = 2T( /2) + /2 (2.10) Thus, Merge Sort is a better than Insertion Sort N N N when there is no prior knowledge of the data to be sorted. T(N) = 2[2T( /4) + /4] + /2 (2.11) N N T(N) = 4[2T( /8) + /8] + N (2.12) N 3N 5. EMPIRICAL EVALUATION T(N) = 8T( /8) + /2 (2.13) The goal of this evaluation is to show a clear improvement in the task of sorting when using Merge Sort instead of Insertion Sort. The evaluation criterion is a measurement of the processor time increasing – it is in fact linearly increasing as the size of the data for each algorithm. increases. 5.1 Environment and Data From the results it is very clear that Merge Sort sorts faster than Insertion Sort for data sets of any significant size, and will The evaluation of Merge Sort was conducted on a Windows continue to do so no matter how much the data size is increased. machine containing a 7200RPM internal drive, an Intel Q6600 Also, from the same results, the difference between the two sorts quad-core processor, and 3.5GB of DDR2 ram. The Merge Sort for array sizes smaller than 2000 is so insignificantly small that and Insertion Sort algorithms were run on the Java 1.5.0 Virtual using Insertion sort would not give any short term speed increase. Machine with no other processes being run on the same CPU core. The data for the evaluation can be found at http://cs.fit.edu/~pkc/classes/writing/httpdJan24.log.zip 6. CONCLUSION In this paper we introduced the Merge Sort algorithm which takes 5.2 Procedure a Divide-and-Conquer approach to rearrange a list of linearly The data file above contains over 320 thousand records, which is comparable values into ascending order. The algorithm is an too many records to deal with under the default memory improvement over its predecessors, Insertion Sort; Selection Sort; constraints of the JVM. Instead, 20 thousand records are read at a and Bubble Sort. While its predecessors are limited to solving the time and used at a time with the procedure outlined below. problem with a quadratic time complexity, Merge Sort will rearrange any random arrangement of values with a guaranteed To get a better analysis of Merge Sort and Insertion Sort, the time complexity of N*log(N). procedure, outlined below, was repeated ten times, and the time measurements were averaged to produce the results in this paper. Despite this guarantee, Merge Sort actually performs worse when the input size decreases during the Split phase. This is largely due The procedure is as follows: to the overhead of creating, working with, and destroying many 1. Store the first 20000 records in an array. small arrays. One possible improvement to the Merge Sort 2. Choose any 500 consecutive records to sort. algorithm may include using a “faster” sorting algorithm when 3. Sort these 500 records separately using Merge Sort and array sizes decrease below a certain size to prevent the speed Insertion Sort. decrease caused by the overhead of data storage. Such future work 4. Record the completion time (the moment a sort is called until would also include determining whether such a size should be an the moment that sort complete) of each sort. input parameter, and whether it should be a discrete value or a 5. Repeat steps 2-4, incrementing the number of consecutive percentage of the input size. records by 500. 7. REFERENCES 5.3 Results [1] Cormen, Thomas, et. al. Introduction to Algorithms, 2nd Figure 3 below shows the results of the experiment. In the figure, Edition. 15-19. MIT Press. 2001. the completion time for Merge Sort is in red and the completion [2] Sedgewick, Robert. 2003. “Selection Sort”. Algorithms in time for Insertion Sort is in blue. Java, 3rd Edition. 283-284. Addison-Wesley, Princeton University. [3] Sedgewick, Robert. 2003. “Bubble Sort”. Algorithms in Java, 3rd Edition. 285-287. Addison-Wesley, Princeton University. [4] Sedgewick, Robert. 2003. “Insertion Sort”. Algorithms in Java, 3rd Edition. 288-289. Addison-Wesley, Princeton University.

From Figure 3 we clearly see that, as the number of sorted records increases, the time Insertion sort takes to sort will also increase at a very rapid rate. In fact, performing a curve fitting on the graph produces a line that fits the data points very well. Merge sort, on the other hand, takes so much less time that it is unclear from Figure 3 whether the completion time is increasing. Linear regression was performed on the data to verify that the time was