Algorithms for Sorting Nearly Sorted Lists

cosc 460 Department of Computer Science University of Canterbury Supervisor: Dr. A. Moffat John Wright October 1986 CONTENTS Abstract ........................................... 3 1.0 Introduction ..................................... 4 2.0 Preliminary Testing ................................ 5 2.1 Sortedness Ratio . 5 2.2 Inversions .................................. 6 2.3 Number of Ascending Sequences .................. 7 2.4 Longest Ascending Sequences ..................... 7 2.5 Number of Exchanges .......................... 8 2.6 Results ................................... 10 2. 7 Conclusions ................................ 11 3.0 The Implementation of Five Algorithms ................. 12 3.1 Linear Insertion Sort .......................... 13 3.2 Ysort ..................................... 14 3.3 Cksort .................................... 16 3.4 Natural Mergesort ............................ 17 3.5 Smoothsort ................................ 19 3.6 Results . 23 3.7 Conclusions ................................ 25 4.0 Improvements, Modifications and Other Ideas ............. 26 4.1 Improvements to Y sort ........................ 26 4.2 Improvements to Cksort ....................... 27 4.3 Alterations to Smoothsort ...................... 29 4.4 Alterations to Linear Insertion Sort ................ 30 5.0 Conclusions .................................... 33 Appendix A ....................................... 34 Appendix B . 37 Appendix C . 42 References ........................................ 44 page2 ABSTRACT The following five algorithms for sorting in situ are examined: linear insertion sort, cksort, natural mergesort, ysort and smoothsort. Quicksort and heapsort are also considered although they are not discussed in detail. The algorithms have been implemented and compared, with particular emphasis being placed on the algorithms' efficiency when the lists are nearly sorted. Five measures of sortedness are investigated. These are sortedness ratio, number of inversions, longest ascending sequence, number of ascending sequences, and number of exchanges. The sortedness ratio is chosen as a basis for comparison between these measures and is used in experiments on the above algorithms. Improvements to cksort and ysort are suggested, while modifications to smoothsort and linear insertion sort failed to improve the efficiency of these algorithms. page3 1.0 INTRODUCTION One of the most widely studied problems in computer science is sorting. Algorithms for sorting were developed early and have received considerable attention from mathematicians and computer scientists. A large number of algorithms were developed, but no one algorithm is best for all situations and in many cases the files to be sorted are just too large and need to be sorted externally using discs or tapes as the storage medium. In recent years the interest in sorting has concentrated on algorithms that exploit the degree of sortedness of the input. Quicksort is an internal sorting algorithm that was first proposed by C.A.R Hoare [1] in 1962 and has been studied extensively. It has been widely accepted as the most efficient internal sorting algorithm and achieves a lower bound of O(n log n), but has a worst case complexity of O(n2) and the algorithm does not attempt to exploit the degree of sortedness of the input. Many modifications to quicksort have occurred and there exists a large number of algorithms based on the original theme. Other algorithms such as heapsort and mergesort achieve an O(n log n) lower bound and yet do not have quicksort's worst case complexity, but these algorithms also do not take into account the sortedness of the input. Considered here are five algorithms for sorting in situ. Each algorithm is claimed to have good behaviour on nearly sorted lists. In particular linear insertion sort, natural mergesort, ysort, cksort and smoothsort have been studied. Both ysort and cksort have utilized the quicksort algorithm or one of its descendants. Natural mergesort is an extension of the merge sorting technique, but uses natural runs, ascending or descending sequences, in the data. Smoothsort is a new algorithm, proposed by Dijkstra [8] that uses a sophisticated data structure based on heaps and was designed specifically to sort nearly sorted lists in linear time and have a worst case complexity of O(n log n). Finally insertion sort has been included as its performance on nearly sorted lists is widely known although it has a worst case complexity of O(n2). All of the algorithms attempt to utilize the sortedness of the data in some way and each is claimed to have O(n) complexity on nearly sorted lists. page4 2.0 PRELIMINARY TESTING In dealing with sorting algorithms the phrases " ... on a nearly sorted list ... " and " ... on a randomly sorted list ... " arise frequently, but a list may be "nearly" sorted according to one terminology and "randomly" sorted according to another. Intuitively a list is nearly sorted if only a few of its elements are out of order. The problem is to define an easily computable measure that coincides with intuition. A list will be considered sorted if all its elements form a non decreasing sequence and a list is reverse sorted if all its elements form a non ascending sequence. A list of length n can be sorted in O(n) time if the number of operations on all the elements of the list is proportional ton by some constant factor. For example the list, 5 6 7 8 4 3 2 1, can be sorted in O(n) since n/2 comparisons and n/2 swaps are needed to sort the list, provided the list is sorted by merging sequences from opposite ends. This chapter discusses and relates five measures of sortedness, the sortedness ratio, inversions, number of ascending sequences, longest ascending sequence and the number of exchanges. The first measure discussed [2] provides a basis from which to compare the other four measures. Results have been graphed and also included in Appendix A. 2.1 Sortedness ratio Cook and Kim [2] defined the sortedness ratio of a list of length n as,\ sortedness ratio = kin, where k is the minimum number of elements that need to be removed to leave the list sorted. For a sorted list this ratio is O since all the elements are in their correct positions and for a list in reverse sorted order this ratio approaches 1. For example, 21354 has sortedness 2/5 since the removal of 5 or 4 and 1 or 2 will leave the list sorted. page5 This measure of sortedness is by no means perfect. Consider the following lists: 87654321 21436587 The first list has a sortedness ratio of 7/8, the maximum possible sortedness ratio for a list, yet it can be sorted in O(n) time by reversing the list. The second list contains local disorder and has a sortedness ratio of 4/8, indicating a high degree of unsortedness. Yet it too can be sorted in O(n) time by an algorithm that will exploit local disorder. 2.2 Inversions For a list on n elements; x1, x2, x3, ... xn, the number of inversions is defined to be,\ number of inversions = Ii=l,n-l Ij=i+l,n inv, where inv = { 0 if xi ~ xj, { 1 if xi > xj. If the list is in reverse order there are n(n - 1)/2 inversions. The number of inversions is bounded below by 0, for the sorted list, and above by n(n - 1)/2, for the reverse sorted list. For example in the list, 423865 there are 5 inversions, namely (4,2), (4,3), (8,6), (8,5) and (6,5). This measure of sortedness indicates a list in reverse order would be less sorted than a list of elements chosen from a random distribution. For example, in 10987654321 87531492610 the first list has 45 inversions, indicating a high degree of unsortedness but certainly can be sorted in O(n) time. The second list contains 22 inversions but has no obvious properties that allow it to be sorted in O(n) time. page6 2.3 Number of ~scending Sequences t '>, The number of ascending sequences in a list of length n is defined as, number of ascending sequences = I,i=l,n-l (run) + 1, where run= { 1 if xi+l <xi, { 0 if xi+l :::: xi. For example in the list, 5 4 9 2 6 7 8 3 10 11 15, there are 4 ascending sequences partitioned as follows, (5) (4 9) (2 6 7 8) (3 10 11 15). For a list in sorted order there is just 1 ascending sequence and for a list in reverse order there are n ascending sequences. This method has its disadvantages in lists which have a high degree of local disorder. For example, lists such as, 2 1 4 3 6 5 8 7 ..... , have a large number of ascending sequences but certainly have properties that allow linear time sorting since each element is only one position from its correct position in the sorted list. 2.4 Longest Ascending Sequence The longest ascending sequence of a list can be seen from the following example, 1541236798 which is partitioned into the following ascending sequences, (1 5) (4) (12367 9) (8), with the longest ascending sequence of length 6. page 7 A sorted list has a single ascending sequence of length n and a reverse sorted list has n ascending sequences of length 1 and generally the greater the number of ascending sequences the less sorted the list. But consider the following lists: 21436587109 10987654321 10563582174 Both the 1st and 2nd lists have immediate properties that allow O(n) time for sorting. In the first list each element is 1 position from its sorted position, even though the longest ascending sequence is of length 1. The 3rd list has no obvious properties yet its longest ascending sequence is of length 3, and according to this measure it is more sorted than the first two lists. 2.5 Number of Exchanges The number of exchanges in a list is the smallest number of exchanges of elements needed to bring the list into a sorted order. For example the list, 13879546, requires 3 exchanges to move the elements into their correct position: 1 3 .a 7 9 5 .4 6 1 3 4 1 9 5.

Load more