INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 9, ISSUE 03, MARCH 2020 ISSN 2277-8616 An Efficient Methodology to Sort Large Volume of Data

S.Bharathiraja, G.Suganya, M.Premalatha, R.Kumar, Sakkaravarthi Ramanathan

Abstract— is a basic data processing technique that is used in all day-day applications. To cope up with technological advancement and extensive increase in data acquisition and storage, Sorting requires improvement to minimize time taken for processing, response time and space required for processing. Various sorting techniques have been proposed by researchers but the applicability of those techniques for large volume of data is not assured. The main focus of this work is to propose a new sorting technique titled Neutral Sort, to reduce the time taken for sorting and decrease the response time for a large volume of data. Neutral Sort is designed as an enhancement to . The advantages and disadvantages of existing techniques in terms of their performance, efficiency and throughput are discussed and the comparative study shows that Neutral sort drastically reduces time taken for sorting and hence reduces the response time.

Index Terms—Chunking, Banding, Sorting, Efficiency, Merge sort, Bigdata Sorting, Neutral Sort.

——————————  —————————— 1 INTRODUCTION N data science, ordering of elements is very much useful in divided chunks may already be in sorted manner and hence I various applications and data base querying. Different doesn’t need further split, we have proposed a modified approaches have been proposed by researchers for Merge sort titled “Neutral sort” to reduce time performing the sorting operation effectively based on the type complexity and hence to reduce response time. of application used. The approaches are normally validated Chapter 2 elaborates on the various that exist in using the space complexity, that refers to the temporary space practice with a detailed analysis on their merits and occupied by the algorithm during sorting and time limitations. A detailed discussion on proposed methodology complexity, the time taken to sort and return the final sorted with the algorithmic explanation is presented in Chapter 3. data. Algorithms emerge with the consent to minimize both The prototype is tested and the results are discussed in space and time complexities. Literature as discussed in the Chapter 4. following section asserts that quick sort and merge sort algorithms are efficient and effective for usage in real time. In 2 EXISTING ALGORITHMS addition to this, depending on the location where sorting is executed, the techniques are broadly categorized into Internal and . The existing sorting techniques provide 2.1 different throughput and efficiency, out of which Quick, Bubble sort is one of the simplest algorithm for sorting data. Merge, Binary sorting techniques have good responses. The algorithm compares each element with its next element Chunking and banding are widely used in different and if necessary (if not ordered) swaps the elements [1]. The applications to create modularity and to ensure specificity. comparison and swapping operations are repeated during Also, Merge sort, a technique that divides the data into chunks each pass until all the elements in the list are completely till the size of the chunk becomes one and then combines the sorted. It is inefficient since, in the worst case situation there chunks applying in order is proven to be might be at most swapping possible. The to efficient by many researchers. Considering the fact that the sort data with the application of Bubble sort in worst case scenario is O (n2). ———————————————— 2.1.1 Benefits and Drawbacks  S.Bharathiraja is currently working as a Assistant Professor in Vellore Though Bubble sort is inefficient its simplicity makes it an Institute of Technology Chennai and is pursuing his Ph.D., PH- advantageous algorithm given some small of data for 04439931120. E-mail: [email protected]  Dr. G.Suganya is currently working as Associate Professor is currently in sorting. Bubble sort utilizes O(1) auxiliary space. Due to its Vellore Institute of Technology Chennai and is specialized in Software time constraint its inefficient for sorting large set of data[1]. Engineering and Machine Learning., PH-04439931399. E-mail: [email protected] 2.2  M.Premalatha is currently working as a Assistant Professor in Vellore Institute of Technology Chennai and is pursuing his Ph.D., PH- The number of comparisons and hence swapping is slightly 04439931071. E-mail: [email protected] reduced in Selection sort when compared to Bubble sort.  Dr.R.Kumar is currently working as a Associate Professor in Vellore During every pass, only one swap occurs to place the Institute of Technology Chennai and is pursuing his Ph.D., PH-0443993. E-mail: [email protected] appropriate data in respective position and hence, selection  Sakkaravarthi is currently working as a professor in Department of sort is better than Bubble sort in terms of swapping. Even Computer Science , Cegep Gaspesie, Montreal, Canada., PH-+1(438) 395- then, while applied to large datasets, the performance of this 9639. E-mail:[email protected] sorting is O (n2)[2] in worst case situation.

5828 IJSTR©2020 www.ijstr.org INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 9, ISSUE 03, MARCH 2020 ISSN 2277-8616

2.2.1performance Analysis selected as either the left most or the right most element. This [2][3]Selection sort needs O(n2) comparisons and (n-1) swaps drawback was eliminated later by selecting a suitable method to sort the list of n elements. [2][3]Two other variations of of selecting the pivot. One best method among choosing an selection sort that are quite popular are Quadratic Sort[5] and appropriate pivot which would divide the list into two almost [3][2]. equal halves during all recursive partitions is to choose the pivot as the median of first, middle and last elements. Quick 2.2.2 Benefits and Drawbacks sort works efficiently even in an virtual memory environment Though selection sort is inefficient due to its performance for and its an in-place algorithm. large set of data its better than bubble sort in terms of lesser 2.4.1 Performance Analysis number of swapping which reduces the data movement much. [8]The fastest sorting algorithm with an average running time Selection sort utilizes O(1) auxiliary space. of O(n log n) is the Quick sort. Generally if we select either the first or the last element as pivot it produces the worst run time 2.3 of O(n2). However, these worst case scenarios are not that A much better efficient algorithm compares with bubble and frequent. One of the variant of Quick sort is the Qsort [10], selection sort is the Insertion sort. It works just similar to which is robust and faster than the traditional Quick sort playing cards. As and when a new element is picked up it algorithm method. Quick sort is the better suited for large data looks for the appropriate position and inserts the element by sets. shifting all the elements which occurs after the inserting 2.4.2 Benefits and Drawbacks position one position forward to maintain the sorted order in [8]It is the fastest and efficient algorithm for large sets of data. the list. This insertion process will continue for all the But it is inefficient if the elements in the list are already sorted elements in the list to maintain the ordering. Insertion sort is which results in the worst case time complexity of O(n2). more suitable for almost or partially sorted list of elements. Quick sort uses O(log n) [13] secondary space for recursive Time complexity is proven to be O (n2) for this sorting. function calls so it might be expensive in terms of space 2.3.1 Performance Analysis occupancy for large sets of data. Moreover quick sort carry out Insertion sort performance degrades to quadratic sequential traverse through the elements of array that results computational complexity if the order of elements is reversed. in good locality of reference and cache behavior for arrays [9]. The advantage of insertion sort is its efficient performance when the elements in the list are partially sorted Insertion sort 2.5 Merge Sort is inefficient when size grows since its average case running Merge sort [11] also uses divide and conquer approach. The time is also O(n2). A popular variant to this sort is the Shell procedure splits the set of elements to be sorted into two equal sort [7]. It is unstable and an in-place comparison sorting halves if the number of elements is even and splits into two algorithm with the performance of O(n log n) in best case partitions with either of the partition containing one element scenario. Insertion sort utilizes O(1) auxiliary space. greater than the other, if the number of elements is odd. The 2.3.2 Benefits And Drawbacks procedure continues recursively until all the partition contains Insertion sort is also inefficient like bubble and selection sort one element each. for large set of data. Also to locate the insertion position it has Merge operation is applied after partitioning the elements. to scan through number of element to be sorted. Its best case The merging operation merges the correspond left and right run time is O(n) [13] only when the elements are already in partition and grows up until it forms a single partition which sorted form. contains all the elements in the original set but now in sorted order. Mostly Merge sort is done using recursive procedure 2.4 Quick Sort though it can also be done using non-recursively. It is a stable The fastest internal sorting technique is the Quick sort [8]. sorting technique since it preserves the relative order of Quick sort is most widely used many application data sets, elements with equal key. since it doesn’t require any additional space for sorting the 2.5.1 Performance Analysis elements. Its exhibits divide and conquer strategy for solving the problems. [11]Merge sort is efficient when compared with methods like [8]Quick sort selects an element called a pivot and divides and quick sort. The worst and average case time the array elements in to two parts by placing the pivot element complexity of merge sort is O(n log n). Merge sort uses a in its sorted location by applying an exchange procedure. separate array to store the entire sub array along with the Once the pivot element is placed in its sorted location all the main array. It’s an external sorting algorithm which needs elements to the left of selected pivot will be smaller than pivot O(n) [13] additional memory space for n elements. Hence, and similarly, all the elements to the right of pivot element merge sort is inefficient for applications that run on machines will be greater. This method of slitting the array in to two is with limited memory resource. Therefore, merge sort called partitioning. recommends for large data sets with external memory. The process is repeated recursively for both the right and left side partitions until each right and left partitions have single element Worst case behaviors is obtained if the pivot is 5829 IJSTR©2020 www.ijstr.org INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 9, ISSUE 03, MARCH 2020 ISSN 2277-8616

3 PROPOSED METHODOLOGY F2: 3 4 5 10

3.1 The Working Principle Apply the banding algorithm those two chunks are merged The working principle of the Neutral Sort is: Dividing the and thus the elements are sorted. whole list of elements into chunks which are already in F: 0 1 2 3 4 5 6 7 8 9 10 sorted order into itself, if the total number of chunks are greater than 1, then we need to apply Banding Algorithm The above procedure is iterated until the number of chunks which is given below till we get the total number of chunks = becomes equal to one. 1. In general the merge sort restricts itself into a sub file of sizes 1, 2, 4, 8. . . 2k where 2k ≥ size of the file F. Also must it 3.2 Algorithm needs as many split merge phases based on the number of 3.2.1 Chunking Pseudo for Neutral Sort elements used. Normally unnecessary split and merge process /*File F containing the data is spitted and written to two sub is carried out if a particular chunk is already in sorted order. files F1 and F2 by copying segments of sorted data in F For those chunks which are already in sorted order we restrict alternately to F1 and F2. */ further split, thereby reducing the split merge phase for those 1. Open file F for input spacer with F1 and F2 as output files. chunks. If the size of such sorted chunks is considerably larger 2. While EOF of file F has not been reached: it would greatly reduce the amount of time required to a. Elements of F are copied to F1 based on following perform sorting. Our Neutral sort is a version of merge sort conditions: Read one element at a time from F and that uses the strategy of identifying such sorted chunks and write it onto F1 until the subsequent element in F is minimizes the unnecessary split merge operation for those less than last copied item in F1 or the EOF is reached. chunks. Especially in large volume of data where sorting is b. If not EOF, then the next sorted sub file of F is copied applied, the strategy is best suited. into F2 just similar to the procedure what we have A sample example is given below to demonstrate the done for copying on to F1. working principle of Neutral Sort. Consider a file F containing End While elements as given below: F: 7 9 0 2 1 8 6 4 3 5 10 3.2.2 Merge Pseudo for Neutral Sort

/*Merge sorted sub files correspondingly fromm F1 and F2 to Segments of F consist of elements that may already in sorted fashion, produce the output file F. */ F: 7 9 0 2 1 8 6 4 3 5 10 1. Open F1 and F2 for input spacer, F for output spacer. Divide the file F elements in to different chunks wherein the 2. Initialize a global variable “__sub__files” = 0. elements in each chunk are sorted. Copy the chunks in F1 and 3. While EOF of F1 or EOF of F2 is reached: F2 (Chunk1 in F1, chunk2 in F2, chunk3 in F1, chunk4 in F2 a. While no end of a sub files in F1 or in F2 has been and so on) reached: F1: 7 9 1 8 4 if F2: 0 2 6 3 5 10 The next element in F1 is less than the next element of F2, copy the next element from input Now, perform banding algorithm to combine the chunks in F1 spacer sub file F1 into output spacer file F. and F2. Banding is almost similar to performing merge Else algorithm, it sorts the elements in both the chunks while Copy the next element from input spacer sub file merging. F1 into the output spacer file F. F: 0 2 7 9 1 6 8 3 4 5 10 End While

Now after F1 and F2 are combined we have 3 chunks to which b. If The end of a sub file in F1 has been reached: the same procedure is applied. Divide the three chunks by Copy the rest element of the corresponding input applying the same procedure. spacer sub file F2 into F. F1: 0 2 7 9 3 4 5 10 Else F2: 1 6 8 Copy the rest element of the corresponding input spacer sub file F1 into F. Perform banding operation which combines the first chunk in F1 and F2, the resultant number of chunks obtained would be c. Increment the global variable “__sub__files” by 1. 2. End While F: 0 1 2 6 7 8 9 3 4 5 10 4. Copy any sub file remaining in F1 or F2 to, incrementing global variable “__sub__files” by 1 for Again divide the last two chunks by copying it to F1 and F2 each F1: 0 1 2 6 7 8 9

5830 IJSTR©2020 www.ijstr.org INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 9, ISSUE 03, MARCH 2020 ISSN 2277-8616

3.2.3 Pseudo Neutral Sort increases the efficiency of the neutral sort algorithm. To show /*Use neutral sort to sort the file F. */ the performance of neutral sort with other sorting algorithms Repeat the following step until global variable “__sub__files” we have shown the comparative performance using both line is equal to 1. chart Figure1. As we could see from the Table1, the performance of bubble, selection and insertion are reasonable till 1000 1. Call chunking algorithm for splitting the file F into two elements to be sorted. When the number of elements to be distinct files F1 and F2. sorted increases considerably from 1000, the performance of 2. Corresponding chunks in files F1 and F2 are merged bubble, selection, and insertion sort starts degrading. The using Merge algorithm, where the merged content amount of time required to sort the element greatly increases would be available in File F. as the volume of element to be sorted increases. Since our aim is to analyze the performance of merge sort by introducing the 4 RESULTS AND DISCUSSIONS chunking and banding algorithm the comparison between merge and neutral sort in table1 would give an insight in The run time of all the existing algorithms are being listed in terms of execution time. As we could see when the number of Table1. The execution time for each algorithm is given in elements to be sorted using merge sort is around 20000 the terms of micro seconds using windows operating system. .Net time required to sort in micro seconds was 176036 where in for frame work and I3 processor. The number of elements the same number of elements using Neutral sort the execution gradually increased and the corresponding run time is time in micro seconds was 18596 which is par to the recorded by running each algorithm. As we could see in performance of Quick sort. It is evident from the result data table1, as the number of elements increases the execution time that because of introduction of chunking and banding also increases for most of the conventional algorithms. algorithm the execution time is greatly minimized. Also we Relatively lesser execution time is achieved using neutral sort could notice that the performance of Neutral sort is almost on for the same set of elements. Due to the introduction of par with Quick sort. chunking and banding algorithm to the existing merge sorting technique the execution time is considerably minimized which

Figure1- Comparison of execution of Neutral sort with other sorting techniques using line chart.

5831 IJSTR©2020 www.ijstr.org INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 9, ISSUE 03, MARCH 2020 ISSN 2277-8616

Table1 – Execution time in micro seconds for various sorting algorithms in comparison with neutral sort.

Execution Time in Micro Second (Windows Platform & .net Framework, I3 Processor) Total Neutral Bubble Selection Insertion Quick Merge Element 100 1537 1333 874 766 735 1222 200 1745 1250 1005 1471 212 937 300 1890 1826 1386 1527 467 944

500 1941 4062 2911 2567 720 1863 1000 2909 13071 7412 2693 823 4604 2000 2959 52545 25149 23531 914 7083 5000 2225 326268 140043 100694 1931 17314 10000 2470 1097879 598587 498780 3598 51063 20000 2596 4637621 2479249 2681191 7362 176036 50000 4433 27555655 13896535 12612051 18769 1039849 100000 8552 108555145 55732827 468913578 39261 33798155

[2] E.H.Friend. “Sorting on electronic computers”, Journal of Applied 5 CONCLUSION and Computational Mathematics, Vol. 3 Issue 2, pages 34-168, 1956. [3] Samanta Debasis Samanta, “Classic Data Structure” ,2nd Edition. PHI, We have presented the conventional sorting algorithms and later 2009. extensively discussed about the modified Quick and merge sort [4] D. L. Shell, “A high-speed sorting procedure”, Association of algorithms in detail in this work. Since our basic aim is to Computer Machinery, Vol.2 , Issue 7, pp.:30-32, July 1959. [5] C.A.R. Hoare, “”, Communications of.ACM, pp.4:321, 1961. improve the performance of merge sort marginally by introducing [6] Omar khan Durrani, Shreelakshmi V, Sushma Shetty, and Vinutha D the concepts of chunking and banding we have discussed merge C, “Analysis and determination of asymptotic behavior range for sort extensively. Also we have elaborated our Neutral sort popular sorting algorithms”, International Journal of Computer algorithm with a sample example along with algorithm. The Science Informatics, ,ISSN:2231-5292,Vol 2,Issue 1. results show a convincing improvement in terms of execution [7] Jon Louis Bentley and M. Douglas McIlroy, “Engineering a sort function”, Software Practioner . Experience. , Vol. 23(11), pp.1249- time in micro seconds. It is evident from the results that the 1265, 1993. performance of our Neutral sort algorithm is better compared to [8] Coenraad Bron. Merge sort algorithm [m1] (algorithm 426), other conventional algorithms. As the number of elements “Communications of ACM”, Vol.15(5):, pp.357,358, 1972. increases, performance of the algorithm degrades in all the [9] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein, “Introduction to Algorithms”, 2nd Edition, MIT Press conventional algorithms. Because of the introduction of chunking and McGraw-Hill, 2001. ISBN 0-262-03293-7 page 17. and banding algorithm in conventional merge sort, we could [10] Bender, Michael A; Farach-Colton, Martín; Mosteiro, Miguel (2006). increase efficiently considerably. The overhead involved in [11] A.D. Mishra and D. Garg. Selection of Best Sorting introducing the chunking and banding algorithm is not discussed Algorithm.International Journal of Intelligent Processing, 2(2):363 in this paper, which could be addressed in the subsequent paper. 368, July-December 2008. [12] M.S. Garai Canaan.C and M. Daya, “Popular sorting algorithms.World Applied Programming”, Vol 1, pp.62-71, April 2011. REFERENCES [13] Pankaj Sareen , Comparison of Sorting Algorithms (On the Basis of Average Case), “International Journal of Advanced Research in [1] Donald E. Knuth, “The Art of Computer Programming”, 2nd Eition, Computer Science and Software Engineering”, Volume 3, Issue 3, volume 3. ADDISON-WESLEY, 1998. March 2013,ISSN: 2277 128X.

5832 IJSTR©2020 www.ijstr.org