Sub-Linear Algorithms for Shortest Unique Substring and Maximal Unique Matches by Sanjeewa Bandara Senanayaka a Thesis Submitted

Sub-linear Algorithms for Shortest Unique Substring and Maximal Unique Matches by Sanjeewa Bandara Senanayaka A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science (Computer Science) At The University of Wisconsin–Whitewater December, 2019 Graduate Studies The members of the Committee approve the thesis of Sanjeewa Bandara Senanayaka presented on (December 6th, 2019 of Oral Exam) Dr. Arnab Ganguly, Chair Dr. Athula Gunawardena Dr. Hien Nguyen Abstract Living in an era where we produce quintillions of data on a daily basis means that there are many intriguing (possibly unrevealed) facts behind them. Some of the data that exists in large quantities is encountered in Computational Biology; our DNA strands are extremely long (≈ 3.2 billion characters). Genomic problems such as comparing similarities between two strands of DNA or finding the uniqueness between two similar strands of DNA can hugely benefit in the understanding of evolution and possibly create medical advances. A couple of avenues of addressing a problem of similar flavor (known as gene alignment) are using Shortest Unique Substrings and Maximal Unique Matches. Existing techniques for these problems either ne- cessitate the use of high performance computers, or using techniques that have large memory footprints, or use of (theoretical) techniques that are complicated to construct. Prior to our work, Ganguly et al. [TCS’ 17] showed that these problems can be solved in 2 n O(nτ log τ ) time and O(n/τ) space for any parameter τ = !(1) and τ = o(n). However, these techniques require the use of complex data structures and techniques, which makes them practically intractable. Presented here are sub-linear algorithms for the Shortest Unique Sub- string problem and the Maximal Unique Matches problem. Besides using minimal amounts of work space and being (reasonably) fast, our algorithms are a combination of two simple techniques – Rabin-Karp fingerprint and Bloom Filters; this makes our work easily amenable to implementations. However, our algorithms suffer from a bounded error probability, which can be reduced using (slightly) more space. iii ACKNOWLEDGMENTS First I would like to thank my thesis advisor Dr. Arnab Ganguly of the Computer Science department at University of Wisconsin-Whitewater. Dr. Ganguly has always been available to assist me and pull me out of tough spots whenever I had any questions or trouble during my research. I appreciate his patience and thoroughness throughout my research. He was able to provide me with guidance that allowed me to excel. I would also like to thank the experts, Dr. Athula Gunawardena and Dr. Hien Nguyen who were a part of my thesis defense committee. Their passionate participation and their willing- ness to help allowed me to successfully finalize my research. I would also like to acknowledge Dr. Jiazhen Zhou, Dr. Lopamudra Mukherjee, Dr. Sobitha Samaranayake, Dr. Zachary Oster and the entire department of Computer Science at Univer- sity of Wisconsin-Whitewater for being great teachers and for providing all the knowledge and support required for a student to grow. Finally, I must express my very profound gratitude to my parents and family for providing me with unconditional support and continuous encouragement throughout my years of study. This accomplishment would not have been possible without them. Thank you. iv DISCARD THIS PAGE v TABLE OF CONTENTS Page ABSTRACT ........................................ iii LIST OF TABLES ..................................... vii LIST OF FIGURES ....................................x 1 Introduction ......................................1 1.1 The Problems . .5 1.2 Applications . .6 1.3 Previous Work and Motivation . .7 2 Preliminaries ..................................... 10 2.1 Trie . 10 2.2 Suffix Trees and Suffix Array . 11 2.3 Longest Common Prefix . 13 2.4 Bloom Filters . 13 2.5 Rabin-Karp Fingerprint . 15 3 Computing Shortest Unique Substring ....................... 17 3.1 Using Suffix Array and LCP array . 17 3.2 Our Approach . 18 3.3 Illustration . 19 4 Computing Maximal Unique Matches ........................ 23 4.1 Using Generalized Suffix Trees . 23 4.2 Our Approach . 25 4.3 Illustration . 29 v vi Page 5 Experiments for Shortest Unique Substring .................... 34 5.1 Setup . 34 5.2 Synthetic Genome Data . 34 5.3 Real Genome Data . 35 5.4 Synthetic Data of Varying Alphabet Size . 35 5.5 Peak Memory Usage . 35 LIST OF REFERENCES ................................. 48 vi DISCARD THIS PAGE vii LIST OF TABLES Table Page 3.1 Explanation of the process for finding the SUS in the string babaccc ........ 21 3.2 Explanation of a false negative in the string abcabdcacb ............... 22 4.1 Explanation of MUM showing the cases where MUM is found and MUM fails. MUM calculation shown for X1 = eabcdacbea and X2 = cabccebdac with window size 1....................................... 30 4.3 Explanation of MUM showing the cases where MUM is found and MUM fails. MUM calculation shown for X1 = eabcdacbea and X2 = cabccebdac with window size 2. 31 4.5 Continued explanation of MUM showing the cases where MUM is found and MUM fails. MUM calculation shown for X1 = eabcdacbea and X2 = cabccebdac with window size 2. 32 4.6 Explanation of MUM failure when the string contains nothing unique, as shown for string aaaa .................................... 33 5.1 Table displaying the Length of SUS found, the time required, and the memory usage to find the SUS using Manber Myers (MM) and Bloom Filter technique (BF). The string is generated randomly and contains 4 characters. Percentage of error = 14.2% .................................... 36 5.2 Table displaying the Length of SUS found the time required, and the memory usage to find the SUS using Manber Myers (MM) and Bloom Filter technique (BF). The Genomes are real. Percentage of error = 0% ............... 37 5.3 Table displaying the Length of SUS found, the time required, and the memory usage to find the SUS using Manber Myers (MM) and Bloom Filter technique (BF). The string is generated randomly for an alphabet of size 2. Percentage of error = 0% ...................................... 38 vii viii Table Page 5.4 Table displaying the Length of SUS found, the time required, and the memory usage to find the SUS using Manber Myers (MM) and Bloom Filter technique (BF). The string is generated randomly for an alphabet of size 4. Percentage of error = 0% ...................................... 39 5.5 Table displaying the Length of SUS found, the time required, and the memory usage to find the SUS using Manber Myers (MM) and Bloom Filter technique (BF). The string is generated randomly for an alphabet of size 8. Percentage of error = 0% ...................................... 40 5.6 Table displaying the Length of SUS found the time required, and the memory usage to find the SUS using Manber Myers (MM) and Bloom Filter technique (BF). The string is generated randomly for an alphabet of size 16. Percentage of error = 0% ...................................... 41 5.7 Table displaying the Length of SUS found, the time required, and the memory usage to find the SUS using Manber Myers (MM) and Bloom Filter technique (BF). The string is generated randomly for an alphabet of size 32. Percentage of error = 0% ...................................... 42 5.8 Table displaying the Length of SUS found, the time required, and the memory usage to find the SUS using Manber Myers (MM) and Bloom Filter technique (BF). The string is generated randomly for an alphabet of size 64. Percentage of error = 0% ...................................... 43 5.9 Table displaying the Length of SUS found, the time required, and the memory usage to find the SUS using Manber Myers (MM) and Bloom Filter technique (BF). The string is generated randomly for an alphabet of size 128. Percentage of error = 0% ...................................... 44 5.10 Table displaying the Length of SUS found, the time required, and the memory usage to find the SUS using Manber Myers (MM) and Bloom Filter technique (BF). The string is generated randomly for an alphabet of size 256. Percentage of error = 0% ...................................... 45 5.11 Table displaying the Length of SUS found, the time required, and the memory usage to find the SUS using Manber Myers (MM) and Bloom Filter technique (BF). The string is generated randomly for an alphabet of size 512. Percentage of error = 7.1% ..................................... 46 viii ix Appendix Table Page 5.12 Table displaying the Length of SUS found, the time required, and the memory usage to find the SUS using Manber Myers (MM) and Bloom Filter technique (BF). The string is generated randomly for an alphabet of size 1024. Percentage of error = 0% .................................... 47 ix DISCARD THIS PAGE x LIST OF FIGURES Figure Page 2.1 Trie of the strings fabc; ac; cacg ........................... 11 2.2 Suffix Tree, Suffix Array, and LCP Array for the string banana$ .......... 14 2.3 Illustration of how to compute LCP array for the string banana$ .......... 14 2.4 Insertion on elements x; y; and z with collision on z ................ 15 4.1 Generalized Suffix Tree of the strings X1 : abcaxy$ and X2 : abcayx#. Red dashed arrows are the suffix links, whereas blue nodes correspond to MUMs. 24 5.1 Figures associated with Table 5.1 . 36 5.2 Figures associated with Table 5.2 . 37 5.3 Figures associated with Table 5.3 . 38 5.4 Figures associated with Table 5.4 . 39 5.5 Figures associated with Table 5.5 . 40 5.6 Figures associated with Table 5.6 . 41 5.7 Figures associated with Table 5.7 . 42 5.8 Figures associated with Table 5.8 . 43 5.9 Figures associated with Table 5.9 . 44 5.10 Figures associated with Table 5.10 . 45 5.11 Figures associated with Table 5.11 . 46 5.12 Figures associated with Table 5.12 . 47 x 1 Chapter 1 Introduction Bernard Marr, in the article [BM], “How much data do we create every day? The mind- blowing stats everyone should know”, states that around 2.5 quintillion bytes of data are created daily by internet users.

Load more