Suffix Tree, Minwise Hashing and Streaming Algorithms for Big Data Analysis in Bioinformatics
Total Page:16
File Type:pdf, Size:1020Kb
University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Computer Science and Engineering: Theses, Computer Science and Engineering, Department Dissertations, and Student Research of Fall 12-4-2020 SUFFIX TREE, MINWISE HASHING AND STREAMING ALGORITHMS FOR BIG DATA ANALYSIS IN BIOINFORMATICS Sairam Behera University of Nebraska-Lincoln, [email protected] Follow this and additional works at: https://digitalcommons.unl.edu/computerscidiss Part of the Computer Engineering Commons, and the Computer Sciences Commons Behera, Sairam, "SUFFIX TREE, MINWISE HASHING AND STREAMING ALGORITHMS FOR BIG DATA ANALYSIS IN BIOINFORMATICS" (2020). Computer Science and Engineering: Theses, Dissertations, and Student Research. 201. https://digitalcommons.unl.edu/computerscidiss/201 This Article is brought to you for free and open access by the Computer Science and Engineering, Department of at DigitalCommons@University of Nebraska - Lincoln. It has been accepted for inclusion in Computer Science and Engineering: Theses, Dissertations, and Student Research by an authorized administrator of DigitalCommons@University of Nebraska - Lincoln. SUFFIX TREE, MINWISE HASHING AND STREAMING ALGORITHMS FOR BIG DATA ANALYSIS IN BIOINFORMATICS by Sairam Behera A DISSERTATION Presented to the Faculty of The Graduate College at the University of Nebraska In Partial Fulfilment of Requirements For the Degree of Doctor of Philosophy Major: Computer Science Under the Supervision of Professor Jitender S. Deogun Lincoln, Nebraska December, 2020 SUFFIX TREE, MINWISE HASHING AND STREAMING ALGORITHMS FOR BIG DATA ANALYSIS IN BIOINFORMATICS Sairam Behera, Ph.D. University of Nebraska, 2020 Adviser: Jitender S. Deogun In this dissertation, we worked on several algorithmic problems in bioinformatics using mainly three approaches: (a) a streaming model, (b) suffix-tree based indexing, and (c) minwise-hashing (minhash) and locality-sensitive hashing (LSH). The streaming models are useful for large data problems where a good approximation needs to be achieved with limited space usage. We developed an approximation algorithm (Kmer- Estimate) using the streaming approach to obtain a better estimation of the frequency of k-mer counts. A k-mer, a subsequence of length k, plays an important role in many bioinformatics analyses such as genome distance estimation. We also developed new methods that use suffix tree, a trie data structure, for alignment-free, non-pairwise algorithms for a conserved non-coding sequence (CNS) identification problem. We provided two different algorithms: STAG-CNS to identify exact-matched CNSs and DiCE to identify CNSs with mismatches. Using our algorithms, CNSs among various grass species were identified. A different approach was employed for identification of longer CNSs ( 100 bp, mostly found in animals). In our new method (MinCNE), ≥ the minhash approach was used to estimate the Jaccard similarity. Using also LSH, k-mers extracted from genomic sequences were clustered and CNSs were identified. Another new algorithm (MinIsoClust) that also uses minhash and LSH techniques was developed for an isoform clustering problem. Isoforms are generated from the same gene but by alternative splicing. As the isoform sequences share some exons but in different combinations, regular sequencing clustering methods do not work well. Our algorithm generates clusters for isoform sequences based on their shared minhash signatures. Finally, we discuss de novo transcriptome assembly algorithms and how to improve the assembly accuracy using ensemble approaches. First, we did a compre- hensive performance analysis on different transcriptome assemblers using simulated benchmark datasets. Then, we developed a new ensemble approach (Minsemble) for the de novo transcriptome assembly problem that integrates isoform-clustering using minhash technique to identify potentially correct transcripts from various de novo transcriptome assemblers. Minsemble identified more correctly assembled transcripts as well as genes compared to other de novo and ensemble methods. iv ACKNOWLEDGMENTS Firstly, I would like to express my sincere gratitude to my advisor Dr Jitender S. Deogun for his support and guidance in the last few years here at UNL. We spent a lot of time discussing the ideas and possible solutions for various problems. I am really grateful to him for all his mentorship during my PhD research. Besides my advisor, I would like to thank my committee members: Dr. Vinod N. Variyam, Dr. Lisong Xu, Dr. James C. Schnable and Dr. Etsuko N. Moriyama. It has been a matter of great pleasure and fortune to get a chance to collaborate and work with Dr. Schnable, Dr. Vinod and Dr. Moriyama for different projects. I thank Dr. Schnable for introducing me to the bioinformatics problems in plant science research. I also thank Dr. Vinod for helping me to understand some big data algorithms. My sincere thanks to Dr. Xu for his insightful comments on some of my research and his advice on my research progress. I would like to thank Dr. Moriyama for allowing me to join her research lab at School of Biological Sciences. Without her support and guidance, it would not be possible to conduct this research. I am grateful to Computer Science and Engineer- ing Department, Center for Root and Rhizobiome Innovation (CRRI) and Nebraska EPSCoR for supporting my research. I thank my fellow lab-mates including Suraj, Baset, Natasha, Mohammed and Aziza for all their support and insightful discussions. I thank all professors, staffs and friends in CSE department for their help and encouragement throughout my PhD research. I am so grateful to members of Schnable lab for helping me to understand many concepts of biology and plant science. My sincere thank goes to Sutanu of Vinod's group, Xianjun and Zhikai from Schnable Lab and Adam and Kushagra from Moriyama lab for their help and collaboration. v Last but not the least, I would like to thank my family: my parents, parents-in- law, brothers, sisters, my wife and my beautiful daughter Saisha for supporting me throughout my PhD and my life in general. vi PREFACE • Some of the results presented in Chapter 2 have been published in [13]. • The results presented in Chapter 3 have been published in [74] and [11]. • The results presented in Chapter 4 will appear in [14]. • The results presented in Chapter 5 have been published in [15]. • Results presented in Chapter 6 have been published in [12], [145], and [17] and are included in a manuscript currently in preparation [16]. vii GRANT INFORMATION This work has been partly supported by NSF EPSCoR RII Track-1: Center for Root and Rhizobiome Innovation (CRRI) Award OIA-1557417 to Etsuko N. Moriyama. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of these agencies. viii Table of Contents List of Figures xii List of Tables xiv 1 Introduction 1 2 Streaming algorithm for approximating k{mer frequency counts 5 2.1 Introduction . 5 2.1.1 Problem statement . 8 2.1.2 Related works . 8 2.2 Methods . 10 2.2.1 Implementation . 14 2.3 Results . 15 2.3.1 Experimental setup . 15 2.3.2 Accuracy . 18 2.3.3 Time and space . 19 2.3.4 Sample size . 21 2.4 Conclusion . 22 3 Discovery of conserved non-coding sequences efficiently 24 3.1 Introduction . 24 ix 3.2 Background and Related Works . 27 3.3 Methodology . 30 3.3.1 Problem definition . 31 3.3.2 Algorithm . 31 3.3.3 CNS with mismatches . 33 3.4 Experimental Results . 36 3.4.1 Accuracy and sensitivity of our approach . 37 3.4.2 Comparison of results from our approach and CDP . 38 3.4.3 Association of CNSs with DNase hypersensitive sites . 40 3.4.4 Running time . 41 3.5 Conclusion and Future Works . 42 4 Identifying conserved non-coding elements using min-wise hashing 44 4.1 Introduction . 44 4.2 Materials and Methods . 46 4.2.1 Minhash signatures . 46 4.2.2 LSH-based clustering . 49 4.2.3 CNE identification . 52 4.2.4 Benchmark dataset . 53 4.2.5 Performance evaluation . 54 4.3 Results and Discussion . 55 4.3.1 CNE identification performance . 55 4.3.2 Time and space usage . 57 4.4 Conclusion . 59 5 Isoform clustering using minhash and locality-sensitive hashing 60 5.1 Introduction . 60 x 5.2 Materials and Methods . 64 5.2.1 Sequence comparison using minhash signatures . 64 5.2.2 MinIsoClust isoform-clustering strategy . 66 5.2.3 LSH-based bucketing . 67 5.2.4 Identification of isoforms by clustering . 68 5.2.5 Benchmark datasets . 69 5.2.6 Performance evaluation . 70 5.2.7 Program execution . 72 5.3 Results and Discussion . 72 5.3.1 Isoform-clustering accuracy . 72 5.3.2 Computational time and space usage . 74 5.4 Conclusion . 75 6 New ensemble approach for improving transcriptome assembly 77 6.1 Introduction . 78 6.2 Transcriptome Assembly Strategies . 79 6.2.1 Genome-guided approach . 80 6.2.2 De novo approach . 81 6.2.3 Ensemble approach . 83 6.3 Performance Evaluation of Transcriptome Assembly . 84 6.3.1 Performance metrics without references . 85 6.3.2 Performance metrics using actual biological data . 86 6.3.3 Performance metrics using simulated benchmark data . 87 6.4 Simulated Benchmark Transcriptome Datasets Generation . 89 6.4.1 RNA-seq simulation methods . 90 6.4.2 Examples of RNA-seq simulation . 90 xi 6.5 Performance Comparison among Transcriptome Assemblers . 92 6.5.1 Genome-guided approach . 93 6.5.2 De novo approach . 95 6.5.3 Combining de novo assemblies generated using different k-mers 98 6.5.4 Analysis of k-mers used in assembled contigs . 98 6.5.5 Ensemble approach . 100 6.6 Minsemble: a New Ensemble Approach . 102 6.6.1 Minhash signature generation . 103 6.6.2 Clustering of potential isoforms . 107 6.6.3 Selection of contigs for final assembly .