Suffix Structures and Circular Pattern Problems

Graduate Theses, Dissertations, and Problem Reports 2011 Suffix Structures and Circular Pattern Problems Jie Lin West Virginia University Follow this and additional works at: https://researchrepository.wvu.edu/etd Recommended Citation Lin, Jie, "Suffix Structures and Circular Pattern Problems" (2011). Graduate Theses, Dissertations, and Problem Reports. 3402. https://researchrepository.wvu.edu/etd/3402 This Dissertation is protected by copyright and/or related rights. It has been brought to you by the The Research Repository @ WVU with permission from the rights-holder(s). You are free to use this Dissertation in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you must obtain permission from the rights-holder(s) directly, unless additional rights are indicated by a Creative Commons license in the record and/ or on the work itself. This Dissertation has been accepted for inclusion in WVU Graduate Theses, Dissertations, and Problem Reports collection by an authorized administrator of The Research Repository @ WVU. For more information, please contact [email protected]. Suffix Structures and Circular Pattern Problems Jie Lin Dissertation submitted to the College of Engineering and Mineral Resources at West Virginia University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer and Information Sciences Dr. Donald Adjeroh, Ph.D., Chair Dr. Elaine M Eschen , Ph.D. Dr. Arun Ross, Ph.D. Dr. James Harner, Ph.D. Dr. Cun-Quan Zhang, Ph.D Lane Department of Computer Science and Electrical Engineering Morgantown, West Virginia, 2011 Keywords: Suffix Array, Suffix Tree, Pattern Matching, Text Mining, Probabilistic Suffix Trees, Probabilistic Suffix Arrays, Markov Models, Space Efficiency, Circular Patterns, Mul- tidomain Proteins, Circular Pattern Discovery Copyright@ 2011 Jie Lin ABSTRACT The suffix tree is a data structure used to represent all the suffixes in a string. However, a major problem with the suffix tree is its practical space requirement. In this dissertation, we propose an efficient data structure – the virtual suffix tree (VST) – which requires less space than other recently proposed data structures for suffix trees and suffix arrays. On average, the space requirement (including that for suffix arrays and suffix links) is 13:8n bytes for the regular VST, and 12:05n bytes in its compact form, where n is the length of the sequence. Markov models are very popular for modeling complex sequences. In this dissertation, we present the probabilistic suffix array (PSA), a space-efficient alternative to the probabilistic suffix tree (PST) used to represent Markov models. The PSA provides all the capabilities of the PST, such as learning and prediction, and maintains the same linear time construction (linearity with respect to sequence length). The PSA, however, has a significantly smaller memory requirement than the PST, for both the construction stage, and at the time of usage. Using the proposed suffix data structures, we study the circular pattern matching (CPM) problem. We provide a linear time, linear space algorithm to solve the exact circular pattern matching problem. We then present four algorithms to address the approximate circular pattern matching (ACPM) problem. Our bidirectional ACPM algorithm provides the best time complexity when compared with other algorithms proposed in the literature. Further, we define the circular pattern discovery (CPD) problem and present algorithms to solve this problem. Using the proposed circular pattern matching algorithms, we perform experiments on computational analysis and function prediction for multidomain proteins. Acknowledgement I would like to thank my advisor, Dr. Don Adjeroh, for his guidance, advice, and contin- ued encouragement. It has been a pleasure to work under his supervision. Without him, this dissertation could not have come about. I would also like to thank my other committee members: Dr. Elaine Eschen, Dr. Arun Ross, Dr. James Harner, and Dr. Cun-Quan Zhang for their help during my studies. And finally, I thank my family members for their constant support, encouragement, and help. The work reported in this thesis was partly supported by a DOE CAREER award (No: DE-FG02-02ER25541 ), an NSF ITR award (No: 0312484), and a WV-EPSCoR RCG grant. iii Contents 1 Introduction 1 1.1 Overview . 1 1.2 The Problem . 2 1.2.1 Suffix Tree . 2 1.2.2 Markov Models and Probabilistic Suffix Tree . 3 1.2.3 Circular Pattern Matching and Circular Pattern Discovery . 5 1.3 Contribution . 6 1.4 Organization . 7 2 Related Work 9 2.1 Suffix Tree and Suffix Array . 10 2.1.1 Basic Notations and Definitions . 10 2.1.2 Suffix Tree . 10 2.1.3 Suffix Links . 11 iv 2.1.4 Implementation and Problems with the Suffix Tree . 12 2.1.5 Suffix Array . 12 2.2 Space-Efficient Suffix Trees . 12 2.2.1 ESA . 13 2.2.2 LST . 14 2.3 Markov Models and Probabilistic Suffix Tree . 15 2.3.1 Variable Length Markov Models . 15 2.3.2 Probabilistic Suffix Tree . 16 2.3.3 Computing TF and DF via Suffix Arrays . 16 2.4 Circular Pattern Matching . 21 2.4.1 String Pattern Matching . 21 2.4.2 Circular Pattern Matching Problems . 23 2.4.3 Exact Circular Pattern Matching (ECPM) . 25 2.4.4 Approximate Circular Pattern Matching (ACPM) . 26 2.4.5 ACPM Problem in Protein Sequences . 29 2.5 Pattern Discovery Problem . 30 3 The Virtual Suffix Tree 31 3.1 Introduction . 31 3.2 Basic Data Structure . 33 3.2.1 Example VST . 34 v 3.2.2 Properties of the Virtual Suffix Tree . 34 3.2.3 Pattern Matching on VST . 36 3.3 Improved Virtual Suffix Tree . 38 3.3.1 Adjusting Edge Lengths . 38 3.3.2 Construction Algorithm . 40 3.3.3 Further Space Reduction . 41 3.3.4 Complexity Analysis . 41 3.4 Computing Suffix Links . 43 3.5 From SA to VST . 45 3.6 Summary . 50 4 The Probabilistic Suffix Array 59 4.1 Introduction . 59 4.2 Probabilistic Suffix Tree . 60 4.3 Proposed Data Structure . 61 4.3.1 Internal Node Attributes . 62 4.3.2 Measurement Attributes . 63 4.3.3 Example PSA . 64 4.3.4 Interval Array and Document Frequency in Linear Time . 65 4.4 Constructing the PSA . 66 4.4.1 Building the Interval Tree . 67 vi 4.4.2 Building the Suffix Link . 68 4.4.3 Sorting the PSA Structure . 68 4.4.4 Computing Conditional Probabilities Using the PSA . 70 4.4.5 Prediction with VLMM via the PSA . 71 4.5 Space Analysis . 74 4.5.1 Storage Space . 74 4.5.2 Construction Space . 75 4.6 Experiments . 77 4.6.1 Predicting Protein Families . 77 4.6.2 Space Consideration . 79 4.6.3 Computational Time Requirement . 80 4.6.4 PSA in Phylogenetic Tree Construction . 81 4.7 Summary . 82 5 Circular Pattern Matching 87 5.1 Introduction . 87 5.2 Exact Circular Pattern Matching Problem . 89 5.2.1 Linear Time ECPM Algorithm . 89 5.2.2 Comparison of ECPM algorithms . 92 5.3 Approximate Circular Pattern Matching Problem . 93 5.3.1 Greedy ACPM Algorithm . 93 vii 5.3.2 ACPM with LIS . 94 5.3.3 ACPM with q-grams and Suffix Array . 96 5.3.4 Improved Algorithm: ACPM with Bidirectional Edit Distance . 98 5.3.5 Comparison with Other ACPM Algorithms . 104 5.4 Experiments . 105 5.4.1 Data Set . 106 5.4.2 CPM Experimental Design . 107 5.4.3 Multidomain Protein Networks using Circular Patterns . 110 5.5 Summary . 115 6 Circular Pattern Discovery 133 6.1 Introduction . 133 6.2 The Circular Pattern Discovery Problem . 134 6.3 The ECPD Algorithm . 136 6.4 The ACPD Algorithm . 137 6.4.1 ACPD using Maes’ Algorithm . 137 6.4.2 Proposed ACPD Algorithm . 139 6.4.3 Comparison . 140 6.5 Experiments . 140 6.6 Summary . 146 viii 7 Conclusion and Future Work 147 7.1 Conclusion . 147 7.2 Future Work . 149 7.2.1 Circular Pattern Discovery . 149 7.2.2 Network Analysis for Circular Multidomain Proteins . 149 7.2.3 From PSA to PFA . 150 7.2.4 Approximate Pattern Matching Using PSA . 150 7.2.5 Prediction with PSA using Inexact Matching . 150 7.3 Publications from the Dissertation . ..

Load more