Graduate Theses, Dissertations, and Problem Reports

2011

Suffix Structures and Circular Pattern Problems

Jie Lin West Virginia University

Follow this and additional works at: https://researchrepository.wvu.edu/etd

Recommended Citation Lin, Jie, "Suffix Structures and Circular Pattern Problems" (2011). Graduate Theses, Dissertations, and Problem Reports. 3402. https://researchrepository.wvu.edu/etd/3402

This Dissertation is protected by copyright and/or related rights. It has been brought to you by the The Research Repository @ WVU with permission from the rights-holder(s). You are free to use this Dissertation in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you must obtain permission from the rights-holder(s) directly, unless additional rights are indicated by a Creative Commons license in the record and/ or on the work itself. This Dissertation has been accepted for inclusion in WVU Graduate Theses, Dissertations, and Problem Reports collection by an authorized administrator of The Research Repository @ WVU. For more information, please contact [email protected]. Suffix Structures and Circular Pattern Problems

Jie Lin

Dissertation submitted to the College of Engineering and Mineral Resources at West Virginia University in partial fulfillment of the requirements for the degree of

Doctor of Philosophy in Computer and Information Sciences

Dr. Donald Adjeroh, Ph.D., Chair Dr. Elaine M Eschen , Ph.D. Dr. Arun Ross, Ph.D. Dr. James Harner, Ph.D. Dr. Cun-Quan Zhang, Ph.D

Lane Department of Computer Science and Electrical Engineering Morgantown, West Virginia, 2011

Keywords: Suffix Array, Suffix Tree, , Text Mining, Probabilistic Suffix Trees, Probabilistic Suffix Arrays, Markov Models, Space Efficiency, Circular Patterns, Mul- tidomain Proteins, Circular Pattern Discovery

Copyright@ 2011 Jie Lin ABSTRACT

The suffix tree is a used to represent all the suffixes in a string. However, a major problem with the suffix tree is its practical space requirement. In this dissertation, we propose an efficient data structure – the virtual suffix tree (VST) – which requires less space than other recently proposed data structures for suffix trees and suffix arrays. On average, the space requirement (including that for suffix arrays and suffix links) is 13.8n bytes for the regular VST, and 12.05n bytes in its compact form, where n is the length of the sequence.

Markov models are very popular for modeling complex sequences. In this dissertation, we present the probabilistic suffix array (PSA), a space-efficient alternative to the probabilistic suffix tree (PST) used to represent Markov models. The PSA provides all the capabilities of the PST, such as learning and pre- diction, and maintains the same linear time construction (linearity with respect to sequence length). The PSA, however, has a significantly smaller memory requirement than the PST, for both the construction stage, and at the time of usage.

Using the proposed suffix data structures, we study the circular pattern matching (CPM) problem. We provide a linear time, linear space algorithm to solve the exact circular pattern matching problem. We then present four algorithms to address the approximate circular pattern matching (ACPM) problem. Our bidirectional ACPM algorithm provides the best time complexity when compared with other algorithms proposed in the literature. Further, we define the circular pattern discovery (CPD) problem and present algorithms to solve this problem. Using the proposed circular pattern matching algorithms, we perform experiments on computational analysis and function prediction for multidomain proteins. Acknowledgement

I would like to thank my advisor, Dr. Don Adjeroh, for his guidance, advice, and contin- ued encouragement. It has been a pleasure to work under his supervision. Without him, this dissertation could not have come about.

I would also like to thank my other committee members: Dr. Elaine Eschen, Dr. Arun Ross, Dr. James Harner, and Dr. Cun-Quan Zhang for their help during my studies.

And finally, I thank my family members for their constant support, encouragement, and help.

The work reported in this thesis was partly supported by a DOE CAREER award (No: DE-FG02-02ER25541 ), an NSF ITR award (No: 0312484), and a WV-EPSCoR RCG grant.

iii Contents

1 Introduction 1

1.1 Overview ...... 1

1.2 The Problem ...... 2

1.2.1 Suffix Tree ...... 2

1.2.2 Markov Models and Probabilistic Suffix Tree ...... 3

1.2.3 Circular Pattern Matching and Circular Pattern Discovery ...... 5

1.3 Contribution ...... 6

1.4 Organization ...... 7

2 Related Work 9

2.1 Suffix Tree and Suffix Array ...... 10

2.1.1 Basic Notations and Definitions ...... 10

2.1.2 Suffix Tree ...... 10

2.1.3 Suffix Links ...... 11

iv 2.1.4 Implementation and Problems with the Suffix Tree ...... 12

2.1.5 Suffix Array ...... 12

2.2 Space-Efficient Suffix Trees ...... 12

2.2.1 ESA ...... 13

2.2.2 LST ...... 14

2.3 Markov Models and Probabilistic Suffix Tree ...... 15

2.3.1 Variable Length Markov Models ...... 15

2.3.2 Probabilistic Suffix Tree ...... 16

2.3.3 Computing TF and DF via Suffix Arrays ...... 16

2.4 Circular Pattern Matching ...... 21

2.4.1 String Pattern Matching ...... 21

2.4.2 Circular Pattern Matching Problems ...... 23

2.4.3 Exact Circular Pattern Matching (ECPM) ...... 25

2.4.4 Approximate Circular Pattern Matching (ACPM) ...... 26

2.4.5 ACPM Problem in Protein Sequences ...... 29

2.5 Pattern Discovery Problem ...... 30

3 The Virtual Suffix Tree 31

3.1 Introduction ...... 31

3.2 Basic Data Structure ...... 33

3.2.1 Example VST ...... 34

v 3.2.2 Properties of the Virtual Suffix Tree ...... 34

3.2.3 Pattern Matching on VST ...... 36

3.3 Improved Virtual Suffix Tree ...... 38

3.3.1 Adjusting Edge Lengths ...... 38

3.3.2 Construction Algorithm ...... 40

3.3.3 Further Space Reduction ...... 41

3.3.4 Complexity Analysis ...... 41

3.4 Computing Suffix Links ...... 43

3.5 From SA to VST ...... 45

3.6 Summary ...... 50

4 The Probabilistic Suffix Array 59

4.1 Introduction ...... 59

4.2 Probabilistic Suffix Tree ...... 60

4.3 Proposed Data Structure ...... 61

4.3.1 Internal Node Attributes ...... 62

4.3.2 Measurement Attributes ...... 63

4.3.3 Example PSA ...... 64

4.3.4 Interval Array and Document Frequency in Linear Time ...... 65

4.4 Constructing the PSA ...... 66

4.4.1 Building the Interval Tree ...... 67

vi 4.4.2 Building the Suffix Link ...... 68

4.4.3 Sorting the PSA Structure ...... 68

4.4.4 Computing Conditional Probabilities Using the PSA ...... 70

4.4.5 Prediction with VLMM via the PSA ...... 71

4.5 Space Analysis ...... 74

4.5.1 Storage Space ...... 74

4.5.2 Construction Space ...... 75

4.6 Experiments ...... 77

4.6.1 Predicting Protein Families ...... 77

4.6.2 Space Consideration ...... 79

4.6.3 Computational Time Requirement ...... 80

4.6.4 PSA in Phylogenetic Tree Construction ...... 81

4.7 Summary ...... 82

5 Circular Pattern Matching 87

5.1 Introduction ...... 87

5.2 Exact Circular Pattern Matching Problem ...... 89

5.2.1 Linear Time ECPM Algorithm ...... 89

5.2.2 Comparison of ECPM algorithms ...... 92

5.3 Approximate Circular Pattern Matching Problem ...... 93

5.3.1 Greedy ACPM Algorithm ...... 93

vii 5.3.2 ACPM with LIS ...... 94

5.3.3 ACPM with q-grams and Suffix Array ...... 96

5.3.4 Improved Algorithm: ACPM with Bidirectional . . . . . 98

5.3.5 Comparison with Other ACPM Algorithms ...... 104

5.4 Experiments ...... 105

5.4.1 Data Set ...... 106

5.4.2 CPM Experimental Design ...... 107

5.4.3 Multidomain Protein Networks using Circular Patterns ...... 110

5.5 Summary ...... 115

6 Circular Pattern Discovery 133

6.1 Introduction ...... 133

6.2 The Circular Pattern Discovery Problem ...... 134

6.3 The ECPD Algorithm ...... 136

6.4 The ACPD Algorithm ...... 137

6.4.1 ACPD using Maes’ Algorithm ...... 137

6.4.2 Proposed ACPD Algorithm ...... 139

6.4.3 Comparison ...... 140

6.5 Experiments ...... 140

6.6 Summary ...... 146

viii 7 Conclusion and Future Work 147

7.1 Conclusion ...... 147

7.2 Future Work ...... 149

7.2.1 Circular Pattern Discovery ...... 149

7.2.2 Network Analysis for Circular Multidomain Proteins ...... 149

7.2.3 From PSA to PFA ...... 150

7.2.4 Approximate Pattern Matching Using PSA ...... 150

7.2.5 Prediction with PSA using Inexact Matching ...... 150

7.3 Publications from the Dissertation ...... 152

ix List of Tables

2.1 Interval Array ...... 18

3.1 VST node attributes for the example sequence T = missississippi$ used in Figure 3.1...... 35

3.2 Node attributes in the improved VST for the example sequence, T = missississippi$. 40

3.3 Branching factor and maximum space requirement for various sample files. . . 54

3.4 Storage requirement for the VST, including suffix links ...... 55

3.5 Detailed attributes for nodes in the TA data structure using the sample sequence, T = missississippi$...... 55

3.6 Node mapping table from TA to VST ...... 56

4.1 Attributes of PSA ...... 63

4.2 Example PSA internal nodes, using the PSA of the sequence T = accactact$ 65

4.3 Example PSA leaf nodes, using the PSA of the sequence T = accactact$ . 65

x 4.4 Performance of the PSA in modeling and prediction of protein families. Fam- ilies correspond to the first 51 protein families with 12 or more members in the Pfam database, ordered alphabetically based on their abbreviated names in Pfam. For comparison, we have included the results obtained using the PST [17] on the same data set. (TP stands for true positive, while MD stands for missed detection). ∗∗The family apple was not in the dataset used in [17]...... 84

4.5 Summary performance in protein family classification using the PSA and PST . 85

4.6 Summary data on the first 51 families in Pfam, as described in Table 4.4. . . . . 85

4.7 Construction memory needed for the PSA and PST. Results are based on the first 51 families in Pfam, as described in Table 4.4...... 85

4.8 Construction time comparison for PSA and PST. Results are based on the first 51 families in Pfam, as described in Table 4.4. Recorded time is time needed per family (in seconds). Speedup is computed as the ratio with respect to PSA time...... 85

4.9 Prediction time comparison between PSA and PST. Results are based on the first 51 families in Pfam, as described in Table 4.4. Recorded time (in seconds) is prediction time per family – i.e. total time needed to predict all members in the family against all the other families...... 86

5.1 Comparison of ECPM algorithms ...... 93

5.2 Comparison with other proposed ACPM Algorithms ...... 105

5.3 Top 15 highest degree proteins with GO function ...... 109

5.4 The predicted protein functions using union for In-edge and Out-edge . . . . . 111

5.5 The predicted protein functions using intersection for In-edge and Out-edge . . 112

5.6 Performance in Protein Function Prediction using the Top-500 Proteins . . . . 112

xi 5.7 Network statistics for multidomain protein networks ...... 114

5.8 Top 25 proteins with the highest node degree differences between protein net- works using the circular and non-circular patterns...... 116

5.9 The longest path in the Protein network ...... 117

5.10 The longest path in the Family network ...... 118

6.1 The number of distinct patterns with pattern length ...... 144

6.2 Sample discovered circular patterns with length five...... 144

6.3 Sample discovered circular patterns with length thirteen...... 145

xii List of Figures

1.1 Summary of work in this dissertation...... 8

2.1 Algorithm for computing document frequency [131] ...... 20

2.2 Edit Graph of T and PP [87]...... 27

2.3 Maes’ Algorithm ...... 28

3.1 Suffix tree and virtual suffix tree for the string T = missississippi$. . . . 51

3.2 Example VST (solid nodes) showing left SA index (lSA) and right SA index (rSA) for sample nodes...... 51

3.3 Edge-length adjustment procedure...... 52

3.4 Improved VST for the string T = missississippi$ ...... 52

3.5 Suffix links on the VST for the sample string T = missississippi$. . . . 55

3.6 Constructing VST from the suffix array...... 56

4.1 State diagram and transition matrix for a first order Markov model for an exam- ple sequence T = accactact$...... 61

4.2 Example suffix tree and probabilistic suffix tree for the string T = accactact$. 62

xiii 4.3 Top-k classification rate for sample protein families using the PSA...... 79

4.4 Memory consumption factor (MC Factor) needed to construct the PSA and PST data structures for the first 51 protein families in Pfam...... 80

4.5 Phylogenetic tree for 20 species constructed using the predicted probabilities obtained using the PSA...... 82

5.1 Suffix tree for the string T = missississippi$ with some suffix links. . . 118

5.2 The number of hypotheses with q-gram ...... 121

5.3 Dynamic Programing in q-gram matching ...... 122

5.4 Three cases in computing the circular edit distance in the ACPM algorithm using the bidirectional edit distance. The numbered double-header show the symbol positions involved in each case...... 123

5.5 The time cost of the CPM algorithms ...... 124

5.6 Degree distributions in the network of multidomain proteins constructed based on the circular patterns they contain...... 125

5.7 Number of directly connected pairs in Top-K highest degree proteins ...... 126

5.8 The Protein network (using both CPs and non-CPs) ...... 127

5.9 The Protein network using only non-circular patterns ...... 128

5.10 The Protein network using only circular patterns ...... 129

5.11 The Family network (using both CPs and non-CPs) ...... 130

5.12 The Family network using only non-circular patterns ...... 131

5.13 The Family network using only circular patterns ...... 132

xiv 6.1 Variation of the number of distinct patterns (including non-circular and circular patterns) with pattern length...... 143

6.2 Variation of number of circular patterns with pattern length...... 143

6.3 Variation of maximum number of occurrences with pattern length...... 145

xv Chapter 1

Introduction

1.1 Overview

The suffix tree is an important data structure used to represent sequences (for example, text, DNA sequence, video, etc.). However, its space requirement is huge for most practical applica- tions. Most methods for suffix tree (ST) and suffix array (SA) have focused on the theoretical time and space complexity. Markov models are popular for modeling complex sequences whose sources are unknown, or whose underlying statistical characteristics are not well understood. A major problem, however, is its space complexity, which grows exponentially with the order of the Markov model. The probabilistic suffix tree (PST) was proposed to address with the space problem of the Markov model. This reduced the space complexity theoretically. However, in practice, the space requirement for the PST is still relatively large, and often impractical for most real-life problems. Circular permutations and circular pattern matching are interesting problems in computer science and biology. There are several algorithms and methods to solve this problem. But in practice, these algorithms and methods have huge time and space costs.

In the first part of this dissertation, we develop efficient suffix data structures for analysis of huge sequences. In the second part, we use these data structures to study the problem of circular permutations and their applications in computational biology. Based on our circular permutation work, we define and study the circular pattern discovery problem.

1 CHAPTER 1. INTRODUCTION 2

First we propose a space efficient data structure called the virtual suffix tree (VST). The VST supports the same functions as the suffix tree, but with much less space practical requirement. The average space requirement is significantly smaller than other data structures for suffix tree and suffix array. Secondly, we propose an efficient data structure, the probabilistic suffix array (PSA) to represent the Markov model. PSA takes a much smaller space than probabilistic suffix tree which is implemented on a regular suffix tree. We show the experiment in the biology applications. Lastly, using suffix trees and suffix arrays we propose algorithms to solve the circular pattern matching problem and use these to study circular permutations and function prediction for multidomain proteins. We also introduce the circular pattern discovery problem and present algorithms to solve the problem.

1.2 The Problem

1.2.1 Suffix Tree

The suffix tree is an important data structure used to represent the set of all suffixes of a string. The suffix tree is efficient in both time and space, and has been used in a variety of applications, such as pattern matching, , identification of repetitions in genome-scale biological sequences, and in data compression. Various algorithms have been de- veloped for efficient construction of suffix trees [38,89,122,128]. However, one major problem with the suffix tree is its practical space requirement. The suffix array is a related data structure, which was originally introduced in [84] as a space-efficient alternative to the suffix tree. The suffix array simply provides a listing of all the suffixes of a given string in lexicographic order. The suffix array can be used in most (though, not all) situations where a suffix tree is used.

Although the theoretical space complexity is linear for both data structures, typically, for a given string T of length n, the suffix array requires about three to five times less space than the suffix tree. The construction time for both algorithms is also O(n) on average. For suffix arrays, construction algorithms that run in O(nlogn) worst case are relatively easy to develop, but O(n) worst case algorithms are much harder to come by. Recent suffix sorting algorithms with worst-case linear time complexity have been reported in [57, 63, 65]. Gusfield [46] provides a comprehensive treatment of suffix trees and its applications. Puglisi et al. [104] provide a CHAPTER 1. INTRODUCTION 3

recent survey on suffix arrays. An extensive discussion on the connection between the Burrows- Wheeler Transform [23] and suffix trees and suffix arrays is provided in [3].

For small alphabet sizes, the suffix tree and the suffix array have about the same complexity in pattern matching. For pattern matching, the suffix array requires time in O(mlogn) to locate one occurrence of a pattern P of length m in T. However, with additional data structures, such as the lcp array, this time can be reduced to O(m+logn). With the suffix tree, the same search can be performed in O(m) time. The problem, however, is for sequences with large alphabets where |Σ| → n. Here, |Σ|, the alphabet size is no longer negligible. Using the array representation of nodes in the suffix tree will require O(n|Σ|) space for the suffix tree, and O(m) time for pattern matching. For linear space, the linked list or binary search tree can be used, but the search time becomes O(m|Σ|) or O(mlog|Σ|) respectively.

The Challenge. A key problem then is to develop space-efficient data structures that can support pattern matching using the same time complexity as suffix trees, but at a practical space requirement that approaches that of the suffix array. Such a data structure should also support the complete functionality of the suffix tree, such as support for suffix links, as may be required in certain applications. Two recent data structures that have attempted to address this problem are the ESA – enhanced suffix array [2], and the LST – linearized suffix tree [61, 62]. Both methods are based on the notion of lcp-intervals [60], constructed using the suffix array and the lcp array.

1.2.2 Markov Models and Probabilistic Suffix Tree

Markov models are very popular for modeling complex sequences whose sources are un- known, or whose underlying statistical characteristics are not well understood. This is especially the case when the sequences exhibit some memory. For a short term memory of length, say L, this means that the conditional distribution of the next symbol given the last L symbols does not change significantly if we condition on L or more previous symbols. Thus, such sequences are often modeled using Markov models of order L , or using the Hidden Markov Models (HMM) [37]. The models provide efficient mechanisms to compute the required conditional probabilities, and also for generating sequences from the models. The problem is that the size of Markov models increases exponentially with increasing memory length L. Thus, they are CHAPTER 1. INTRODUCTION 4 practical only for low order models with short memory lengths. This leads to the second chal- lenge: such low-order Markov models often provide a poor approximation of the true sequence being modeled. It is known that learning with Hidden Markov Models is computationally very challenging. Hardness results on the learnability of HMM are discussed in [1], while similar results on inferencing using HMM are reported in [42].

Probabilistic suffix models such as probabilistic suffix trees (PST) have been proposed by Ron et al. [108] to address some of the key problems with Markov models. They showed the equivalence between PSTs and a subclass of probabilistic finite automata (PFAs) called probabilistic suffix automata (PSF/PSA): for a given PST, there is an algorithm to construct a PSF/PSA whose size is the same as that of the PST within a constant factor. Further, the distribution generated by a PST is guaranteed to be within a small distance from that generated using the PSF/PSA, as measured by the Kullback-Leibler divergence.

Probabilistic suffix trees and probabilistic suffix automatons are related to context-based models which are extensively used in sequence prediction and data compression. Typical exam- ples of such context models include context tree weighting (CTW) [129], prediction by partial matching (PPM) [28], and the Lempel-Ziv decomposition [133, 134]. See also [106, 107]. The use of these context models as a surrogate for variable length Markov models with applications in sequence prediction are reviewed in [16]. Other applications have been found in model- ing DNA sequences [109], protein sequence classification [17, 71], and in modeling API call sequences for malicious codes in Windows XP [88].

The Challenge. The probabilistic suffix models however require O(Ln2) time and space to construct. The algorithm to construct the PFA/PSA from the PST also runs in O(Ln2) time. Later Apostolico and Bejerano [9] showed how the PST can be constructed in O(n) time, inde- pendent of the order L, using traditional suffix links used in constructing suffix trees, and the notion of reverse suffix links. They did not consider the problem of constructing the PSF/PSA from the PST. Their use of the suffix tree also implies that the space requirements for construct- ing the PST will be very high, although it is still linear in terms of the sequence length. CHAPTER 1. INTRODUCTION 5

1.2.3 Circular Pattern Matching and Circular Pattern Discovery

Computing similarity (or dissimilarity) between two strings is an important problem in general sequence analysis [44, 44, 46, 51, 81, 115], pattern recognition [113, 120] and biology [40, 47, 55]. The circular edit distance is an extension of traditional edit distance which seeks to determine similarity between strings in a circular shift. A circular shift is a mapping f : ∗ ∗ t Σ → Σ , f (c1...cr) = ct+1...crc1...ct−1ct, where 0 ≤ t ≤ r − 1 and r is the length of string c1c2...cr. The circular edit distance between two strings s1 and s2 is defined as EDc(s1,s2) = i j min{ED[ f (s1), f (s2)]|0 ≤ i ≤ |s1| − 1, 0 ≤ j ≤ |s2| − 1}, where |s1| is the length of string s1 and |s2| is the length of string s2, ED[A,B] is the standard edit distance between A and B. Thus, the dissimilarity between two strings in a circular shift is a function of the circular distance between them. Computational methods have also been proposed to study circular patterns in biological problems. [123, 124, 126, 127]

In biology, circular proteins and circular permutations in proteins are becoming of increas- ing interest, especially given their role in the structure, function, folding, and stability of pro- teins [40, 47, 55]. In a circular (or cyclic) protein, the traditional N- and C-termini are joined, resulting in a protein sequence with no termini [123]. The cyclotides is a typical example of a naturally-occurring family of cyclic proteins in the Plant Kingdom. Cyclotides are known to play a major role and provide important functions in terms of plant defense against insects and other pathogens [35]. Their cyclic structure is known to be an important factor in their unusual stability [35]. Other common examples of cyclic proteins are the bacteriocins, small antimicro- bial peptides with 30-70 residues produced by bacteria [33], cyclosporins found in fungi [66], and the primate rhesus θ-defensin -1 [119] with antibacterial properties for the immune system of macaques monkeys.

The Challenge. There are several algorithms for calculating the circular distance between text T and circular pattern P, but they did not consider the problem of finding circular pattern P and its circular shifts inside of text T. In our work, we define variants of the CPM problem and propose algorithms to solve them.

Pattern discovery is a fundamental analysis method used to identify possibly hidden rela- tions within or between the sequences. Pattern discovery problems are well studied in many CHAPTER 1. INTRODUCTION 6

applications. In biology, motif discovery is often performed as a kind of pattern discovery ap- plication. To our knowledge, there is no existing work that explicitly studied the problem of pattern discovery involving circular patterns. In this work, we introduce the Circular Pattern Discovery problem and propose algorithms for its solution.

1.3 Contribution

We introduce the VST (virtual suffix tree), an efficient data structure for suffix trees and suffix arrays. Starting from the suffix array, we construct the suffix tree, from which we derive the virtual suffix tree. Later, we remove the intermediate step of suffix tree construction, and build the VST directly from the suffix array. The VST provides the same functionality as the suffix tree, including suffix links, but at a much smaller space requirement. It has the same linear time construction even for large alphabets, Σ, requires O(n) space to store (n is the string length), and allows searching for a pattern of length m to be performed in O(mlog|Σ|) time, the same time needed for a suffix tree. Given the VST, we show an algorithm that computes all the suffix links in linear time, independent of Σ. The VST requires less space than other recently proposed data structures for suffix trees and suffix arrays, such as the enhanced suffix array [2], and the linearized suffix tree [62]. On average, the space requirement (including that for suffix arrays and suffix links) is 13.8n bytes for the regular VST, and 12.05n bytes in its compact form.

We present the probabilistic suffix array (PSA), a data structure for representing informa- tion in variable length Markov chains. The PSA essentially encodes information in a Markov model by providing a space-efficient representation of the probabilistic suffix tree (PST). Our PSA provides the same functionality as the PST, but at a significantly reduced space require- ment. Given a sequence of length n, construction and learning in the PSA is done in O(n) time and space, independent of the Markov order. Prediction using the PSA is performed in n O(mlog |Σ| ) time, where m is the pattern length, and Σ is the symbol alphabet. The specific memory requirement is 33n bytes in the worst case, and 26n bytes on average, including space for the suffix array and the input sequence. This can be compared with the 41n bytes needed using the PST.

We propose an exact circular pattern matching (ECPM) algorithm that runs in linear time CHAPTER 1. INTRODUCTION 7 and linear space. We also propose algorithms to solve the approximate circular pattern matching (ACPM) problem. In our work, we solved a harder version of the ACPM problem when com- pared to other previous work on ACPM [44, 81, 123, 124, 126, 127]. We present an experiment on finding circular relations in multi-domain proteins. Our experiments show that the methods based on circular permutations can produce very good results in predicting protein functions.

We propose two algorithms for the Circular Pattern Discovery (CPD) problem. The first algorithm uses suffix trees and suffix links to solve the exact circular pattern discovery problem 2 in O(m2N) time. The second algorithm uses suffix arrays to solve the more challenging ap- 2 2 2 proximate circular pattern discovery (ACPD) problem in O(km2N ) worst case, and O(km2N) on average. By exploiting the nature of the ACPD problem, the complexity can be reduced to 2 2 2 O(m2N ) worst case, and O(m2N) on average.

Aspects of the work from this dissertation are reported in the following papers: [74–79].

1.4 Organization

The dissertation is organized as follows. In Chapter 2, we introduce related work, includ- ing basic notations and definitions. Chapter 3 presents the Virtual Suffix Tree (VST) including its construction algorithm, searching algorithm and support for suffix links. We also analyze its time and space complexity and compare with related algorithms. Chapter 4 introduces the Prob- abilistic Suffix Array (PSA) including detailed explanation and implementation. The practical space requirement of the data structure is also examined. In this chapter, we present the experi- ments of PSA using protein sequences and miDNA sequences and protein sequences. Chapter 5 discusses the circular pattern matching (CPM) problem. Algorithms are then proposed for the exact and inexact variants of the problem using suffix data structures. We implement the algorithms and show examples of using these algorithms to search for circular patterns in mul- tidomain proteins. In Chapter 6, we define the circular pattern discovery (CPD) problem and present algorithms to solve the CPD problem. Chapter 7 draws some conclusions and also de- scribes possible directions for future work. Figure 1.1 provides a summary of the work reported in this dissertation. CHAPTER 1. INTRODUCTION 8

Data Structure

Algorithm PSA Protein Family Classification Experiment Phylogenetic Tree Construction

Data Structure VST Algorithm

Dissertation work ECPM Algorithm ACPM

CPM Protein function prediction Experiment

CPM & CPD Circular pattern network

ECPD Algorithm ACPD CPD

Experiment Pattern discovery in multi-domain protein database

Figure 1.1. Summary of work in this dissertation. Chapter 2

Related Work

In this chapter, first we introduce the suffix data structures, namely suffix trees and suffix arrays, which form the basis for the majority of the work proposed in this dissertation. The suffix data structures are important in computer science and bioinformatics. In Section 2, we introduce the enhanced suffix array (ESA), the Linearized Suffix Tree (LST), and other space-efficient suffix trees. These two structures use the suffix array and LCP array to simulate the suffix tree. They are related to our new structure, the virtual suffix tree (VST).

In Section 3 we describe variable length Markov models (VLMM) and the probabilistic suffix tree (PST) which is used to implement VLMM. They are related to a new data structure, the probabilistic suffix array (PSA) proposed in this work. We also present previous work on calculating the term frequency (TF) and document frequency (DF), which are used in our proposed PSA representation of VLMM.

In Section 4, we discuss the circular pattern matching (CPM) problem. We first review the general string pattern matching problem. We define two CPM problems and introduce previous work. We will provide our solutions to the different CPM problems and experimental results in Chapter 5. In Section 5, we discuss the related work on the pattern discovery problem.

9 CHAPTER 2. RELATED WORK 10

2.1 Suffix Tree and Suffix Array

2.1.1 Basic Notations and Definitions

Let T = T[1..n] be the input string of length n, over an alphabet Σ. Let T = αβγ, for some strings α, β, and γ (α and γ could be empty). The string β is called a of T, α is called a prefix of T, while γ is called a suffix of T. The prefix α is called a proper prefix of T if α 6= T.

Similarly, the suffix γ is called a proper suffix of T if γ 6= T. We will also use ti = T[i] to denote the i-th symbol in T — both notations are used interchangeably. We use Ti = T[i..n] = titi+1 ...tn to denote the i-th suffix of T. For simplicity in constructing suffix trees, we ensure that no suffix of the string is a proper prefix of another suffix by appending a special symbol, $ to T, such that $ ∈/ Σ, and $ < σ, ∀σ ∈ Σ. We let P = P[1...m] to be the pattern string that needs to be found in T.

In our work, the size of alphabet |Σ| is not fixed. It may be small, example |Σ|=4 for DNA sequences, or it may be large, example |Σ| ≈ 106 for multidomain protein sequences.

2.1.2 Suffix Tree

Given a string T, its suffix tree (ST) is a rooted tree with n leaves, where the i-th leaf node

corresponds to the i-th suffix Ti of T. Except for the root node and the leaf nodes, every node must have at least two descendant child nodes. Each edge in the suffix tree represents a substring of T, and no two edges out of a node start with the same character. For a given edge, the edge

label is simply the substring in T corresponding to the edge. We use li to denote the i-th leaf node. Then, li corresponds to Ti, the i-th suffix of T. When the edges from each node are sorted

alphabetically, then li will correspond to TSA[i], the i-th suffix of T in lexicographic order, where SA denotes the suffix array.

For edge (u,v) between nodes u and v in ST, the edge label (denoted label(u,v) ) is a non- empty substring of T. The edge length is simply the length of the edge label. The edge label is usually represented compactly using two pointers to the beginning and end of its corresponding substring in T. For a given node u in the suffix tree, its path label, L(u) is defined as the label of CHAPTER 2. RELATED WORK 11 the path from the root node to u. Since each edge represents a substring in T, L(u) is essentially the string formed by the concatenation of the labels of the edges traversed in going from the root node to the given node, u. The string depth of node u, (also called its string length or path length) is simply |L(u)|, the number of characters in L(u). The node depth (also called node level) of node u is the number of nodes encountered in following the path from the root to u. The root is assumed to be a node at depth 0.

Given the string T = T[1..n], of length n, but with the end of string symbol appended to give a sequence T$ with length n + 1, the suffix tree of the resulting string T$ will have the following properties:

1. Exactly n + 1 leaf nodes.

2. At most n internal (or branching) nodes (the root node is considered an internal node).

3. Every distinct substring of T is encoded exactly once in the suffix tree. Each distinct substring is spelled out exactly once by traveling from the root node to some node u, such that L(u) is the required substring. Note that the node u may be an implicit node, i.e. ending at a position between two (explicit) nodes.

4. No two edges out of a given node in the suffix tree start with the same symbol.

5. Every internal node has at least two outgoing edges. Properties (1), (2), (4), and (5) imply that a suffix tree will have at most 2n + 1 total nodes, and at most 2n edges.

2.1.3 Suffix Links

Some suffix tree construction algorithms make use of suffix links. The notion of suffix links is based on a well-known fact about suffix trees [89, 128], namely, if there is an internal node u in ST such that its path label L(u) = aα for some single character a ∈ Σ, and a (possibly empty) string α ∈ Σ∗, then there is a node v in ST such that L(v) = α. A pointer from node u to node v is called a suffix link. If α is an empty string, then the pointer goes from u to the root node. Suffix links are important in certain applications, such as in computing matching statistics needed in approximate pattern matching, matching, or in certain types of traversal of the suffix tree. CHAPTER 2. RELATED WORK 12

2.1.4 Implementation and Problems with the Suffix Tree

A predominant factor in the space cost for suffix trees is the number of interior nodes in the tree, which depends on the tree topology. Thus, a major consideration is how the outgoing edges from a node in the suffix tree are represented. The three major representations used for outgoing edges are arrays, linked lists, and binary search trees. While the array is simple to implement, it could require a large memory for large alphabets. However, independent of the specific method adopted, a simple implementation of the suffix tree, for example, using Ukknoen’s algorithm [122], can require as large as 33n bytes of storage with suffix links, or 25n bytes without suffix links [3].

2.1.5 Suffix Array

The suffix array (SA) is another data structure, closely related to the suffix tree. The suffix array simply provides a lexicographically ordered list of all the suffixes of a string. If SA[i] =

j, it means that the i-th smallest suffix of T is Tj, the suffix starting at position j in T.A related structure, the LCP array contains the length of the longest common prefixes between adjacent positions in the suffix array. Combining the suffix array with the LCP information provides a powerful data structure for pattern matching. With this combination, decisions on the occurrence (or otherwise) of a pattern P of length m in the string T of length n can be made in O(m + logn) time. Given the new worst-case linear-time direct SA construction algorithms, and the small memory footprint of suffix arrays, it is becoming more attractive to construct the suffix tree from the suffix array. A linear-time algorithm for constructing ST from SA is presented in [3].

2.2 Space-Efficient Suffix Trees

The problem of practical space needed in using suffix trees have been recognized, and methods have been proposed to provide space-efficient data structures [8,45,82,83,94,110,111]. Andersson et al. [8] presented a level-compressed suffix tree in O(n) bytes. Munro et al. [94] proposed some space efficient suffix structures in O(nlogn)bits of space and O(m|Σ|) searching CHAPTER 2. RELATED WORK 13 time. The structures include a suffix array and other auxiliary structures to represent the suffix tree. But these structures did not include the suffix link which is an important part of the suffix tree. Grossi et al. [45] proposed compressed suffix structures, an indexing structure in O(n) bits with O(m|Σ|) searching time. Compact Suffix Array [82] uses at most 9n bytes of space 1 (or 13 8 n for including the LCP array), but the time of construction is O(nlogn) and time of 2 searching is O(mloglogn + (logn) loglogn + nocc lognloglogn), where nocc is the number of occurrences.

While the suffix array reduces the problem of space, the suffix tree still provides a simpler way in certain analysis problems, such as in computing matching statistics [27]. Thus, methods have been proposed to improve the suffix array with extra information to provide the full func- tionality of the suffix tree. The enhanced suffix array [2] and the linearized suffix tree [61, 62] are two example data structures recently proposed for full-text indexing with the functionalities of both the suffix tree and suffix array.

2.2.1 ESA

The enhanced suffix array (ESA) [2] is composed of the suffix array, and extra data struc- tures, namely, the lcp array and a child table that contains branching information between parent and child nodes in the suffix tree. The key idea used in the ESA is the concept of lcp-intervals (originally used in [60]). Given the suffix array, SA an interval [i.. j] in SA, 1 ≤ i < j ≤ n + 1 is called an lcp-interval with lcp-value l if the following conditions hold:

1. lcp[i] < l;

2. lcp[k] ≥ l ∀ k s.t. i + 1 ≤ k ≤ j;

3. lcp[k] = l, for some k, s.t. i + 1 ≤ k ≤ j;

4. lcp[ j + 1] < l.

Thus, rather than the traditional suffix tree, the ESA constructs an lcp-interval tree. Nodes in a suffix tree are now replaced with lcp-intervals, such that the parent-child relationships CHAPTER 2. RELATED WORK 14 in a traditional suffix tree are now captured by equivalent parent-child relationships between lcp-intervals. The root node corresponds to the interval [1..n] in the suffix array, essentially the entire suffix array. In the ESA, the basic structure used to represent the child-nodes from a given parent node is a linked list. Essentially, the child table is composed of three arrays, namely up, down, and nextIndex. The up and down arrays store information about the edges in the tree, while array nextIndex records information about the linked list used to represent the sibling relationship between nodes with the same parent. These three arrays would ordinarily require 3n elements to store. Interestingly, only n of these elements are required, and hence the child table requires only n integers to store. The ESA assumes that |Σ| is small relative to n. Thus, for large alphabets, pattern matching could take longer on the ESA. For instance, with the binary tree representation of the nodes in the suffix tree, pattern matching will take O(mlog|Σ|) time for a pattern of length m; doing the same on the enhanced suffix array (which uses a linked list representation of the nodes) will require O(m|Σ|) time, a significant difference for large alphabets.

2.2.2 LST

The linearized suffix tree (LST) [61, 62] is an improvement on the ESA. It uses the same up and down arrays as the ESA, but replaces the nextIndex array with two other arrays: lchild and rchild. Thus, the two arrays store information about siblings at a given node in the interval tree, such that the intervals can be represented by a complete binary tree. Specifi- cally, let the interval [i.. j] denote the lcp-interval for a given node in the lcp-interval tree (equiva- lently a node in the traditional suffix tree). Then, the two new arrays are defined as follows [61]: lchild[i] records the first index of the left child node of the longest interval starting at i in the complete binary tree; rchild[i] records the corresponding value for the right child. Like the ESA, the nature of the four arrays in the new child table used by the LST makes it possible to represent the relevant information in the table using only n integers rather than the 4n integers that ordinarily would be required. However, unlike the ESA, the LST uses a complete binary tree as the basic structure to represent information about sibling nodes at a given node. This important difference makes it possible for the LST to support pattern matches in O(mlog|Σ|) time, the same time bound for suffix trees. CHAPTER 2. RELATED WORK 15

2.3 Markov Models and Probabilistic Suffix Tree

2.3.1 Variable Length Markov Models

A Markov model is a sequence of stochastic events {Xn,n = 0,1,2,3...} with state space S that satisfies the Markov property:

P(Xn+1 = j|Xn = i,Xn−1 = in−1...,X0 = i0) = P(Xn+1 = j|Xn = i)

A Markov model (or Markov chain) of order L, where L is finite, is a sequence of events satisfying:

P(Xn+1 = j|Xn = i,Xn−1 = in−1...,X0 = i0) = P(Xn+1 = j|Xn = i,Xn−1 = in−1...,Xn−L = in−L)

Thus, in an order-L Markov chain, the current state is dependent on the past L states. Fixed length Markov models (FLMM) represent a probabilistic finite state machine which can be used to model arbitrarily complex sequential data. Such models aim at learning the probability, P(σ|C), the conditional probability distribution of a symbol σ, given its context C, where σ ∈ Σ, and C ∈ ΣL, L is the order or memory length of the model, and is fixed. The FLMM of order L is represented as a ΣL × Σ matrix. The space requirement is thus in O(ΣL+1).

Variable length Markov Models (VLMM) differ from FLMM in an important way. Vari- able length Markov models attempt to learn the conditional distribution of a symbol whereby the context length or model order could be varying, depending on the data being modeled. Es- SL i sentially, for VLMM, C ∈ i=1 Σ . This property of varying memory length implies that with the VLMM, Markov dependencies of varying order in the training data – both large and small – could be captured with ease. This flexibility of the VLMM however comes at a huge cost in terms of space requirement. To represent a VLMM of order L, we have to store L matrices, one for each order, from 1 to L. The total space requirement will be in O(Σ2 + Σ3 + ... + ΣL+1), or O(ΣL+2). The space is huge, even when L is small. Thus, the space requirement for Markov models is exponential in L, whether we consider fixed length or variable length models.

An important observation that could point to a potential reduction in the space requirement for Markov models is that for a given sequence of length n, there are n(n + 1)/2 possible sub- CHAPTER 2. RELATED WORK 16 strings in the sequence. Thus, there are at most n(n + 1)/2 states that can be represented in the Markov model for the sequence, for any given order. For the length-n sequence, the maximum order of a Markov model will be n − 1. Thus, with knowledge of the sequence, we can have a limit on the possible number of states in its Markov model.

2.3.2 Probabilistic Suffix Tree

The probabilistic suffix tree (PST) is a probabilistic suffix model which is based on the traditional suffix tree. Like the suffix tree, the PST represents all the n(n + 1)/2 substrings from the root to the leaf nodes. The PST models variable length Markov models, which means that the string depth is not fixed for every node. For an FLMM, its corresponding PST can be obtained from the PST of the VLMM by constraining each leaf node to be of the same string depth. The transition probability of a symbol on a given path is computed as the relative frequency of the symbol in the observed data, given the preceding substring on the path. The length of the substring used to determine such conditional probabilities is simply given by the memory length or order of the model.

Example PST and ST for a sample sequence T = accactact$ are given in Chapter 4 (see Figure 4.2).

The original algorithm [109] used an O(Ln2) time complexity to construct and prune the PST from a suffix tree. The improved algorithm [88] used balanced red-black trees [31] to construct the PST in O(Lnlogn) time complexity. Apostolico and Bejerano [9] presented an al- gorithm having an O(n) time complexity using suffix links and reverse suffix links, independent of L.

2.3.3 Computing TF and DF via Suffix Arrays

In [9, 109], the conditional probabilities required for the PST were computed as relative counts, using the notion of empirical probabilities, based on symbol frequencies from the ob- served data. To compute the empirical probabilities, we use the notions of term frequency and document frequency as used in information retrieval. The term frequency, TF is simply the CHAPTER 2. RELATED WORK 17 number of times a given term (or substring in our case) occurred in a given text. Here, the text could contain many documents. Therefore, the document frequency for a given term is the number of times the term occurred in the document. Thus, the term frequency is easily obtained from the document frequency as a simple sum. When the text contains only one document, the TF and DF will be the same.

There are n(n + 1)/2 substrings in a sequence of size n. Using a na¨ıve algorithm, we will need to compute TFs for all the n(n + 1)/2 terms. However, with the suffix tree for this sequence, we have n leaf nodes and at most n internal nodes. These 2n nodes represent the n(n+1)/2 substrings in the sequence. Therefore, some multiple substrings at different positions in the sequence will be represented by the same node. These substrings represented by the same node must have the same frequency count. Since the node labels are unique in the suffix tree, this means that the multiple substrings in the same node are essentially the same substring, that were repeated multiple times in the sequence. Thus, while there are O(n2) possible substrings in T, there are at most O(n) unique substrings. Hence, we need to compute the TF for only the 2n unique substrings.

Yamamoto and Church [131] presented a data structure to represent nodes in the suffix tree using a suffix array. They called this structure the interval array. Using the interval array, their algorithm calculated all the required TFs in O(n) time, but needed O(nlogn) time to compute the DFs. Our data structure is based on the interval array, and we will show how we can improve the algorithm for computing DF to O(n) time.

Interval Array Structure

From the suffix tree, we know we can cluster the potential n(n+1)/2 substrings into at most 2n ”groups”. In [131], the interval array was proposed as a data structure to represent these groups. Table 2.1 shows the interval array for sample sequence T = accactact$. From the table, we can describe some important properties of the interval array as follows.

For all substrings, we have the following:

1. There are at most 2n groups in the document of size n. Substrings in the same group have CHAPTER 2. RELATED WORK 18

the same statistics (for example: TF and DF) and the same derivative measurements from these statistics.

2. An lcp-delimited interval < i, j > is constructed using the LCP array, where < i, j > is an interval on the suffix array. An lcp-delimited interval < i, j > must meet the condition:

max(LCP[i],LCP[ j+1]) < Lengthgroup(i, j) ≤ min(LCP[i+1],LCP[i+2],...,LCP[ j]), where, Lengthgroup(i, j) is the LCP of the suffixes that belong to the same group with the < i, j > interval.

The lcp-delimited intervals have the following properties:

1. Each lcp-delimited interval represents one unique group of substrings.

2. The maximum length of an lcp-delimited interval < i, j > is given by: min(LCP[i + 1],LCP[i + 2],...,LCP[ j]).

3. A non-trivial lcp-delimited interval is one with a start position that is less than the end position. That is, the lcp-delimited interval < i, j > is non-trivial if i < j. Otherwise, if i ≥ j, the interval is said to be trivial. There are n trivial lcp-delimited intervals with TF=1. There are at most n − 1 non-trivial lcp-delimited intervals with TF > 1

4. The lcp-delimited intervals for a document can form a nested structure of intervals, but no two lcp-delimited intervals can overlap. Thus, the lcp-delimited intervals can be rep- resented in a tree-like structure.

5. Let α and β be two substrings in the same lcp-delimited interval < i, j >. Then, the following two conditions hold:

Table 2.1. Interval Array Interval LCP Term Freqency < 2,3 > 3 2 < 1,3 > 2 3 < 6,7 > 2 2 < 4,7 > 1 4 < 8,9 > 1 2 CHAPTER 2. RELATED WORK 19

• TF(α) = TF(β) = j − i + 1 • DF(α) = DF(β)

Below we consider algorithms for computing TF and DF.

Determining the lcp-delimited intervals

Computing the term frequency (TF) is almost analogous to determining the lcp-delimited in- tervals. Given the lcp-delimited interval < i, j >, the TF algorithm in [131] uses the relation TF(< i, j >) = j − i + 1 to determine the term frequency for the interval. The time complexity of the algorithm is O(n), using O(n) space.

We observe that there are at most n neighboring LCP pairs (i.e. LCP[i] and LCP[i + 1]) in a suffix array of size n. Thus, there are at most n increasing orders in such neighboring pairs. Similarly, there are at most n decreasing orders between the neighboring pairs. A decreasing order between neighboring pairs LCP[i] and LCP[i + 1] implies that there exists at least one lcp-delimited interval < k,i >, where k ≤ i. Using the foregoing and the fact that we have at most 2n lcp-delimited intervals in a document of size n, Yamamoto and Church [131] computed the term frequency. Given that the lcp-delimited intervals could be nested, a stack can be used to hold the information on the intervals.

Assuming we have m decreasing neighboring pairs in an LCP array of size n. Let the interval between each pair be: n1,n2,...,nm. Hence, there are n1 +1+n2 +1+n3 +1,...ni +1+ m ...nm + 1 = m + ∑i=1 ni ≤ m + n ≤ 2n lcp-delimited interval. When we find the ith decreasing neighboring pair, we will output the ni intervals. Thus, determining the lcp-delimited intervals can be done in O(n) time. This linear time complexity can be compared with the work reported in [2], where they constructed a similar structure (the LCP-interval tree) in O(nlog|Σ|) time.

Computing the Document Frequency (DF)

Determining the document frequency (DF) is more difficult than computing the term frequency (TF). In [131], an algorithm was given to calculate DF in O(nlogn) time and O(n) space. The CHAPTER 2. RELATED WORK 20 algorithm (reproduced below in Figure 2.1 ) uses the procedure described above to get the lcp- delimited intervals. It uses a new array to map the first symbol of the substring to a document id based on the order of the suffix array index. Using a simple algorithm, we could search in each interval to determine the document frequency. This will however be too expensive in time cost, leading to time in O(n2). Given that lcp-delimited intervals could be nested, we only need to check in the extra range over the calculated range. So the core of the algorithm is to check whether the document id of the current position has been previously computed. We reproduce the DF algorithm proposed in [131] below (see Figure 2.1). In Chapter 4, we modify this algorithm for an improved time complexity.

Figure 2.1. Algorithm for computing document frequency [131] CHAPTER 2. RELATED WORK 21

2.4 Circular Pattern Matching

2.4.1 String Pattern Matching

The pattern matching problem is to find the occurrences of a given pattern in a given text string. This is an old problem, which has been approached from different fronts, motivated by both its practical significance and its algorithmic importance. Matches between strings are determined based on the string edit distance. Given two strings T : t1...tn and P : p1...pm, over an alphabet Σ, the edit distance indicates the minimum member of edit operations which transform one string into other string. There are three basic edit operations: insertion of a symbol, deletion of a symbol, and substitution of a symbol with another symbol. We assume that the cost of each edit operation is unity. If two characters are identical, the cost of the match operation is zero. The substitution operation can also considered as a mismatch operation. The exact string matching problem is to look for all occurrences of a pattern matching a substring of the text with zero edit operations. Various algorithms have been proposed for both exact and approximate pattern matching [3, 11, 20, 34, 46, 48, 64, 117].

Grossi and Vitter [45] pointed out that a full-text indexing system is expected to be able to support three basic types of queries, existential query: returns a binary value (true or false) indicating whether a pattern, P occurs in the text T; counting query: returns nocc, (0 ≤ nocc ≤ n), the number of occurrences of P in T; and enumerative query: returns nocc numbers, each indicating the starting position in T, of an occurrence of P [3].

Exact String Matching

Three well-know efficient string matching algorithms with linear time complexity are the Knuth- Morris-Pratt(KMP) algorithm [64], Karp-Rabin algorithm [59] and the Boyer-Moore(BM) al- gorithm [20]. Like KMP, the BM algorithm matches the pattern and the text by skipping char- acters that are not likely to result in exact matching with the pattern. Unlike the other methods, it compares the strings from right to left of the pattern. These algorithms need an O(m) prepro- cessing for the pattern and search in O(n) or sometimes even sublinear in practice. The total time will be O(n + m). CHAPTER 2. RELATED WORK 22

A different approach to pattern matching based on bitwise operations was introduced by R. Baeza-Yates and G. Gonnet [13]. Here, the pattern is represented by a binary mask. Bit-wise SHIFT and AND operations that are considered constant time are used to find the patterns. Under this framework, SHIFT and AND correspond to the pattern movement and matching respectively. The algorithm is effective for small patterns, when the pattern length is less than a computer word (say, 64 characters), which is usual for the text searching problem.

When multiple patterns need to be searched, alternative algorithms are used, such as the Aho-Corasick algorithm [5] which used a keyword tree for the set of patterns. In addition to multiple pattern matching, the suffix tree algorithm [46] is efficient when the pattern will be searched multiple times. There are linear time suffix tree construction algorithms available [89, 122, 128]. The search time for each pattern will be O(m). The total time for searching s patterns will thus be O(n+sm). This can be compared with O(m+sn) that would be needed by algorithms such as KMP and BM.

Approximate String Matching

Algorithms for the approximate pattern matching problem can be grouped into three major categories: methods based on dynamic programming [114]; methods based on bit-wise oper- ations [130]; and methods based on the longest common subsequence (LCS) [68]. The edit distance can be computed using dynamic programing. The path that leads to the minimal dis- tance can easily be identified by adding trace-back pointers. The time complexity for computing the edit distance is generally in O(mn). When the number of allowed errors k is known, the edit distance can be computed in O(kn), for example using Ukkonen’s deterministic finite state au- tomaton (DFA) approach [121]. Typical approximate matching algorithms based on bit-wise operations are AGREP [130] and NRGREP [95]. These are obtained by extending the bit-wise exact matching algorithms. Another variation of pattern matching with errors is the k-mismatch problem, where only substitution operations are considered during the edit distance computa- tion. In [68], the k-mismatch problem is solved using the suffix tree so that the LCS query can be answered in constant time. To find all the k-mismatch patterns of P in text T, we perform k-mismatch checks for every alignment of P, each time starting at different character position in T. CHAPTER 2. RELATED WORK 23

Dynamic programming is the most popular algorithm in the approximate string matching (ASM). From the classical searching for longest common subsequence (LCS) [31] to multiple strings alignment [25, 125]. It uses divide and conquer method to break a complex problem into subproblems by solving the subproblems first and then integrating the solutions to get the optimal answer. The search time is O(mn) and its space requirement is O(m). This algorithm can be improved to achieve O(√kn ) [26] on average. Smith-Waterman algorithm is one of |Σ| the most famous algorithms in this category which was first proposed by Temple F. Smith and Michael S. Waterman in 1981 [116] for local alignment. Needleman-Wunsch [96] algorithm is a general global alignment technique used in bioinformatics. Both of these algorithms are based on dynamic programming. Smith-Waterman [116] algorithm guarantees finding the optimal solution for local alignment and Needleman-Wunsch [96] algorithm finds the optimal solution for global alignment.

2.4.2 Circular Pattern Matching Problems

∗ ∗ t A circular shift is a mapping f : Σ → Σ , f (c1...cr) = ct+1...crc1...ct, where 0 ≤ t ≤ r − 1 0 and r is the length of string c1c2...cr. Thus, f (c1...cr) corresponds to the original string. Let [s] be a set of circular shifts of string s, then [s] = { f i(s)|0 ≤ i ≤ |s| − 1}. Given two circular strings s1 and s2, the edit distance between s1 and s2, ED(s1,s2), is the minimum number of edit operations needed to transform one member of [s1] to one member of [s2]. This is defined as i j EDc(s1,s2) = min{ED[ f (s1), f (s2)]|0 ≤ i ≤ |s1| − 1 and 0 ≤ j ≤ |s2| − 1}, where |s1| is the length of string s1 and |s2| is the length of string s2, ED[A,B] is the standard edit distance. We consider two major problems related to circular pattern matching (CPM) defined as follows.

Problem 1: Exact circular pattern matching (ECPM). Given one circular pattern P = P[1...m] and the text T = T[1...n], return all occurrences of circular string [P] and its circu- lar shifts inside text T without any error. [P] is a match to text T at position j ∈ [1...n − m + 1] ⇔ f t(P) = T[ j... j + m − 1], for some t, 0 ≤ t ≤ m − 1.

Problem 2: Approximate circular pattern matching (ACPM1). Given text T = T[1...n], circular pattern P = P[1...m] and maximum error k, return “Matching” when the edit distance between text T and circular pattern P is less or equal to k. Thus, the result will be the matching

pair g = {(T,P)|EDc(P,T) ≤ k,s.t. − k ≤ n − m ≤ k} . CHAPTER 2. RELATED WORK 24

This problem uses the existential query to look for circular pattern matches between text T and circular pattern P, where −k ≤ n − m ≤ k. This problem compares two sequences with circular edit distance less than k.

We also consider a harder variation of the ACPM problem with the extension −k ≤ n − m. This variation is to find circular permutations of P inside T. This problem compares circular pattern P and substring of T with circular edit distance less than k. We define this variation more formally as Problem 3 below:

Problem 3: Approximate circular pattern matching problem 2 (ACPM2)

Given text T = T[1...n], circular pattern P = P[1...m] and maximum error k, return all positions where the circular string [P] matched text T with at most k errors. [P] is said to be a k-approximate match with text T at position j ∈ [1...n −m −k +1] i f EDc(P,T[ j... j +m]) ≤ k, where 0 ≤ t ≤ m − 1,−k ≤ n − m.

Comparing with the ECPM problem, the ACPM2 problem looks for all approximate matches within the given maximum error k. The ACPM1 problem looks for an approximate match be- tween two whole strings text T and circular pattern P, but the ACPM2 problem looks for all approximate matches between every substring of text T and the circular pattern P and its circular shifts. The ACPM2 problem is the hardest of these three problems.

An extention of the CPM problem is the All-Against-All variant.

Problem 4: All-Against-All CPM problem

Given SeqDB, the database of sequences, the All-Against-All CPM problem is to compare each sequence in the database with every other sequence in the database for possible circular matches.

Similar to the standard CPM problem, we can also consider the ECPM and ACPM versions for the All-Against-All CPM problem. CHAPTER 2. RELATED WORK 25

2.4.3 Exact Circular Pattern Matching (ECPM)

Given a pattern P and a text T, the exact circular pattern matching problem is to find the position in T which matches a circular permutation of P. The exact circular pattern match- ing problem was first studied by Booth [19] in 1980. He proposed an algorithm to detect the lexicographically smallest conjugate of a word. Improved methods were proposed in [10, 36]. However, the focus of the algorithms was on the canonical rotation(s) of a word. ECPM was a particular case in that problem. The swap pattern matching problem [7] is related to the ECPM problem. It looks for an exact match of a swapped pattern P in text T. We can see that the ECPM problem is a particular case of the swap pattern matching problem. Gusfield [46] discussed the ECPM problem as an end-of-chapter exercise but did not provide an explicit so- lution. Shiloach [115] provided an algorithm for the ECPM problem. However both only solve the online version of the ECPM problem which does not use indexing to preprocess the text T.

Our work is perhaps more closely related Iliopoulos et al. [51] who proposed two algo- rithms that solve the ECPM problem using indexing. These two algorithms are based on suffix structures and the time complexity are O(mloglogn + nocc) and O(mlogn + nocc) respectively, where nocc is the number of occurs.

The first algorithm builds a new data structure CPI-I to index circular patterns. Two steps are used to find the circular pattern. First, the algorithm will index the text T in two suffix trees ¯ ¯ STT and STT¯ where T is the reverse order of T. It also maintains two list LL(R) and LL(R) which are linked lists of all the leaf nodes from left to right in STT and STT¯ respectively. Secondly, the algorithm will search two parts Q1,Q2 of each permutation of pattern P in STT and STT¯ respectively. STT returns the occurrences of Q1 and STT¯ returns the occurrences of Q2. The algorithm finds the intersection of these two sets by using the two linked list LL(R) and LL(R¯).

The construction time and space complexity for this algorithm is O(nlog1+ε n), where 0 <

ε < 1 and n is the length of text T. The query time complexity is O(mloglogn + nocc).

The second algorithm used another new structure CPI-II to address the ECPM problem. The data structure CPI-II is constructed by the suffix array SA, inverse suffix array SA−1 and arrays Pre and Su f , where Pre is an array for the prefix of pattern P and Su f is an array for the suffix of pattern P. There are three steps involved. First, compute the interval of prefix CHAPTER 2. RELATED WORK 26 of pattern P into array Pre and the interval of suffix of pattern P into array Su f . There are m prefixes and suffixes in P. For each prefix or suffix, it is calculated using the previous prefix or suffix, so the time complexity is O(logn) using the suffix array SA and inverse suffix array SA−1. Secondly, for each circular permutation pattern which can be constructed by P[m − i]P[i] where 1 ≤ i ≤ m, the algorithm finds the intervals of P[m − i] and P[i] in Su f [m − i] and Pre[i] separately. Thirdly, output the intersection of the intervals of P[m − i] and P[i].

The space complexity are O(n) bytes for this algorithm implemented using in the suffix array. The time complexity for answering a query is O(mlogn + nocc). When implemented using the compress suffix array [3,45], its space complexity will be O(nlogn) bits, but the time 2 complexity for queries increases to O(mlog n + nocc).

2.4.4 Approximate Circular Pattern Matching (ACPM)

The ACPM problem is to find k-approximate matches between circular pattern [P] and text T. The na¨ıve method for the ACPM problem is to use each of circular strings f t(P) to calculate the edit distance between T and f t(P), where m is the length of pattern and 0 ≤ t ≤ m − 1. Thus the dynamic programming procedure will be run m times. The time complexity of a na¨ıve algorithm to compute ED([P],T) is O(m2n).

Maes [81] published a “divide and conquer” algorithm to compute ED([P],T) in O(mnlogm). Up to now, this is the best theoretical result for computing the edit distance between a circular pattern and a text. Given the significance of Maes’ algorithm, we present the details below.

In theory, Maes [81] algorithm is the best algorithm. It uses “divide and conquer” to cal- culate the edit distance by using a dynamic program table. This algorithm constructs an edit graph between text T and string PP (Figure 2.2 [87]), where PP is a concatenation of pattern P to itself. In this edit graph, let path(x1,y1)−(x2,y2) be a path from vertex (x1,y1) to vertex (x2,y2). For each vertex (x,y) on this path, we have x1 ≤ x ≤ x2,y1 ≤ y ≤ y2. The edit distance between T and f i[P] in the subgraph can be computed by following a path from vertex (0, i) to vertex (n, i+m-1), where 0 ≤ i ≤ m − 1. Let Pathi be an optimal edit path between f i(P) and T in the edit subgraph. That is a path of minimum cost from vertex (0, i) to vertex (n, i+m-1). Maes’ algorithm is based on an important observation on the edit graph: if Pathi and Path j are each CHAPTER 2. RELATED WORK 27 an optimal edit path, then Pathi and Path j can not cross each other, where 0 ≤ i < j ≤ m − 1. We can see that when Pathi and Path j has a crossing point say at (x,y), then edit distance of j path(0,i)−(x,y) is less than edit distance of path(0, j)−(x, y), whenever i < j. Thus Path is no longer an optimal path. The set of paths {Patht|i < t ≤ j} do not have optimal path, because each Patht has to cross Pathi at the point (x,y).

Figure 2.2. Edit Graph of T and PP [87].

Based on the above observations, the algorithm calculates Pathi and Path j first, where i < j. If two paths do not cross each other, that means there is an optimal edit path between Pathi and Path j. Next, they calculate the path Patht in-between Path j and Path j, where i < t < j. In this case, the time for calculating Patht is O(( j − i) × n). When Pathi and Path j cross each other, we do not need to calculate the path set {Patht|i < t ≤ j} anymore, because these is no optimal path between them.

Following this idea, the algorithm calculates all optimal paths starting from Path0 and Pathm, where Pathm is the optimal path from (0, m) to (n, 2m − 1), and Path0 and Pathm are parallel paths. Figure 2.3 illustrates this algorithm. The step (1) of Figure 2.3 shows this with time cost of O(mn). In the second step(step (2) of Figure 2.3), the algorithm computes m 0 m the optimal path Path 2 between Path and Path with time complexity of O(mn). In the third m 3m step (step (3) of Figure 2.3), two optimal paths Path 4 and Path 4 are computed. Time cost for mn computing each path is O( 2 ), hence time complexity of this step is O(mn) too. The step 4 m 3m 5m 7m (step (4) of Figure 2.3) calculates four paths Path 8 ,Path 8 ,Path 8 and Path 8 . Time cost of CHAPTER 2. RELATED WORK 28

mn computing each path is O( 4 ), so time complexity of this step is O(mn) too. And so on and so forth, there are O(logm) steps and time cost of each step is O(mn). Even when there are some crossing points, the time for calculating each step is still O(mn). Thus the total time complexity is O(mnlogm). After getting all optimal paths, the minimum edit distance between text T and circular pattern P can be computed.

Figure 2.3. Maes’ Algorithm

Gregor et al. [44] gave a O(m2n) algorithm, however, this is a data-dependent algorithm. In practice, the algorithm may reach to O(mn) time complexity on average. Oncina [98] presented an algorithm which has the same time complexity as Gregor et al. [44]. Marzal et al. [87] provide a branch and bound algorithm which is based on Maes [81] algorithm. The worst case time complexity is the same as Maes algorithm, but with more efficient time complexity on average.

The above methods all produce complete results in calculation of the circular edit distance. Some studies [22, 90–92] also present suboptimal algorithms with reduced time complexity that runs on O(mn), but with the possibility of missing some results. Bunke and Buhler [22] presented a suboptimal algorithm whose time complexity is O(mn). Mollineda [90–92] pub- lished two algorithms based on Bunke algorithm [22] and showed in an experiment that the suboptimal solution is almost as good as the optimal counterparts. CHAPTER 2. RELATED WORK 29

We note that all the above methods on the ACPM problem have only considered the ACPM1 variant. To our knowledge, there has been no published work addressing the more challenging ACPM2 problem.

2.4.5 ACPM Problem in Protein Sequences

A number of studies have been reported on algorithms for detecting circular permutations for protein sequences [40,47,55]. The first method [54] used the dot matrix and human visual- ization to identify circular relationships between protein sequence pairs. The work in [6] used a dictionary method to find short fragments common to the protein sequence pairs and used human visualization to report the best local matches.

Needleman et. al [96] proposed a method for global alignment between two protein se- quences. The global alignment algorithm measures the number of edit operations (insertion, deletion, and substitution) for transforming one sequence to another sequence. Uliel et al. [123, 124] introduced a method to detect circular permutations in protein sequences using global alignment [96]. They gave an O(m3) time complexity algorithm to find the complete set of matching circular permutations. They also proposed a greedy algorithm in O(m2) time com- plexity, but which could miss some valid circular permutations in the text T. Weiner et al. [126, 127] proposed another greedy method that runs in O(m2) time complexity. They focused on circular multidomain proteins, where the alphabet are now the protein domain blocks, rather than traditional protein symbols. Thus, |Σ| could be quite large, of the order of 20q, where q is the length of the domain blocks. This is the first application of the CPM problem in studying multidomain proteins. However, they did not consider the problems posed by the expanded alphabet.

The algorithm of Uliel et al. [123, 124] used a simple method that calculated the edit dis- tance [72] between text and one of the circular permutations using the Needleman and Wunsch algorithm [96]. This was repeated m times for all the circular permutations, with O(m3) time complexity and O(m2) space complexity for each sequence as a pattern P against on the other protein sequences. The greedy algorithm of Uliel et al. [123, 124] modified the local align- ment algorithm to find the best local alignment in a 2m × n matrix. This algorithm is similar to the Smith-Waterman local alignment algorithm [116]. It is not guaranteed to find all circular CHAPTER 2. RELATED WORK 30 matches, and thus may miss some valid matches. The algorithm of Weiner et al. [126, 127] is also a greedy algorithm and thus could miss some valid matches. They concatenated the text T as TT and the pattern P as PP, and thus constructed a 2n×2m matrix using the Needleman and Wunsch algorithm [96]. At the verification phase, the circular matching condition must satisfy certain conditions defined on the 2n × 2m matrix [126, 127].

More fundamentally, both groups [123, 124, 126, 127] that have studied CPM in protein sequences have focused on whole sequence comparison with another whole sequence. In their experiments, they have to group the protein sequences based on their specified lengths, and used the dissimilarity in lengths for initial pruning. These methods ignored the fact that a shorter circular protein sequence could be part of the functional region of a much larger multidomain protein. This, however, could be a key consideration in function prediction for multidomain proteins. Further, as with the more theoretical algorithms for the ACPM problem, the methods for protein sequences [123, 124, 126, 127] also only considered the ACPM1 problem.

2.5 Pattern Discovery Problem

Pattern discovery is a well studied problem in computational biology and data mining, and various methods have been proposed. The basic method is to identify short sequences that tend to be over-represented within a given set of sequences. Mining sequential patterns was studied in [30, 132]. Motif discovery methods in bioinformatics are surveyed in [112]. Algorithms for discovery of proximity patterns were proposed in [12]. Proximity pattern discovery is closely related to the more recent notion of ”complex motif”, which is defined as a composite motif whereby the individual components are constrained to be within a specified seperation distance. Perhaps, a more closely related work is the method of pattern discovery using mutable permu- ation patterns [49]. However, although permu-patterns offer a lot of flexibility in the match, ignoring the order of the patterns still does not handle the problem of possible cyclic relations between patterns. Most efforts in pattern discovery have been invested in studing the statistical significance of the patterns (see for e.g. [85,86]), and the biological relevance of the discovered patterns, in the case of biological applications [112]. There has not been much attention on the pattern matching problem involved, which forms the basis of pattern discovery. Chapter 3

The Virtual Suffix Tree

3.1 Introduction

Our proposed data structure is most closely related to the ESA and LST. The virtual suffix tree can be constructed in the same time and space bounds as the suffix tree. It also supports basic search operations in the same time and space bound as the suffix tree. However, the VST requires a much smaller practical space than the suffix tree. The space requirement (12.05n bytes using the compact form) is generally smaller than that of ESA and LST, the other closely related data structures (each requires 20n bytes). Other related data structures that have been proposed include the suffix cactus [56], suffix vectors [93, 103], compact suffix trees [82], the lazy suffix trees [41], level-compressed suffix trees [8], compressed suffix trees [94], and com- pressed suffix arrays [45]. See also [3].

Main results1 We introduce a new data structure, the virtual suffix tree (VST), an efficient data structure for suffix trees and suffix arrays. The VST neither stores the lcp array nor the lcp-intervals, but rather exploits the inherent nature of the suffix tree topology. We state our main results in the form of two theorems about the VST. 1Part of the work reported in this chapter has been published in the following papers: [78, 79]

31 CHAPTER 3. THE VIRTUAL 32

Theorem 3.1: Given a string T = T[1..n], with symbols from an alphabet Σ, and the virtual suffix tree for T, we can count the number of occurrences of a pattern P = P[1..m] in T in

O(mlog|Σ|) time, and locate all the ηocc occurrences of P in T in O(mlog|Σ| + ηocc) time.

Theorem 3.2: Given a string T = T[1..n], with symbols from an alphabet Σ, the virtual suf- fix tree, including the suffix links, can be constructed in O(n) time, and O(n) space, independent of Σ.

Essentially, the VST provides the same functionality as the suffix tree, but at a much smaller space requirement. It has the same linear time construction for large |Σ|, requires O(n) space to store, and allows searching for a pattern of length m to be performed in O(mlog|Σ|) time, the same time needed for a suffix tree. To provide the complete functionality of the suffix tree, we describe a simple linear time algorithm that computes the suffix links based on the VST. We present two algorithms for VST construction. The first algorithm builds the VST from the suffix tree, which in turn is generated from the suffix array. The second algorithm eliminates the need for the suffix tree construction step, and thus builds the VST directly from the suffix array. Although the space needed for the VST is linear (as in suffix tree implementations using linked lists or binary trees), the practical space requirement is much smaller than that of a suffix tree. The VST requires less space than other recently proposed data structures for suffix trees and suffix arrays, such as the ESA [2], and the LST [62]. On average, the space requirement (including that for suffix arrays and suffix links) is 13.8n bytes for the regular VST, and 12.05n bytes in its compact form. This can be compared with the 20n bytes needed by the LST or the ESA.

Organization In Section 2, we introduce the basic data structure and discuss the properties of the VST. Section 3 presents an improved data structure, along with algorithms for its con- struction. A complexity analysis on the construction and use of the VST is also presented in this section. Section 4 shows how the suffix link can be constructed on the VST. In Section 5, we eliminate the need to construct the suffix tree, and show how the VST can be constructed directly from the suffix array. We make the summary in Section 6. CHAPTER 3. THE VIRTUAL SUFFIX TREE 33

3.2 Basic Data Structure

Starting from the suffix array, we construct an efficient data structure to simulate the suffix tree (ST). We call this structure a virtual suffix tree (VST). The VST stores information about the basic topology of the suffix tree, the suffix array, and the suffix links. Thus, the VST is represented as a set of arrays that maintains information on the internal nodes of the suffix tree. The leaf nodes are not stored directly. However, whenever needed, information about any leaf node can be obtained via the suffix array. Unlike the ESA and LST, the VST neither uses the lcp-interval tree nor stores the lcp array. We call the data structure a virtual suffix tree in the sense that it provides all the functionalities of the suffix tree using the same space and time complexity as a suffix tree, but without storing the actual suffix tree. Later, we show that the VST leads to a more compact representation of suffix trees and suffix arrays. (We mention that [60] also used the term ”virtual suffix tree”, but for a limited form of the enhanced suffix array).

Below, we present the basic VST. This structure will require 14 bytes for each node in the VST and supports pattern matching in O(mlog|Σ|) time, for an m-length pattern. In the next section, we present an improved data structure that reduces the space cost by eliminating the need to store edge lengths, while still maintaining O(mlog|Σ|) time for pattern matching. We also describe a more compact structure for the VST that uses only 10 bytes for each internal node of the VST, and 5 bytes for each leaf node. Pattern matching on this compact representa- tion will, however, be in O(m|Σ|) time.

Each node in the VST corresponds to a distinct internal node in the suffix tree. In its basic form, each node in the VST is characterized by five attributes. For a given node in the VST (say node u), with a corresponding internal node in ST (say node uST ), the five attributes are defined as follows.

• sa index: index in the suffix array (SA index) of the leftmost leaf node under the

internal node uST of the suffix tree.

• fchild: the node ID of the first child node of uST that is also an internal node. (Scanning is done left to right; edges at a node are also sorted left to right in ascending lexicographic CHAPTER 3. THE VIRTUAL SUFFIX TREE 34

order). If node u is a leaf node in the VST, the value will be negative. The absolute value will point to the first child node of the next internal node in the VST.

• elength: The edge length of the edge (v,u) in the VST, or equivalently (vST ,uST ) in the suffix tree, where v is the parent node of u and vST is the parent node of uST .

• nfleaf: the number of child leaf nodes before the first child of uST that is also an internal node.

• nnleaf: the number of sibling leaf nodes after uST , the current internal node of the suffix tree, but before the next sibling internal node.

In terms of storage, the sa index, fchild and elength each requires one integer (4 bytes), while nfleaf and nnleaf each requires one byte of storage (assuming |Σ| ≤ 256).

3.2.1 Example VST

We use an example sequence to explain the above definitions. The suffix tree and VST for the string missississippi$ are shown in Figure 3.1. Note that the string missississippi$ is made intentionally different from mississippi$, to capture some of the cases involved in a VST. Only the internal nodes (dark nodes) are explicitly stored in the VST. The leaf nodes (empty circles) are not stored. The order of storage is based on the node-depths, from top to bottom. Table 3.1 shows the corresponding values of the VST node attributes for each VST node in the example.

3.2.2 Properties of the Virtual Suffix Tree

We can trace the properties of the VST based on the standard properties of a suffix tree.

1. The VST only stores the internal nodes of the suffix tree. No leaf nodes in the ST are represented in the VST. Information about the leaf nodes can be obtained from the SA when needed. Then the space requirement of the VST depends on the topology of the the suffix tree, or more specifically, on the number of internal nodes. CHAPTER 3. THE VIRTUAL SUFFIX TREE 35

2. The number of leaf nodes in a suffix tree is n. The number of internal nodes in the suffix tree (and hence number of nodes in the VST) is at most n.

3. The VST stores only the SA index of the leftmost leaf nodes and information about the child nodes.

4. For a given node in the VST, the number of child nodes will be no larger than |Σ|. Thus, the time needed to match a symbol is at most O(log|Σ|).

5. The nodes in the VST are ordered based on the internal nodes of the suffix tree using the hierarchy sequential access method (HSAM). The child nodes from any given node will be stored sequentially. The child nodes of two nearby nodes will therefore be stored in nearby locations. This is an important property for addressing problems involving locality of reference.

We introduce further definitions needed in the description below. For a given node u in the VST, we use the term prior node to denote the node that appears before the current node u in the HSAM ordering. Similarly, next node denotes the node that appears after the current node u in this ordering. We use lsa index (left sa index) to denote the SA index of the leftmost leaf node that is a descendant of u. Similarly, rsa index (right sa index) denotes the rightmost leaf node that has u as its ancestor. Figure 3.2 shows an example.

It is simple to determine the lsa index and the leftmost child node of any given node. The properties of the VST and the organization of the VST lead to the following lemma about the VST:

Table 3.1. VST node attributes for the example sequence T = missississippi$ used in Figure 3.1. node root N1 N2 N3 N4 N5 N6 N7 N8 N9 sa index 0 1 7 9 3 9 12 4 10 13

fchild N1 N4 −N5 N5 N7 N8 N9 elength 0 1 1 1 3 1 2 3 3 3 nfleaf 1 2 2 0 1 1 1 2 2 2 nnleaf 0 1 0 0 0 0 0 0 0 0 CHAPTER 3. THE VIRTUAL SUFFIX TREE 36

Lemma 1: For a given node in the VST, its rightmost child node, and the right sa index can each be determined in constant time.

Proof: Let u be the current node in the VST, with parent node v. Let w be the next node in the HSAM ordering. By property 5, if w is an internal node in the VST, the rightmost child (rchild) node of u will be the prior node to the leftmost child node of w. If w is a leaf node in the VST, then w.fchild will point to the next node after u’s rightmost child node. Then the time to determine the rightmost child node will be O(1).

For the right sa index, if u has a next sibling node, say w, the right sa index of u will be the left sa index of this sibling node w minus the nnleaf of u. If u does not have a next sibling node, the right sa index of u will be the right sa index of node v (u’s parent) minus the nnleaf of u. That is,

( w.sa index − u.nnleaf − 1 :u has a next sibling, w u.rsa index = v.rsa index − u.nnleaf :otherwise (3.1)

Thus the time required to determine the right sa index is O(1). 

3.2.3 Pattern Matching on VST

Lemma 1 provides an indication of how pattern matching can be performed on the VST. For pattern matching using the suffix tree, an important issue is how to quickly locate all the child nodes for a given internal node. In the VST, each node points to its leftmost leaf node using the sa index. During pattern matching, at any given node in the VST, we will need to determine four parameters, namely the leftmost child node (lchild), the rightmost child node (rchild), the left sa index (lsa index) and the right sa index (rsa index). These parameters define the boundaries of the search at the given node. To search in a leaf node of the VST, we will need only the left sa index and right sa index of the node. When we search in an internal node, we will need all the four parameters to match a pattern. Lemma 1 shows that for any given node, we can determine each of these parameters in constant time. The following examples illustrate the two cases involved in computing the rsa index, and how CHAPTER 3. THE VIRTUAL SUFFIX TREE 37 pattern matching can be performed on the VST.

Example: Determining the right boundary from a next sibling node. Consider node N5 in Figure 3.2. The left sa index of N5 is 9 and the right sa index is 11, since N5.sa index=9 and N5+1.sa index=12, and hence the right sa index of N5=12-1=11. The leftmost child node is the fchild of the current node, thus the leftmost child of N5 is N8. The next node of the rightmost child node is N5+1.fchild=N9. Then the rightmost child node is N9−1=N8, since the child node will be stored side by side between sibling nodes.

Example Determining the right boundary from the right boundary of the parent node.

Consider node N1 in Figure 3.2. The left sa index of N1 is N1.sa index=1. The right sa index of N1 is N2.sa index -(N1.nnleaf -1)=7-1-1=5. The leftmost child node of N1 is N1.fchild=N4. The next node of N1 is N2. Since N2.fchild=-N5 is negative, N2 must be a leaf node in the VST. We therefore know that the next node of the rightmost child node of N1 will

be N5. Finally, the rightmost child node of N1 can be determined as N5−N1.nnleaf = N5−1 = N4.

We summarize the foregoing discussion as the first main result of this work:

Theorem 3.1: Given a string T = T[1..n] of length n, with symbols from an alphabet Σ, and the virtual suffix tree for T, we can count the number of occurrences of a pattern P = P[1..m] in T

in O(mlog|Σ|) time, and locate all the ηocc occurrences of P in T in O(mlog|Σ| + ηocc) time.

Proof: The theorem is a consequence of Lemma 1. First consider the cost of one single symbol- by-symbol comparison at a node in the VST. The number of child nodes at any internal node can be no larger than |Σ|, and we can find the boundaries of the search in constant time. Since the edges are ordered lexically at each internal node, and given the HSAM ordering, matching a single symbol can be done in O(log|Σ|) time steps using binary search. To find the first match, we need to consider the m symbols in the pattern. We perform the above symbol-by-symbol comparisons at most m times to decide whether there is a match or not. After a match is found, we can again use binary search (using lsa index and rsa index as bounds) to determine

all the ηocc occurrences of the pattern. Reporting each occurrence can be done in constant time, or an additional ηocc time for all the occurrences.  CHAPTER 3. THE VIRTUAL SUFFIX TREE 38

3.3 Improved Virtual Suffix Tree

The basic data structure introduced above stores the length of each edge in the VST. We can improve the structure to reduce the space requirement by avoiding the need to store information about the edge lengths directly. The improved data structure has only four attributes rather than five. The attributes sa index and elength in the basic structure are now combined into one attribute called the adjusted SA index (asa index). This requires a key modification to the suffix tree, leading to an important distinction between the suffix tree and the virtual suffix tree.

3.3.1 Adjusting Edge Lengths

A well-known property of the suffix tree is that no two edges out of a node in the tree can start with the same symbol. For efficient representation of the VST, this characteristic of the ST is modified such that, for a given node, every edge that leads to an internal node in the VST has an equal length. This modification is done as follows: Start from the root node and progress towards the leaf nodes in the VST. For a given internal node, say u, adjust the edge label from u to each of its children such that all edges that lead to an internal node will have the same edge length. The major criteria is that, for two sibling internal nodes, their edge labels differ only in the last symbol. If for some edge, say (u,w), the original edge length (or edge label) is longer than the new length, prepend the extraneous part of old label(u,w) to each outgoing edge from w. The edge length for edges that lead to leaf nodes are left unchanged. Then repeat the adjustment at each child node of u. Figure 3.3 shows an example of this procedure. Observe that this adjustment only affects the edge lengths, and does not change the general topology of the suffix tree.

The above adjustment procedure leads to an important property of the VST:

Property: In the improved VST, all internal sibling nodes occur at the same node-depth, and same string-depth, and the edge labels for the edges from the parent to each sibling differ only in the last symbol. This means that, in the VST, two branches from the same node can start with the same symbol, but their edge labels will differ.

This property provides an important difference between the suffix tree and the VST. The CHAPTER 3. THE VIRTUAL SUFFIX TREE 39 suffix tree mandates that no two edges from the same node have the same starting symbol. Further, the suffix tree only guarantees that the node-depth of two sibling nodes are the same, but not their string depth. This property of equal-length sibling edge labels is the key to more efficient representation of the VST, without explicit edge labels. Figure 3.4 shows an exam- ple of the modified suffix tree with equal-length edges for sibling nodes that are also internal nodes, and the corresponding improved virtual suffix tree. Table 3.2 shows the corresponding values of the attributes for each node in the improved VST. What remains is how we compute asa index, the adjusted SA index. This is done by combining the original sa index with elength.

Lemma 2 : Given a node in the VST say u, and its parent node (say v), we can compute the adjusted SA index in constant time. Further, when required, the edge length can be determined in constant time.

Proof: Computing the adjusted edge length (new elength) and the adjusted SA index (asa index) can be done using the following relations:  u.new elength + u.sa index : u = v.fchild and  u.asa index = u.new elength 6= 1 (3.2)  u.sa index : otherwise

u.sa index = u.fchild.sa index − u.nfleaf (3.3)

At time of VST construction, we calculate asa index from bottom to top. For leaf nodes in the VST, we already know the sa index and new elength, then we can calculate the asa index from Eqn (3.2). When the node u is an internal node in the VST, we first obtain u.sa index from Eqn (3.3) since we know u.nfleaf and u.fchild.sa index. Then we determine u.asa index from Eqn (3.2). The new edge length is not stored explicitly in the VST nodes, but can be computed in constant time whenever needed (for instance, during pattern matching) by simply changing the subjects in Eqns (3.2) and (3.3). This is possible since at this time we already know u.asa index for each node in the VST. 

Thus while we store only the asa index, our calculations will still use the original sa index. However, this can be derived from the asa index in constant time. In fact, CHAPTER 3. THE VIRTUAL SUFFIX TREE 40 we can observe that in practice, we need to compute the asa index for only the leftmost child node at each node-level, while keeping the original sa index for all other nodes. To determine the new elength for these other nodes, we simply make a constant time access to their leftmost (sibling) node (at the same node-level), and then use this to compute the length. For searching with the VST, we will calculate the length of the common string at each level. If the length is greater than 0, then we know there is a common string in the edge labels for the child nodes and only the last character is different. Thus, we do not need to store the edge lengths explicitly, leading to a reduction of one integer per node over the basic VST.

Table 3.2. Node attributes in the improved VST for the example sequence, T = missississippi$. NodeName root N1 N2 N3 N4 N5 N6 N7 N8 N9 sa index 0 1 7 9 3 9 12 4 10 13

fchild N1 N4 -N5 N5 N7 N8 N9 new elength 0 1 1 1 1 1 1 3 1 2 nfleaf 1 2 2 0 1 1 1 2 2 2 nnleaf 0 1 0 0 0 0 0 0 0 0 asa index 0 1 7 9 3 9 12 4+3=7 10 13+2=15

We have included new elength, so one can compare with elength in Table 3.1. However, in practice this will not be stored in the VST.

3.3.2 Construction Algorithm

Construction of the VST makes use of an array Q which records the internal nodes of the suffix tree. This array maps the internal nodes of the suffix tree to nodes in the VST. Thus, elements in the array are in the same ordering as the corresponding nodes in the VST.

Given an input string T, the first step is to construct the suffix array for T. This can be done in worst case linear time and linear space using any of the existing algorithms [57,63,65]. Using the SA, we construct the suffix tree as described in [3]. While the suffix tree can be constructed directly in linear time, working from the SA to the ST will require less space for the construction. The suffix tree is then preprocessed in linear time to adjust the edges from a given parent node that lead to internal child nodes to equal-length edges. Using the adjusted suffix tree, the algorithm will process the internal nodes in the suffix tree in a top-down manner CHAPTER 3. THE VIRTUAL SUFFIX TREE 41 to determine the attributes (fchild, nfleaf and nnleaf) for the corresponding nodes in the VST. Next, we process the VST from the VST leaf nodes to the root, using the Q array to update the asa index at each node. The adjusted asa index field includes information on the sa index and edge length.

The steps for constructing the VST for a given input string are summarized in Algorithm 3.1.

3.3.3 Further Space Reduction

We can further reduce the space needed by the VST, at the cost of an increased time for pattern matching. In the pattern matching phase, if the algorithm is to compare symbols one- by-one, rather than using binary search on the branches from a given node in the VST, we will only need to compute the lsa index and rsa index of the node.

Consider an arbitrary node (say node u) in the VST. The number of children from u or the number of u’s leaf nodes cannot be larger than |Σ|. Thus, the sa index of any child node of u will lie between node u’s lsa index and rsa index. Then comparing one symbol from the pattern against the first symbol on each edge from u to its children will require at most O(|Σ|) time steps. The left child node and the right child node will not need to be used again. Thus, the attributes fchild and nfleaf in the leaf nodes of the VST are no longer required. We make the asa index to be negative for the leaf nodes. Thus, during pattern matching, this serves as a flag for the VST leaf nodes. This compact structure will reduce the space requirement at each leaf node of the VST by 5 bytes. Time for pattern matching, however, will increase to O(|Σ|) for each symbol in the pattern P, or O(m|Σ|) overall, where m = |P|.

3.3.4 Complexity Analysis

Time and space complexity.

The time cost for lines 1-3 in the construction algorithm CONSTRUCT-VST (Algorithm 3.1) is O(n)+O(n)+O(n)=O(n). Lines 5-17 in the algorithm perform a one time traversal of the nodes CHAPTER 3. THE VIRTUAL SUFFIX TREE 42 in the suffix tree. The respective values of pTop and pBottom range from 1 to 2n. Thus the cost for the traversals is O(n). Lines 18-27 in the algorithm run at most pBottom times. The time for lines 18-27 in the algorithm is thus O(n), since each iteration of the loop requires constant time. Therefore, for the regular VST, the overall construction time is O(n). The time for pattern matching is in O(mlog|Σ|). For the compact structure, the construction time is the same as the regular structure, but the VST is no longer stored linearly. Here we use an array to store the relation between the Q array and the compact VST. The searching time is now O(m|Σ|).

The space requirement clearly depends on the number of nodes in the VST, which is at most n for a sequence of length n. Each node requires a fixed amount of memory to store, leading to an O(n) space requirement.

Number of nodes and practical space requirement.

The actual space needed for the VST depends on the topology of the suffix tree. This topology can be captured by the number of internal nodes in the suffix tree, or alternatively, by the quantity RIL, the ratio between the number of internal nodes and the number of leaf nodes. We call RIL the density or branching factor for the suffix tree. We conducted an experiment to evaluate the effect of this branching factor on the storage requirement of the VST. The suffix tree was constructed and the branching factors computed for a set of files taken from [104]. For each file, we used the first 224 symbols as the text, and computed the branching factor. Table 3.3 shows the results. The maximum ratio of 0.76 was observed for the file Jdk13c. On average, however, the maximum ratio was around 0.63. The worst case occurs for a sequence with |Σ| = 1, (that is, T = an), leading to a branching factor of 1. The table shows that, for a given sequence, the branching factor depends on a complex relationship between n, |Σ|, and the mean LCP.

The space requirement for the VST, for both the compact and regular structures depends directly on the branching factor. The last two columns in Table 3.3 show the maximum space requirement for each file.

The foregoing discussion leads to the following lemma on VST construction: CHAPTER 3. THE VIRTUAL SUFFIX TREE 43

Lemma 3: Given a string T = T[1..n], with symbols from an alphabet Σ, the virtual suffix tree (without suffix links) can be constructed in O(n) time, and O(n) space, independent of Σ.

3.4 Computing Suffix Links

Constructing the suffix tree from the suffix array as described in [3] does not include the suffix link. There are also a number of other suffix tree construction algorithms that build the suffix tree without the suffix link. See for example, Farach et al. [38], and Cole and Hariharan [29]. The suffix link, however, is a significant component of the suffix tree, and is important in certain applications, such as approximate pattern matching using matching statistics, and other forms of traversal on the suffix tree. Thus, a data structure to support the complete functionality of the suffix tree requires an inclusion of the suffix link. Recent efficient data structures for suffix trees have thus provided mechanisms for constructing the suffix link. The ESA [2] provided suffix links using complicated RMQ preprocessing [18]. The LST [62] also supported suffix links using the lcp-interval tree and intervals defined on the inverse suffix array. A recent work by Maaβ [80] focused exclusively on suffix link construction from suffix arrays, or from suffix trees that do not have such links.

The virtual suffix tree provides a natural mechanism for constructing suffix links. The key idea is that suffix links in the VST can be computed bottom-up, from the nodes with the highest node-depth (leaf nodes) in the VST to those with the least (the root). This is based on the following two observations about suffix links.

1. Consider a leaf node uST in the suffix tree corresponding to suffix Ti in the original se- quence. The suffix link from uST will point to the leaf node corresponding to the suffix Ti+1 (that is, the suffix that starts at the next position in the sequence).

2. The suffix link from a node u in the VST will point to some node w with a smaller string-

depth in the VST, such that |L(u)| = |L(w)|+1 (or equivalently |L(uST )| = |L(wST )|+1).

The following lemma establishes how we can build suffix links on the VST. CHAPTER 3. THE VIRTUAL SUFFIX TREE 44

Lemma 4 Given the VST for a string T = T[1..n] of length n, the suffix links can be con- structed in O(n) time using an additional O(n) space.

Proof: Let u and w be two arbitrary nodes in the VST. Let v be the parent node of u. Let u.slink be the node to which the suffix link from node u points. We consider two cases:

Case A: u is a leaf node in the VST. Then, using the above observations, the suffix link from node u will point to node w in the VST (that is, u.slink = w) such that SA[w.sa index] = SA[u.sa index] + 1. Clearly, |L(w)| = |L(u)| − 1, where L(x) is the path label of node x. Note that this path label is not explicitly stored in the VST, but for each node, the length can be computed in constant time. This computation can be performed in constant time by maintaining two arrays and observing that n − |L(w)| = n − |L(u)| + 1. One array is the inverse suffix array (ISA) for the given string, defined as follows: ISA[i] = j if SA[ j] = i, (i, j = 1,2,...,n). The second is an array M that maps the SA values to the corresponding parent nodes in the VST, defined as follows: M[i] = u, if uST in ST is the parent node of the leaf node corresponding to the suffix TSA[i]. Clearly, both arrays can be computed in linear time, and require linear space.

Case B: u is not a leaf node in the VST. This is a simpler case. When u is an internal node in the VST, the suffix link from u will point to some node w, such that w is an ancestor of node u.fchild.slink, such that |label(u,u.fchild)| = |label(w,u.fchild.slink)|. The O(n) time result then follows by using the skip/count trick [46], by observing that a VST has at most n nodes, a node depth of at most n, and that each upward traversal on the suffix link decreases the node depth by at least 1. 

Although the above description is from the viewpoint of a VST already constructed, the suffix links can be constructed as the VST is being built, by some modification of the VST construction algorithm. Algorithm CONSTRUCT-VST-WITH-SUFFIXLINKS (Algorithm 3.2) shows a modification of Algorithm algorithm CONSTRUCT-VST (Algorithm 3.1) to incorporate sections to compute the suffix links. The suffix link construction algorithm is based on the Q array used during the VST construction. We observe that the additional work required to construct the suffix links on the VST is independent of the alphabet size.

Figure 3.4 shows the result of the suffix link algorithm when applied to the VST of our example string T = missississippi$. Essentially, given the VST, the suffix link is con- CHAPTER 3. THE VIRTUAL SUFFIX TREE 45 structed right to left, node-depth by node-depth, starting with the rightmost node at the deepest node-depth, and moving up the VST until we reach the root. Thus, the order of suffix link construction in the example will be SL1,SL2,...,SL9.

Algorithm 3.2 shows that the additional work required to compute all the suffix links is linear in the length of the string. After construction, the suffix link on the VST will require one additional integer per internal node in the VST. This can be compared with the 2 integers per node required to store the suffix link using the ESA, or LST. In a typical VST, where the maximum branching factor is usually less than 0.7, the suffix link will require a maximum extra space of 0.7n ∗ 4 = 2.8n bytes. Table 3.4 shows the space required for the VST (including the suffix array and suffix links) for both the compact structure and the regular VST, at varying values of the branching factor.

We summarize the above discussion in the following theorem which captures the second main result of the work:

Theorem 3.2: Given a string T = T[1..n], with symbols from an alphabet Σ, the virtual suffix tree, including the suffix links, can be constructed in O(n) time and O(n) space, independent of Σ.

Proof. The theorem follows directly from Lemma 3 and Lemma 4. 

3.5 From SA to VST

So far, we have constructed the VST by first building the suffix tree from the suffix array, and then converting the suffix tree to a VST. The major problem with this approach is the relatively large memory requirement for suffix tree construction (for instance, compared to its storage). In this section, we eliminate this problem by constructing the VST directly from the suffix array, without a need to first construct the suffix tree.

The VST mainly encodes the structural information in a suffix tree, while avoiding the need to store some information that could be computed from the encoded structure. Thus, the key to going from SA to VST directly is to observe how the SA encodes the structural information CHAPTER 3. THE VIRTUAL SUFFIX TREE 46 in a suffix tree. The observation is that, given a sequence, the branching information in its suffix tree can be determined by making use of the corresponding suffix array and lcp array of the sequence. The edge labels, and hence edge lengths can be determined by analyzing the differences between adjacent lcp values. In a sense, it was this same observation that was used in constructing the suffix tree from the suffix array in [3] which was exploited in Algorithm 3.1.

Like the suffix tree, the VST has two types of nodes, leaf nodes and non-leaf nodes. The VST encodes only the non-leaf nodes in the suffix tree. Each leaf node in the VST corresponds to an internal node in the suffix tree whose child nodes are all leaf nodes in the suffix tree. The suffix tree leaf nodes in turn point to positions in the suffix array. Thus, to determine the VST nodes and their respective attributes from the suffix array, we consider whether the node is a VST leaf node, or a non-leaf node. We call the former Type 0 nodes, and the later Type 1 nodes. The problem then is to determine how the VST node attributes are derived from the SA and lcp for each type of node. We take a two step approach. First, we scan the SA and lcp from left to right, and use a temporary data structure to record pertinent information about each node in the VST. The temporary data structure (denoted TA) will be an array of structures, (similar to a VST node structure), but a TA node will contain more information than a VST node. Each node in the VST has a corresponding entry in TA. In the second stage, we construct a mapping function (denoted MAP) that provides a one-to-one map from the elements in TA to the VST nodes. At this stage, some attributes in TA are renamed, and non-required fields in the TA structure are removed to give the required virtual suffix tree.

The first stage makes use of two structures – the TA structure and a stack data structure. The stack structure has two elements, the stack value (an integer) and the stack type (one bit). The stack type shows the VST node types described above. Type 0 indicates an unmerged leaf node, while Type 1 indicates an unmerged internal node. The TA structure contains the same attributes as a VST node (sa index, fchild, nfleaf, and nnleaf), in addition to two pointers, namely, next which points to the next sibling node of the current node, and rsa index, the rsa index of the current node.

The algorithm scans the suffix array (SA) and lcp array (LCP) from left to right, (assumes the suffixes are sorted left to right in ascending order), and determines whether to create a new node based on the lcp values. The condition for starting the procedure to create a new node is when LCP[Stack.top.index] is larger than lcp of current index. We exit from the CHAPTER 3. THE VIRTUAL SUFFIX TREE 47 procedure when LCP[Stack.top.index] is less than lcp of current index. When entering the node creation procedure, we use curNode to denote the current index, and the curLCP to denote the lcp of current index. Whenever we exit the procedure, we run a special exiting function for housekeeping, which could also create a new node. We make use of the following definitions:

( Stack.top.value : Stack.top.type = 0 Stack.top.index = TA[Stack.top.value].rsa index : Stack.top.type=1 (3.4)

( curNode.value : curNode.type=0 curNode.sa index = TA[curNode.value].sa index : curNode.top.type=1 (3.5)

Essentially, a node is created by merging an existing internal node with another internal node, or with a leaf node, or by merging two leaf nodes to form an internal node. The algorithm makes use of several auxiliary routines, depending on the type of node. There are two cases, corresponding to the two node types:

1. CASE 1: VST LEAF NODES. Here, Stack.top.type=0. We consider two sub-cases.

Case 1A: LCP[Stack.top.index] > curLCP

• Case 1A1: LCP[Stack.top.index] 6= LCP[curNode.sa index] We merge Stack.top.index and curNode to a new element of TA, say T[k]. Let Stack.top.index be the leftmost leaf node and curNode be the right- most child. The required update is performed using the merge1A1( ) routine described as follows: If curNode is an index for SA, then T[k] has two leaf nodes: if (curNode.type = 0), then update T[k] as follows: T[k].sa index = Stack.top.index, T[k].nfleaf=2 (since there are 2 leaf nodes), T[k].rsa index = curNode.value. If curNode is an element of TA, then T[k] has one leaf node, CHAPTER 3. THE VIRTUAL SUFFIX TREE 48

and one child node; then, update T[k] as follows: T[k].sa index = Stack.top.index, T[k].fchild = curNode.value, T[k].nfleaf=1 (since there is one leaf node), T[k].rsa index = TA[curNode.value].rsa index.

• Case 1A2: LCP[Stack.top.index] = LCP[curNode.sa index]. In this case, curNode must be an element of TA. We update the node as follows: TA[curNode.value].sa index = Stack.top.index, TA[curNode.value].nfleaf = TA[curNode.value].nfleaf+1, and pop the stack.

Case 1B: LCP[Stack.top.index] = curLCP. Again, in this case, curNode must be an element of TA. We simply update the num- ber of leafs, namely, numleaf=numleaf+1 and pop the stack

2. CASE 2: VST INTERNAL NODES. Here Stack.top.type=1. We also consider two sub-cases.

Case 2A: LCP[Stack.top.index] > curLCP. We merge TA[Stack.top.value] and curNode to a new element of TA, say T[k]. The first child node will be TA[Stack.top.value], and the next (sibling) node will be curNode. The required update is performed using the merge2A() routine described as follows: If curNode is an index for SA, then T[k] has one leaf node. Update T[k] as follows: T[k].sa index = TA[Stack.top.value].sa index, T[k].fchild = TA[Stack.top.value], T[k].nfleaf=0, T[k].rsa index = curN- ode.value, TA[Stack.top.value].nnleaf=1 (this leaf is curNode). If curNode is an ele- ment of TA, then T[k] has two child nodes. If LCP[Stack.top.index] = LCP[Stack.(top- 1).index] then pop the stack. (These two must be an element of TA). Then, update T[k] as follows: CHAPTER 3. THE VIRTUAL SUFFIX TREE 49

T[k].sa index = TA[Stack.top.value].sa index, T[k].fchild = TA[Stack.top.value], T[k].nfleaf=0, T[k].rsa index = TA[curNode.value].rsa index, TA[Stack.top.value].next = curNode.value.

Case 2B: LCP[Stack.top.index] = curLCP. The update here is performed using the merge2B() routine, described as follows: If curNode is an index for SA, then, update as follows: TA[Stack.top.value].nnleaf = TA[Stack.top.value].nnleaf + numleaf +1. If curNode is an element of TA, then update as follows: TA[Stack.top.value].nnleaf = TA[Stack.top.value].nnleaf + numleaf, TA[Stack.top.value].next = curNode.value.

The special exit housekeeping procedure is performed using exitFunction(). The proce- dure is described as follows: If numleaf = 0, then push curNode into stack. Otherwise, (so we must have numleaf 6= 0), then let T[k] be the node resulting from merging curN- ode with the leaf which has the same LCP value as curLCP. Note that curNode is an el- ement of TA. Update T[k] as follows: T[k].sa index = TA[curNode.value].sa index- numleaf, T[k].nfleaf = numleaf, T[k].fchild = curNode.value, T[k].rsa index = TA[curNode.value].rsa index. Push T[k] into the stack, (equivalently, push (k) and set stack type to 1), and increment k by 1. If curLCP ≤ lcp of next index then push curNode into the stack).

Algorithm 3.3 shows the steps for constructing the TA structure, given the SA and LCP array. Figure 3.5 shows the VST nodes (nodes in TA) created using the algorithm on our running example string T = missississippi$. Table 3.5 shows the attributes of each node in the TA structure. Notice that some nodes, such as TA[2] and TA[4] were updated at later steps of the algorithm, after their initial creation.

Algorithm 3.4 shows how the TA node labels are mapped to VST node labels. The algo- rithm computes elength for the TA nodes, in order to compute asa index, the adjusted sa index which is used in the VST to avoid storing the edge lengths. Table 3.5 shows the result of the mapping for the TA structure shown in Fig. 3.5 and Table 3.5. The algorithm uses a simple auxiliary function computeChildren elengths() to determine the edge lengths. CHAPTER 3. THE VIRTUAL SUFFIX TREE 50

Building the suffix links on the VST structure above can be done as was done earlier. The two algorithms still maintain the linear time construction for the VST.

3.6 Summary

In this work, we have presented the virtual suffix tree (VST), an efficient data structure for suffix trees and suffix arrays. The searching performance is the same as the suffix tree, that is, O(mlog|Σ|) for a pattern of length m, with symbol alphabet Σ. We also showed how suffix links can be constructed on the VST in linear time, independent of the alphabet size. The VST does not store the edge lengths explicitly. This is achieved by modifying a key property of the suffix tree - the requirement that no two edges from a given node in the suffix tree can start with the same symbol. This key modification leads to a major distinction between the VST and the suffix tree, and results in extra space saving. However, whenever needed, the length for any arbitrary edge in the VST can be obtained in constant time using a simple computation. A further space reduction leads to a more compact representation of the VST, but at the expense of an increased search time, from O(mlog|Σ|) to O(m|Σ|).

The space requirement depends on the topology of the suffix tree, in particular, the branch- ing factor. For the compact structure, the worst case space requirement (including the suffix array) is 11.5n bytes without suffix links, and 15.5n bytes with suffix links, where n is the length of the string. However, in practice, the branching factor is typically less than 0.7. For the compact structure, this gives less than 9.25n bytes on average without the suffix links, or 12.05n bytes with suffix links.

In this work, we started from efficient storage of the suffix tree and suffix array after they have been constructed. Thus, we constructed the VST from the suffix tree, which in turn was constructed from the suffix array. To reduce the space requirement at construction time, we introduced another algorithm that constructs the VST directly from the suffix array. An inter- esting question is whether one can construct compressed versions of the VST, in a way that is analogous to compressed suffix trees and compressed suffix arrays. This could lead to further space saving for the VST. CHAPTER 3. THE VIRTUAL SUFFIX TREE 51

(a) (b)

Figure 3.1. Suffix tree and virtual suffix tree for the string T = missississippi$. (a) suffix tree ; (b) virtual suffix tree. The number at each leaf node indicates the position in SA. The number at each internal node indicates the node ID in the VST.

Figure 3.2. Example VST (solid nodes) showing left SA index (lSA) and right SA index (rSA) for sample nodes. CHAPTER 3. THE VIRTUAL SUFFIX TREE 52

(a) original tree (b) improved tree after adjusting the edge lengths

Figure 3.3. Edge-length adjustment procedure.

(a) modified suffix tree (b) improved virtual suffix tree

Figure 3.4. Improved VST for the string T = missississippi$ . CHAPTER 3. THE VIRTUAL SUFFIX TREE 53

Algorithm 3.1: VST Construction Algorithm

CONSTRUCT-VST(T,n) 1 SA ← COMPUTE-SUFFIXARRAY(T,n) 2 ST ← SUFFIXTREE-FROM-SUFFIXARRAY(SA) 3 ST ← ADJUST-EDGELENGTHS(ST) 4 Initialize VST[],Q[], pTop=0, pBottom=0, curNode=root, Q[pTop]=root 5 while (pBottom >= pTop) 6 for ( each childnode in curNode) do 7 if (childnode is internal node in ST ) then 8 pBottom ← pBottom + 1; Q[pBottom] ← childNode 9 if childnode is first internal node then 10 VST[pTop].fchild ← pBottom 11 end if 12 else 13 Update VST[pTop].nfleaf and VST[pBottom].nnleaf 14 end if 15 end for 16 pTop ← pTop + 1; curNode ← Q[pTop] 17 end while 18 for (pb ← pBottom down to 0) do 19 if (VST[pb] is leaf node) then 20 VST[pb].asa index ← Q[pb].fchild 21 else if (Q[pb].elength=1) then 22 VST[pb].asa index←VST[pb].fchild.asa index + VST[pb].nfleaf - Q[pb].elength 23 else 24 VST[pb].asa index←VST[pb].fchild.asa index + VST[pb].nfleaf - Q[pb].elength+ Q[pb].elength 25 end if 26 end if 27 end for CHAPTER 3. THE VIRTUAL SUFFIX TREE 54

Table 3.3. Branching factor and maximum space requirement for various sample files. File |Σ| Max Ratio Compact Regular Description Bible 63 0.61 8.60n 10.13n King James bible Chr22 5 0.73 9.50n 11.33n Human chromosome 22 E.coli 4 0.65 8.89n 10.52n Escherichia coli genome Etext 146 0.54 8.02n 9.36n Texts from Gutenberg project Howto 197 0.55 8.13n 9.51n Linux Howto files Jdk13c 113 0.76 9.69n 11.59n JDK 1.3 documentation Rctail 93 0.66 8.95n 10.60n Reuters news in XML format Rfc 120 0.64 8.77n 10.36n Concatenated IETF RFC files Sprot 94 0.61 8.54n 10.05n World 94 0.54 8.06n 9.41n CIA world fact book Average 0.63 8.71n 10.29n

Algorithm 3.2: VST construction with suffix links

CONSTRUCT-VST-WITH-SUFFIXLINKS(T,n) 4 Initialize VST[],Q[],ISA[],M[], pTop←0, pBottom←0, curNode←root, Q[pTop]←root . . 18 for (pb ← pBottom down to 0) do 19 if (VST[pb] is leaf node) then 20 Update array M to map SA index and node VST [pb] . . 26 end if 27 end for 28 for (pb ← pBottom down to 0) do 29 if (VST[pb] is leaf node) then 30 VST[pb].slink ← M[ISA[VST[pb].sa index+1]] 31 else 32 Find ancestor w of VST[pb].fchild.slink s.t. |label(w, VST[pb].fchild.slink)|=|label(VST[pb], VST[pb].fchild)| 33 Set VST[pb].slink ← w 34 end if 35 end for CHAPTER 3. THE VIRTUAL SUFFIX TREE 55

(a) (b)

Figure 3.5. Suffix links on the VST for the sample string T = missississippi$. (a) suffix links on the VST, but showing the leaf nodes of the suffix tree; (b) suffix links on VST (no ST leaf nodes).

The suffix links are labeled SL1,SL2,...SL9, indicating the order in which they were constructed

Table 3.4. Storage requirement for the VST, including suffix links Ratio Compact Regular Worst Case 1 15.50n 18.00n Average Case 0.75 12.63n 14.50n 0.7 12.05n 13.80n 0.65 11.48n 13.10n 0.6 10.90n 12.40n

Table 3.5. Detailed attributes for nodes in the TA data structure using the sample sequence, T = missississippi$. TA[0] TA[1] TA[2] TA[3] TA[4] TA[5] TA[6] TA[7] TA[8] TA[9]

label P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 sa index 4 3 1 0 7 10 9 13 12 9 fchild null TA[0] TA[1] TA[2] null null TA[5] null TA[7] TA[6] next null null TA[4] null TA[9] null TA[8] null null null rsa index 5 5 5 5 8 11 11 14 14 14 nfleaf 2 1 2 1 2 2 1 2 1 0 nnleaf 0 0 1 0 0 0 0 0 0 0 CHAPTER 3. THE VIRTUAL SUFFIX TREE 56

Figure 3.6. Constructing VST from the suffix array.

Nodes are labeled based on their labels in the temporary array, TA. The mapping of the TA node labels to the corresponding VST node labels is shown in Table 3.5.

Table 3.6. Node mapping table from TA to VST TA nodes P0 P1 P2 P3 P4 P5 P6 P7 P8 P9

VST nodes N7 N4 N1 root N2 N8 N5 N9 N6 N3 CHAPTER 3. THE VIRTUAL SUFFIX TREE 57

Algorithm 3.3: Constructing VST From Suffix Array

CONSTRUCT-VST-FROM-SA(LCP[],SA[]) 1 Stack←buildStack(); TA[n]; k ← 0; st.value=0; st.type=0; push(st) 2 for (i ← 1 to n) do 3 if (LCP[i] ≥ LCP[Stack.top.index]) then 4 st.value=i; st.type=0; push(st) 5 else 6 curLCP←LCP[i]; curNode.value←i; curNode.type←0; numleaf←0 7 do while(Stack is not empty & curLCP≤ LCP[Stack.top.index]) 8 if(Stack.top.type=0 & LCP[Stack.top.index] > curLCP) then 9 if(LCP[Stack.top.index] 6= LCP[curNode.sa index]) then 10 TA[k] ← merge1A1(Stack.top.index,curNode) 11 curNode.value ← k; curNode.type ← 1; k ← k + 1 12 else 13 TA[curNode.value].sa index ← Stack.top.index; TA[curNode.value].nfleaf ← TA[curNode.value].nfleaf+1 14 end if 15 else if(Stack.top.type=0 & LCP[Stack.top.index]=curLCP) then 16 numleaf ← numleaf+1 17 else if(Stack.top.type=1 & LCP[Stack.top.index]>curLCP) then 18 TA[k] ← merge2A(TA[Stack.top.value],curNode) 19 curNode.value ← k; curNode.type ← 1; k ← k + 1 20 else 21 merge2B(TA[Stack.top.value],curNode) 22 pop(Stack); break 23 end if 24 pop(Stack) 25 end while 26 exitFunction() 27 end if 28 end for 29 root ← the value of last element of Stack 30 return MAP-TA-TO-VST(TA[],k,ROOT) CHAPTER 3. THE VIRTUAL SUFFIX TREE 58

Algorithm 3.4: Mapping TA nodes to VST nodes

MAP-TA-TO-VST(TA[],k,root) 1 MAP[0..k-1]; W[0..k-1]; pTop←-1; curNode ← root 2 TA[curNode].fchild.elength ← 1 3 for (j ← 0 to k-1) do 4 MAP[j] ← curNode 5 computeChildren elengths(curNode) 6 if (curNode.next 6= NULL) then 7 curNode ← curNode.next 8 else 9 pTop ← pTop + 1 10 do while (TA[MAP[pTop]].fchild = NULL) 11 pTop ← pTop + 1 12 end while 13 curNode ← TA[MAP[pTop]].fchild 14 if (TA[curNode].elength > 1) then 15 TA[curNode].sa index←TA[curNode].sa index+TA[curNode].elength 16 end if 17 TA[MAP[pTop]].fchild ← j+1 18 end if 19 end for 20 for (j ← 0 to k-1) do 21 W[MAP[j]] ← j 22 end for 23 for (j ← 0 to k-1) do 24 if (MAP[j] ≥ 0 then) 25 index ← j 26 SWAP1 ← TA[MAP[index]] 27 do while(MAP[index] ≥ 0) 28 SWAP2 ← TA[index] 29 TA[index] ← SWAP1 30 SWAP1 ← SWAP2 31 MAP[index] ← -1 32 index ← W[index] 33 end while 34 end if 35 end for 36 Remove next,rsa index,elength from TA 37 Return TA as VST Chapter 4

The Probabilistic Suffix Array

4.1 Introduction

It has earlier been shown [109] that the probabilistic suffix tree (PST) is equivalent to the probabilistic suffix automata which is a type of probabilistic finite automata (PFA). The PFA on the other hand can be viewed as a variable length Markov model (VLMM), whose memory length constraint is determined by the observed data. We present the probabilistic suffix array (PSA), a new data structure for representing information in variable length Markov chains. The PSA essentially encodes information in a VLMM by providing a space-efficient representation of the probabilistic suffix tree (PST). Our PSA provides the same functionality as the PST, but at a reduced space requirement. The equivalence between the PST and a class of PFAs implies that our PSA is also equivalent to this class of PFAs.

Main Resultes. We present algorithms to construct the PSA and for sequence prediction based on the constructed PSA. Our algorithms are based on the notion of empirical probabilities, modeled using information retrieval notions of term frequency (TF) and document frequency (DF). We present a linear time algorithm for computing the document frequency. We state our main results in the form of two theorems about the PSA.

59 CHAPTER 4. THE PROBABILISTIC 60

Theorem 4.1: Given a sequence T = T[1...n], with symbols from an alphabet Σ, and the memory constraint L on the variable length Markov model, the probabilistic suffix array (PSA) for T can be constructed in O(n) time, and O(n) space, independent of the Markov order, L.

Theorem 4.2: Given a sequence T = T[1...n], with symbols from an alphabet σ, where

σ = σ1σ2...σ|Σ|, and the probabilistic suffix array (PSA) for T, we can decide on whether a pattern P = P[1..m] is generated by the same variable length Markov chain that generated T in n O(mlog |Σ| ) time.

In previous work by Ron et al. [109] and Apostolico and Bejerano [9], the probabilistic finite automata (PFA) was represented using the probabilistic suffix tree (PST). Here, we present a space-efficient data structure to simulate such finite state machines when represented as a PST. Since a variable length Markov model is a finite state machine, and can be represented as a PST, our proposed structure can be used to capture the information in a variable length Markov model. We call our data structure the probabilistic suffix array (PSA), since it is built on suffix arrays rather than the suffix tree data structure. The PSA uses an array of nodes to capture the branching structure in the suffix tree, and other auxiliary arrays to maintain information needed for learning from the observed data. Learning in the PSA is performed by computing conditional probabilities at each node in the PSA using empirical probabilities computed via the TF and DF.

Organization. In the next section, we briefly describe the PST using an example sequence. In Section 3, we present our PSA data structure. We also give an example for the PSA to explain how it works. The construction algorithm is presented in Section 4. We analyze its practical space requirement in Section 5. Section 6 presents experimental results on protein family classification and phylogenetic tree construction. The last section provides a summary and conclusion.

4.2 Probabilistic Suffix Tree

Given a sequence T with n observations, the Markov model needed to represent T will require space that is exponential in L, the order of the Markov model. In practice, the transition CHAPTER 4. THE PROBABILISTIC SUFFIX ARRAY 61 matrix for the Markov model to represent such a sequence will be sparse as the order L increases. Suffix tree data structures represent all substrings of a sequence seq in O(n) internal nodes and leaf nodes. Ron et al. [109] presented a space-efficient data structure called the probabilistic suffix tree (PST) to represent the order L transition matrix. The PST encodes only the non- zero transition probabilities. Therefore, the PST is a suffix tree which contains the transition probabilities for the Markov model with any order L, where 0 < L ≤ n. However, in practice, the space requirement for the suffix tree is still a problem.

Consider the sample sequence T = accactact$. Its first order transition matrix and the corresponding state diagram are shown in Figure 4.1. Figure 4.2 shows an example suffix tree and the corresponding PST, for this example sequence. The PST is shown for the case of order L = 3. In this PST, we label the transition probability in each symbol. The transition probability for a given symbol is calculated using the conditioning context.

0.25 a c 0.25 1

0.5

1 g t (a) State Diagram (b) Transition Matrix

Figure 4.1. State diagram and transition matrix for a first order Markov model for an example sequence T = accactact$.

4.3 Proposed Data Structure

We propose the probabilistic suffix array (PSA) as a way to simulate the probabilistic suffix tree. Each node in the PSA has a corresponding node in the PST. The basic PSA structure has three types of attributes. The first category of attributes are the foundation attributes, which consists of the original text and its suffix array. Construction of the suffix array is in O(n) time CHAPTER 4. THE PROBABILISTIC SUFFIX ARRAY 62

(a) (b)

Figure 4.2. Example suffix tree and probabilistic suffix tree for the string T = accactact$.

(a) suffix tree ; (b) probabilistic suffix tree. The trees are shown without the suffix links. The numbers at each node of the PST are based on the count of the number of times the symbols are observed after observing the sequence corresponding to the node label. Essentially, at a given node, say u, these encode the conditional probabilities P(σ|C) of observing the symbol σ following the sequence L(u).

and space, using any of the various linear-time linear-space algorithms. See for example [4,57, 97, 104]. The second category of attributes are the internal node attributes. These are derived from the interval array, which is determined following [131]. These are used to represent the internal nodes in the suffix tree including the suffix links. The suffix link is the link from an internal node to its suffix node. The third type of attributes are measurement attributes. These record measurement information, such as term frequency, document frequency, and conditional probabilities, which are needed to compute probabilities in the Markov model. Table 4.1 shows the three categories of attributes used in the PSA. We use the term length of PSA to refer to the number of nodes in the PSA. The term length emphasizes the fact that our PSA nodes are stored as arrays. In this work, we use M to denote the PSA length.

4.3.1 Internal Node Attributes

The internal node attributes are derived from the interval array. The pair < Start,End > represents the interval position of this node in the suffix array. Length denotes that length CHAPTER 4. THE PROBABILISTIC SUFFIX ARRAY 63 of longest common prefix between the substrings represented by the node. This essentially corresponds to the length of the path from root to the current node. The PSA internal node attributes are used to simulate the internal nodes in a suffix tree. The attribute Suffix Link is a regular suffix link from the current internal node to its suffix node. We use this link to continue the searching process when a mismatch occurs at the time of prediction. The internal node attributes including the suffix link are constructed using Algorithm BUILDPSA.

4.3.2 Measurement Attributes

These attributes are dependent on the application. For example, for applications in docu- ment feature selection, or in protein sequence classification, we only compute and store TF and DF, and the conditional probabilities. In other applications, such as document clustering, we may need to compute document frequency for the classes, rather than just the document fre- quency. The attributes are also dependent on the methods used in calculating the probabilities in the Markov model. Thus, we focus on the method of computing the conditional probabilities, and the probability of a node in the VLMM. For a given node with node label, say (S1...Sn), its probability, PVLMM is given by:

PVLMM = P(S1...Sn) = P(Sn|S1...Sn−1)...P(S2|S1)P(S1) (4.1)

If S = S1...St−1St (where 1 ≤ t ≤ n ) does not occur in the training data, we find the longest suffix of S which occurred in the training data. Assume the Sk...St−1St (where 1 ≤ k ≤ t) is the

Table 4.1. Attributes of PSA Type Attribute Space Fundamental Original Text(text) char Attributes Suffix Array(SA) integer Internal Node Start integer Attributes End integer Length integer Suffix Link integer Measurement Term Frequency(TF) integer Attributes Document Frequency(DF) integer

cProbability P(St |Sk...St−1)(CP) float CHAPTER 4. THE PROBABILISTIC SUFFIX ARRAY 64 longest suffix of S, then P(St|S1...St−1) = P(St|Sk...St−1). Thus,

 TFS  t : k = t  n P(St|S1...St−1) = P(St|Sk...St−1) = (4.2) TF  Sk...St  TF : k < t Sk...St−1

Here, TFu is the term frequency of the node with node label u. We make two observations: First, if the terminal symbol St of a path Sk...St−1St is a first symbol in an edge of the suffix tree, then the conditional probability P(St|Sk...St−1) can be computed as the frequency of the current node divided by the frequency of the parent node. Secondly, if the terminal symbol

St of a path Sk...St−1St is a non-first symbol in an edge of the suffix tree, then the conditional probability P(St|Sk...St−1) is 1. We call this a trivial conditional probability. Thus, we need to store the conditional probabilities for only the first symbols in each edge. When we determine that the terminal symbol of a path is a non-first symbol in an edge, we simply return 1 for the conditional probability of the symbol.

4.3.3 Example PSA

Table 4.2 and Table 4.3 show the nature of the PSA nodes and the order-3 conditional probabilities in a PSA, using the sequence T = accactact$ used in Figures 4.2 and 4.1. The entries in Table 4.2 are directly calculated from the interval array. We notice that P(c|a) is 1. This indicates the substring “ac” is represented on one edge and that the terminal symbol ”c” is the second symbol on this edge. This is an example of a trivial conditional probability. P(c|ta) also is an example whose probability is 1.

Entries in Table 4.3 are computed from the PSA leaf nodes. We only showed the non- trivial conditional probabilities on the leaf nodes. These conditional probabilities are calculated based on the first observation described in Section 4.3.2. The numerator is 1 since this is a leaf node which must have a frequency of 1. The denominator is the frequency of the substring corresponding to the node label of the parent of the current leaf node. This is easily obtained as (End − Start + 1) using the elements in the interval array. The PSA nodes can be compared with nodes in the example PST shown in Figure 4.2. The transition matrix in Figure 4.1(b) can be constructed in the PSA. CHAPTER 4. THE PROBABILISTIC SUFFIX ARRAY 65

PSA node Interval Length Suffix Probability Conditional Index Link Expression Probability 2 1 < 2,3 > 3 2 P(t|ac) 3 1 2 < 1,3 > 2 3 P(a) 3 2 < 1,3 > 2 3 P(c|a) 1 1 3 < 6,7 > 2 4 P(t|c) 2 4 4 < 4,7 > 1 -1 P(c) 9 2 5 < 8,9 > 1 -1 P(t) 9

Table 4.2. Example PSA internal nodes, using the PSA of the sequence T = accactact$

SA index Probability Conditional Expression Probability 1 1 P(c|ac) 3 3 P(a|act) 1 1 4 P(a|c) 4 1 5 P(c|c) 4 1 7 P(a|ct) 2 9 P(a|t) 1

Table 4.3. Example PSA leaf nodes, using the PSA of the sequence T = accactact$

4.3.4 Interval Array and Document Frequency in Linear Time

The interval nodes represent the basic structure of the PSA. Since other attributes are based on the structure of these interval nodes, we will need to compute these interval nodes first. The term frequency (TF), and document frequency(DF) are computed as by products, as we build the interval nodes. In this section, we give a new algorithm to compute DF in linear time. Compared with the original algorithm [131] which runs in O(nlogn) time, our algorithm is more efficient and applicable to large document collections. The interval nodes are stored as an array using the interval array data structure. To build the interval array, we use Yamamoto and Church’s TF algorithm [131]. In our structure, the non-trivial lcp-delimited intervals represent the internal nodes of the suffix tree. Therefore, we modify the algorithm to output only the non- trivial lcp-delimited intervals. After this procedure, we obtain the interval array which includes interval attributes < Start,End > and Length for each node. Length is simply the LCP of the interval represented by < Start,End >. The attribute TF is also computed at this stage. CHAPTER 4. THE PROBABILISTIC SUFFIX ARRAY 66

For applications that involve multiple sequences or documents, for instance, in text clus- tering or in protein sequence classification, we may require the document frequency. In this work, the input sequence in such applications will be a concatenation of all the sequences, with a special end of document symbol ($) delimiting each individual sequence.

The original algorithm proposed in [131] for computing document frequency runs on O(nlogn) time. Here, we modify the algorithm to improve its running time to O(n). The original algo- rithm was reproduced in Figure2.1 for easy reference.

In line 4-6, the algorithm searches for the largest x. The worst case for this search will be O(logn). We add a new array docsp that maps the document id to a stack. The length of this array is the number of documents, Z. When we calculate a new interval < Start,End >, the al- gorithm will check whether the element has been observed previously. The algorithm searches docsp with the document id of the new element. If the document id is found, the document frequency (DF) is changed. The process performs a simple look-up using docsp in constant time, and hence the modified algorithm runs in O(n) time. Algorithm COMPUTEDOCUMENT- FREQUENCY implements the proposed modifications.

Algorithm 4.1: Computing Document Frequency in Linear Time

COMPUTEDOCUMENTFREQUENCY (3) doc ← getdocnum(s[j]), docsp[doc] ← sp (4) if doclink[doc] 6= -1, do (5) if docsp[doc] >sp or stack[docsp[doc]].i > doclink[doc] do (5.5) stack df[docsp[doc]] ← stack df[docsp[doc]]-1 (6) doclink[doc] ← j, docsp[doc] ← sp

4.4 Constructing the PSA

Having described the building blocks for the probabilistic suffix array (PSA), we are now ready to describe how we put them together to construct the PSA. Algorithm BUILDPSA (Al- gorithm 4.2) uses five major steps or procedures in constructing the PSA data structure for a given input sequence. The first step is the construction of the suffix array from the original sequence. We use standard linear-time linear-space algorithms for this step. Using Yamamoto CHAPTER 4. THE PROBABILISTIC SUFFIX ARRAY 67 and Church’s PRINT LDIS STACK() function [131], we construct the interval array. We then simulate the tree-like structure from the interval array in the third step. The third step maps each position in the input sequence to its interval. In the forth step, the routine BUILDSUFFIXLINK starts by first constructing the inverse suffix array W. From W, the algorithm determines a link that allows the interval array to point to the next position. Thus, the suffix link is easy to construct.

The final procedure constructs a ranked list of the elements in the interval array (the nodes in the PSA) in a non-decreasing order. This process uses the counting sort in O(M) time and using 2M integer space, where M is the number of non-trivial lcp-delimited intervals. Since M ≤ n, this will take O(n) time.

Algorithm 4.2: Building the Probabilistic Suffix Array

BUILDPSA(Text) 1 SA ← BUILDSA(Text) 2 IntervalArray ← PRINT LDIS STACK(SA) 3 ← BUILDINTERVALTREE(SA,IntervalArray) 4 PSA ← BUILDSUFFIXLINK(posInterval,parent) 5 PSA ← SORTPSA(PSA)

4.4.1 Building the Interval Tree

Algorithm BUILDINTERVALTREE (Algorithm 4.3) uses the interval pairs < Start,End > to construct a tree-like structure that encodes the parent-child relationships between the inter- vals. The main idea is to set the interval ID into a position between positions Start and End, such that the position has not been previously set. The problem is that the pairs < Start,End > can overlap (nesting). A na¨ıve algorithm for this task will require an O(n2) time.

We use a stack to store the free position which is the current pair < Start,End >. If the position has not been earlier set to some interval ID, the algorithm sets the interval ID one by one from Start to End. If the position has earlier been set to some interval ID, the algorithm will set the interval ID in the last part of the pair < Start,End > which has not earlier been set CHAPTER 4. THE PROBABILISTIC SUFFIX ARRAY 68 to an interval ID. This process guarantees that the position will be set to one interval ID, and that the position is included in only one < Start,End > pair.

Some positions which could not be set to an interval ID in the last pass will be set to an interval ID in the next loop. The following step pops positions from the stack and sets interval IDs into these positions. This process guarantees that every position will be set to one interval ID, even the positions that belong to the next higher level interval, or to a lower level interval.

After determining the interval IDs, the algorithm computes the parent array. The parent array stores the link from one interval to the higher interval position which shares the same value for Start. The variable pN is a stack. This stack stores the intervals which have been computed, but the parent’s interval ID has not yet been set. Line 16 pushes the interval into the stack.

When the current interval includes the interval at the top of the stack pN, it means the current interval is the parent of the interval at the top of pN. Thus, we set the value for parent of the interval at the top of pN to the current interval. The loop in lines 13 to 15 repeat the step to find all the child intervals of the current interval.

4.4.2 Building the Suffix Link

Algorithm BUILDSUFFIXLINK (Algorithm 4.4) uses the suffix array (SA) to compute the inverse suffix array W. To compute the suffix link, the algorithm sets the position of the parent in W to be the current position. Thus we obtain the suffix link for each leaf node. To compute the suffix link of an internal node, we simply use the parent array to find the suffix link of the internal node.

4.4.3 Sorting the PSA Structure

After obtaining the suffix link, we resort the PSA structure to make it suitable for efficient searching during the VLMM prediction stage. The prediction procedure performs frequent searches using the interval array. (See Section 4.4.5). After sorting using the < Start,Length > attributes of the PSA, searching will be done in O(logM) time, where M is the PSA length, the CHAPTER 4. THE PROBABILISTIC SUFFIX ARRAY 69

Algorithm 4.3: Buliding the Interval Tree

BUILDINTERVALTREE(SA[],Start[],End[],Length[]) 1 posInterval[] ← -1; pt ← 0 2 for (i ← 1 to M) do 3 if posInterval[i]=-1 do //first time to observe Start[i] 4 for (j ← Start[i] to End[i]) do 5 if posInterval[j]=-1 do posInterval[j] ← i else break end if 6 end for 7 for (j ← End[i] down to Start[i]) do 8 if posInterval[j]=-1 do posInterval[j] ← i else break end if 9 end for 10 end if 11 pop position pos which is between Start[i] and End[i] in Stack 12 posInterval[pos] ←i; push the position between Start[i] and End[i-1] 13 while pt > 0 and Start[i] ≤ Start[pN[pt-1]] 14 parent[pN[pt-1]] ← i; pt← pt-1 15 end while 16 pN[pt] ← i; pt ← pt+1 17 end for 18 return < posInterval[], parent[] >

Algorithm 4.4: Buliding Suffix Link

BUILDSUFFIXLINK(SA[],Start[],Length[], posInterval[], parent[]) 1 for (i ← 1 to n) do W[SA[i]-1] ← i end for 2 for (i ← 1 to n) do W[i] ← posInterval[W[i]] end for 3 for (i ← 1 to M) do 4 if Length[i] 6= 1 and sLink = NULL do 5 sLink[i] ←W[SA[Start[i]]] 6 k ← parent[i]; pi ← i 7 do while (k 6= root and sLink[k] = NULL) 8 sLink[k] ← parent[sLink[pi]] 9 do while (Length[sLink[k]] 6= Length[k]-1) 10 sLink[k] ← parent[sLink[k]] 11 end while 12 pi ← k; k ← parent[k] 13 end while 14 end if 15 end for CHAPTER 4. THE PROBABILISTIC SUFFIX ARRAY 70 number of the interval pairs in the PSA.

The algorithm SORTPSA (Algorithm 4.5) uses counting sort to sort the PSA structure. In the interval array, the attribute Length is already in decreasing order for intervals that share the same Start. Therefore, we only need to sort the structure based on the order of the Start attribute. The time complexity of this algorithm is linear with respect to the number of nodes (number of intervals). The additional space is at most 2n integers, see analysis in Section 4.5.

Algorithm 4.5: Sorting the PSA Structure

SORTPSA(Start[],Length[],sLink[]) 1 Count[] ← 0; Order[] ← 0; wOrder[] ← 0; NewStart[] ← 0; NewEnd[] ← 0 NewLength[]← 0; NewsLink[]← 0; sum ← 0; NewPattern[]← 0 2 for (i ← 1 to M) do Count[Start[i]] ← Count[Start[i]]+1 end for 3 for (i ← 1 to n) do 4 if Count[i] 6=0 do sum ← sum+Count[i]; Count[i] ← sum-Count[i]; end if 5 end for 6 for (i ← 1 to M) do 7 Order[Count[Start[i]]] ←i; Count[Start[i]] ← Count[Start[i]] + 1 8 end for 9 for (i ← 1 to M) do wOrder[Order[i] ← i end for 10 for (i ← 1 to M) do sLink[i] ← wOrder[sLink[i]] end for 11 for (i ← 1 to M) do 12 NewStart[i] ← Start[Order[i]];NewEnd[i] ← End[Order[i]];NewLength[i] ← Length[Order[i]] 13 NewsLink[i] ← sLink[Order[i]];NewsPattern[i] ← pattern[Order[i]] 14 end for

4.4.4 Computing Conditional Probabilities Using the PSA

Algorithm COMPUTEPROBABILITY (Algorithm 4.6) is a simple routine which is based on the PSA structure. It computes the conditional probability at each internal node by using equation (4.2). The algorithm uses the temporary array parent which was generated by algo- rithm BUILDSUFFIXLINK (Algorithm 4.4). The array parent contains a record of the parent for each given interval node. The algorithm is linear in time and is in place. After computing the conditional probabilities, the space used by array pattern can be released, since the array is no longer needed. CHAPTER 4. THE PROBABILISTIC SUFFIX ARRAY 71

We summarize the above results in Theorem 4.1, our first main result on the PSA:

Theorem 4.1: Given a sequence T = T[1...n], with symbols from an alphabet Σ, and the memory constraint L on the variable length Markov model, the probabilistic suffix array (PSA) for T can be constructed in O(n) time, and O(n) space, independent of the Markov order, L.

Proof: Algorithm BUILDSA and PRINT LDIS STACK each runs in O(n) time complexity as in [104] and [131] respectively. Algorithm BUILDINTERVALTREE computes two arrays posInterval and parent. The array posInterval stores the relationship between leaf nodes and interval nodes, while parent represents the parent nodes for the interval nodes. There are n leaf nodes and M interval nodes. Lines 2-17 in algorithm BUILDINTERVALTREE searches each interval node. Lines 4-6 makes the relationships between current interval node and the leaf nodes before the first child interval node of the current interval node. Lines 7-9 calculates the relationships between current interval node and leaf nodes after the last child interval node of current interval node. Line 11 computes the relationships between current interval node and leaf nodes which is not included in other interval nodes. Each leaf node will be computed one time in algorithm BUILDINTERVALTREE. Lines 13-15 compute the child nodes of the current interval node. The time cost is O(M) over the whole algorithm, even with the loop in Line 2. Algorithm BUILDSUFFIXLINK calculates the suffix link via the inverse suffix array W. It calculates suffix links starting at the lowest level interval nodes which does not include other interval nodes. Then it calculates the suffix link for the parent node of current interval node following the parent array. So the time is O(M) for computing the suffix links for all interval nodes. There are 6 loops in algorithm SORTPSA. Each loop runs either n times or M times. So the time complexity is O(n). Therefore, overall, the probabilistic suffix array (PSA) for a sequence T of length n can be constructed in O(n) time.

Linear space requirement follows from the space analysis in Section 4.5. 

4.4.5 Prediction with VLMM via the PSA

An important procedure in Markov models is to compute the probability that a given test pattern is generated by the model. For the variable length Markov model (VLMM), we denote CHAPTER 4. THE PROBABILISTIC SUFFIX ARRAY 72

Algorithm 4.6: Calculating Conditional Probability

COMPUTEPROBABILITY(PSA,M,n) 1 for (i ← 1 to M) do 2 if PSA.Length[i]=1 do 3 PSA.cProbability[i] ← PSA.TF[i]/n 4 else 5 PSA.cProbability[i] ← PSA.TF[i]/PSA.TF[PSA.parent[i]] 6 end if 7 end for

this probability as the PVLMM of the input pattern. Algorithm VLMM-PREDICTION (Algorithm 4.7) calculates the PVLMM of a test sequence. This algorithm uses Equation (4.1) to compute P(S1),P(S2|S1)... by searching for the sub-patterns S1,S1S2,... When there is a mismatch while searching with the sub-pattern Sk...St, the algorithm jumps to the node pointed to by the suffix link attribute of the current node. Thus, matching proceeds from the node representing the suffix

Sk+1...St, after accounting for the prefix of this suffix, which has already been matched in the previous step.

The algorithm scans positions in the pattern from left to right. While matching the sub- pattern, it uses the function LeftmostMatchedPosition which uses standard suffix ar- ray search algorithms [3, 84] to find the left most position (le ftPosn) that matched the sub- pattern. The function LeftmostMatchedPosition searches the pattern Pattern[s...i]#, where # is a symbol that never occurred in the alphabet. That is, # ∈/ Σ,# < σ,∀σ ∈ Σ, and $ < #. The search will result in a mismatch. Since Pattern[s...i] has already matched up to position i − 1 using the SA, we can easy determine the leftmost position of Pattern[s...i] in the suffix array. We use le ftPosn to denote this leftmost position in the algorithm. The function SearchPSA uses the determined leftmost position to search in the PSA. This function also uses standard suffix array search algorithms [3, 84]. The function determines the index of the PSA node such that PSA.index.Start = le ftPosn and PSA.index.Length is the minimum value that is larger than i − s, the length of matching prefix of the pattern. Since the PSA is aleady sorted by Start and Length, the search for the index will be done in O(logM) time complexity, where M is the length of PSA (i.e. the number of nodes in the PSA).

When a mismatch occurs, the algorithm uses the previously computed index to determine CHAPTER 4. THE PROBABILISTIC SUFFIX ARRAY 73 the suffix link. Thus the search will be redirected to the new branch following the suffix link. The algorithm now uses the longest suffix of the sub-pattern that so far matched as the new sub-pattern, and re-starts matching from the symbol that mismatched. Determining this point where matching should re-start is a constant time operation.

The foregoing implies that, given a sequence T = T[1...n], with symbols from an alphabet Σ, and the PSA for T, we can decide on whether a pattern P = P[1...m] is generated by the same n variable length Markov chain that generated T in O(mlog |Σ| ) time.

We can use the predicted probability above to perform protein sequence classification. Sup-

pose we have F protein families and we have computed the PSA for each family. Let PSAk be the model constructed using the k-th protein family. Further, let PVLMM(P,PSAk) be the probability that protein sequence P is generated by PSAk, as returned by Algorithm VLMM- PREDICTION. Then, we classify P to protein family f , where f is given by:

f = argmax{PVLMM(P,PSAk)}. (4.3) k=1...F We summarize the foregoing in the following Theorem, the second major contribution of this work on the PSA:

Theorem 4.2: Given a sequence T = T[1...n], with symbols from an alphabet σ, where

σ = σ1σ2...σ|Σ|, and the probabilistic suffix array (PSA) for T, we can decide on whether a pattern P = P[1..m] is generated by the same variable length Markov chain that generated T in n O(mlog |Σ| ) time.

Proof: We observe that whenever a mismatch occurs, the start position of the sub-pattern s will be increased by one. The number of mismatches is at most m. Similarly, the number of matching positions will equally be at most m. Thus, the loop in Lines 2-17 will run at most 2m times. Line 12 calls the function LeftmostMatchedPosition which uses standard suffix array search algorithms [3, 84], that run in O(logn) time per call. The function SearchPSA in Line 13 searches the PSA data structure with the parameters determined in the previous step in O(logM) time, where M is the length of the PSA. Thus, the total running time will be in n O(mlogn). We improve this time to O(mlog |Σ| ) using an extra |Σ| space to record the starting CHAPTER 4. THE PROBABILISTIC SUFFIX ARRAY 74 position of each symbol in the suffix array. 

Algorithm 4.7: Prediction with VLMM via the PSA

VLMM-PREDICTION(Pattern,PSA) 1 s ← 1, Prob ← 1,L ← 1,R ← n, index ← 0,m ← kPatternk 2 for (i ← 1 to m) do 3 Search Pattern[s...i] in PSA.SA with parameter L,R 4 if mismatch do 5 index ← PSA.index.sLink 6 L ← PSA.index.Start, R ← PSA.index.End 7 s ← s + 1 8 if i ≤ s − 1 do 9 i ← i + 1 10 end if 11 else 12 position ← LeftmostMatchedPosition(Pattern[s...i],PSA.SA,L,R) 13 index ← SearchPSA(PSA,position,i − s + 1) 14 Prob ← Prob × PSA.index.Conditional Probability 15 L ←PSA.index.Start, R ← PSA.index.End 16 end if 17 end for 18 return Prob

4.5 Space Analysis

In this section, we analyze the space requirement for the PSA structure. We consider space required during its construction (work space), and for its storage and use.

4.5.1 Storage Space

The basic PSA structure has four types of attributes. The measurement attributes are de- pendent on the application, and may not be needed every time the PSA is used. We also observe that, though the TF is needed for computing the empirical probabilities required for the later stage of determining the conditional probabilities, we do not need to store the TFs directly. CHAPTER 4. THE PROBABILISTIC SUFFIX ARRAY 75

They can be obtained easily using the < Start,End > pairs stored at each node. Thus, we focus on the other attributes. From Table 4.1, we see that the worst case space for the PSA structure is 6n integers plus n characters, or 25n bytes, assuming n ≤ 232, and |Σ| = 256. On average, we could save at least n integers by storing the < Start,End > pair as < Start,(End − Start) >, and the fact that the maximum value in the Length array will be the maximum LCP value for the sequence, which is known to be of length in O(log|Σ| n) [58].

Further, with M = PSA length (number of PSA nodes or intervals), the space for the internal node attributes and suffix link attributes will be 4M integers. In this case, M is a sub-linear M function of n. Thus, the ratio γ = n , where 0 ≤ γ ≤ 1 is an important measure on the node branching structure of the PSA, and hence the complexity of the original sequence. With larger γ, we need more practical space to store the PSA. At γ < 0.7, which is our observation for example sequences tested (See Table 3.3), this will result in an average space requirement of (5n + 12n ∗ γ), or 13.4n bytes for the PSA.

4.5.2 Construction Space

There are five steps to build the PSA and compute the VLMM probabilities. We list the work space in each step and analyze the worst case space over all the construction steps.

1. Building Interval Array and Computing TF, DF The space requirement will depend on the following:

(a) Input: Text of size n, suffix array (n integers) and LCP Array (n integers). (b) Work space (in integers): Stack(2n), DF (n), doclink(Z) and docsp(Z), where Z is the number of documents. (c) Output: interval array with < Start,End > and Length, n integers each.

The total space will be (7n + 2Z) integers plus n characters for the text. Memory reuse: The interval < Start,End > and Stack can share the same space. Stack is the space which stores the yet-to-be-computed non-trivial lcp-delimited intervals. In the extreme case, the length of Stack is at most n, since there are at most n non-trivial CHAPTER 4. THE PROBABILISTIC SUFFIX ARRAY 76

lcp-delimited intervals. So when a non-trivial lcp-delimited interval has been processed, it will be stored in the interval < Start,End >. The length of the interval < Start,End > is at most n. Thus, the interval < Start,End > records the calculated non-trivial lcp- delimited intervals. However, the sum of length of the interval < Start,End > and Stack is at most n. Thus, the interval < Start,End > and Stack can share the same space. The maximum space required will then be (5n + 2Z) integers plus n characters.

2. Building the Interval Tree The space requirement will depend on the following:

(a) Input: Text of size n, suffix array, interval array < Start,End > and Length. (b) Work space and output (in integers): posInterval (n), parent (n), pN (n) and Stack (n).

The total space needed for this step is 8n integers plus n characters. Memory reuse: The arrays Stack and pN can share the same n integer space, since pN is a stack that indicates the nodes have been processed, while Stack is the space that indicates those not yet processed. These are mutually exclusive, with a combined length n. The maximum space will be 7n integer plus n char.

3. Calculating Suffix Links The space requirement will depend on the following:

(a) Input : Text of size n; 1n integer array each for, SA, Start, Length, posInterval, and parent. (b) Work space and output (in integers): inverse suffix array W(n) and suffix link(n).

The total space for this step is 7n integer plus n characters. Memory reuse: Suffix link and posInterval share the same n integer space. In the algo- rithm BUILDSUFFIXLINK, after line 7, the array posInterval is never used again, and the space could be reused for the suffix link. The maximum space is thus 6n integer plus n characters. CHAPTER 4. THE PROBABILISTIC SUFFIX ARRAY 77

4. Sort PSA The space requirement will depend on the following:

(a) Input: Text of size n; 1n integer array each for Start, Length, and Su f fixLink. (b) Work space and output: 8 counting arrays (n integer each), namely: Count, Order, wOrder, NewStart, NewEnd, NewLength, NewsLink, NewPattern.

Memory reuse: At most two additional arrays are need at any given time. Thus, the maximum work space is 5n integers plus n characters.

5. Computing the Conditional Probabilities This is an in place algorithm, requiring no extra space.

From above analysis, the maximum space required during PSA construction will be 7n integers plus n characters, where we have assumed that n  Z. We can use an argument similar to that made for the storage space to save n integers, and applying the γ ratio to get an average case construction space requirement of 5.15n integers, or 20.6n bytes.

The above PSA space requirement can be compared with the space requirement using the PST. The use of reverse suffix links [9] imply that the PST will require at least 37n bytes on average (assuming Ukkonen’s suffix tree construction algorithm), without counting other auxiliary structures needed for the PST construction.

4.6 Experiments

We performed experiments on protein sequences to test the proposed data structure. The experiments were performed using a DELL PC, with 4 × 2.67GHz CPU, and 8G memory, running Ubuntu 10.10 Linux operating system. All programs were compiled using gcc.

4.6.1 Predicting Protein Families

Our major objective was to develop a time- and space-efficient alternative to the PST. How- ever, to place our results in the correct context, we must first verify that the proposed PSA pro- CHAPTER 4. THE PROBABILISTIC SUFFIX ARRAY 78 duces an equivalent performance in protein sequence modeling and prediction, when compared with the original PST. Thus, to be able to compare the PSA results with previous approaches using the PST [17, 71], we used the same protein sequence dataset that was used in [17] and in [71]. We downloaded the Pfam databse [15, 39], release 1.0. The database contains 175 families originally derived from the SWISSPROT 33. We use family members from the un- aligned sequences to generate the PSA, one PSA for each family. To test the performance in modeling the protein sequences and in predicting the family for unknown sequences, we used leave-one-out cross validation. For each sequence in the SWISSPROT 33 database, we com- pute the PVLMM using the PSA structure for each family. We then assign the protein sequence to the family with the maximum probability. When the maximum probability is obtained using the model of the correct family (whose PSA is generated without the test sequence), we say we have a correct classification (true positive), otherwise, there is a classification error. This simple approach avoids the difficult problem of setting thresholds for correct classification.

Table 4.4 shows the classification performance using the PSA. For comparison, we have also included the results obtained using the PST on the same dataset, as reported in [17]. Table 4.5 shows the summary classification performance. As expected, both the PST and the PSA produce comparable results with respect to modeling and classification of protein families. The PST had an average true positive rate of 90.8%, while the PSA had 90.2%. We note the signif- icant differences in the family sizes for the PST and PSA results. Although we used the same general Pfam dataset, there has been various additions to the Pfam database since the original publication of the PST results. On average, the size of the families in our dataset was 128.21, while the size used for the PST was 97.82. The total size of the current dataset used for the PSA was more than 1500 sequences larger than that of PST (6539 versus 4891). As can be seen from Table 4.4, most of the families where the PST performed significantly better than the PSA could be due to this difference in family sizes (see for example, ank, C2, efhand).

One advantage of performing classification using Eqn. (4.3), is that, when there is an error in the classification, we can consider the family with the next highest predicted probability. For instance, we can consider the the top-k classification rate, which shows the probability of finding the correct protein family within the first k families, as ordered based on their generated probabilities, using the test sequence. Fig. 4.3 shows the classification performance of the PSA on some sample families, using the top-k classification rate. We can observe how the CHAPTER 4. THE PROBABILISTIC SUFFIX ARRAY 79 classification performance rapidly approaches 100% after the first few k values.

● ● ● ● ● ● ● ●

● True Positive Percentage Positive True

7tm_1 ● adh_short cytochrome_b_C efhand 0 20 40 60 80 100

0 2 4 6 8 10

Top k

Figure 4.3. Top-k classification rate for sample protein families using the PSA.

4.6.2 Space Consideration

A major problem with suffix trees is their practical memory space requirement. Although they have the same theoretical linear space requirement as suffix arrays, in practice, suffix trees consume much more space [3, 46]. This was our primary motivation for developing the PSA. Table 4.6 shows the summary data on the protein families in Pfam used in our experiments. Table 4.7 compares the memory space required to construct the PST [17] and SPST [71]) with that required for the PSA.

We have included results for PST-20, PST-FULL, and SPST [71]. PST-FULL corresponds to the complete PST with no pruning, i.e. with the full string depth for each leaf node. PST-20 corresponds to orde-20 PST, i.e. PST with a maximum string depth of 20 symbols. This was the variant used in [17]. The SPST proposed in [71] also involved some pruning of the suffixes. First, we can observe the nature of the protein sequences (Table 4.6). The maximum branching M factor (γ = N ) observed was 0.75, while the minimum was 0.46. The average was 0.62. As a key performance measure, we used the memory consumption factor (MC Factor), defined as the ratio of the required memory to the total sequence length (N) of the family. We compared CHAPTER 4. THE PROBABILISTIC SUFFIX ARRAY 80 the MC Factor for PSA, PST, and SPST. The table shows that the PSA ratio was steady at about 33.16N bytes. The maximum memory required by the PSA for any of the families was 53.46N bytes. This can be compared with the (mean and maximum) memory needed for PST- 20 (41.67N,222.12N), PST-FULL (167.47N,1216N) and SPST (67.17N,111.53N). Fig. 4.4 shows more detailed information on the memory consumption needed to construct the data structures, using the Pfam protein families. Perhaps, more significantly, while the PSA memory is relatively constant independent of the sequence or family, we can observe the huge fluctuation in the memory needed for the PST and SPST, as captured by the range and standard deviation on the memory consumption factor.

● PSA PST−20 PST−FULL SPST MC Factor

● ●●●●●●●●●●●●●●●●● ● ● ● ● ●● ● ●●● ●● ● ●●● 0 200 400 600 800 1000 1200

0 20 40 60 80 100 120 140

File Size (K)

Figure 4.4. Memory consumption factor (MC Factor) needed to construct the PSA and PST data structures for the first 51 protein families in Pfam.

4.6.3 Computational Time Requirement

Table 4.8 shows the summary of the time required for constructing the PSA and the PST data structures. Table 4.9 shows the corresponding summary of the time needed for prediction using the models. The tables show that for prediction, on average, the PSA is about 2.5 times faster than PST-20. The PSA was much faster at the construction stage. For instance, while the PSA was about 3 times faster to build than PST-20, and about 250 times faster than constructing CHAPTER 4. THE PROBABILISTIC SUFFIX ARRAY 81

PST-FULL. The speedup could be related to the fact that the PSA requires a much smaller construction space, and thus less time is spent on moving data in memory.

4.6.4 PSA in Phylogenetic Tree Construction

Ron et al [109] and Bejerano and Yona [17] have enumerated various applications of the PST. As earlier indicated, the PSA is simply a more efficient alternative to the PST. Thus, the PSA can be used anywhere the PST is used. To study this versatility of the PSA/PST further, we considered a new application of the PSA – specifically, its use in the problem of phylogenetic tree construction, using mtRNA or mtDNA sequences. We used the mtDNA se- quences from 20 species, namely, human (Homo sapiens, V00662), common chimpanzee (Pan troglodytes, D38116), pigmy chimpanzee (Pan paniscus, D38113), gorilla (Gorilla gorilla, D38114), orangutan (Pongo pygmaeus, D38115), gibbon (Hylobates lar, X99256),baboon (Pa- pio hamadryas, Y18001), horse (Equus caballus, X79547), white rhinoceros (Ceratotherium si- mum, Y07726), harbor seal (Phoca vitulina, X63726), gray seal (Halichoerus grypus, X72004), cat (Felis catus, U20753), fin whale (Balenoptera physalus, X61145), blue whale (Balenopter- amusculus, X72204), cow (Bos taurus, V00654), rat (Rattusnorvegicus, X14848), mouse (Mus musculus, V00711), opossum (Didelphis virginiana, Z29573), wallaroo (Macropusrobustus, Y10524) and platypus (Ornithorhyncus anatinus, X83427). This is the same dataset previously used in constructing phylogenetic trees by Otu et al [99] and Li et al [73]. This is a challenging dataset, and there has been some debate on the position of some of the species [24, 105].

To construct the phylogenetic tree, we use the PSA to compute a dissimilarity measure between every pair of species in the dataset. First we construct the PSA for the mtDNA sequence for each species. Then for a given species, we compute the PVLMM, the probability that the given species is generated by the PSA constructed from each of the other species. After getting the probabilities, we use the quantity λ(A,B) = −logPVLMM(A,PSAB) as the dissimilarity measure between the sequences from two species A and B, where PVLMM(A,PSAB) is the predicted probability that sequence A is generated by the model represented by the PSA of sequence B.

We then use the measurements λ(A,B) for all pairs of species to construct the phylogenetic tree.

Fig. 4.5 shows the constructed phylogenetic tree using the PSA. The results are generally in agreement with earlier work on this dataset (see [73,99] for example). The only major difference CHAPTER 4. THE PROBABILISTIC SUFFIX ARRAY 82 is in the placement of cow, which is supposed to be closer to bluewhale and finwhale than to mouse and rat. This is a very encouraging result, especially, given that the PSA approach is completely alignment free. We believe a more detailed study of the use of PSA in phylogentic tree construction (for instance, deriving other features or measures of similarity) could further improve the results. platypus cat wallaroo opossum cow rat g seal h seal baboon horse w rhino mouse gibbon gorilla orangutan human f whale b whale pig chim com chim

Figure 4.5. Phylogenetic tree for 20 species constructed using the predicted probabilities obtained using the PSA.

4.7 Summary

We have presented the probabilistic suffix array (PSA), a data structure for representing information in variable length Markov models. The PSA provides the same functionality as the probabilistic suffix tree (PST), but at a significantly reduced time and space requirement. Given a sequence of length N, construction and learning in the PSA is done in O(N) time and space, N independent of the Markov order. Prediction using the PSA is performed in O(mlog |Σ| ) time, where m is the pattern length, and Σ is the symbol alphabet. The specific memory requirement for PSA constuction is 33N bytes in the worst case, and 26N bytes on average, including space CHAPTER 4. THE PROBABILISTIC SUFFIX ARRAY 83 for the suffix array and the input sequence. This can be compared with the 41N bytes needed using the PST.

We have shown experiments in computational biology. The first experiment compares PSA with PST [17] and SPST [71] on the same data set. The space for PSA is more efficient than PST-FULL and SPST. The space of PSA is close to the PSA-20 which only stores L=20 depth path. The construction time of PSA is significant fast than PST and SPST and the prediction time of these three methods (PSA, PST and SPST) is similar. The other experiment is Phyloge- netic Tree Construction by PSA. This experiment shows a very encouraging result, especially, given that the PSA approach is completely alignment free. CHAPTER 4. THE PROBABILISTIC SUFFIX ARRAY 84

Table 4.4. Performance of the PSA in modeling and prediction of protein families. Families correspond to the first 51 protein families with 12 or more members in the Pfam database, ordered alphabetically based on their abbreviated names in Pfam. For comparison, we have included the results obtained using the PST [17] on the same data set. (TP stands for true positive, while MD stands for missed detection). ∗∗The family apple was not in the dataset used in [17]. Family Size # MD TP (%) Size # MD TP (%) by PSA by PSA by PST by PST 7tm 1 530 26 0.951 515 36 0.930 7tm 2 36 2 0.944 36 2 0.944 7tm 3 12 1 0.917 12 2 0.833 AAA 79 9 0.886 66 8 0.879 ABC tran 330 46 0.861 269 44 0.836 actin 160 22 0.863 142 4 0.972 adh short 186 48 0.742 180 20 0.889 adh zinc 129 22 0.829 129 6 0.953 aldedh 69 11 0.841 69 9 0.870 alpha-amylase 114 0 1.000 114 14 0.877 aminotran 63 14 0.778 63 7 0.889 ank 305 60 0.803 83 10 0.880 apple∗∗ 16 1 0.938 arf 43 2 0.953 43 4 0.907 asp 72 5 0.931 72 12 0.833 ATP-synt A 79 1 0.987 79 6 0.924 ATP-synt ab 183 1 0.995 180 6 0.967 ATP-synt C 62 1 0.984 62 5 0.919 beta-lactamase 51 2 0.961 51 7 0.863 bZIP 95 14 0.853 95 10 0.895 C2 101 21 0.792 78 6 0.923 cadherin 168 12 0.929 31 4 0.871 cellulase 40 3 0.925 40 6 0.850 cNMP binding 69 4 0.942 42 3 0.929 COesterase 62 3 0.952 61 5 0.918 connexin 40 3 0.925 40 1 0.975 copper-bind 61 1 0.984 61 3 0.951 COX1 80 4 0.950 80 13 0.838 COX2 114 10 0.912 109 2 0.982 cpn10 58 1 0.983 57 4 0.930 cpn60 84 1 0.988 84 5 0.940 crystall 103 6 0.942 53 1 0.981 cyclin 80 19 0.763 80 9 0.888 Cys-protease 95 1 0.989 91 11 0.879 cystatin 88 11 0.875 53 4 0.925 Cy knot 61 6 0.902 61 4 0.934 cytochrome b C 133 4 0.970 130 27 0.792 cytochrome b N 170 4 0.976 170 3 0.982 cytochrome c 175 10 0.943 175 11 0.937 DAG PE-bind 108 5 0.954 68 7 0.897 DNA methylase 57 2 0.965 48 8 0.833 DNA pol 51 12 0.765 46 9 0.804 dsrm 22 14 0.364 14 2 0.857 E1-E2 ATPase 117 2 0.983 102 7 0.931 efhand 739 96 0.870 320 25 0.922 EGF 676 33 0.951 169 18 0.893 enolase 41 2 0.951 40 0 1.000 fer2 88 15 0.830 88 5 0.943 fer4 156 34 0.782 152 18 0.882 fer4 NifH 49 2 0.959 49 2 0.959 FGF 39 2 0.949 39 1 0.974 CHAPTER 4. THE PROBABILISTIC SUFFIX ARRAY 85

Table 4.5. Summary performance in protein family classification using the PSA and PST Family Size # MD TP (%) Size # MD TP (%) by PSA by PSA by PST by PST Mean 128.216 12.373 0.902 97.820 8.720 0.908 Std 145.798 17.711 0.103 84.778 8.690 0.050 Min 12.000 0.000 0.364 12.000 0.000 0.792 Max 739.000 96.000 1.000 515.000 44.000 1.000 Total 6539 631 4891 436

Table 4.6. Summary data on the first 51 families in Pfam, as described in Table 4.4. M Protein Family Length # of Internal γ = N Min Max Size (N) nodes (M) LCP LCP Mean 128.216 22173.451 252.353 0.616 190.608 14.659 Std 145.798 23359.451 136.862 0.079 139.531 9.198 Min 12.000 1360.000 68.000 0.460 26.000 3.987 Max 739.000 140744.000 753.000 0.750 719.000 49.912

Table 4.7. Construction memory needed for the PSA and PST. Results are based on the first 51 families in Pfam, as described in Table 4.4. PSA MC PST-20 MC PST- MC SPST MC Factor Factor FULL Factor Factor Mean 616 33.16 425 41.67 1577 167.47 1068 67.17 Std 655 8.23 116 42.68 125 211.22 541 23.77 Min 71 16.74 263 4.33 1455 12.77 79 18.53 Max 4523 53.46 1023 222.12 2235 1216 2547 111.53

Table 4.8. Construction time comparison for PSA and PST. Results are based on the first 51 families in Pfam, as described in Table 4.4. Recorded time is time needed per family (in seconds). Speedup is computed as the ratio with respect to PSA time. PSA PST-20 Speedup PST-FULL Speedup SPST Speedup Mean 0.092 0.244 3.150 12.648 250.967 1.825 1.740 Std 0.106 0.231 1.297 9.984 238.806 2.701 2.617 Min 0.005 0.016 0.854 3.780 20.917 0.047 0.036 Max 0.545 1.376 6.895 56.540 1076.000 17.121 16.658 CHAPTER 4. THE PROBABILISTIC SUFFIX ARRAY 86

Table 4.9. Prediction time comparison between PSA and PST. Results are based on the first 51 families in Pfam, as described in Table 4.4. Recorded time (in seconds) is prediction time per family – i.e. total time needed to predict all members in the family against all the other families. PSA PST-20 Speedup PST-FULL Speedup SPST Speedup Mean 87.71 98.83 2.52 113.25 2.79 207.03 3.4 Std 93.89 66.4 3.28 74.47 3.46 130.93 2.39 Min 5.00 59.23 0.25 21.47 0.32 33.50 1.03 Max 576 528 20.13 579.3 18.86 600 17.6 Chapter 5

Circular Pattern Matching

5.1 Introduction

The CPM problem was first introduced in 1980 [19]. Since then, variations of the circular pattern matching problem has been studied. The first variant [46, 115] is the exact circular pattern matching(ECPM) problem. This problem is to find all occurrences of a circular pattern P in a text T without any error. The second variant [44, 81] is the approximate circular pattern matching problem. This problem allows some error between the circular pattern P and the text T. According to the definitions in Chapter 2 on related work, most approaches have focused on the existential query for the ACPM1 problem.

The ACPM2 problem is given text T, circular pattern P and maximum error k, return all positions where the circular string [P] match to text T with at most k errors. [P] is said to be a k-approximate match with text T at position j ∈ [1...n − m − k + 1] if EDc(P,T[ j... j + m]) ≤ k, where 0 ≤ t ≤ m − 1,−k ≤ n − m. This is clearly more difficult than the ECPM problem or the ACPM1 problem for existential and counting queries.

Main Results. Our main goal is to solve the ECPM and ACPM2 problems for existential queries which were introduced under related work. We then define and solve the ECPD and ACPD problems to find “interesting” circular patterns, as defined using specified constraints.

87 CHAPTER 5. CIRCULAR PATTERN MATCHING 88

In this chapter, we present a new algorithm to solve the ECPM problem. This algorithm runs in linear time and space complexity. To date this is the best ECPM algorithm with respect to time and space complexity. We also present four algorithms to solve the ACPM2 problem. Three of the algorithms report complete results and one is a greedy(suboptimal) algorithm reporting an incomplete result. We compare our algorithms with other algorithms in the literature such as those reported by Maes [81],Gregor [44],Uliel [123] which were introduced in related work. On average, our ACPM2 algorithm provides the best result for the ACPM2 problem with respect to time and space complexity.

The following three theorems represent our main contributions on the CPM problem.

Theorem 5.1: Given a text T = T[1..n] and a circular pattern P = P[1..m], with symbols from an alphabet Σ, the ECPM algorithm can solve the ECPM problem in O(mlog|Σ|) worst case time after constructing the suffix tree and suffix links in O(n) time and space complexity.

Theorem 5.2: Given a text T = T[1...n] and a circular pattern P = P[1...m], with symbols from an alphabet Σ, Algorithm ACPM2 solves the ACPM2 problem in O(km2n) time, and O(n) space.

Theorem 5.3: Given a database sequences SeqDB, with Z sequences and N total symbols

(|Σ| could be O(N)), Algorithm ACPM2 solves the all-against-all ACPM problem in O(kmaN) 2 time cost on average, and O(kmmN ) time worst case, using O(N) worst case space, where N ma = Z , mm is the length of the longest sequence in SeqDB.

Organization. In the next section, we present a linear-time linear-space algorithm for the ECPM problem. Algorithms for the ACPM2 problem are presented and analyzed in Section 3. In Section 4, we show experiments on analyzing circular permutations in multidomain proteins using our algorithms. Based on the results, we perform protein function prediction for multido- main proteins. In Section 5, we summarize our work on circular pattern matching problems. CHAPTER 5. CIRCULAR PATTERN MATCHING 89

5.2 Exact Circular Pattern Matching Problem

In this section, we present an algorithm to solve the ECPM problem in linear time and linear space. Our method is index-based and can be built on the space-efficient virtual suffix tree proposed earlier in Chapter 3. The key to our method is the suffix link provided by the suffix tree and the VST. To our knowledge, this is the first time that the suffix link has been exploited to solve the circular pattern matching problem.

Notation. Before describing the algorithm, we assume there is a sequence database SeqDB with Z sequences. The total number of symbols in SeqDB is N. Let SeqDB[i] be the i-th sequence in SeqDB, where 0 < i ≤ Z. Let mi be the length of SeqDB[i]. The average number N of symbols per sequence in SeqDB is ma = Z . Let k be the allowed error in a match. In this database SeqDB, the alphabet size is |Σ|.

5.2.1 Linear Time ECPM Algorithm

The index-based approach was first introduced by Iliopoulos and Rahman [51]. However, their methods were quite complex and difficult to implement. The best time complexity of Iliopoulos and Rahman’s algorithms is O(N logN). Our algorithm is relatively simple with lower time and space complexity. First we give an example of ECPM and use this to explain our algorithm. Then we present a formal algorithm for the ECPM problem with input pattern P and ST, the suffix tree constructed from T. We build this algorithm to solve the all-against-all version of the ECPM problem, given a database of sequences.

Examples. In chapter 3, we showed the suffix tree for the string T =missississippi$. We utilize this example again here. Figure 5.1 shows the suffix tree for T with a few suffix links. Let circular pattern P =iss, so [P] = {iss,ssi,sis}, where f 0(P) = iss, f 1(P) = ssi, f 2(P) = sis. f i+1(P) is obtained by removing the first symbol from f i(P) and then appending this symbol at the end, where 0 ≤ i < m − 1. So in our algorithm, when we match f i(P), we use the suffix link to find the first (m−1) symbols of f i+1(P). This operation is done in constant time. Then we only need to compare the last symbol of f i+1(P) to the corresponding position in the suffix tree and if the symbols match, we report a match for f i+1(P). CHAPTER 5. CIRCULAR PATTERN MATCHING 90

In our algorithm, we use the suffix link to match the circular pattern in an incremental manner. For example, we search f 0(P) = iss first. In this case, we find iss is in the edge 1 between node N1 and node N4, thus there is a match. Then match for f (P) = ssi by following the suffix link from node N4 to node N6. Thus we get the next matching edge in internal node N6 by using the “skip/count” method [46]. The time cost of this operation is constant time. To look 2 for the matches to f (P) = sis, we use the suffix link from node N6 to node N5. The prefix si has matched up to node N5, thus we only check the outgoing edges from node N5 to its children, and select the one whose first symbol matches symbol s. We find the edge between node N5 and N8 is matched by symbol s. So all of circular matchings of [P] have been found whose positions start from the leaf nodes of N4,N6 and N5 respectively. From position 2 to position 9 in T, there are nine substrings that matched the circular pattern P = iss.

We give another more complex example for searching f 0(P) = ism. First, we find the prefix is in the edge between node N1 and node N4, but there is a mismatch at the last symbol 1 m. To search for f (P) = smi, we follow the suffix link from node N4 to node N6. The previous iteration only matched two symbols, so this iteration start from the second symbol of f 1(P) = smi which is m. However, the length of path from root to N6 is 3, which is larger than 1, so this searching starts from node N3 which is the parent node of N6. Then we check the second symbol, but it is still a mismatch. Thus we know that f 1(P) = smi does not occur in T. Now to 2 search for f (P) = mis, we again follow the suffix link from node N3 to the root, and continue the match from the root. Thus, we find a match of f 2(P) = mis on the path to leaf node 6. Therefore, the system will report one occurrence in leaf node 6 of the suffix tree, which corresponds to position 1 in T.

ECPM Algorithm Description

Algorithm ECPM (Algorithm 5.1) shows the pseudo code for our exact circular pattern match- ing algorithm. In this algorithm, the input ST is the suffix tree for the text T. T denotes a given text (string, or sequence) to be searched. For a given pattern P with length of m, the algorithm derives a new pattern, called PP, PP is derived from Pattern P which repeats the Pattern P and removes the last character of P. Thus is, PP = P[1...m]◦P[1...m−1]. Therefore, the new pattern PP has a length of 2m − 1. CHAPTER 5. CIRCULAR PATTERN MATCHING 91

In the suffix tree, ST, the algorithm searches PP starting from the root of ST from the leftmost branch to the rightmost. The iteration variable i indicates the starting position in PP.

The variable len indicates the position of node label. When len is equal to 1, a new child node is created for the current node. Line 4 in the algorithm represents the process of finding the right child for the current node. This operation requires O(log(|Σ|)) time for comparison. If len is larger than 1, the matching operation takes place in the current node.

The variable top is the pointer which points to the current circular pattern. If the length of the matched pattern PP[top...i] is m, it means that a circular pattern occurs. Since the number of possible circular patterns is m, pointer top will never be larger than m. If top pointer is larger than i, it indicates that the symbol pattern[top] character never occurred in the text. Thus, this set of circular patterns cannot be in the text. The pointer top is increased by one in the following two cases.

The first case is when a mismatch occurs in the current node. In this situation, the current node is replaced by the node which is pointed to by its suffix link. At the same time the string depth for the current node decreases by one. Pointer top increases by one. After this, the next iteration of the string matching process starts. The other case is when the path length from the root to the current node is m. It indicates one of the circular patterns occurs inside Text. Before matching the next position of PP, pointer top increases by one to keep the length of the pattern equal to m.

Based on the ECPM algorithm, we develop an algorithm to solve the all-against-all prob- lem. That is, we compute ECPM(SeqDB[i],SeqDB[ j]),∀i, j,i 6= j.Algorithm ALL-VS-ALL- ECPM (Algorithm 5.2) enumerates each sequence in the database as a pattern to search the circular relation by using the former ECPM algorithm. It constructs the suffix tree ST for the entire database first. Then let each sequence be a pattern to search in ST.

Algorithm Analysis

In the ECPM algorithm (Algorithm 5.1), when the same position in PP is compared again (Line 18), pointer top increases by one. top remains less than or equal to m. Thus, Line 18 runs at CHAPTER 5. CIRCULAR PATTERN MATCHING 92 most m times.

The “for” loop from Line 3 to Line 23 runs at most 2m−1 times. Inside this loop, the cost of Line 4 is at most O(log|Σ|). The other lines inside this “for” loop have a constant running time. Therefore, the time cost of this algorithm is at most log|Σ| × (3m − 1)=O(mlog|Σ|). Given the relation between the ST and VST as described in Chapter 3, the proposed algorithm can easily be modified to use the VST.

The space cost of implementation using the suffix tree is 32n + 2m − 1, where m ≤ n. The space cost if we implement use the VST is 13.8n + 2m − 1.

We summarize the above in Theorem 5.1.

Theorem 5.1: Given a text T = T[1..n] and a circular pattern P = P[1..m], with symbols from an alphabet Σ, the ECPM algorithm can solve the ECPM problem in O(mlog|Σ|) worst case time after constructing the suffix tree and suffix links in O(n) time and space complexity.

Algorithm ALL-VS-ALL-ECPM (Algorithm 5.2) builds the ST in O(N) time cost. Time cost of line 4 is O(mlog|Σ|), so the total cost of the loop from line 2 to line 5 is O(N). Worst case time complexity of this algorithm is O(N log|Σ|). Space complexity of this algorithm is O(N).

5.2.2 Comparison of ECPM algorithms

Table5.1 shows the comparison of our ECPM algorithm with Iliopoulos and Rahman’s algorithms [51] CPI-I, CPI-II. The table shows that the ECPM algorithm is the best algorithm with respect to time complexity, even when |Σ| → O(N). In terms of space, the ECPM algorithm has the same time complexity with CPI-II algorithm which was implemented using the suffix array. The CPI-II algorithm implemented using the compressed suffix array has the best space complexity among these algorithms, but the time complexity is not as good. We note that for practical space complexity, the CPI-I need two suffix trees and various auxiliary arrays, and hence will require a significantly larger practical space. CHAPTER 5. CIRCULAR PATTERN MATCHING 93

5.3 Approximate Circular Pattern Matching Problem

In this section, we present our algorithms for the ACPM problem. We start with a sim- ple greedy algorithm and then consider a suffix-array based q-gram algorithm for the ACPM problem. First, we introduce a basic LIS algorithm APM-VIA-LIS (Algorithm 5.3) to find an approximate match of a pattern P in text T. The algorithm does not handle circular pattern matching. Next, we propose our algorithms for the ACPM problem and analyze their complex- ity. The LIS method for pattern matching will be used in these algorithms. When we use this algorithm to solve the ACPM problem, we have to use all circular shifts f t(P) to match the text T.

The LIS method utilizes the LIS algorithm [46,50] to calculate the longest common subse- quence (LCS) [31,46] between two sequences. The verification process is to verify whether the edit distance between these two sequences is less than k. When we calculate LIS and LCS, each matched symbol will occur in the LCS. We are able to get occurring positions in two sequences for the matched symbols. We can use these positions to check the number of edit operations between two matched symbols. Thus the algorithm reports the edit distance between these two mn sequences. The time complexity for this algorithm is O( |Σ| logm). When |Σ| is close to O(m), as in the case for multidomain proteins, the time complexity will be O(nlogm).

5.3.1 Greedy ACPM Algorithm

Algorithm ACPM-GREEDY (Algorithm 5.4) compares any two sequences with one as text and the other one as circular pattern in two main steps. The first step is the generation of LCS. The second step will verify whether the LCS generated in step 1 represents a part of a valid subsequence. These two steps were presented in Algorithm APM-VIA-LIS(Algorithm 5.3).

Table 5.1. Comparison of ECPM algorithms ECPM algorithm CPI-I [51] CPI-II [51] CPI-II(Compressed Suffix Array) [51] Time Complexity O(N log|Σ|) O(N log1+ε N + N loglogN) O(N logN) O(N logN logN) Space Complexity O(N)bytes O(N log1+ε N)bytes O(N)bytes O(N logN)bits CHAPTER 5. CIRCULAR PATTERN MATCHING 94

First, Algorithm ACPM-GREEDY (Algorithm 5.4) will choose two sequences, one as text and the other one as circular pattern. After getting text T and circular pattern P, the ACPM works on the following two steps. The first step creates a new pattern PP by concatenation of P. And then the second step calculates the LCS between PP and T and returns the LCS string lcs. This procedure is performed in line 5. This step also verifies the approximate pattern matching with parameter k.

This method is greedy(suboptimal): it finds only one occurrence of the pattern, it may not to detect all the existing circular patterns in the text. If there is more than one LCS in T, this method may miss some matches.

Time Complexity Analysis

For the time complexity analysis, we need to consider three cases.

1. For the case of using one sequence as pattern P and the other sequence as text T, the mn time complexity of getting LCS (line 5) is O( |Σ| logm). When |Σ| is close to O(m) as in mutildomain proteins, the time complexity will be O(nlogm).

2. For the case of searching for one sequence against a group of sequences (loop from line Z 2 to line 6), the time complexity is O(∑i=1 ni logm) = O(N logm), where N is the total Z length of all sequences used, N = ∑i=1 ni Z is the number of sequences, and ni is the length of the i-th sequence in SeqDB.

3. For the case of searching for a CP among a group of sequences (loop from line 1 to line 7), the time complexity is ZN logm), where m is the length of the longest sequences. The final time complexity is O(N2 logm), since Z = O(N). In our experiment with multidomain proteins, N ≈ 6Z

5.3.2 ACPM with LIS

Here we present a second algorithm (Algorithm 5.5) for the ACPM problem. This method also utilizes the LIS algorithm [46, 50] to calculate LCS [31, 46]. However, the method to CHAPTER 5. CIRCULAR PATTERN MATCHING 95 construct Pattern P and Text T are changed. More importantly, unlike the greedy algorithm described earlier, this algorithm can detect all the circular patterns. The pseudo code is listed in Algorithm ACPM-LIS . One sequence is used to construct the circular pattern P. Another sequence is used as text T. All possible circular shifts of the pattern are enumerated. The subT is extracted from sequence T by a sliding window with size of m + k. Finally, each enumerated circular shift of the pattern is searched against the sliding window separately. Assuming the sequence to be researched using a sliding window has the length n, then, there are max{1,n − (m + k) + 1} = O(n) windows to be constructed.

During the searching process, if there is a common subsequence with length m-k is found, then, there is a circular pattern occurring in Text T. This method reveals all the circular patterns in each sequence. Thus it can find the optimal solution, with respect to completeness of the results.

Time Complexity Analysis

m(m+k) For one sliding window, the time complexity of finding one circular pattern is O( |Σ| logm (line 7 to line 9). There are O(n) siding windows and O(m) circular patterns inside one query pattern and one text (line 5 to line 10). Therefore, the time complexity of detecting a circular m(m+k) pattern between one pattern and one text is O( |Σ| logm × mn). Examining these terms, we can find that k is at most O(m) and |Σ| = O(m). Thus, the time complexity can be abbreviated as O(mlogm × mn) = O(m2nlogm).

For each pattern, the algorithm compares with the other sequences (line 2 to line 11). The 2 Z 2 time complexity of each pattern comparing with all other sequences is O(m logm×∑i=1 ni)=O(m N logm), where N is the total length of all sequences and ni is the length of i-th sequence in SeqDB.

Z 2 After considering all patterns (line 1 to line 12), the time complexity becomes ∑i=1 mi N logm), Z 2 Z 2 2 where m is the length of the longest sequence. In fact, ∑i=1 mi ≤ (∑i=1 mi) = N . The final time complexity is therefore O(N3 logm). This is the worst case complexity. CHAPTER 5. CIRCULAR PATTERN MATCHING 96

5.3.3 ACPM with q-grams and Suffix Array

The q-gram approach [3] is a two-phase method to reveal all approximate patterns. The first phase is the Hypothesis Phase which determines all potential matching positions using only q-gram substrings of P and T. In the second phase, the Verification Phase, the algorithm verifies each potential matching position to report the correct matches. First we introduce an ACPM algorithm with q-grams, then we present a hybrid algorithm with ECPM and ACPM with q-grams. The latter algorithm is more time efficient in practice, but the theoretical time complexity is the same as the ACPM algorithm with q-grams.

Figure 5.2 shows the number of hypotheses with different q values in the ProDom database of multidomain protein sequences [32]. Here we used N = 106. We notice that when q increases, the number of hypotheses will decrease fast. So when q is not very small, e.g q ≥ 3, the number hypotheses will typically reduce to O(N).

Algorithm Description

Algorithm ACPM-QGRAM (Algorithm 5.6) shows the process. Lines 1-7 is the preprocessing stage. This stage constructs a long concatenated sequence, seq, using all the sequences so far encountered in SeqDB. It also builds an auxiliary array pos. This array is used to maintain the relationship between position in seq and SeqDB. Line 8 constructs the suffix array for the concatenated sequence.

Lines 9-24 is a loop to generate all of the hypotheses for the q-gram method using the LCP array. Line 11-13 determines candidate matching positions that have the same q-gram prefix. Line 14 considers each pair of candidate positions obtained with the current q-gram for verification.

Lines 15-22 is the verification algorithm. We use the LIS algorithm to verify the approx- imate patterns. Constructing the circular pattern is the same as in the previous algorithm. We enumerate the m circular patterns from a sequence one by one. We construct subT from the second sequence T as follows. Assume the q-gram occurs in position y, so let subT be the substring of T which includes T[y...y + q − 1] and the length is (m + k). So text will be CHAPTER 5. CIRCULAR PATTERN MATCHING 97

T[y − m − k + q...y + q − 1], T[y − m − k + q + 1...y + q], ... T[y...y + m + k − 1]. There are (m + k − q) number of such substrings.

Time Complexity Analysis

The time complexity of LIS to verify a pattern vs. substrings of Text which includes one matched q-gram is O(mlogm × (m + k − q)). Since k ≤ O(m) and q ≤ O(m), the time com- plexity is O(m2 logm). Each pair in the same group has O(m) circular pattern operations, thus the time complexity for verifying each pair is O(m2 logm×m) = O(m3 logm) There are r groups r 2 3 and group i has ni elements and there are ∑i=1 ni pairs. The total complexity is O(m logm × r 2 2 3 ∑i=1 ni ). The worst case occurs when r = 1 with time complexity of O(N m logm). For the average case, m is the average length of the sequences. Then the time complexity will be in 3 N O(Nma log(ma)), where is ma = Z .

Hybrid Algorithm

We can combine the ECPM algorithm (ECPM) and the ACPM algorithm with q-gram (ACPM- QGRAM) for a possible improvement in practical time. The ECPM algorithm reports the exact circular pattern matches which should not be computed again when we search for approximate matches. First, the hybrid algorithm uses the ECPM algorithm to look for matching exact circular patterns and stores them. Next the hybrid algorithm uses the q-gram method to generate hypothesis. In the verification phase, the algorithm checks whether this hypothesis has occurred before within the ECPM results. If it occurred, then the verification stage is skipped.

Thus this approach will reduce the practical running time, given the reduced verifica- tions. For checking the each hypothesis, it takes O(logN) time cost. The time complexity is O(N2(m3 logm + logN). But this algorithm needs O(N2) space to maintain the pair-wise results from the ECPM algorithm. CHAPTER 5. CIRCULAR PATTERN MATCHING 98

5.3.4 Improved Algorithm: ACPM with Bidirectional Edit Distance

In this subsection, we propose an algorithm to solve the all-against-all ACPM2 problem. The algorithm uses a two-stage hypothesis generation – hypothesis verification paradigm. After generating the hypotheses using the q-gram filteration method, the algorithm verifies each hy- pothesis in O(km) time complexity, where k is the maximum error allowed and m is the length of the pattern. This algorithm follows the same general paradigm as the previous algorithm (Section 5.3.3), however, there are significant differences in both the hypothesis. In practical, this algorithm uses the suffix tree than the suffix array.

Filteration via q-grams

The q-gram approach [53] is a filteration method which is based on the fact that for any two strings that are approximate matches, there must be some exact matching sub-region between them. The problem is how to determine such sub-regions and their length(s). Lemma 5.1 shows this fact and points out how to choose the value q, the minimum length of the matching regions.

Lemma 5.1 [14] : Given a text T, a pattern P of length m, and an integer k, (0 ≤ k < m), for a k-approximate match of P to occur in T, there must exist at least one q-length block of m symbols in P that form an exact match to some q-length substring in T, where q = b k+1 c.

Approximate pattern matching based on q-gram uses two phases. The first phase is the hypothesis phase which identifies all potential matches using q-gram filtering operations. Based on partial exact matching, the algorithm can find O(N2) potential matches. The second phase is the verification phase. Here the hypothesized potential matches from the first phase are verified to determine whether they are true k-approximate matches. Our ACPM2 algorithm is based on q-gram filteration. First we generate hypotheses for potential approximate circular pattern matches by q-gram filter operations. Then we verify each hypothesis to find the true circular patterns. CHAPTER 5. CIRCULAR PATTERN MATCHING 99

ACPM Hypothesis Generation

Algorithm 5.7 represents our hypothesis generation phase using suffix trees. First, it builds the generalized suffix tree ST for the sequence database. Then, the algorithm treats each sequence m as a pattern. For each pattern, q is calculated using q = b k+1 c, where m is the length of the current pattern. There exists O(m) overlapping q-grams in a given m-length pattern. We search each overlapping q-gram of the pattern from left to right in the suffix tree ST. Each exact match is a hypothesis with the current q-gram. The use of the suffix tree implies that for each given distinct q-gram in P, all the occurrences in T will be found as the leaf nodes from the same parent node in the suffix tree. Finding this parent node requires only one q-gram match

in O(qlog|Σ|) time. Similar to the ECPM algorithm, after searching the first q-gram, say qi = P[i...i+q−1] in ST, the next q-gram qi+1 = P[i+1...i+q] can be matched incrementally from qi by using suffix links in constant time (Using the function Search-in-ST). Thus, counting all the matching q-grams or locating all the parent nodes for each unique matching q-gram can be done in O(mlog|Σ|) time, independent of n or N, where n is the length of the current sequence (the text), and N is the total length of all the sequences in the database (the number of leaf nodes in the generalized suffix tree).

Given pattern P = P[1..m] and text T = T[1..m], there are O(m) q-grams in P and O(n) q-grams in T. So each q-gram of P can produce at most (n − q + 1) exact matches in T. The number of hypotheses is O(m(n − q + 1)) = O(mn). We notice the added difficulty introduced by the ACPM2 problem. Unlike for the ACPM1 problem where only one occurrence of P in T is required, here, we have to verify each of the potential O(mn) hypotheses, in order to identify all the occurrences. In the sequence database seqDB, we have the total length of all sequences Z as: N = ∑i=1 mi. Each q-gram can produce potentially O(N) exact matches. Thus, the number of hypotheses will be in O(N2). In the worst case, there exist O(|Σ|q) unique q-grams. The q N 2 N2 number of hypotheses will be O(|Σ| × ( |Σ|q ) ) = O( |Σ|q ) on average. When q increases, the number of hypotheses will decrease exponentially. It will be O(N) when |Σ| is O(N). CHAPTER 5. CIRCULAR PATTERN MATCHING 100

ACPM Verification Algorithm Description

Our verification algorithm makes use of a novel bidirectional edit distance that uses both the direct sequence and its reverse. Before we introduce our ACPM verification algorithm, we first describe some important characteristics of edit distances.

Lemma 5.2 : Suppose there is an exact matching q-gram common to both the text T = T[1...n] and the pattern P = P[1...m]. Let (i, j) be the position of occurrence of the q-gram in T and P respectively. If this q-gram is part of a true approximate match between P and T, then the edit distance between T and P is given by ED(P,T) = ED(P[1... j −1],T[1...i−1])+ED(P[ j + q...m],T[i + q...n]), where 1 ≤ j ≤ m,1 ≤ i ≤ n,q ≥ 1.

Proof: There exists one q-length block in exact match between the pattern and the text (Figure 5.3). Since this q-gram is involved in a true approximate match between P and T, it must be involved in the edit distance computation between P and T. Thus the optimal edit path must contain the subpath between these regions of the pattern and the text. P[ j... j + q − 1] only compares with T[i...i + q − 1], so the edit cost of the subpath is zero. We don’t need to compare P[1... j − 1] with T[i...n], nor do we need to compare P[ j...m] with T[1...i − 1], because the optimal edit paths do not cross these areas. Figure 5.3 illustrates this using a q- gram exact match (the solid line). Three optimal paths (the dashed lines) pass through point (i, j), but no optimal path can pass through point (i,L) or point (H, j), where L > j and H > i. P[1... j − 1] only compares with T[1...i − 1] and P[ j + q...m] compares with T[i + q...n]. Hence the edit distance between pattern P and text T is the sum of three components ED(P[1... j − 1],T[1...i − 1]), ED(P[ j... j + q − 1],T[i...i + q − 1]) and ED(P[ j + q...m],T[i + q...n]). Since ED(P[ j... j + q − 1],T[i...i + q − 1])=0, the Lemma holds. 

Lemma 5.3 : For unit cost edit operations, the ED(P,T) = ED(PR,T R), where PR and T R are reversed version of P and T respectively.

Proof: This Lemma has been proved in the proof of [118] Lemma 2.

We need to verify each hypothesis generated by Algorithm 5.7. Assume QT and QP are the q-grams from the text T and the pattern P respectively. Let QT = QP = Q. So there is CHAPTER 5. CIRCULAR PATTERN MATCHING 101

an exact matching q-gram between the text and the pattern using this q-gram. We present an O(km) verification algorithm to determine whether the q-gram Q is part of a true k-approximate circular match between P and T.

We now describe the idea of bidirectional edit distance, based on which we verify a given

hypothesis. Suppose QP is a substring of pattern P which starts at position j and QT is a substring of text T which starts at position i, where 0 ≤ j ≤ m and 0 ≤ i ≤ n. That is, QP = P[ j... j + q − 1] and QT = T[i...i + q − 1]. Our goal is to compute the circular edit distance between pattern P and the substring of T denoted subTi, where subTi = T[i+q−1−m−k...i+ m +k]. First, we construct two strings from the pattern P1 = P[ j +q...m]◦ P[1... j −1] and P2 = R R P1 . We also construct two strings T1 = T[i+q...i+m+k] and T2 = T[i−1−m−k +q...i−1] from the text. We compute the edit distance between P1 and T1 using Ukkonen’s algorithm [121]. This can be done in O(km) and returns an array ED1 which contains the minimum edit distance of each row. The value of ED1[h] indicates the minimum distance between P1[1...h] and T1[1...r], where h − k ≤ r ≤ h + k. Similarly, We use the same algorithm to calculate the minimum edit distance array ED2 between P2 and T2. ED2[h] is the minimum distance between P2[1...h] and T2[1...r], where h − k ≤ r ≤ h + k.

Figure 5.4 shows the comparisons made by the algorithm. Here we have P1 = P[ j +q...m]◦ R R R R P[1... j − 1] and P2 = P1 = (P[ j + q...m] ◦ P[1... j − 1]) = P[1... j − 1] ◦ P[ j + q...m] . P2 is matched against T2 in reverse direction, while P1 is matched against T1 in the regular (forward) direction.

We construct an array ED from ED1 and ED2 as follows.   ED2[m − q − h] : 1 ≤ h ≤ m − q ED[h] = 0 : m − q + 1 ≤ h ≤ m (5.1)  ED1[h − m] : m + 1 ≤ h ≤ 2m − q

Lemma 5.4 : For a given hypothesis occurring at positions i and j in T and P respectively,

h+ j+q−2 ED( f (P),subTi) = ED[h] + ED[h + m − 1], where 1 ≤ h ≤ m.

R Proof : According to Figure 5.4, we construct a new string PP as P2 ◦ P[ j... j + q − 1] ◦ P1, R where P2 = P1. Then PP = P[ j+q...m]◦P[1... j−1]◦P[ j... j+q−1]◦P[ j+q...m]◦P[1... j−1]. CHAPTER 5. CIRCULAR PATTERN MATCHING 102

There are three cases in representing f h(P). These cases are indicated as numbers 1,2, and 3 respectively on the double-headed arrows in Figure 5.4.

CASE 1 : h = 1, where f h+ j+q−2(P) = PP[1...m]. PP[1...m] is constructed from two parts. R One is P2 and the other is P[ j... j + q − 1]. The edit distance in this case is ED[1] + ED[m] by Lemma 5.2, where ED[1] is ED2[m − q] and ED[m] = 0.

CASE 2 : 2 ≤ h ≤ m − q, where f h+ j+q−2(P) = PP[h...h + m − 1]. PP[h...h + m − 1] is R constructed from three parts. The first part of P2 . The second part is P[ j... j +q−1] and the last part is P1. From Lemma 5.2, we know the edit distance is ED[h] + ED[h + m − 1].

CASE 3 : h = j. This case is similar to the first case, so the edit distance is ED[h]+ED[h+ m − 1].

In these three cases, we only compute 1 + m − q − 1 + 1 = m − q + 1 circular shifts of the pattern. We do not calculate the circular shifts f r(P), where j + 1 ≤ r ≤ j + q − 1. There exists an exact match involving P[ j... j +q−1] (the matching q-gram), and f r(P) does not contain this substring, when j + 1 ≤ r ≤ j + q − 1. Therefore, it is not necessary to calculate f h(P), when j + 1 ≤ r ≤ j + q − 1.

Lemma 5.5 : For a given hypothesis occurring at positions i and j in T and P respectively,

EDc(P,subTi) = min0≤h≤m−1{ED[h]+ED[h+m−1]}, where subTi = T[i+q−1−m−k...i+ m + k].

Proof : There are m circular shifts in the pattern P. From Lemma 5.4, we calculate all h of the minimum edit distances between possible circular shifts f (P) and the text subTi in this hypothesis, where h ∈ [0, j] ∪ [ j + q,m]. The circular edit distance of P against subTi is the minimum edit distance of all possible circular shifts against the text subTi in current hypothesis. The lemma holds. 

Lemma 5.6 : For a given hypothesis, the time complexity of the verification algorithm is O(km).

Proof : Algorithm 5.8 presents the verification processes following the above method. Clearly reversing the strings can be done in linear time, and computing edit distances on the CHAPTER 5. CIRCULAR PATTERN MATCHING 103

reversed string does not change the time required. Similarly, EDc() in Lemma 5.5 can be computed in O(km) time, by maintaining an O(m) array to record intermediate results during the computation of standard edit distance using dynamic programming. Line 4 calls the k- approximate pattern matching algorithm in O(km) time using Ukkonen’s algorithm [121] by bidirection. Line 5 implements equation (5.1). Line 6 to Line 9 run in O(m) time cost to check the matching by Lemma 5.4. Thus the time complexity is O(km). 

Theorem 5.2: Given a text T = T[1...n] and a circular pattern P = P[1...m], with symbols from an alphabet Σ, Algorithm ACPM2 solves the ACPM2 problem in O(km2n) time, and O(n) space.

Proof: Suffix tree construction (including suffix links) can done in linear time and lin- ear space. After constructing the suffix tree, hypothesis generation phase is performed in O(mlog|Σ|) time, independent of n or N, since at this stage we only need a count of the number of hypothesis, and the the parent nodes of the leaf nodes in the ST that correspond to the start positions of the matching q-grams in the text or database. Given pattern P and text T, there exists O(mn) possible hypotheses. From Lemma 5.6, the time complexity of verification algo- rithm is O(km) for each hypothesis. This means that the algorithm solves the ACPM2 problem 2 in O(km × mn) = O(km n) time.

Theorem 5.3: Given a database sequences SeqDB, with Z sequences and N total symbols

(|Σ| could be O(N)), Algorithm ACPM2 solves the all-against-all ACPM problem in O(kmaN) 2 time cost on average, and O(kmmN ) time worst case, using O(N) worst case space, where N ma = Z , mm is the length of the longest sequence in SeqDB.

Proof: The result essentially from Theorem 5.2. For a sequence database seqDB of length N, hypothesis generation phase is performed in O(N log|Σ|). The number of hypotheses will be 2 2 O(N ), so the time complexity is O(kmmN ) in the worst case, where mm is the length of the N2 longest sequences. On average, the number of hypotheses is O( |Σ|q ), then the time complexity N2 is O(kma |Σ|q ), where ma is the average length of sequences in seqDB. When q increases or |Σ| → N, the time complexity will be O(kmaN). Space requirement is in O(N) to maintain the suffix tree data structure.  CHAPTER 5. CIRCULAR PATTERN MATCHING 104

5.3.5 Comparison with Other ACPM Algorithms

In Table5.2, we compare our ACPM-QGRAM algorithm and ACPM-BIDIRECTIONAL al- gorithm with the other related algorithms which were introduced in Chapter 2, namely Maes’ al- gorithm [81], Gregor and Thomason’s algorithm [44] and Uliel et. al’s algorithm [123]. Weiner et. al’s algorithm [126, 127] is a greedy algorithm which may miss some important circular relations. We also compare this algorithm with the other algorithms.

Our goal is to develop an algorithm to solve the ACPM problems and to apply this to study circular permutations in multidomain proteins. We make minor changes in Maes’ algorithm [81], Gregor and Thomason’s algorithm [44], Uliel et. al’s algorithm [123] and Weiner et. al’s algorithm [126, 127] for adapting them to the ACPM problems. Because these algorithms all focus on computing the circular edit distance, we extend them to match the pattern against all the substring of T. This takes (n − m) steps, and hence the total time complexity will increase by n − m = O(n) times.

3 2 The time complexity of our ACPM-QGRAM algorithm is O(maN ) in the worst case. On 3 2 q average, the time complexity is O(maN /|Σ| ), where ma is the average length of sequence m 2 q and N is the total length of sequences, and q = b k+1 c. When q increases, O(N /|Σ| ) will be reduced to O(N), since |Σ|q ≤ O(N).

2 The time complexity of our ACPM-BIDIRECTIONAL algorithm is O(kmmN ) in the worst 2 q case. On average, the time complexity is O(kmaN /|Σ| ), where ma is the average length of se- m 2 q quence and N is the total length of the sequences, and q = b k+1 c. When q increases, O(N /|Σ| ) will be reduce to O(N) since |Σ|q ≤ O(N).

Table 5.2 shows the time complexity for the worst case of these algorithms. The last row shows the average case for the most challenging problem of all-against-all approximate circular pattern matching. Comparing with the ACPM-QGRAM, when m is large (m = O(N)), our q- gram algorithm will be worse. In this case, the Maes [81] algorithm will be the best algorithm in m the worse case. But when k+1 increases and m is not very large, the ACPM-QGRAM algorithm 3 N N will run in O(m N), where m = Z . This can be treated as a constant ( Z ≈ 6 for the case of multidomain proteins). Therefore, under such condictions, the ACPM-QGRAM algorithm is a linear time algorithm on average. The proposed ACPM-QGRAM algorithm is better than the CHAPTER 5. CIRCULAR PATTERN MATCHING 105

Table 5.2. Comparison with other proposed ACPM Algorithms

Maes [81] Gregor et. al’s [44] Uliel et. al’s [123] Weiner et. al’s1 [126, 127] ACPM-qgram ACPM- BIDIRECTIONAL One circular pattern O(m2 logm) O(m3) O(m3) O(m2) O(m2 logm/|Σ|) O(km) against one text =O(mlogm) window One-against-One O(m2 logmn) O(m3n) O(m3n) O(m2n) O(m3 logm) O(kmn) One-against-All O(m2 logmN) O(m3N) O(m3N) O(m2N) O(m2N logm) O(kmN) Z 2 Z 3 Z 3 Z 2 3 2 2 All-against-All O(∑1 m N logm) O(∑1 m N) O(∑1 m N) O(∑1 m N) O(m N logm) O(kmN ) =O(N3 logm) =O(N4) =O(N4) =O(N3) 2 2 2 2 2 3 Average Case O(N ma logma) O(N ma) O(N ma) O(maN ) O(maN log(ma)) O(kmN) All-against-All N (m = ma = Z )

other four related algorithms which were introduced in related work. Comparing the ACPM- BIDIRECTIONAL algorithm with Maes’ algorithm [81] which is the best available algorithm for the ACPM1 problem, we can see that apply Maes’ algorithm to the all-against-all ACPM 2 3 problem requires time in Θ(N ma logma) on average, and O(N logmm) worst case. These are 2 still worse than our proposed algorithm that runs in O(kmaN) time on average and O(kmmN ) worst case. Landau et al’s algorithm [67] runs in O(kmn) to solve the ACPM2 problem (for one-against-one). In the all-against-all ACPM2 case, the time complexity will be Θ(kN2). This algorithm is worse than our algorithm that run in O(kmaN) time on average.

5.4 Experiments

As discussed in Chapter 1, circular permutations and cyclic pattern have been used in var- ious studies in biology. We performed some experiments using the results of the proposed algorithms to study circular permutations in molecular biology. In our experiments, we apply our algorithms on multidomain proteins to look for potential circular permutation relationships between them. We also use the results of the proposed algorithm to predict potential functions for uncharacterized or unknown proteins. In these experiments, the alphabet size is the number of domains which is a large number, close to 106.

1Algorithm could produce incomplete results CHAPTER 5. CIRCULAR PATTERN MATCHING 106

5.4.1 Data Set

Protein Domain Database (ProDom)

The protein domain is a section of the protein sequence whose structure can evolve, function and it exists independently of the rest of the protein chain [32]. Most proteins consist of several domains. The same protein domain may occur in related proteins. The ProDom is a database of known protein domains. The ProDom web site (http://ProDom.prabi.fr) provides a tool to search a protein domain in the protein database. The results are the proteins which contain a given protein domain. Each domain is represented as a unique symbol, thus a multidomain protein is viewed as a sequence of such symbols. The length of the domain representation is generally much smaller than the original protein sequence, but the size of alphabets has increased drastically.

Gene Ontology Database (GO)

The Gene Ontology (GO) project (http://www.geneontology.org/) provides a description of genes and protein products in different databases including the known functions of the genes. Currently the GO Consortium includes many databases such as GeneDB (http://www.genedb.org/), UniProtKB-Gene Ontology Annotation @ EBI (UniProtKB-GOA) (http://www.ebi.ac.uk/GOA/) and FlayDB (http://flybase.bio.indiana.edu/). More details on the GO Consortium is available at http: //www.geneontology.org/GO.consortiumlist.shtml.

The ProDom database provides the Accession Number for the parent protein of each do- main. The Accession Number is also provided for UniProtKB-GOA. This establishes a con- nection between entities in ProDom and their corresponding entities in GO database. In our experiments, we used this relation to obtain the GO terms used to describe the protein func- tion. Based on this, we can predict functions for multidomain proteins using our CPM results obtained using the proteins in ProDom. CHAPTER 5. CIRCULAR PATTERN MATCHING 107

5.4.2 CPM Experimental Design

We implemented the exact circular pattern matching algorithm and three approximate cir- cular pattern matching algorithms and applied them to detect circular patterns in ProDom database. We downloaded data from ProDom web site(http://ProDom.prabi.fr) on March 12, 2009 (ProDom version 2006.1 as released on November 6th, 2008). There were 1,997,497 pro- teins in this database. We removed proteins with less than three domains, and also removed redundant proteins with more than 90% similarity to some other protein. The result is a reduced database with 973,686 proteins. This means that ECPM and ACPM will only apply to proteins that contains an entire copy of another protein.

Results: Speed and Completeness

We ran the four algorithms on the reduced database and use the results to analyze the relation- ship between multidomain proteins. ACPM-QGRAM algorithm was executed on two different parameters, namely q = 1 and q = 2. When q = 1, the result is complete. When q = 2, the result is suboptimal (incomplete). We use the complete results as a benchmark to compare with the results from the algorithm.

The exact algorithm is the fastest algorithm. It only needed six minutes to build the suffix tree and perform searches for all circular patterns. ACPM-LIS algorithm is the slowest al- gorithm. ACPM-GREEDY algorithm is faster than the other ACPM algorithms, but the result has low accuracy (around 50%). Figure 5.5 shows the practical time required by these three algorithms, where q-gram has two instances, q = 1 and q = 2.

A comparison of the outputs of the algorithms provides some insight in their overall per- formance. There are 29,625,738 relations in the complete result. ECPM algorithm can be viewed as a greedy algorithm when the objective is approximate matching. The number of relations found using ECPM was 28,096,046 which is close to the complete result. ACPM- GREEDY only identified 15,075,729 relations. The ACPM-QGRAM algorithm with parameter q = 2 found 29,345,380 relations. When we run the ECPM algorithm in ProDom, we get almost 95% of the complete relations. When we run ACPM-QGRAM algorithm of q = 2, we get more CHAPTER 5. CIRCULAR PATTERN MATCHING 108

than 99% of the complete relations.

We run the hybrid algorithm where the ECPM algorithm was applied first and followed by using ACPM-QGRAM algorithm with parameter q = 1. We get the complete result and the time cost was reduced from 41 hours to 14 hours.

Analysis of Results

Based on the results, we built a relationship network among the multidomian proteins. This is a directed graph. The proteins are represented as the vertices, while the relations are represented

by the edges. The In-edges and Out-edges are defined as follows. If a protein sequence P1 is a circular pattern in protein sequence P2, then there is an Out-edge from P1 to P2. Conversely, there is an In-edge from P2 to P1. Figure 5.6 shows the degree distribution of the network. Panel (a) of Figure 5.6 is the degree distribution of all vertices and panel (b) shows the degree distribution of the Top-100 highest degree nodes. Panel (c) and (d) are log-log plots of panel (a) and (b) respectively.

Each protein sequence is not only used as a pattern to search against the other protein sequences, but also used as text to be searched against using the other protein sequences in the database. 424,888 protein sequences were found to be a pattern in some other protein sequences. 799,044 protein sequences contain at least one other protein sequence as a circular pattern. 374,279 protein sequences have both out-edge and in-edges. 50,609 protein sequences only have out-edges while 424,765 protein sequences only have in-edges. The average degree of this graph was 23 with an average out-degree of 46 and an average in-degree of 24.5.

Figure 5.7(a) shows the number of directly connected pairs in the Top-K highest degree proteins, where K is 10, 20 ... 1000. Let the Top-K highest degree proteins be vertices of a subgraph, the number of directly connected pairs is the number of edges. We define a ratio ρK # o f total edges # o f observed edges as follows: ρK = # o f edges in Top−K complete subgraph = 1 . Figure 5.7(b) shows the 2 ×K×(K−1) ratio ρk in Top-K proteins. When K is less than 460, the ratio ρK stays stable at around in 0.5. When K is larger than 460, the ratio ρK starts to decrease. Thus in this graph, Top 460 highest degree proteins have higher relations. CHAPTER 5. CIRCULAR PATTERN MATCHING 109

Table 5.3. Top 15 highest degree proteins with GO function Rank Count AC Number Go Description 1 23353 Q7VMZ1 nucleotide binding ; ATP binding ; ATPase activity ; nucleoside-triphosphatase activity 2 23344 Q9CPC5 nucleotide binding ; ATP binding ; ATPase activity ; nucleoside-triphosphatase activity 3 23338 Q3EG14 Protein not found in GO 4 20508 Q33HH1 Protein not found in GO 5 20446 Q47AY9 nucleotide binding ; ATP binding ; ATPase activity ; nucleoside-triphosphatase activity 6 20446 Q4UQ62 nucleotide binding ; ATP binding ; ATPase activity ; nucleoside-triphosphatase activity 7 20446 Q8P4K7 nucleotide binding ; ATP binding ; ATPase activity ; nucleoside-triphosphatase activity 8 20446 Q8PG73 nucleotide binding ; ATP binding ; ATPase activity ; nucleoside-triphosphatase activity 9 20415 Q426Q5 Protein not found in GO 10 20398 Q3BNR9 nucleotide binding ; ATP binding ; ATPase activity ; nucleoside-triphosphatase activity 11 20393 Q73PA3 nucleotide binding ; ATP binding ; ATPase activity ; nucleoside-triphosphatase activity 12 20273 Q50XK7 Protein not found in GO 13 20246 Q66C16 nucleotide binding ; ATP binding ; ATPase activity ; nucleoside-triphosphatase activity 14 20244 Q5NU40 No function in GO 15 20244 O32748 nucleotide binding ; ATP binding ; ATPase activity ; nucleoside-triphosphatase activity

Protein Function Prediction

Table 5.3 shows the protein function for the Top-15 highest degree proteins. We notice that 10 of the 15 proteins have exactly the same functions. There are four proteins (rank is 3,4,9,12 respectively) that do not have entries in the GO database. Protein Q5NU40 (rank 14) has a record in GO database, but there is no function assigned to it in GO database. With high probability, we can say that the four proteins with no known function are likely to have the same function as the other 10 proteins. We plan to verify these functions by searching the biology literature in future.

We use the z-score as a measure of significance of the relationship between two proteins. For a given random variable x, the z-score is defined as follows. z = x−µx , where µ is the mean, σx x and σx is the standard deviation.

To predict the function for a protein say PA, we compute the z-scores for the number of occurrences of given function for the proteins in the respective In-edge and Out-edge sets for protein PA. We then assign the protein function as the function with z-score above a threshold.

Table 5.4 shows the prediction results on 9 multidomain protein sequences using the union of the functions of the In-edge and Out-edge proteins at different thresholds on the z-scores. CHAPTER 5. CIRCULAR PATTERN MATCHING 110

Table 5.5 shows equivalent result using intersection.

We also conducted an experiment to predict the protein functions in the Top-500 highest degree proteins of these 156 proteins were not found in the GO database. Table 5.6 shows the prediction performance in terms of precision, recall and the F-measure, where FP is the number of false positive; FN is the number of false negative; TP is the number of true positive. TP TP The recall is calculated as TP+FN and the precision is calculated as TP+FP . The F-measure is recall×precision calculated as 2 × recall+precision .

From the F-measure of Table 5.6, we notice the union method at z ≥ 3 has the highest F-measure 0.84. This indicates the union method at z ≥ 3 provides the best result of all the combination.

5.4.3 Multidomain Protein Networks using Circular Patterns

Introduction

Philipp et al. [100] introduced a tool to discover the potential relationships between proteins using the protein domain network. This protein domain network was based on the protein do- main interaction networks. They built a web resource to explore the Protein Domain Interaction MAp(DIMA). In this network, the nodes are the protein domains and the edges are the inter- actions between two protein domains. In our work, network formation is based primarily on cyclic relationships between multidomain proteins.

Dataset

In this experiment, we further studied the use of our proposed ECPM and ACPM algorithms on the problem of analyzing multi-domain protein sequences. Based on the patterns found by our algorithms, we constructed multidomain protein networks by connecting different multidomain proteins that are found to be associated by some matching circular or non-circular patterns found by our algorithms. We use the Pfam database [15, 39] to identify the families for the multidomain proteins in ProDom. (We had introduced the Pfam database earlier in Chapter CHAPTER 5. CIRCULAR PATTERN MATCHING 111

Table 5.4. The predicted protein functions using union for In-edge and Out-edge Protein Function Predicted Predicted Predicted AC Number Function Function Function (z ≥3) (z ≥2) (z ≥1) Q7VMZ1 GO:0000166 GO:0000166 GO:0000166 GO:0000166 GO:0005524 GO:0005524 GO:0005524 GO:0005215 GO:0016887 GO:0016887 GO:0016887 GO:0005524 GO:0017111 GO:0017111 GO:0017111 GO:0016787 GO:0042626 GO:0016887 GO:0017111 GO:0042626 O32184 GO:0003824 GO:0003824 GO:0003824 GO:0003824 GO:0005488 GO:0005488 GO:0005488 GO:0004316 GO:0016491 GO:0016491 GO:0016491 GO:0005488 GO:0016491 Q2Y7W6 GO:0000156 GO:0000155 GO:0000155 GO:0000155 GO:0004871 GO:0004871 GO:0004871 Q33CH5 GO:0003723 GO:0003723 GO:0003723 GO:0003723 GO:0003968 GO:0003968 GO:0003968 GO:0003968 Q30U32 GO:0000156 GO:0000156 GO:0000155 GO:0000155 GO:0004871 GO:0000156 GO:0000156 GO:0004871 GO:0004871 O93828 GO:0004585 GO:0004585 GO:0004585 GO:0004585 GO:0016597 GO:0016597 GO:0016597 GO:0016597 GO:0016740 GO:0016740 GO:0016740 GO:0016740 GO:0016743 GO:0016743 GO:0016743 GO:0016743 Q30SN9 GO:0003824 GO:0003824 GO:0004252 GO:0004252 GO:0005515 Q2YTY7 GO:0003723 GO:0003723 GO:0009982 GO:0009982 O78911 GO:0008137 GO:0008137 GO:0008137 GO:0008137 GO:0016491 GO:0016491 GO:0016491 GO:0016491 CHAPTER 5. CIRCULAR PATTERN MATCHING 112

Table 5.5. The predicted protein functions using intersection for In-edge and Out-edge Protein Function Predicted Predicted Predicted AC Number Function Function Function (z ≥3) (z ≥2) (z ≥1) Q7VMZ1 GO:0000166 GO:0000166 GO:0000166 GO:0000166 GO:0005524 GO:0005524 GO:0005524 GO:0005524 GO:0016887 GO:0016887 GO:0016887 GO:0016887 GO:0017111 GO:0017111 GO:0017111 GO:0017111 O32184 GO:0003824 GO:0005488 GO:0016491 Q2Y7W6 GO:0000156 GO:0004871 GO:0004871 GO:0000155 GO:0004871 Q33CH5 GO:0003723 GO:0003723 GO:0003723 GO:0003968 GO:0003968 GO:0003968 Q30U32 GO:0000156 GO:0004871 GO:0000155 GO:0004871 O93828 GO:0004585 GO:0016597 GO:0016597 GO:0016740 GO:0016740 GO:0016743 GO:0016743 Q30SN9 GO:0003824 GO:0004252 Q2YTY7 GO:0003723 GO:0009982 O78911 GO:0008137 GO:0008137 GO:0008137 GO:0008137 GO:0016491 GO:0016491 GO:0016491 GO:0016491

Table 5.6. Performance in Protein Function Prediction using the Top-500 Proteins Method Parameter TP FP FN Recall Precision F measure Union z ≥3 1349 186 317 0.81 0.88 0.84 z ≥2 1353 1302 313 0.81 0.51 0.63 z ≥1 1464 1950 202 0.88 0.43 0.58 Intersection z ≥3 162 522 1504 0.1 0.24 0.14 z ≥2 1269 3891 397 0.76 0.25 0.37 z ≥1 1345 6857 321 0.81 0.16 0.27 CHAPTER 5. CIRCULAR PATTERN MATCHING 113

4). There are 40807 protein sequences in ProDom database which is also in Pfam database. There are 12104 families in Pfam database. In our experiment, Proteins in ProDom that do not have corresponding families in Pfam were not included in this analysis. Similar to Chapter 4, for function prediction based on the circular pattern networks, we use the protein functions as maintained in the GO database.

To ensure diversity, and reduce the problem of redundancy in the protein families, each family is represented by only two member proteins. We select two proteins from each family in the Pfam database to construct a new network. For each family, we select the respective proteins with the maximum and minimum number of circular pattern relationships. These often correspond to the longest and shortest protein sequences in the family. Some proteins belong to multiple families, thus some proteins will be chosen several times. We only keep one of them on our network. This network presents non-redundant relationships with all protein families. There are 3659 proteins and 4725 families in this network. The resulting data set contained 3,659 proteins from 4,725 families.

Network Formation

We construct two types of network. One is based on the circular permutation relationship be- tween proteins. We call this the “Protein” network. The other is based on the families. Any circular relationships between two proteins will be circular permutation relationship between the two families to which the two proteins belong. We call this the “Family” network. Further, we construct three networks for each of the two networks. The first is a network using only non-circular patterns (non-CPs) found between the proteins. This is constructed based on non- circular matching relationships between proteins. The second network is the circular pattern network. This network is constructed using only the circular matching relationships (excluding non-circular matches) between the multidomain proteins. The last network is the combined network which is constructed from all matching relationships (including both circular and non- circular matches). Thus, we construct six networks from our data. Table 5.7 shows the statistics of these six networks. The networks are shown in Figures 5.8-5.13. CHAPTER 5. CIRCULAR PATTERN MATCHING 114

Table 5.7. Network statistics for multidomain protein networks Parameters Protein Protein Protein Family Family Family non-circular circular combined non-circular circular combined network network network network network network Clustering Coefficient 0.157 0.102 0.174 0.181 0.084 0.249 Connected Components 756 407 786 600 369 608 Network Diameter 7 9 9 13 15 11 Shortest paths 23515 17104 25593 103683 58164 123380 Characteristic Path Length 2.079 2.099 2.083 3.525 3.814 3.294 Average number of neighbors 2.612 2.496 2.723 5.325 4.073 6.844 Number of nodes 3458 2299 3659 4416 3140 4725 Number of edge 4517 2869 7386 13031 6741 19772

Significance of Circular Pattern Networks

The networks in Figures 5.4.3 and 5.11 have two colors for the network nodes. The red nodes are the nodes in the non-circular pattern networks. The green nodes are the nodes found only in the circular pattern networks. We notice there are very few green nodes in the networks. There are also two colors for the edges. The blue edges indicate edges that occur in the non-circular networks. The pink edges indicate the edges that only occurred in the circular networks. From these two figures, we see that the pink edges show more significant differences between the non-circular pattern network and the circular pattern network.

For example, in the combined Protein network (with CPs and non-CPs), we can observe the pink edges between between some major clusters. This shows an important relationship be- tween these clusters that are only exposed using the circular permutations. For function predic- tion work based on this network, one can expect that these edges will provide more information about the clusters, potentially leading to improved prediction results.

In the family networks, there are some interesting observations between the combined net- work and the non-circular network. The part labeled ”G” in the combined network (Figure 5.4.3) is connected to the main component, but in the non-circular pattern network, the part G is a large component which is disconnected from the main component. This indicates that there exist some important edges that only occur red in the circular network. Such relationships can not be found using direct pattern matching, and require methods for circular pattern matching, as proposed in this chapter. CHAPTER 5. CIRCULAR PATTERN MATCHING 115

Table 5.8 shows the Top 25 proteins that exhibited the highest difference (in terms of node degree) between protein circular-pattern network and non-circular pattern network. Perhaps, the significance of the CP network is clearer in this table, which shows the quantitative differences between selected nodes in the two network. We can observe that, in some cases, the more than half of the associations to some nodes in the Protein networks are due mainly to the circular pattern relationships. We can also see that some small networks are formed only exclusively based on circular patterns, with no associations using direct patterns (i.e non-CPs).

Table 5.9 shows the proteins found on the longest path in the Protein network. Given the nature of the network, it means that these multidomain proteins must have some common circular patterns (and perhaps some non-circular) shared between them. Table 5.10 shows the corresponding longest chain of families in the Family network.

5.5 Summary

In this chapter, we present four ACPM algorithms to solve the ACPM2 problem and one algorithm to solve the ECPM problem. ECPM matching algorithm is not only the best algo- rithm in theory, but also it is the fastest in practice. The ACPM-BIDIRECTIONAL algorithm is the best of our ACPM algorithms. Comparing with other algorithms in literature, ACPM- BIDIRECTIONAL algorithm also has the best time complexity on average.

Based on the results, we analyzed circular permutations in multidomain proteins and used this to perform protein function prediction. Our results show a performance of 0.88, 0.81 in precision and recall respectively, at z ≥ 3.0, using the union of the functions inter In-edge and Out-edge proteins. CHAPTER 5. CIRCULAR PATTERN MATCHING 116

Table 5.8. Top 25 proteins with the highest node degree differences between protein net- works using the circular and non-circular patterns. Degree in circular network Protein Degree in Degree in Degree in Degree in combined network (%) circular network combined network non-circular network P03304 266 704 438 37.78% P19525 221 592 371 37.33% P35409 199 567 368 35.10% Q00962 175 470 295 37.23% P17546 175 464 289 37.72% P42684 122 390 268 31.28% P43699 147 402 255 36.57% P48633 174 403 229 43.18% P29476 123 348 225 35.34% Q04610 122 320 198 38.13% Q99323 113 306 193 36.93% P22009 95 276 181 34.42% Q06889 94 260 166 36.15% Q03351 431 588 157 73.30% P10272 82 238 156 34.45% P03336 83 225 142 36.89% P16112 101 240 139 42.08% P38939 77 216 139 35.65% P41381 81 203 122 39.90% P19560 59 175 116 33.71% P30963 68 176 108 38.64% P26762 64 168 104 38.10% P27742 72 169 97 42.60% P19559 62 150 88 41.33% P08049 65 150 85 43.33% CHAPTER 5. CIRCULAR PATTERN MATCHING 117

Table 5.9. The longest path in the Protein network order Protein Family 1 P25930 Pfam-B 8089 1 P25930 Pfam-B 8090 2 P34540 Pfam-B 2386 3 P43141 Pfam-B 2823 4 Q03351 ig 4 Q03351 Pfam-B 11753 4 Q03351 Pfam-B 3706 4 Q03351 Pfam-B 3707 5 P29276 Pfam-B 10807 6 P13677 Pfam-B 135 6 P13677 Pfam-B 735 7 P11461 Pfam-B 8910 8 P03967 ras 9 P31133 Pfam-B 3874 10 P46870 Pfam-B 210 10 P46870 Pfam-B 229 10 P46870 Pfam-B 461 10 P46870 Pfam-B 547 10 P46870 Pfam-B 548 10 P46870 Pfam-B 616 11 P13068 Pfam-B 5383 12 P42686 Pfam-B 7008 12 P42686 Pfam-B 7009 13 P44768 Pfam-B 10433 14 P32745 Pfam-B 4919 14 P32745 Pfam-B 4920 15 P23678 Pfam-B 7252 15 P23678 PH 16 P25892 Pfam-B 3816 17 P31134 Pfam-B 1832 18 P35409 7tm 1 18 P35409 Pfam-B 3952 18 P35409 Pfam-B 725 19 P21838 Pfam-B 8095 20 P07199 Pfam-B 6062 21 P09803 Pfam-B 4482 21 P09803 Pfam-B 6146 22 P24710 Pfam-B 1620 22 P24710 Pfam-B 1621 CHAPTER 5. CIRCULAR PATTERN MATCHING 118

Table 5.10. The longest path in the Family network Order Family 1 Pfam-B 6544 2 Pfam-B 11160 3 ins 4 vwc 5 Pfam-B 5115 6 Pfam-B 1677 7 Pfam-B 4686 8 Pfam-B 7481 9 Pfam-B 1063 10 Pfam-B 2682 11 Pfam-B 3197 12 Pfam-B 8974 13 Pfam-B 604 14 Pfam-B 818 15 adh zinc 16 Pfam-B 11159 17 Pfam-B 842 18 Pfam-B 2068

root $ 0 s N i 3 p si N 1 $ N N 6 1 2 i ppi$ ssi ssi ppi$ N pi$ N 9 2 missississippi$ i$ 5 N ssi ppi$ 4 8 12 7 N 13 ppi$ ssi ppi$ 8 ssippi$

N ppi$ ssippi$ 3 7 6 9 ppi$ 10 14 ssippi$ 11 4 5

Figure 5.1. Suffix tree for the string T = missississippi$ with some suffix links. (This figure is the same as Figure 3.1 with some suffix link.) CHAPTER 5. CIRCULAR PATTERN MATCHING 119

Algorithm 5.1: ECPM Algorithm ECPM(ST,Pattern) 1 PP←Pattern ◦ Pattern[1...m-1] 2 Current←ST.root,top←1,len←1 3 for ( i ← 1 to 2m-1) do 4 if (len = 1) and Current.child.label[1]=PP[i] then 5 Current ← Current.child 6 end if 7 if Current.child.label[1]=PP[i] then 8 len ← len + 1 9 if len > label.length then 10 len ← 1 11 end if 12 if i-top=m-1 then 13 output ”(matched, Current)” 14 Current ← Current.SuffixLink, top ← top + 1 15 end if 16 else 17 Current ← Current.SuffixLink 18 top ← top + 1, i ← i - 1 19 end if 20 if i < top or top > m then 21 break 22 end if 23 end for

Algorithm 5.2: All vs. All ECPM Algorithm

ALL-VS-ALL-ECPM(SeqDB,Z) 1 ST←Get Suffix Tree(SeqDatabase) 2 for ( i ← 1 to Z) do 3 Pattern←SEQDB[i] 4 ECPM(ST,Pattern) 5 end for CHAPTER 5. CIRCULAR PATTERN MATCHING 120

Algorithm 5.3: Pattern Matching Using LIS

APM-VIA-LIS(T,P,k) 1 Build the mapping table mapTable which stores the positions in P of each symbol in decreasing order 2 seq ← NULL 3 for ( i ← 1 to n) do 4 seq ← seq ◦ mapTable[T[i]] 5 end for 6 Generate LIS from seq 7 Calculate LCS between T and P from LIS 8 if verify(LCS,k) is true then 9 return matched 10 else 11 return mismatched 8 end if

Algorithm 5.4: ACPM2 with Greedy Algorithm

ACPM-GREEDY(SeqDB,Z,k) 1 for ( i ← 1 to Z) do 2 for ( j ← 1 to Z) do 3 P ← SeqDB[i], m ← |P|, PP ← P[1...m] ◦ P[1...m − 1] 4 T ← SeqDB[ j], n ← |T| 5 APM-via-LIS(T,PP,k) 6 end for 7 end for CHAPTER 5. CIRCULAR PATTERN MATCHING 121

Algorithm 5.5: ACPM2 Algorithm with LIS ACPM-LIS(SeqDB,Z,k) 1 for ( i ← 1 to Z) do 2 for ( j ← i + 1 to Z) do 3 P ← SeqDB[i], m ← |P| 4 T ← SeqDB[ j], n ← |T| 5 for ( v ← 1 to n − m + k) do 6 subT ← T[v...v + m + k] 7 for ( h ← 1 to m) do 8 APM-via-LIS(subT, f h(P),k) 9 end for 10 end for 11 end for 12 end for

Figure 5.2. The number of hypotheses with q-gram CHAPTER 5. CIRCULAR PATTERN MATCHING 122

Algorithm 5.6: ACPM2 Algorithm with q-gram and Suffix Array

ACPM-QGRAM(SeqDB,N,Z,q,k) 1 seq ← NULL, pos ← NULL, s ← 1 2 for ( i ← 1 to Z) do 3 seq ← seq ◦ SeqDB[i]

4 for ( j ← 1 to mi) do 5 pos[s] ← i, s ← s + 1 6 end for 7 end for 8 ← BuildSA(seq) 9 for ( i ← 1 to N) do 10 Candidates ← {} 11 do while (lcp[i] ≥ q ) 12 Candidates ← Candidates ∪ {i}, i ← i + 1 13 end do 14 for each Pair {x,y} ∈ Candidates do 15 P ← SeqDB[pos[SA[x]]], m ← |P| 16 T ← SeqDB[pos[SA[y]]], n ← |T| 17 for ( j ← min(1,y − m − k + q) to y + m + k − q) do 18 subT ← T[ j... j + m + k − 1] 19 for ( h ← 1 to m) do 20 APM-via-LIS(subT, f h(P),k) 21 end for 22 end for 23 end for 24 end for

i n 0

j

m

Figure 5.3. Dynamic Programing in q-gram matching CHAPTER 5. CIRCULAR PATTERN MATCHING 123

P2 P1

P[j+q...m] P[1...j-1] QP P[j+q...m] P[1...j-1]

2 3 1

Q Text T

Figure 5.4. Three cases in computing the circular edit distance in the ACPM algorithm using the bidirectional edit distance. The numbered double-header show the symbol positions involved in each case.

Algorithm 5.7: q-gram ACPM Hypothesis Generation

ACPM-BIDIRECTIONAL(SeqDB[],k,N) 1 ST ← Build-Suffix-Tree(SeqDB) 2 for i=1 to N 3 P ← SeqDB[i] 4 m ← |P| m 5 q ← b k+1 c 6 start← 1 7 {< SeqDB ID,position>} ← Search-in-ST(ST,P[1...q]) 8 for each pair < SeqDB ID,position> 9 T ← SeqDB[SeqDB ID] 10 BIDIRECTIONALED(P,1,T,position,k,q) 11 end for 12 start ← start + 1 13 do while (start ≤ m-p) 14 {< SeqDB ID,position>} ← Search-in-SuffixLink(ST,P[start...start+q-1]) via suffix link 15 for each pair < SeqDB ID,position> 16 T ← SeqDB[SeqDB ID] 17 BIDIRECTIONALED(P,1,T, position,k,q) 18 end for 19 start ← start + 1 20 end do 21 end for CHAPTER 5. CIRCULAR PATTERN MATCHING 124

Algorithm 5.8: ACPM Hypothesis Verification

BIDIRECTIONALED(P, posP,T, posT,k,q) 1 m ← |P| 2 P1 ← P[posP+q...m] ◦ P[1...posP-1] 3 P2 ← P1R 4 ed1 ← DP(P1,T[posT+q...posT+q+m+k-1],k) ; ed2 ← DP(P2,T[posT-m+q...posT-1]R,k)

5 ED ← ComputeED(ED1,ED2,m,q) 6 for h=1 to m-q+1 7 if (ED[h] + ED[h+m-1] ≤ k) then do 8 return match 9 end for

● ● 300 GreedyACPM ● ACPMLIS ACPMq−gram(q=2) ● ACPMq−gram(q=1)

● 200

● Time(Minutes)

log(Time(Minutes)) ● GreedyACPM ● ● ACPMLIS ● ACPMq−gram(q=2)

100 ACPMq−gram(q=1)

● ● ● ● −2 0 2 4 6 8 0

0 50000 100000 150000 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

Number of protein Number of protein

Figure 5.5. The time cost of the CPM algorithms CHAPTER 5. CIRCULAR PATTERN MATCHING 125

(a) Degree distribution (b) Degree distribution in Top-100 degree nodes

(c) Log degree distribution (d) Log degree distribution in Top-100 degree nodes

Figure 5.6. Degree distributions in the network of multidomain proteins constructed based on the circular patterns they contain. CHAPTER 5. CIRCULAR PATTERN MATCHING 126

(a) (b)

Figure 5.7. Number of directly connected pairs in Top-K highest degree proteins CHAPTER 5. CIRCULAR PATTERN MATCHING 127

Figure 5.8. The Protein network (using both CPs and non-CPs)

Red nodes are nodes found in the non-CP network. Green nodes are nodes found only in the CP-network. Blue edges denote edges found in the CP network. Pink edges are edges found only in the CP network. CHAPTER 5. CIRCULAR PATTERN MATCHING 128

Q09893

P36331 P11204 P31822 P07566 P11516 P18146 P04281 P36999 P13529 P36499 P11161 P17343 P19560 P31242 P25387 P38123 P11047 P38488 P06019 P22056 P11490 P00397 Q05215 P13473 P10272 Q01981 P26632 P08012 P29387 P41838 P08078 P21480 P19028 P24794 P21838 P35880 P03001 P08768 P08364 P07268 P15270 P03302 P05844 P36330 P36304 Q04726 P32479 P34956 P03200 P28159 Q01014 Q00184 P19561 P03878 P22495 P20014 P27410 P20126 P13561 Q05159 P38675 P17010 P13068 P19199 P25059 P10071 P08151 P47025 P35928 P19907 P06935 P03599 P19901 P24384 P03556 P03314 P18247 Q06889 P39806 Q09715 Q05057 P46070 P14787 P16604 P42000 P03304 P25049 Q04610 P36130 P09814 Q04544 Q06110 P13900 P05959 P27409 P39770 P46684 P07105 Q00962 P49177 P42841 P13897 P36309 P33748 P33749 P18917 P16649 Q04538 P27282P03305 P37693 P17097 P24120 P13959 P48984 P27285 P03316Q02597 P15269 P47164 P38952 P10978 P43088 P29324 P09803 P31630 P20021 P25490 P29393 P17593 P07949 P03306 P30336 P32113 P06654 P32795 P36327 Q02413 Q02040 P41810 P28713 Q01931 Q10087 P39102

Q00899 P10896 P45219

P47695 P05222

P20310 P11461 P24503

P33760 P33151 P33289 P41836 P46298

P44325 Q00861 P23787 P17641 P40180 P12878 Q00619 P32468 P47421 P45051 P27439 P23598 P30296 Q03025 P22036 P46466 P12234 P15370 P07109 P34123 P19950 P33661 P23678 P24683 P28743 P49501 Q02592 P33299 P41647 P39109 Q03519 P03319 P16683 P17980 P21448 P21439 P46471 P36028 P45962P46872 P28739P28025 Q04982P39583 P46465 P03320 Q07970 P11092 P38735 P46864 P13568 P22940 P40967 P46502 Q05597 P21441 P45791 P44887 P33297 P45637

P46867 P33310 P36371 P19771 P45861 P44047 P46870 P10636P16521

P33311 P37029 P15962 P44917 P15187 P25997 Q09427

P08266 P45019 P40024 P30963

P03956 P33176 P42436 P34540 P38046

P46871 Q03252 P05955

P45170P08007 P37759 P45052 P39456 P23815 P45600

P31774 P34383 P45321 P04176 Q00609 P09833 P42065 P21178 P35352 P45105 P16684 P42566 Q06851 P28246

P33681 P35093 P21852 P01008 P32718 P33200 P31060 P09012 Q03203

P30628 P43074 P24136

P03593 P12383 P45167 P26361 P14728 P17259 P46920 P13615 P33302 P34216 P32380 P10964 Q03396 P10567 P42337 P17863 P24710 P38748

P17771 P16355 P35416 Q02926 P15398 P15310 P46302 P48810 P25892P12757

P35418 P35417 P25066 P23463 P27609 P33286 P47840

P35579 Q00680 P14105 P08799 P35415

P19706P35748 Q02440 P31367

P32492 P16477

P42522 P09629 P34092 P09632 Q99323 P23459P02836 P19524 P09015 P05661 P37806Q01860 P10180 P36197 P12845 P20265 P21000 P36006 P49335P21952 Q01226 P10569 P24061 P34694 P46735 P20267 P21001 P10628 P43345 P31249 P31360 Q04996 P39984 P31899 P33216 P40592 P03434 P10267 P20719 P20263 P31364 P14858 Q08727P03274 P31258 P37935 Q01630 Q00466 P31314 P22317 P49640 P29825 P02832 P17278 P48241 P10181 P15424 P21693 P40317 P09026 P42587 P09079 P16143 P39546 P42571 P80205 P43120 P37938 P43698P07548 P37275 P32242 P20793 P25808 P46604 P40426 P31362 P34689 P14652 P45577 P39687 P32892 P42694 Q05466 P22009P43699 P46320 P17919 P24782 P23812 P05824 P33906

P40764 P46608 P26802 P18264 Q03974 Q09727 P42305

P41381 P02482 P20448 P06634 P44526 P22735 P16989

P28366

P13590 P38041 P47847 P38380 P49649 P80206 Q06461 P13592 P32243 P41241 P35845 P13594 P13238

P05129 Q04912 P07199 P41823 P46934 P22059 P16258

P24507 P27414 P38432 P34711 P39745 P39940 P47812

P36095 P07524 P49137 P28548 P32490 P06845 P03967 P29590 Q04899 P18293 Q06846 P47709 P18265P36005 P34891 Q03043 P15242 P33402 P42086 P12575 P49185 P34244 P31169 P11730 P48510 P27704 P27037 P45984 P07602 P36978 Q06548 P23435 P32612 P09386 P13596 Q09499 P47811 P08413 P13595 P18431 P32358 P07313 P48749 P39000 P41279 P35513 P13677 P16911 P37305 P06625 P25693 P09599 P23458 P04409 Q10056 Q09435 P26779 Q10071 P33530 P16671 P02703 P01193 P27966 P25321 P40061 P23443 P03641 P98137 Q02156 P47735 Q08264 P25389 P47987 P01029 P42282 P15925P14205 P40550 P07931 P06782 P10643 P36314 P38438 Q02216 Q04976 P33497 P06244 P35247P29400 P45894 P29597 P10665 P19525 P44410 Q03364 P18612 P34152 P26645 P31894 P49339 Q03351 P37173 P18653 Q05999 P24571 P40230 P35248 P05997 P44953 P33484 P07876 P34927 P40424 P05126 P03643 P21146 P27792 P09215 P06681 P03219 P12980 P13187 Q01887 P47809 P29317 P21180 P17789 Q07005 P12080 Q00701 P46556 P02893 Q06226 P44602 P22985 P15989 P21758 P42906 P08103 P14996 P39414 Q01149 P15710 P13814P07307 P39962 P24583 P05555 P08631 P34369 P21709 P35246 P24721 P24723 P15790 P13185 Q05513 P34947 P28481 P37274 P21288 Q08942 P20806 P27658 P08691 P13244 P32739 P23292 P23339 P21757 P16112 P27446 P13406 P48615 P38970 P13368 P42983 P27514 P12319P20489 P28693 P20701 P00516 P02458 P16385 P25184 P14282 P20908 P08318 P11444 P43403 P06245 P35761 P26818 P42890 P08630 P04278 P08414 P18546 P08674 P46360 Q09537 P34925 Q02763 P24063 P02874 P06546 Q07252 P11276 P27113 P08673 P80412 Q05655 Q06807 P38047 P09333 P00519 P23327 P39968 P34892 P34340 P07897 P00719 P23298 P43404 P43535 P11798 P20613 P32361 Q06060 P00717 P46378 P15054P42690 P02694 P09619 P18760 P23352 P21860 P00262 P45723 P42681 P23219P04584P26993 P26619 P35827 P04937 Q00342 Q01225 Q01406 P16230 P33005 P18168 P34024 P20999 P05622 P38361 P16144 P29367

P15498 P13186 P25524 Q06805 P40996 P27870 P32328 Q09898 P21795 P09989 Q01481 P42684 P35590 P27400 P46530 Q09014 Q06806 P29366 P00530 Q09734 P35331 P10039 P45539 P04185 P48025 Q03696 P35991 P24821 Q09746 P32790 P98083 P46108 P15216 P42686

P08025

P31695 P04501 P02680 P02679 P36844

P31135 P12799 P46110 P24133 Q01705

P36135 P29598 P05813

P18031 P20607 P35821

P05990

P34442 P30989P49684 Q08209 P35377 P43657 P39143 P37005 P35233 P48483 P47748 P07621 P30546 P28827 P29352 P30098 P32236 Q04573 P41143 P26045 P15409 P47751 P48452 P30549 P49445 P43378 P29074 P17706 P28680 P48459 P31389 P33533 Q05909 P28336 P35371 P14747 P35409 P05363 P23470 P34975 P20346 P46090 P28828 P19398 P23467 P18052 P47901 P35343 P41144 P23469 P48611 Q06180 P36176

P10586 P49446 P35372 P20638 P20989 P35407 P21463 P46960 P38867 P23265 P08173 P19924 P48974 P25930 P42866 P26040 P46023 P01452 P47799 P32940 P24939 P34813 P27580 Q03560 P27352 P16473 P30938 P30937 P42842 P21450 P25106 P37972 P48748 P06724 P05078 P18825P08912 P47319 P31391 P28646 P11613 P32211 P44092 P21084 P23163 P30874 P37650 P35350 P12805 P25473 P30680 P17124 P22321 P47804 P38825 P21039 P30872 P30936 P41842 P22332 P21917 P28088 P30372 P32745 P47898 P32311 P47803 P11229 P37285 P13945 P31948 P23362 P04001 P20309 P42289 Q07866 P15705 P43253 P08172

P49578 P43505 P10720 P34969 P32306 P14126 P43142 P29276 P22297 P25962 P25115 Q02942 P18901 P34980 P42347 Q01717 P35408

P35056 P43115 P30557 P44520 P33292 Q05394 P32240 P32512 Q03569 P10608 P42290 P04274 P30559 P48271 P43141 P31134 P49533 P10909

P49534

P22322 P34055 P34979 P49650 P07700

P32482 Q02099 P49518 P46737 P48279 P23801

P44768

P31133

P32250 P41231P35383

P37088 P22002 P32660 P22138 P20028 P06565 P34455 P04013 P14614 P45975 P45638 P48915 Q08345 P12348 P06295 P13088 P15348 P45047 P41225 P49596 Q03586 P03680 P32595 P07167 Q08341 P33613 Q02343 P28621 P29801 P36753 P24884 P40527 P80151 P45046 P48432 P34778 P40118 P14283 P24793 P46592 Q00942 P01283 P15564 P13903 P22704 P14110 P29719 P36749 P00451 P43739 P47872 P03042 Q09683 P17667 P34203 Q09173 P06566 P36754 P14043 P35208 P21347 P48433 P30318 P09259 P12093 P03345 P36763 P00533 P30320 P22316 P34855P05982 P32770 P33144 Q10115 P07004 P33510 P17920 P48436 P36642 Q01279 P13089 P06622 P01142 P17474 P15582 P98092 P15172 P34221 P13911 P48337 P47582 P33811 P27235 P40010 P49444 P07392 P41360 P04955 P48409 P12349 P28340 P13547 P26806 Q07875 P34852 P24128 P13806 P03336 P38144 P06262 P18540 P29072 P23025 P07851 P07065 P35710 P47792 P22139 P08739 P15178 P31813 P28857 P17393 P09933 P01143 P03105 P29913 P10493 P26687 P30190 P17600 P36419 P10475 P15306 P13122 P23811 P27570 P07666 Q09172 Q01836 P15581 P13097 Q06831 P46976 Q08879 Q01158 P35428 P17546 P04775 P19559 P32296 P23988 P04540 P41298 P41255 P17192 P48756 P01085 P43825 P10870 P15559 P07204 Q06234P01106 P37871 P25247 P26849 P24880 P18680 P11236 P22082 P26846P34854 P46519 Q06945 P39966 P28365 P13280 P02858 P42530 P19811 P41134 P14868 P40875 P05664 P10394 P43763 P45075 P43869 P47359 P07917 P43799 P08104 P29617 P23999 P22831 P28339 P48120 P32481 P20067 P21402 P15390 P04323 P20459 P48372 P13121 P35499 P28480 Q07965 Q06145 P13586 P31373 P05510 P08424 P37889 P47928 P49598 P11718P05030 P14543 Q05066 P31971 P03701 P15173 P16850 P19716 P26374 P22670 P46518 P10723 P42486 P11369 Q02363 P36429 P20504 P02724 P04956 P27106 P34429 P14003 P22816 P16025 P32264 P13846 P11705 Q08436 P21240 P47632 P46198 P47382 P38707 Q07964 P41556 P07987 P04800 P38690 P21850 P22037 P19828 P14550 P48046 P21328 P29149 P00538 P09441 P41811 P32639 P25155 P25345 P43829 P17658 P29990 P48377P40391 P07395 P14090 P16216 P01093 P48371 P22462 P98095 P10085 P43634 P41059 P40339 P13587 P23665 P11454 P13025 P47672 P00797 P17970 P11831 P14547 P25122 Q09891 Q08435 P23634 P19424 P48633 P24588 P30714 P43890 P23989 P36938 P48419 P17326 P27035 P40390 P22460 Q08788 Q01206 Q05037 P25464 Q01207 P48424 P12830 P27742 P25472 P40872 P05804 Q01365 P26257 P39524 P22001 P49380 P22189 P08510 P27743 P06864 P46238 P06476 P48547 P80313 P26046 P10378 Q04574 P36909 Q01842

P03172 P00722 P22700

P07645

P15215 P42199 Q01205 P37051 P31647 P37116 P49563 Q04747 P46065 P46329 Q04707 Q09690 P35956 P13497 P10933 P20396 P31285 P20951 Q08400 P16732 P42228 P27058 P06547 P08159 P98064 P06847 P44955 P09231 P42087 P16954 P38513 P28868 P45059 P28821 P13583 P31909 P28571 P47728 P20797 P28305 P28570 P28573 P27206 P42825 P43027 Q10059 P07200 P48207 P31617 P29120 P45161 P12839 P11181 P06236 Q02207 P16603 P20000 P38552 P17692 P49083 P36051 P11908 P16053 P45045 P00738 P36592 Q04575 P27676 P11522 Q07868 P15684 P11368 P10031 P36501 P18665 Q02078 Q04777 P18292 P16176 Q06443 Q00420 Q06547 P31650 P31334 Q03157 P45274 P23759 P48029 P35036 P21552 P39137 P20965 P15800 P36957 P29474 P27090 P39875 P31623 P45358 P28969 P43644 P23977 P27000 Q00495 P40750 P23359 P08049 P05174 P24228 P20708 Q03432 P34820 P07861 Q05528 P38085 P16519 P22997 P20285 P02809 P38705 P35085 P40564 Q07009 P03803 P09570 P40791 P00956 P36178 P21274 Q00338 P30679 P42003 Q08289 P41343 P15541 P42227 P07284 P38939 P19269 P25691 Q01959 Q05001 P07944 P02808 P49591 P10342 P25306 P08777 P09001 P25502 P26262 P49337 P33117 Q06317 P37089 P31646 Q03416 Q03468 P22591 P42231 P43833 Q09803 P33046 P08539 P12680 P49404 P38776 P34821 P45118 P31643 P29476 P27751 P45345 P17125 P40825 P12894 P12695 P21036 P09603 P09498 P44423 P43652 P27259 Q02572 P40902 P02538 P16047 P25685 P08551 P35500 P26927 P21890 P25404 P04264 P18408 P08144 P13134 P30538 P04968 P08018 P11940 P31652 Q00381 P35037 P14980 Q09928 P27747 P11182 P36608 P98074 P49340 P48208 P48601 P30884 P18075 P26495 P39997 P05407 P37091 P46701 P29384 P35937 P36609 P21461 P42971 Q00288 P10529 P39844 P42731 P25353 P40591 P35900 P31645 P33945 Q04960 P00008 Q08047 P20506 P49082 P16451 P31641 Q06518 P39748 P35232 P35191 P21944 P23622 Q09175 P16263 P13036 P38536 Q04695 P00455 P17779 P17965 P40763 P37464 P14025 P02919 P17247 P49003 P16246 P14639 P09544 P15402 P08779 P14121 P20722 P25621 P37898 P14677 P98073 P29385 P34057 P28037 Q01657 P29475 P18520 P13516 Q08469 P44366 P38039 P06959 P43850 P31662 P08778 P37090

P11621 Q05237 P32215 P15009 P48937 P35236 P48113 P21977 P98131 P46208 P44862 P02636 P36351 P14749 P10950 P07337 P07132 P26808 P18642 P27169 P04958 Q06393 P28661 P09793 P46681 P21620 Q04861 P26381 P23739 P44900 P48983 P39473 P46153 P42891P17349P30545 P08664 P07826 Q03188 P15206 P31780 P13627 P41394 P37512 P31522 P00382 P48825 P24651 P38697 P08055 P14625 P43979 Q07801 P33363 P33817 P01347 P38508 P33859 Q00653 P08955 P46317 P36307 P18640 P32672 P06134 P37062 P42434 P17518 P31956 P46081 P34179 P03579 P35525 P42209 P35221 P30851 P31002 P15772 P31301 Q06927 P02828 P33905 P44854 P46455 P21376 P49086 P44020 P41109 P23773 P20717 P05342 Q00732 Q01002 P27170 P19838 P26382 P17542 P41128 P18626 P09323 P06407 Q03030 P27422 P15183 P07572 P23426 P33641 P05020 P12217 P12219 P12870 P27034 P05876 Q00496 P37020 P13213 P26231 P45085 P00689 P40429 Q08369 P43694 P15823 P30341 P34286 P38938 P30276 P45757 P07142 P26511 P23248 P20281 P36492 P26503 P13226 P16046 P20643 P43693 P14151 P06280 Q04519 P14002 P33775 P47932 Q01647 P03363 P21879 P03072 P05748 P25952 P14850 P32154 P46537 Q00761 P16126 P22474 P38628P07768 P17079 P41586 P41118 P09144 P33909 P24188 P27973 P47190 P19328 P08913 P37061 P39097 P46538 Q08642 P34410 P20642 P31434 P12204 Q00556 P30878 P42712 P33429 P19332 P45604 Q05763 P05794 P03544 P06380 P15976 P02662 P04090 P19669 P36212 P46605 P40406 P09286 P04025 P10253 P27981 Q09687 P44946 P22091 P44387 P14010 P18871 P33007 P20618 P49454 P30183 P45778 P07143 P44801 Q03040 P03354 P07375 P31382 P10845 P08081 P09745 P42072 P38086 P39567 P04808 P44715 P32198 P07259 P49242 P22506 P33818 P11657 P26207 P12155 P45048 P38060 Q04619 P47427 P20712 P17521 Q02581 Q00763 Q09439 P31562 P33068 P07124 P12256

P04386 P32086 P45646 P29951 P13605 P27214 P19976 P48892 P38285 P11701 P30518 P37127 P29093 P26678 P09233 P80517 P15043 P18860 P40599 P05369 P23176 P21097 P26177 P34028 P29921 P32131 P18158 P49426 Q05113 P38820 P38756 Q10057 P26764 P06665 P46240 P46314 P98056 P17869 P27336 P26379 P31020 P22102 Q02394 Q08276 P41152 P47200 P47424 P07075 P29041 P09057 P42789 P49557 P37388 P10856 P20724 P48828 P41772 P10758 P49697 P43304 P10047 P12040 P12733 P80448 P44487 P26683 P43502 P25980 P00480 P11066 P36740 P49530 P07985 P25415 P22731 P03004 P33768 P49522 P31251 Q06202 P07860 Q09765 P39474 P35888 P22314 P46182 P43753 P45003 P07838 P10955 P39409 P32485 P34750 P27120 P27121 P40851 P09943 P07133 P06385 P26719 P49155 P13045 Q06562 P38971 P27693 P37986 P48848 P28825 P46073 P46388 P22515 P07547 Q06758 P09975 P12684 P04035 P32440 P27542 Q05866 P08568 P41151 Q00613 P38530 P43629 P40801 P44795 P01229 P34650 P16549 P22465 P09451 P00415 P24814 P49331 P06012 P09832 P36924 Q07411 P09223 P36649 P39842 P28235 P42034 P22304 P42042 P22071 P00461 P06107 P32191 Q02934 P13460 P26762 P02959 P02964 Q02123 P32155 P21203 P06457 P39161 P31460 P18249 P18184 P09976 P18186 P06960 P23776 Q07861 P49307 P15645 P30438 P15687 P43922 P43909 P21872 P41441 P00967 P44354 P25090 P20973 P44578 P41226 P43879 Q02053 P04391 P43336 P07166 P47760 P32675 P15644 P41931 P47154 P06387 P45100 P03284 P00780 P11410 P13507 P43626 P16497 P32001 P35425 P47315 P00309 P07244 P16393 P37703 P38532 P21302 P36101 P31475 Q03065 P31743 P11635 P08973 P37894 P32257 P44836 P24014 P34949 Q06828 P20073 P43708 P14853 P21314 P29336 P48044 Q03460 Q10113 P26677 P41978 P45584 P45277 P33725 P24193 Q08890 P08017 P26670 P26237 P35890 P18776 P31007 Q00013 P37965 P13035 P32603 P03109

P38092 P37198 Q05146 P20828 P40142 P05945 P42176 P46973 P03089 P18285 P45340 P45438 P05055 P18395 P11711 P35633 P28298 P08651 P36213 P09781 P29578 P17846 P14560 P40879 P24695 P46740 P42188 P22146 P45767 P45018 P37446 P36317 P07296 Q06987 P12023 P14251 P01026 P45364 P13015 P33984 P49386 P09950 P17955 P20148 P20874 P34736 P28583 P19318 Q02869 Q05815 P11080 P25510 P43433 P40974 P12623 P20852 P02936 P46831 P21999 P12352 P17799 P29579 P40573 P00807 P42186 P34305 Q02473 P17561 P27526 P33282 P25689 P37447 P00606 P36958 P23372 P12627 P15333 P44971 P29037 P43157 P25781 P16381 P14206 P14046 P01023 Q01833 Q08426 P07896 P45380 P22881 P27754 Q09882 P24358 P33865 P42187 P14308 P49369 P33864 P15332 P48511 P45889 P45609 P06684 P45856 P15082 P21049 P16855 Q09056 P16700 P47846 P14907 Q01969 P04282 P33655 P05976 P11349 P03425 P27405 P16706 P25512 P09891 P46862 P38033 P10612 P06179 P42449 P17923 P49481 P05357 P29577 P17854 P10424 P01211 P33985 P27928 P44721 P28772 Q00511 P49642

P19763 P30530 P43847 P43851 P29029 P42883 P40467 Q00259 P03602 Q08100 P13720 P04008 P43471 P11223 P03551 P17502 P12792 P32229 P37840 P28594 P46059 P39868 P10520 P35441 P46483 P10549 P24088 P10614 P42220 P30573 P36599 P25445 P32232 P07374 P23700 P18509 P23279 Q09682 P06490 P35992 P44349 P06586 P08247 P20985 P21080 P33861 P05034 Q02140 P42963 P10171 P47456 P21046 P09560 P33800 P28070 P37426 Q09750 P26711 P06155 P07738 P30572 P22434 P25446 P37887 Q03283 P23701 P41585 P18485 P23638 P09489 P19642 P46786 P12146 P07825 P33832 P33862 P32206 P16453 P25605 P47551 P10170 P02635 Q01478 P16551 P32216 P25451 P00452 P08456 P05137 P08320 P35494 P43920 P36837 P27783 P08970 P35443 P13374 P08407 P24851 P37075 P07946 P03552 P15369 P16605 P32228 P33567 P17177 P18548 P38900 P14263 Q03046 P18254 P01338 P39061 P16167 P37551 P46456 P46702 P15567 P25765 P40954 P33752 P24264 P12880 P36836 P49102 P46813 Q05895 P43538 P30594 P32782 P22886 P35520 P16122 P23603 P48144 P23599 P21670 P10212 P32670 Q02326 P19954 P39965 P29191 P13053 P03296 P19113 P00894 P44524 P39768 P35155 P33863 P04049 P16913 P40112 P32282 P10498 P19416 P04890 P39773 P10615 P39058 P37379 P16097 P36574 P27967 P44333 Q03350 P40889 P06475 P15319 P48993 P31053 P15423 P42055 P16316 P06825 P21037 P40105 P00864 P39276 P22945 Q08875 Q06441 P14226 P11472 P19765 Q00993 P38972 P38025 P20533 Q09923 P39529 P11624 P04329 P03173 P08408 P03070 P13394 P10033 P04840 P23596 P03516 P17371 P37377

P06868 P07703 P35778 Q09828 P41569 P36619 P26185 P20409 P06485 P17365 P28968 P17797 P07169 P28917 P10211 P40973 P30764 P47583 P05835 P30268 P28348 P28715 P35689 P36617 P18310 P37455 P34601 P03958 P16922 P06615 P31098 P31096 P08721 P17144 P17143 P14729 P36265 P36266 P36260 P20024 P10290 P27898 P09958 P23377 P33087 P17888 P17904 P30181 P27303 P44928 P06839 P46586 P43853 P00499 P20825 P14553 P39599 P48572 P18869 P29696 P44462 P11680 P21775 P33291 P07842 P17959 P42756 P09901 P18076 Q01953 P21413 P27688 P20093 P04996 P27919 P07206 P41085 P35721 P37118 P37117 P29961 P45035 P25942 P27512 P01731 P30433 P38608 P27257 P22549 P11976 Q02184 Q02185 P33240 P39396 P13156 Q03031 P13649 P24220 P42326 P07606 Q07744 Q09145 P43061 P22675 P35538 Q03845 P46925 P37231 P33827 P10857 P29453 P45077 Q05740 P06531 P13292 P21045 Q08103 P09779 P09783 P28912 P13201 P16530 P14037 P30760 Q06458 P08201 P10512 P17424 P22805 P33396 Q02219

P39898 Q03181 P20534 P26637 P47913 P80059 P45610 P05328 Q08099 P21044 P23052 P05356 P05359 P28916 P06473 Q05526 P33514 P16023 P47471 P25104

P43823 P12081 P35446 P35447 P23549 P07984 P45837 P04143 P16422 P09758 P10481 P15698 P07056 P14698 P27633 P13393 P19244 P30236 P47418 P38174 P45175 Q07762 P26012 P26013 Q02201 P38755 P35565 P36581 P19237 P05547 P12747 P14286 P31639 P26430 P15292 P09131 P38984 P38979 Q01240 Q01241 P23591 P33217 P12777 P15553 P07061 P07062 Q04870 P41895 P25096 P25456 P04637 P10360 P29916 P29769 P24004 P46463 Q07307 P48777 P41705 P41718 P16284 Q08481 P15288 P44817 P13629 P07603 P13217 P10688 P14262 P08954 P36623 Q01992 P46236 P23658 P37590 P37589 P18133 P22253 P03362 P14078 P06725 P18139 P07147 P40126 P26570 P33329 P28062 P28063 P10486 P17224 Q99132 P12806 P11465 P18148 P25916 P02781 P11095 Q08684 P11835 P26010 P47657 P38754 P18754 P23800 P21893 P45112 P04844 P25235 P31350 P11157 P48535 P12687 P49312 P04256 P20232 P49373 P37970 P37971 P19940 P33658 P06163 P06162 P28702 P28704 P30352 Q01130 P49416 P31431

P08689 P29353 P31171 P35856 P16967 P43307 P19761 Q02445 P14147 P45608 Q09037 Q06222 P21151 P07871 P30597 P29465 P23849 P44843 P20406 P09506 P24761 P33851 P24766 P33854 P33855 P21066 P21069 P33858 P21607 P33819 P33871 P21019 P09396 P17780 P17791 P05351 Q04853 P06590 P27208 P26658 P35258 P41326 P17595 P17449 Q03499 Q04612 P37527 P43545 P39305 P44990 P25221 P22440 P37504 P37500 P06675 P04701 P36033 P21375 P16900 P03322 P05437 P13621 P23004 P32551 P47231 P11934 P43154 P98085 P45582 P48980 P42175 P09152 P42126 P23965 P19217 P36788 P10463 P41150 P06024 P24906 P98136 P48747 P03074 P12905 P29930 P29931 P47877 P12843 P11172 P14017 P13215 P30672 P00589 P00588 P42697 P27594 P34894 P39148 P30195 P20103 P43838 P10641 P15368 P19097 P48755 P18625 P29176 P15407 P46655 P28668 P43468 Q07211 P35052 P35053 P49698 P22449 P08722 P26022 P49259 P49260 P17817 P04070 P30305 P48966 P40959 P34118 P21810 P47853

P08199 P13383 P11834 P32736 P21556 P25105 P10124 P13609 P43119 P43252 P08487 P10686 P40012 P43309 Q01060 P27507 P40307 P22141 P49643 P33610 P44286 P07224 P32506 P15151 Q06600 P01374 P40242 Q01880 P20402 P49409 P49169 P49170 P29044 P29045 P22173 P22174 P44632 P40832 P23389 P05060 P21815 P13839 P16043 P09916 P01135 P01134 P49237 Q03467 P43912 P36244 P35456 P49616 P21115 P24083 P45130 P21077 P26699 P03739 P06723 P16407 P19903 P07233 P11078 P11077 P29077 P20981 P09274 P41064 P44957 P22990 P45678 P45677 P19711 P21530 P10172 P18581 P02844 P06125 Q10002 P22936 P36416 Q05885 P07668 P28329 P24081 P33696 P34425 P43449 P25250 Q05094 P08708 P14042 P24652 P37676 P18278 P16027 P48556 P39822 P41390 P04046 P26367 P47238 P48506 Q09768 P25468 P32747 P23228 Q01581 P04394 P40915 P35312 P19782 Q01263 Q10094 P26420 P26984 P22284 P22285 P35354 P35334 P21033 P07617 P15477 P11768 P40438 P40890 P07102 P14924

P13792 P26975 P15001 P34754 P30706 P26647 Q00269 P14381 P29130 P09889 P15309 P20646 P12852 P49108 P15639 P43852 P24271 P34088 P07567 P29175 P49389 P46801 P23003 P31812 P31393 P13797 P31396 P35398 P16073 Q06427 P31213 P31214 P46250 Q10099 P20616 P10362 P32032 P01545 P24770 P21005 P32097 P33071 P15722 P29155 P37002 P45102 P37474 P30958 P20696 P20698 P21577 P49466 P46067 P09918 P23998 Q03400 P34047 P10368 P31783 P41688 P40266 P06895 P49293 P37271 P14715 P10965 Q01826 P43007 P43346 P48769 P48620 P46312 P16466 P15320 P00958 P47267 P00450 P13635 P25358 P40319 P20585 P13705 Q05749 Q00286 Q00056 P06798 P41563 P41562 P26007 P23229 P43893 P38493 Q05880 P24425 P31325 P13286 P25976 P25977 P46025 P46026 P39656 P33767 P24529 P04177 P09607 P41510 P24555 P37081 P42909 P42905 P16569 P20116 P34546 P32842 P26469 P45042 P17520 P11623 P41115 P14041 P15520 P19909 P26654 P16717 P40584 P36031 P18466 P42797

P49608 P37032 P25384 P25383 P28324 P41158 P43002 P33879 P29859 P34531 P29915 P24918 P46060 P46061 P48562 P42059 P04009 P24598 P15509 P11052 P14605 P30528 P11131 P20174 P35169 P32600 P19254 P18126 P46579 P32921 P39189 P39192 P26982 P44947 Q01222 P33824 P28939 P10238 P37537 P33803 P47198 P38511 P46548 P14743 P34732 P18759 P49151 P16612 P20665 P19257 P20918 P06867 Q00005 P36872 P32933 P32935 P02982 P23054 P35159 P45221 P29976 P27608 P37141 P30710 Q09170 Q01518 P29016 P15812 P12046 P12047 P03515 P12793 P43452 Q06908 P43010 P11024 P16665 P17915 P24501 P23446 Q06031 Q00689 Q00518 P15750 P17490 P27092 P44330 P23386 P46018 P34335 P10503 P07117 P43751 P42632 P23402 P49041 P28527 P43804 P25785 P48032 Q00335 P15910 P08819 P49253 P40275 P07305 Q00991 P42191 Q05111 P36721 P25986 P29062 P46810 P49057 P44396 P42871 Q01015 P13288 Q07571 Q07574 P41006 P39766 P27980 P15629 P44356 Q07020 P23950 P17431

P03566 Q06660 P21876 Q00966 Q03244 P03538 P18609 Q00274 P18294 P37573 P34913 P80299 P10866 P12559 P27699 Q03061 P22557 P08680 P01860 P04220 P04776 P04347 P31319 P30625 P35972 P04851 P13496 Q01129 P35852 P12042 Q02385 P18450 P33472 P35931 P09030 P27864 Q10134 P05825 P46118 P11819 P32112 P39954 P39518 P09181 P08934 P01048 P05164 P11247 P21899 P09180 P37136 P09470 P37248 P19774 Q00709 P10415 P15109 P00436 Q03217 P46011 P42675 P42676 P20810 P27321 P04324 P05857 P03472 P03483 P15921 P30687 P17351 P46288 P42179 P00259 P42799 P23893 P45297 P25723 P23100 P23101 P17921 P27001 P33562 P17743 P49116 P49117 P40340 P44984 P32070 Q01056 P31483 Q01085 P13513 P27679 P13980 P22709 P10231 P08313 Q06793 P36227 P20991 P33839 P19348 P29816 P33857 P35777 P44042 P43262 P03226 P16729 P03519 P03710 P13837 P17053 Q05259 P40111 P08386 P16518 P47191 P26639 P25192 P36300 P36319 P36320 P30673 P36337 P24430 P36336

P09499 P15403 P09776 P17793 Q04916 P15155 P37896 P13372 P24443 Q01020 P22458 P19275 P08195 P10852 P43360 P43366 P42395 P37765 P22648 P44074 P37528 P45294 Q10106 P34247 P06804 P23383 P16916 P16917 Q02190 Q02189 P03828 P31064 P47194 P13701 P03539 P20608 P39337 Q08021 P44744 P37342 P15273 P08538 P12688 P18961 Q02187 Q02426 P27276 P07210 P44624 Q07075 P33733 Q07282 P25047 P25941 P41008 P05654 P16099 P42593 P29247 P33182 P11178 P12694 P11725 P12954 Q09671 Q09670 P38069 P49638 P11478 P49112 P29716 P17989 P27896 P44604 P06811 Q01679 P23843 P06202 P09835 P27668 P43775 Q04609 P06670 P12159 P21249 P47064 P29733 P29989 P29987 P29732 P32908 P41003 P00549 P30613 P16092 P21803 P36673 P36674 P42662 P42674 P18569 P41713 P04144 Q02917 Q05793 P13608 P04626 P18115 P43219 P09681 P14662 P40279 P25857 P09317 P31838 P30347 P14655 P43794 P36386 P48805

Figure 5.9. The Protein network using only non-circular patterns

Red nodes are nodes found in the non-CP network. Green nodes are nodes found only in the CP-network. Blue edges denote edges found in the CP network. Pink edges are edges found only in the CP network. CHAPTER 5. CIRCULAR PATTERN MATCHING 129

P28366 P10306

P00397 P44917

P44047 P49649 P47847 P03878 P06019 P24794

Q09893

P34956 P19771 P21448 P33311 P45861 P49501

P19560 P21441 P22036

Q02592 P03556 P08768 P21480 P10272 P13568 P45791 P13529 P11092 Q06461 P11204 P40967 P36330 P03302 P05844 P31822 P21439 P35416 P39109 P36371 P22495P36331 P19199 P06935 P33310 Q03519 P38041 P48988 P19028 P10978 P22056 P13900P13897 P38735 P19561 P25059 Q00619

P08364 P16604 P46920 Q02597 P16691 P15187 P18247 P36638 Q05057 P26361 Q03025 P36028 P03316 P24586 P09814 P16684 P17757 P27410 P27285 P35418 P35928 P31630 P29172 P03593 P19901 P03304 Q00962 P03314 P27282 P33941 Q08091 P20126 P42331 P36309 Q04610 P27409 P45167 P07199 P36304 P05959 Q04538 P36327 P29324 P37624 P03305 P11449 P03306 P03599 P24136 P02679 P13561 P02680 P17593 P30963 Q04544 P41647 P35417 P03200 P45052 P08007

P41233 P24683 P10569 P45051 P13595 P14105

P35748 P16477 P19706 P08266 P08799 P32492

P12845 Q02440 P19524 P35415 P18766 P36006 P45171 P36844 P42522 P24137 P10636 P46735 P06681 P03956 P25997 P02703 P46110 P34092 P31060 P05661 P04501 Q02975 Q04982 P35248 P13590 P40024 P21852 P09833 P35247 P47303 P30296 P45105 P31894 P07876 Q99323 P13592 P15925 P18546 P16521 P45321 P13594 P45170 P21758 P35246 Q09427 P45118 P01193 P25892

P21757 P09012 P45600 P20908 P28481 P10643 P29400 P42890 P01029 P12080 P42436 P44410 P33484 P35579 P38046 P40317 P42337 Q01205 P14205 P26645 P21180 P10039 P13516 P05997 P12695 P35383 P27658 Q01149 P15989 P02458 P33302 P27747 P05555 P16451 P34340 P20701 P49650 P16263 P11276 P41231 P07700 P14996 P13596 Q01657 P43074 P06959

P14282 P25524 Q06851 P01008 P27479 P21178 P10608 P42290 P33200 P12383 P42566 P04274

P27818 P04937 P32250 P34969 P00893

P38047 P34216 P17771 P29276 P13615 Q03396 P25184 P18825 P21795 P18901P25115 P16144 Q03696 P49140 P33005 P32380 P35331 P13368 P24710 P23352 P18760 P15310 P12757 P30874 P47898 P24821 P47811 P28646 P17124 P22321 P30938 P42289 P48749 P30872 P20309 P29367 P38748 P48810

P25962 P29598Q02926 P48998 P47812 P32745 P11229 P21917 Q01705 P30680 P30936 P29590 P49137 P18168 P18431 P29317 P31391 P49185 P12575 P04185 P15216 P30937 P20806 P08172 P45853 P06244 P08912 Q09022 P23435 Q09435 P21709 P25321 P01452 P49578 Q02763 P08413 P25930 Q06807 P45984 P38432 P35590 P28548 P31695 P32211 Q05513 P22985 Q06806 P45894 P40424

Q06805 P46530 P28693 P13185 P33402 P25389 Q03043 P21084 P07313 P25473 P10909 P23339 P10665 P23458 P35407 P08025 Q09537 P19398 Q02942 P07621 P38438 P43505 P37173 P23298 P42282 P36176 P32328 Q09734 Q01717 P43142 P34891 P20638 P35372 P25693 Q10071 P21039 P16385 P38361 P22297 P47901 P35343 P03641 P34892 P13244 P06245 P08173 P09386 P11613 P14126 P06724 Q05655 P16671 P32358 Q09898 Q03364 P06625 P39000 P32490 Q04573 P41143 P36314 P43253 P29597 Q04899 P09215 P10720 P35350 P30098 P47735 P30549 P10180 P24571 P18612 P47748 P16157 P39968 P09599 P30989 Q02216 P07931 P35409 P33533 P23443 P38867 P05363 Q05999 P18293 P24723 P46090 P48615 P43657 P47751 P47809 P33497 Q08942 P48974 Q03351 P41279 P28088 P34244 P35371 P18265 P32306 P35377 P19525 P23362 P36978 P36005 P20346 P41144 P03434 P11798 P21146 P35408 P42587 P46871 P25106 P34975 Q02156 P24583 P32940 P23219 P26993 P07602 P28680 P31362Q01630 P48241 P40550 P38970 Q00342 P04584 P47803 P09632 P21001 P14652 P23163 P30557 Q09499 Q06226 P31249 P34925 P05078 P30546 P21450 P42347 P28336 P15242 P02694 P32236 P40592 P47987 P13186 P24939 P15409 P23812 P31899 Q04996 P03643 P27966 P34369 P48748 P36095P37305 P34980 P31389 P33176 P40764 P09079 P46320 P06845 P17789 P32512 Q00466 P13677 P05622 P35513 P46737 P46604 P31314 P42086 Q01887 P14922 P13945 Q02011 P37275 P29825 P36135 P32361 P26619 P43115 P09026 P27704 P27037 P32240 P15705 P43535 P37285 P07548 Q01860P43699 P06782 P04409 Q06548 P37972 Q05394 P31135 P11730 P21860 P26818 P17919 P43698 P09619 Q01226 P16473 P09989 P37935 P34947 P21463 P31948 P36197 P08414 P48275 P09015 P18560 P21952 P20267 P07524 P35761 P47799 P32482 P34694 P22009 P26779 P34152 Q06060 P20613 P23292 P24061 P22332 Q10056 P18264 P42460 P09629 P20719 P37650 P21000 P03274 P49339 P18653 P04001 P31777 P32612 P39962 P02836 P49335 P33530 P46023 P02832 P33216 P00516 P32311 P27609 P45577 P13187 P42571 P17278 P40230 P19450 P48032 P43345 P35056 P14858 P49640 P10181 P49518 P25785 P05824 P31364P43120 P46608 P23463 P37806 Q08727 P28970 P10267 P48279 P46667P22317 P04278 P46872 P16230 P16143 P41241 P42686 Q01406 Q07866 P38825 P20263 P13406

P23459 P31360 P24133 P08631 P42681 P43403 P46870 P23327 P15054P42690

Q05466 Q04912 P31367 Q03974 P37938 P10628 P46867

P20265P31258

P20999 P35827 P27870 P24342 P05126 Q01225 P00530 P15790 P45962 P45539 P43404 P46864 P28739 Q07970 P15498 P27400 P28025 P08103

P34540 P48025 P32790 P08630 P42684 P22735

P27446 P45723 P16989P41381 P34689

P26802 P34024 P98083 P20793 P39687 P23678 P46108 P46360 P09333 Q09727 P32892 P15424 P00519 P35991 P33906

P20607 P21693 P49445 P06634 P24782 P44526 P42305 P39546 P42694

P20448 P25808

Q06846 P03967 P47709 P18031

P35821 P28828 P10586

P23469 P29074 P18052 P23470 P48611 P33151 P26045P23467Q05909 P11461 P28827 Q02040

P35233 P37005 P39102 P24503 Q08209

P13135 P05129 P28713 P05222 Q01931

P20310 P09803 P24507 P48483 P30336Q10087Q02413 P43378 P49446 P29352 P14747 P05990 P48459

P17706 Q06180 P20021

P46934 P48452

Q04726 P20459 P28621 P17658 P45973 P46471 P20722 Q01207 Q05215 P34576 P40339 P07004 P45638 P21850 Q01842 Q06234 P08510 Q01206 P17970 P01106 P23999 P15172 Q04574 P14043 P48409 Q05159 P21347 P37693 P06565 Q01365 Q08341 P14787 Q00184 P04955 P22460 P29387 P34429 P15173 P13903 P33299 P26632 P14666 P11161 P18540 P23988 P46198 P18075 P22462 P40118 P07268 Q01836 P06566 P22001 P03001 Q00899 P11490 P14547 P32296 P17920 P46592 P49178 P08365 P23811 P15270 Q01014 P04281 P06295 P42000 Q05037 P25122 P29990 P01142 P42530 P04323 P10085 P28159 P08151 P45075 Q02363 P13097 P17667 Q06110 P29719 P03319 P07167 P06476 P29072 P32763 P13025 P38690 P22816 P26374 P21274 P43763 P47025 P30884 P10394 P10071 P39806 P10896 Q09715 P80151 P17247 P19559 P10030 P24793P35208 P17010 P46466 P13497 Q01981 P47695 P18917 P32481 Q06889 P01283 P01143 P07061 P26806 P47872 P32264 P48984 P08078 P16176 P00538 P11369 P32770 P47672 P46470 P29149 P25247 P46070 P25490 P02724 P98092 P35428 P32795 P49177 P15390 P38552 P23025 P27106 P10475 P35499 P32562 P36130 P23359 P25049 P03336 P32776 P41134 P10870 P03345 P39770 P33748 Q08345 P47928 P38144 Q09683 P29393 P20014 P14550 P08104 P41836 P34821 P33289 P07200 P17125 P45219 P46684 P17097 P43799 P08012 P07949 P24588 P44325 P34820 P20067 P33749 P47164 P48756 P43890 P04775 P15269 P00451 P06654 P41298 P17980 P38488 P16649 P16047 Q06145 P14090 P49003 P18146 P24384 P34455 P04956 P22082 P46502 P27090 P40010 P33760 P43027 P29801 P07987 P40798

P15582 P25472 P32468 P43088 P29913 P13806 P33297 P22316 P24880 P15581 P34852 P26849 P25387 P04800 P34855 P48419 P23665 P16216 P41810 P06262 P31971 P04540 P19424 P46465 P15564 P05510 P13959 Q02343 P26846 P37088

P48915 P34854 P15559 P22002 P32639 P27035 P08739 P08424

P32113

P13547 P31662 P27792 P31242 P26207 Q06317 P08539 P13846 P10378 P27743 P35888 P43739 P45358 P18408 P20797 P08779 P02538P36501 P11522 P22805 P25621 P46329 P08144 P21977 P40902 P12217 Q08642 P33641 P37091 P00742 P12093 P15800 P28868 P22704 P09001 P00956 Q10059 P34927 P33751 P31650 P02893 P10933 P37889 P37116 Q04707 P40527 Q08469 P07392 P42971 P00455 P42087 P07204 P05664 P09259 P18520 P28037 P11657 P18665 P26503 P25691 P30268 P45059 P49404 P07861 P06547 P20489 P12319 P30679 P11831 P07395 P16246 P29474 P45048 P41586 P19332 P37089 P13586 P36499 P38776 Q00420 P13122 P38939 Q06547 P25464 P47582 P16053 P21890 P23634 P22700 P08674 P45604 P28857 P19828 P35890 P13036 P04264 Q05001 P41343 Q07009 Q00761 P46519 P14543 Q08436 P05030 P24721 P04292 P08551 P25502 P20717 P21941 P17326 P00717 P07307 P08673 P13473 P02919 P07944 P02534 P11718 P48633 P08049 P28339 P22139 P17546 P18680 P29476 P40750 P15215 P31334 P29384 P40825 P10529 P39524 P21036 P33396 P38536 P34410 P41109 P27676 P30714 P45161 P36592 P16112 P12980 P39844 P40872 P28365 P38039 P27751 P39875 Q01279 P11047 P28340 Q08400 P33046 P31652 P47728 Q01647 P31646 P21288 P13226 P00008 P49083 P45045 P31562 P10493 P31641 P11444 P03004 P37871 P13121 Q04695 P12839 P32215 P98095 P22189 P13814 P17393 Q06518 P08778 P07897 P40193 P07917 Q08788 P45345 P31645 P09323 P44423 Q03416 P21402 P34028 P19269 P12256 P44946 P42199 P00533 P13068 P14677 P39137 P12219 P13587 Q01959 P27742 P11454 Q06927 Q08435 P30318 P48120 P20504 P25104 P39748 P20000 P17692 P00719 Q02078 P35236 P16603 Q07868 P35900 P27206 P28573 P08018 P23977 P23759 P29475 P49380 P42486 P11705 P02874 P30320 P03680 P26046 P33768 P16025 P17192 P28571 P49082 P12830 P31647 P46317 P24228 P48029 P46556 P21838 P08691 P41811 P32739 P22037 P28570 Q09891

P48436 P98056 P06847 P36101 P14002 P16954 Q09690 P42187 P10253 P43626 P07259 P22591 P36959 P04968 P14749 P22138 P15541 P46153 P21551 Q09172 P42209 P09793 P43979 P34305 P22670 P33909 P48371 P03109 P10955 P05174 P31251 P46605 P32198 P49642 P25995 P02858 Q00942 P48432 P37897 P16208 Q05066 P26927 P02662 P45274 P48372 P48825 P35956 P43629 P41226 P15684 Q00338 P09498 Q08369 P13280 P10047 Q02053 P33363 P21552 P34221 P28661 P42072 P47427 P21879 P20506 P21376 P23773 Q02473 P46537 Q04575 P17474 P14110 P05342 P43694 P49337 P49444 P13213 P26231 P46455 P07065 Q07861 Q08289 P17779 P02828 P36740 P09975 P07124 P31623 P31002 P25306 P18292 P06280 P25404 P40391 P25980 P38085 P35037 Q03030 P14151 P22731 P31301 P43693 P45018 Q09882 P15348 P07666 P49307 P22515 Q09765 P44721 P03042 P14625P33905 P40406 P18626 P30878 P27259 P27034 P10342 P33811 P46208 P26762 P22506 P98131 P20693 P08955 P20396 P40645 P44020 P12894 P48601 P14010 Q05113 P22314 P35036 P46538 P06457 Q06831 P16497 P17965 P49340 Q09173 P09745 P35221 P44862 P24358 P46976 Q04619 P34203 P38820 P98074 P48561 P15976 P30538 P42188 Q05237 P02959 P21944 Q02581 P20028 P40757 P05020 P15402 P38697 P38513 P98073 P14308 P38060 P23132 P20973 P07337 Q08047 P37894 P31617 P38756 Q06945 P44578

P22997 P49452 P15206 P11711 P31475 P32257 P32086 P36619 Q08480 P21203 P12352 Q06562 P35165 P42712 P48892 P34894 P48828 P11701 P48044 P03519 P37127 P29093 P20646 P34750 P33725 P40599 P43851 Q02251 P27336 P17502 P11621 P23176 P26177 P41556 P32131 P00967 P16700 P46065 P27973 P01347 P42943 Q04960 P42825 P43829 P43825 P39997 P26379 P23739 Q02394 P37075 Q07801 P09057 P26683 P34949 P20724 P16732 P47846 P31813 P49697 P36609 P05794 P25781 P23426 P04090 P28480 P43471 P09950 P28305 P42731 Q00495 P48208 P40564 P10723 P17599 P07244 P27422 P21872 P00382 Q04861 P19838 Q00653 P26821 P47632 P43634 P35085 P14868 Q09687 P00689 P48841 P12684 P04035 P32440 P31053 P27120 P27121 P18184 Q03065 P34650 P00946 P38971 P27693 P43652 P14639 P12680 P38938 P30276 P10612 P31460 P40801 P44795 P45077 P00571 P43909 P36213 P13045 P30438 Q00763 P00415 P34895 P37986 P49331 P06012 P08325 P09832 P36924 P37232 P32485 P28235 P42034 P15567 P34055 P15687 P16316 P33068 P42042 P00461 P13911 P06107 P25353 P32155 P43157 Q04554 P36608 P07572 P12023 P03803 P43922 P43644 P25685 P38707 P17600 Q05763 P20965 P11940 P04808 P03105 P02636 P41931 P07860 P29951 P16397 Q03156 P38092 P32138 P16393 P13394 P16126 P22102 P21457 P26808 P47932 P80313 P35191 P43735 P47359 P36419 Q09782 P47315 P28821 P49454 P30183 P20852 Q10057 Q01415 P44836 P80059 P16304 P00309 P49481 P03284 P46358 P49086 P14853 P39148 P48848 P29336 P30518 P03710 Q03460 Q10113 P15309 P15644 P46691 P24193 P38025 P44520 P35425 P23596 P17518 P08017 Q02431 P49465 Q03586 P31007 Q00013 P12040 P45889

P00722 P17923 P23815 P32747 P46973 P11349 P36265 P07738 P11066 P07985 P45438 P35633 Q06458 P37464 P40275 P12348 P40879 P31522 P24695 Q06441 P11908 P31909 P01026 P33007 P45364 Q07432 P38530 P30530 P29029 Q09923 P39529 P80205 P32242 P25415 Q06758 P13015 P06407 P33984 P35441 P26257 P21999 P43334 P28272 Q02869 P19318 P36260 P35494 P21302 Q02934 P43433 P02936 P08201 P49591 P19375

P09570 P14046 P01023 Q01833 Q07946 P30341 Q08426 P07896 P18310 P37455 P41151 Q00613 P39061 P16167 P25765 P40954 P33752 P02482 P07547 P14283 P45380 P12155 P22881 Q05895 P12349 P05407

P06684 P37062 P45856 P34601 P38532 Q00993 P20533 P40908 P40467 P32243 P80206 P43879 P11635 P40591 P24131 P08651 P04176 P25468 P03425 P42176 P36266 P39773 P06960 P23776 P09891 P06179 P28348 P43833 P07305 P09933 P01211 P48113 P33985 Q03350

P31705 P06665 P42891 P08913 P16097 P36574 P27967 P29218 P08159 P18642 P44715 P32154 P13649 P32232 P06490 P17561 P03544 P12870 P14263 P10614 P10615 P10768 Q99289 P40604 P09231 P13134 Q09175 P46586 P43853 P00499 P00894 Q02140 P25605 P07075 P32675 P39409 P15710 P37274 P04276 P09958 P23377 P33087 P27303 P44928 P06839 P20825 P14553 P39599 P46081 Q00496 P04958 P48572 P18869 P29696 P20024 P10290 P27898 P48839 P21151 Q09928 P30414 P27919 P07206 P25916 P25172 Q10134 P05825 P30594 P30572 P30597 P29465 P29961 P45035 P24220 P37887 P09489 P27526 P43920 P36837 P27783 Q08875 P26495 P27981 P15183 Q00556 P03579 P15645 P15823 P12880 P36836 P49102 P46813 Q05528 P03363 P25971 P35520 P10212 P28772 P28594 P46059 P39868 P25416 Q02207 P07375 P00864 P39276 P22945 P08970 P35232 P05876 P26382 P32672 P31743 P41441 P06134 P18871

P34425 P43449 P33240 P39396 P45510 P43550 P18776 P45003 P30195 P20103 P24860 P14635 Q05146 Q01969 P35446 P35447 P23549 P07984 P16422 P09758 P10481 P15698 P48506 Q09768 P28031 P09545 Q02201 P38755 P35565 P36581 Q01015 P13288 P08111 Q08470 P14269 P35986 P12747 P12745 P46067 P09918 P15292 P09131 P10170 P39768 P31554 P44846 P42906 P08318 Q00680 P25066 P19217 P36788 Q07307 P48777 P16284 Q08481 P48612 P05374 P03362 P14078 P11095 Q08684 P02382 P47907 P31396 P35398 P49598 P21328 P08689 P29353 P16043 P09916 P42231 P02808 P36438 P07888 P05357 P09781 P03828 P31064 P11835 P26010 P46701 P37051 P28324 P41158 P12046 P12047 P13088 P16850 Q00709 P10415 P42175 P09152 P42126 P23965 P12777 P15553 P10463 P41150 P11515 P28583

P98136 P48747 P03958 P06615 P13215 P30672 P42697 P20591 P13794 P37726 P15368 P19097 P42557 P41440 P29176 P15407 P43219 P09681 P49698 P22449 P44602 P45354 P30309 P07223 P23591 P33217 P37198 P17955 P21556 P25105 P41006 P39766 P45172 P26311 P08487 P10686 P12273 P46542 P33775 P47190 P27028 P48147 P40307 P22141 P32506 P15151 P24081 P33696 P27276 P07210 P46975 P39007 P49237 Q03467 P44920 P34558 P45678 P45677 P11976 Q03610 P19711 P21530 P00549 P30613 P41083 P35113 P35613 P17790 P22759 P43315 P26339 P10354 P27405 Q05815 P25250 Q05094 P11680 P43885 P18278 P16027 P26719 P23107 P46236 P23658 P17115 P42502 P41390 P04046 P04394 P40915 Q01263 P28273 Q05880 P24425 P22284 P22285 P08373 P44605 P08199 P13383 P21577 P49466

P26367 P47238 P22557 P08680 P98085 P32597 Q00269 P14381 P29130 P09889 P26677 P26678 P27202 P49108 P07567 P29175 P26007 P23229 P15722 P29155 P36649 P45585 P37474 P30958 P42179 P00259 P20696 P20698 Q03046 P18548 P49293 P37271 P49112 P36192 P16466 P15320 P07838 P25090 P20585 P13705 P23228 Q01581 Q00056 P06798 P37734 P37329 P19214 Q02256 P01214 P22005 P46456 P38972 P25515 P32842 P37377 P37379 P37398 P21032 P49608 P37032 P43002 P33879 P15509 P11052 P14605 P30528 P47252 P33803 P40796 P11589 P20938 P25189 P46548 P42882 P34732 P18759 P43304 P32191 P29016 P15812 P31783 P41688 P16421 P14776 P38608 P27257 P43452 Q06908 P43010 P11024 P29120 P16519 P29915 P24918 P24501 P16323 P41439 Q05685 P00958 P00959 Q06031 Q00689

P31708 P15750 P44330 P23386 P10503 P07117 P11472 P46483 P08819 P49253 P01093 P13089 P40851 P42789 P18395 P12623 P36354 P24304 P16549 Q00966 P34913 P80299 P01860 P04220 P26982 P44947 P35852 P12042 P09030 P27864 P39518 P09181 P08934 P01048 P15001 P34754 Q03181 P37231 P14226 P10549 P28298 P46831 P37136 P09470 P15109 P00436 P28715 P35689 P35538 Q03845 P20810 P27321 P04324 P05857 P14614 P36642 P24128 P33613 P19410 P16099 P31483 Q01085 Q08099 P06485 P13837 P17053 P22071 P21097 P37896 P13372 Q09163 Q01481 P44483 P24652 P43360 P43366 P45609 P14206 Q05866 P18244 P22648 P44074 P37002 P46491 P16916 P16917 P49530 P09976 P30851 P45085 P03539 P20608 P12688 P18961 P44624 Q07075 Q09671 Q09670 P04776 P04347 P36033 P21375

P42675 P42676 P06811 Q01679 P43775 Q04609 P07654 P45190 P06670 P12159 P16092 P21803 P22304 Q08890 P00848 P47645 P04144 Q02917 P18254 P01338 P16104 P40279 P45657 P01229

Figure 5.10. The Protein network using only circular patterns

Red nodes are nodes found in the non-CP network. Green nodes are nodes found only in the CP-network. Blue edges denote edges found in the CP network. Pink edges are edges found only in the CP network. CHAPTER 5. CIRCULAR PATTERN MATCHING 130

Pfam-B_7553 Pfam-B_9638Pfam-B_364 Pfam-B_1783 Pfam-B_363 Pfam-B_2499 Pfam-B_1575 Pfam-B_3341 Pfam-B_9637 Pfam-B_1395 Pfam-B_3022 Pfam-B_1978Pfam-B_10214 Pfam-B_10216 Pfam-B_2723 Pfam-B_4314 Pfam-B_7248

Pfam-B_4801 Pfam-B_11491 Pfam-B_4424 Pfam-B_1111 Pfam-B_849 Pfam-B_2950 Pfam-B_758 Pfam-B_7246 Pfam-B_1551 Pfam-B_3424 Pfam-B_360 Pfam-B_1289 Pfam-B_5131 Pfam-B_1316Pfam-B_4721 Pfam-B_1680 Pfam-B_5133 Pfam-B_6535 Pfam-B_549Pfam-B_2370 Pfam-B_3500Pfam-B_1520 Pfam-B_10625 Pfam-B_10215

Pfam-B_10813

Pfam-B_6617 Pfam-B_199 Pfam-B_6622 Pfam-B_1663 Pfam-B_1308 Pfam-B_1399 Pfam-B_2848 Pfam-B_9702 Pfam-B_348 Pfam-B_4212

Pfam-B_8734 cytochrome_c Pfam-B_1477 Pfam-B_10802 Pfam-B_10511Pfam-B_10509 Pfam-B_10512 Pfam-B_4749 Pfam-B_6451 Pfam-B_832 Pfam-B_1660 Pfam-B_4213 Pfam-B_706 Pfam-B_4024 Pfam-B_707 Pfam-B_255 Pfam-B_1165 Pfam-B_6384 Pfam-B_1474 Pfam-B_1051 Pfam-B_1658Pfam-B_1659 Pfam-B_10510Pfam-B_427 Pfam-B_6513Pfam-B_6424 Pfam-B_4822

Pfam-B_914 Pfam-B_1685 Pfam-B_4592 Pfam-B_6423 Pfam-B_677 Pfam-B_10799 Pfam-B_10289 Pfam-B_9192 Pfam-B_3529 Pfam-B_6604 Pfam-B_6415 Pfam-B_5814 Pfam-B_10358 Pfam-B_6875 Pfam-B_10290Pfam-B_10288 Pfam-B_6733Pfam-B_2953 Pfam-B_1728 Pfam-B_8891 Pfam-B_4591 Pfam-B_4618 Pfam-B_5815 Pfam-B_2383Pfam-B_5679 Pfam-B_4593 Pfam-B_4619 Pfam-B_5687 Pfam-B_5812 Pfam-B_4224 Pfam-B_926 Pfam-B_3943

Pfam-B_1990 Pfam-B_11192 Pfam-B_1301 Pfam-B_684 Pfam-B_5584 Pfam-B_4873

Pfam-B_8548 Pfam-B_283 Pfam-B_405 Pfam-B_10370 Pfam-B_517 Pfam-B_3672 Pfam-B_8546 Pfam-B_2988 Pfam-B_6492 Pfam-B_1354 Pfam-B_7977 Pfam-B_560 Pfam-B_6863 Pfam-B_235 Pfam-B_10368 Pfam-B_1079 Pfam-B_5703cellulase Pfam-B_10508 Pfam-B_117 Pfam-B_133 Pfam-B_927 Pfam-B_1908 Pfam-B_24 Pfam-B_10373 Pfam-B_17 Pfam-B_6822 Pfam-B_50 Pfam-B_6426 Pfam-B_126 ketoacyl-synt

Pfam-B_6819 Pfam-B_177Pfam-B_36Pfam-B_52 Pfam-B_5747 Pfam-B_11649 Pfam-B_3528 Pfam-B_287 Pfam-B_1232 Pfam-B_6862Pfam-B_53 Pfam-B_8419Pfam-B_8289 Pfam-B_1650 Pfam-B_6736 Pfam-B_10371 Pfam-B_4159 Pfam-B_4255 Pfam-B_8358 Pfam-B_355 Pfam-B_4138 Pfam-B_1761 Pfam-B_1629Pfam-B_8423 Pfam-B_5669 Pfam-B_1233 Pfam-B_3670 Pfam-B_3138 Pfam-B_4141Pfam-B_8439 Pfam-B_2956 Pfam-B_234 Pfam-B_4011 Pfam-B_8284 Pfam-B_6846Pfam-B_11511 Pfam-B_6066 Pfam-B_5583 Pfam-B_6334 Pfam-B_2616Pfam-B_8508 Pfam-B_5768

Pfam-B_8927 Pfam-B_1853 Pfam-B_8919 Pfam-B_3303 Pfam-B_668Pfam-B_1454 Pfam-B_1449Pfam-B_800 Pfam-B_1627 Pfam-B_3240Pfam-B_4154 Pfam-B_8960 Pfam-B_5793 Pfam-B_799Pfam-B_1285 Pfam-B_1644 Pfam-B_1860 Pfam-B_8430 Pfam-B_1446Pfam-B_322 Pfam-B_8974 Pfam-B_9624 Pfam-B_6337Pfam-B_6333 Pfam-B_347 Pfam-B_629 Pfam-B_9623 Pfam-B_8544 Pfam-B_576 Pfam-B_4510 Pfam-B_33 Pfam-B_6336 Pfam-B_8763 Pfam-B_750 Pfam-B_6335 Pfam-B_8547 adh_zinc Pfam-B_2684 Pfam-B_4508 Pfam-B_609 Pfam-B_970 Pfam-B_5000 Pfam-B_2050 Pfam-B_2346 Pfam-B_10226 Pfam-B_7195 Pfam-B_4509 Pfam-B_2166 Pfam-B_2648 Pfam-B_1333 Pfam-B_3434 Pfam-B_8493 Pfam-B_9998 Pfam-B_5517Pfam-B_5471 Pfam-B_1299 Pfam-B_8312 Pfam-B_4102 Pfam-B_5482 Pfam-B_5509 Pfam-B_5759 Pfam-B_6237 Pfam-B_5513 Pfam-B_8545 Pfam-B_5516 Pfam-B_5758 Pfam-B_8305 Pfam-B_11160 Pfam-B_1953 Pfam-B_8364 Pfam-B_9973 Pfam-B_4511 Pfam-B_11159 Pfam-B_8360 Pfam-B_10419 Pfam-B_9984Pfam-B_9972 Pfam-B_4107 Pfam-B_666Pfam-B_1863 Pfam-B_10956 Pfam-B_8326 Pfam-B_10955 lectin_legA Pfam-B_3322 Pfam-B_741Pfam-B_295Pfam-B_2187 lectin_legB Pfam-B_8368 Pfam-B_1848 Pfam-B_89Pfam-B_3324 Pfam-B_296Pfam-B_294 Pfam-B_5487 Pfam-B_932Pfam-B_2068

Pfam-B_787

Pfam-B_10546 Pfam-B_645

Pfam-B_6467 Pfam-B_1996

Pfam-B_5730 Pfam-B_1683 Pfam-B_3427 Pfam-B_1469 Pfam-B_8027 Pfam-B_4028 Pfam-B_4453 Pfam-B_51 Pfam-B_8026 Pfam-B_1183 Pfam-B_11029 histone Pfam-B_59 Pfam-B_2600 Pfam-B_1614 Pfam-B_194 Pfam-B_5734 Pfam-B_1828 Pfam-B_6724 Pfam-B_2812 Pfam-B_2601 Pfam-B_1829 Pfam-B_6042 Pfam-B_492Pfam-B_9274 Pfam-B_5375 Pfam-B_3467 Pfam-B_9264 Pfam-B_3172 Pfam-B_7646Pfam-B_3981 Pfam-B_11301 Pfam-B_9255Pfam-B_1668 Pfam-B_9905 Pfam-B_40 Pfam-B_9267Pfam-B_1481 Pfam-B_9560 Pfam-B_3570 Pfam-B_7560 Pfam-B_9561 Pfam-B_8802 Pfam-B_222 Pfam-B_8801 Pfam-B_4913 Pfam-B_4410 Pfam-B_9898 Pfam-B_10259 Pfam-B_4104 Pfam-B_4494 Pfam-B_2755 Pfam-B_388 Pfam-B_3404Pfam-B_3738 Pfam-B_6544

Pfam-B_7904 Pfam-B_6227 Pfam-B_4089 Pfam-B_6225 Pfam-B_350 Pfam-B_11331 Pfam-B_2826 Pfam-B_7041 Pfam-B_9356 Pfam-B_3194 Pfam-B_361 Pfam-B_8239 Pfam-B_6903 Pfam-B_1960 Pfam-B_7501 Pfam-B_5138Pfam-B_1591 Pfam-B_1237 Pfam-B_723 Pfam-B_6428 Pfam-B_9968 Pfam-B_9380 Pfam-B_791 Pfam-B_408 Pfam-B_10374 Pfam-B_7970 Pfam-B_10555 Pfam-B_3117 Pfam-B_4547 Pfam-B_4087 Pfam-B_3933Pfam-B_7642Pfam-B_9980 Pfam-B_6021 Pfam-B_9971 Pfam-B_4134 Pfam-B_9859 Pfam-B_2445 Pfam-B_9982 Pfam-B_9981 Pfam-B_8987 Pfam-B_3393 Pfam-B_9983 Pfam-B_4662 Pfam-B_2002 Pfam-B_7664 Pfam-B_8988 Pfam-B_5187Pfam-B_1448 Pfam-B_1509 Pfam-B_11098 Pfam-B_1682 Pfam-B_1676 Pfam-B_10303 Pfam-B_7411 Pfam-B_8989 Pfam-B_9225 Pfam-B_6565 Pfam-B_3197 Pfam-B_380 p450 ins Pfam-B_442 Pfam-B_3196 Pfam-B_6657 Pfam-B_8445 STphosphatasePfam-B_1124Pfam-B_1123Pfam-B_2126

Pfam-B_3881 Pfam-B_6180 Pfam-B_3002 Pfam-B_4492 Pfam-B_2694 Pfam-B_11332 Pfam-B_2339 Pfam-B_2085 vwc Pfam-B_4852 Pfam-B_2682 Pfam-B_4415 Pfam-B_8322 Pfam-B_2998 Pfam-B_5714 Pfam-B_7669 Pfam-B_2763 Pfam-B_1961 Pfam-B_1177 Pfam-B_166 Pfam-B_1985 Pfam-B_1368 Pfam-B_1445 Pfam-B_1965 Pfam-B_9942 Pfam-B_10827 Pfam-B_4696 Pfam-B_10824 Pfam-B_686 Pfam-B_10826 Pfam-B_4813Pfam-B_1693Pfam-B_10817Pfam-B_10818Pfam-B_1512 Pfam-B_6485Pfam-B_8594 Pfam-B_9558 Pfam-B_7117 Pfam-B_8323 Pfam-B_6965 Pfam-B_5794 Pfam-B_1893 Pfam-B_6023 Pfam-B_11324 Pfam-B_6040 Pfam-B_4499 Pfam-B_3741Pfam-B_7119Pfam-B_5048 Pfam-B_1391 Pfam-B_5350Pfam-B_453 Pfam-B_7286 Pfam-B_7897Pfam-B_6484Pfam-B_4277 Pfam-B_4240 Pfam-B_6033fer4 Pfam-B_10822 Pfam-B_964 Pfam-B_10819 Pfam-B_818 Pfam-B_9562 Pfam-B_9436Pfam-B_8515Pfam-B_8516Pfam-B_9180 Pfam-B_8182 Pfam-B_748 Pfam-B_7263 Pfam-B_8986 Pfam-B_5571Pfam-B_9043 Pfam-B_1390 Pfam-B_1286 Pfam-B_2234 Pfam-B_5959Pfam-B_6964Pfam-B_11775 Pfam-B_4142 Pfam-B_2084 Pfam-B_804Pfam-B_5718 Pfam-B_4088Pfam-B_2611Pfam-B_1060 Pfam-B_1846 Pfam-B_7274 Pfam-B_6963 Pfam-B_1326 Pfam-B_1029 Pfam-B_687 Pfam-B_7118Pfam-B_9182Pfam-B_9041 Pfam-B_5444 Pfam-B_4070 Pfam-B_2756Pfam-B_688 Pfam-B_9044 Pfam-B_11833 Pfam-B_1839 Pfam-B_2178 GTP_EFTU Pfam-B_8241 Pfam-B_5028 Pfam-B_9927 Pfam-B_3195 Pfam-B_8302Pfam-B_1622 Pfam-B_7708 Pfam-B_2801 Pfam-B_7695 Pfam-B_8245 fn3 Pfam-B_9047 Pfam-B_11001 Pfam-B_9563 Pfam-B_9046 Pfam-B_9928 Pfam-B_3969 vwa Pfam-B_9042 Pfam-B_4475 Kunitz_BPTItrypsin zf-C2H2 Pfam-B_2715 Pfam-B_6619 Pfam-B_3970 Pfam-B_5441 Pfam-B_9045 Pfam-B_8249 Pfam-B_4474 Pfam-B_8242Pfam-B_8238

Pfam-B_390 Pfam-B_3461 Pfam-B_456Pfam-B_455 Pfam-B_4489 Pfam-B_1275 Pfam-B_220 fibrinogen_C Pfam-B_4898Pfam-B_801 Pfam-B_11846 Pfam-B_3369 Pfam-B_3877 Pfam-B_9241 Pfam-B_10904 Pfam-B_5386 Pfam-B_9418 Pfam-B_7462 Pfam-B_2278 Pfam-B_2729 Pfam-B_1063 Pfam-B_415 Pfam-B_8163 Pfam-B_2421 Pfam-B_1274 Pfam-B_100 Pfam-B_2408 Pfam-B_1739 Pfam-B_11280 Pfam-B_9527 Pfam-B_605 Pfam-B_1216 Pfam-B_585 Pfam-B_3824Pfam-B_5533 Pfam-B_1064 Pfam-B_1815 Pfam-B_8435 Pfam-B_646 Pfam-B_1816 Pfam-B_2797 Pfam-B_1373 Pfam-B_8862 Pfam-B_3360 Pfam-B_1819 Pfam-B_9901 Pfam-B_6248 Pfam-B_7809 fn1 fn2 Pfam-B_467 Pfam-B_466 Pfam-B_2420Pfam-B_10758 Pfam-B_10015 Pfam-B_4237 Pfam-B_1508 Pfam-B_5129 Pfam-B_167 Pfam-B_7577 Pfam-B_1818Pfam-B_1817 Pfam-B_4490Pfam-B_9903 Pfam-B_4713 Pfam-B_5710 Pfam-B_6062 Pfam-B_5709 Pfam-B_2907Pfam-B_6588 Pfam-B_5708 Pfam-B_9649 Pfam-B_4631 Pfam-B_4 Pfam-B_156 Pfam-B_2568 Pfam-B_9650 Pfam-B_3540 wap Pfam-B_6098 Pfam-B_4712 Pfam-B_2419 Pfam-B_11329 Pfam-B_8206 Pfam-B_8912 Pfam-B_16 Pfam-B_3978 Pfam-B_3193 Pfam-B_2906 Pfam-B_874 Pfam-B_10545 Pfam-B_425Pfam-B_337 Pfam-B_5316 Pfam-B_9974 Pfam-B_30 Pfam-B_22 Pfam-B_8205 Pfam-B_4006 Pfam-B_10736 Pfam-B_8181 Pfam-B_7816 Pfam-B_10751 Pfam-B_8204 HTH_1 Pfam-B_267 Pfam-B_1510 Pfam-B_10748 Pfam-B_2274 Pfam-B_2026Pfam-B_9 Pfam-B_982 Pfam-B_7932 Pfam-B_233 Pfam-B_2814 Pfam-B_10170 Pfam-B_10752 Pfam-B_174 Pfam-B_10760 Pfam-B_3509 Pfam-B_5634Pfam-B_1459 Pfam-B_4140 Pfam-B_4939 Pfam-B_3816 Pfam-B_4894 Pfam-B_1467Pfam-B_2494 Pfam-B_10350 Pfam-B_10993 Pfam-B_7545 serpin Pfam-B_6398 Pfam-B_10351 Pfam-B_4563 Pfam-B_3237 Pfam-B_6586 Pfam-B_3015Pfam-B_2550Pfam-B_9007 Pfam-B_1654 Pfam-B_4943 Pfam-B_3017 Pfam-B_1107 Pfam-B_10217 Pfam-B_547 Pfam-B_3436 Pfam-B_606 Pfam-B_6641 Pfam-B_3771 Pfam-B_2028 Pfam-B_7719Pfam-B_2963 Pfam-B_6417 Pfam-B_2710 Pfam-B_7334Pfam-B_4606Pfam-B_10925 Pfam-B_7717Pfam-B_7718 Pfam-B_2643 Pfam-B_6957Pfam-B_7053 Pfam-B_9437 Pfam-B_461Pfam-B_548 Pfam-B_472 Pfam-B_3012 Pfam-B_8273Pfam-B_2314 Pfam-B_5964 Pfam-B_1113 Pfam-B_3081 Pfam-B_616Pfam-B_210 Pfam-B_7155 Pfam-B_2386 Pfam-B_473 Pfam-B_2483 Pfam-B_6076 Pfam-B_7033 Pfam-B_851 Pfam-B_626 Pfam-B_229 Pfam-B_7182 Pfam-B_520 Pfam-B_8380 Pfam-B_675 Pfam-B_5385 Pfam-B_546 Pfam-B_633 Pfam-B_5750Pfam-B_11100 Pfam-B_6122Pfam-B_9008Pfam-B_3773 Pfam-B_948 Pfam-B_3223 Pfam-B_814 Pfam-B_8057 Pfam-B_9006Pfam-B_3013 Pfam-B_5514 Pfam-B_483 Pfam-B_3014 Pfam-B_8376Pfam-B_11590Pfam-B_9682 Pfam-B_6858 Pfam-B_2188 Pfam-B_1620 Pfam-B_9684 Pfam-B_7659 Pfam-B_9683Pfam-B_3942Pfam-B_5087 Pfam-B_8095 Pfam-B_1433 Pfam-B_7546 Pfam-B_1621 Pfam-B_7252 Pfam-B_4127 Pfam-B_10145 ank Pfam-B_8954 Pfam-B_4269 Pfam-B_3153 PH Pfam-B_1593 Pfam-B_4658 Pfam-B_4252 Pfam-B_2630 Pfam-B_2304Pfam-B_667 Pfam-B_5757 Pfam-B_6859 Pfam-B_11279 Pfam-B_8944 Pfam-B_3557Pfam-B_8371 Pfam-B_9276Pfam-B_5512 Pfam-B_1143 Pfam-B_1628 Pfam-B_7535 Pfam-B_7453 Pfam-B_7686 Pfam-B_1822Pfam-B_10719 Pfam-B_9605 Pfam-B_8676 Pfam-B_8222 Pfam-B_5740 Pfam-B_5085 Pfam-B_3249 Pfam-B_1144 pou Pfam-B_6987Pfam-B_4199 Pfam-B_11046 Pfam-B_5383 Pfam-B_1414Pfam-B_3097 Pfam-B_7091 Pfam-B_3994 Pfam-B_3492Pfam-B_5196 Pfam-B_11151 Pfam-B_10473 Pfam-B_5339 Pfam-B_4139 Pfam-B_5197 Pfam-B_11014 Pfam-B_8466 Pfam-B_5340 Pfam-B_4083 Pfam-B_3995 Pfam-B_8157 laminin_Nterm Pfam-B_1800 Pfam-B_2131 Pfam-B_6845 Pfam-B_8076 Pfam-B_1457 Pfam-B_1043 Pfam-B_5841 Pfam-B_7034Pfam-B_3121 Pfam-B_2777 Pfam-B_9138 Pfam-B_8096 Pfam-B_5109Pfam-B_9606 Pfam-B_1440 Pfam-B_10435 Pfam-B_5341Pfam-B_3222 Pfam-B_7640 Pfam-B_6650 Pfam-B_9838 Pfam-B_5361Pfam-B_8223Pfam-B_4093 Pfam-B_4126 Pfam-B_8370Pfam-B_5413 Pfam-B_11092 Pfam-B_2583 Pfam-B_308 Pfam-B_368 Pfam-B_178 Pfam-B_2324 Pfam-B_8910 Pfam-B_5360 Pfam-B_7981 Pfam-B_554 Pfam-B_270 Pfam-B_4908 Pfam-B_332 myosin_head Pfam-B_10527 Pfam-B_7680 Pfam-B_1876 Pfam-B_5842Pfam-B_1941 Pfam-B_105 EGF Pfam-B_6915 Pfam-B_4659 Pfam-B_5198 Pfam-B_4646 Pfam-B_2608 Pfam-B_334 Pfam-B_1084 Pfam-B_10722 cadherin Pfam-B_7982 Pfam-B_307 Pfam-B_6916 Pfam-B_11075 Pfam-B_4632 Pfam-B_8224 Pfam-B_10723Pfam-B_7056 Pfam-B_11278 Pfam-B_407 Pfam-B_6489 Pfam-B_844sushi homeobox Pfam-B_3173 Pfam-B_1782 Pfam-B_8869 Pfam-B_3446Pfam-B_590 Pfam-B_733 Pfam-B_352 trefoil Pfam-B_7751 Pfam-B_10440 Pfam-B_336 Pfam-B_8523Pfam-B_4129Pfam-B_8373 Pfam-B_3547 Pfam-B_3972 Pfam-B_728 Pfam-B_6659 Pfam-B_2900 Pfam-B_358 Pfam-B_5435 Pfam-B_2139 Pfam-B_6988 Pfam-B_9139 Pfam-B_4036 Pfam-B_7843 Pfam-B_309 Pfam-B_10147 Pfam-B_3543 Pfam-B_2635 Pfam-B_843 Pfam-B_8397 Pfam-B_8460Pfam-B_5510Pfam-B_4945 Pfam-B_11021 Pfam-B_3109 Pfam-B_1349 Pfam-B_11256 tRNA-synt_2 Pfam-B_3971 Pfam-B_5808 Pfam-B_4128 Pfam-B_4497 Pfam-B_7849 Pfam-B_7174 Pfam-B_4701 Pfam-B_4044 Pfam-B_1240 Pfam-B_7504 Pfam-B_3342 Pfam-B_3200 Pfam-B_5540Pfam-B_8443Pfam-B_5199 Pfam-B_3714 Pfam-B_1297 Pfam-B_9140 Pfam-B_968 Pfam-B_575 Pfam-B_7845 Pfam-B_1182 Pfam-B_10294 Pfam-B_4617 Pfam-B_10703Pfam-B_4700 Pfam-B_3873 Pfam-B_5809 Pfam-B_7681 Pfam-B_2310 Pfam-B_2377 Pfam-B_359 Pfam-B_4645 Pfam-B_7846 hormone Pfam-B_4045 Pfam-B_2983Pfam-B_11794 Pfam-B_5045 Pfam-B_3715 Pfam-B_10678 Pfam-B_2923 Pfam-B_6294 Pfam-B_1189 Pfam-B_2982Pfam-B_3713 Pfam-B_5084 Pfam-B_402 Pfam-B_7844 Pfam-B_6216 Pfam-B_1197 Pfam-B_1695 Pfam-B_4909 Pfam-B_4084 Pfam-B_6943Pfam-B_2981Pfam-B_2980 Pfam-B_5541Pfam-B_7999 Pfam-B_5650 Pfam-B_2790 Pfam-B_8192 Pfam-B_5810 Pfam-B_8379 Pfam-B_6287 Pfam-B_6419 Pfam-B_7848 Pfam-B_8524Pfam-B_6217 Pfam-B_2570 Pfam-B_8005 Pfam-B_11130 Pfam-B_2342 Pfam-B_2181 Pfam-B_1807 Pfam-B_8955 Pfam-B_196 Pfam-B_3495 Pfam-B_2622 Pfam-B_6146 Pfam-B_11026Pfam-B_11037 Pfam-B_3080 Pfam-B_20 Pfam-B_893 Pfam-B_370 Pfam-B_3946 Pfam-B_11009 Pfam-B_4130 Pfam-B_11898 Pfam-B_11112 Pfam-B_11017Pfam-B_11000 Pfam-B_7997Pfam-B_5317Pfam-B_8372Pfam-B_8383 Pfam-B_9662Pfam-B_3074 Pfam-B_9933 Pfam-B_9932 Pfam-B_4482 laminin_EGF Pfam-B_6944 Pfam-B_37 Pfam-B_7505 Pfam-B_2834 TGF-beta laminin_B Pfam-B_2245 Pfam-B_3037Pfam-B_2463 Pfam-B_1694 Pfam-B_335 Pfam-B_8447Pfam-B_2230 Pfam-B_11897 Pfam-B_3517 Pfam-B_1705 Pfam-B_3883 Pfam-B_136 Pfam-B_32 Pfam-B_3124Pfam-B_3102Pfam-B_2575notch Pfam-B_3059 Pfam-B_1613 Pfam-B_7666 Pfam-B_2158 Pfam-B_889 Pfam-B_367 Pfam-B_913 Pfam-B_58 Pfam-B_150 Pfam-B_671 Pfam-B_9931 Pfam-B_1302 pilin Pfam-B_9934 Pfam-B_9943 Pfam-B_447 Pfam-B_1339 Pfam-B_10131 Cys_knot Pfam-B_121 Pfam-B_10947 Pfam-B_6195 Pfam-B_5182 Pfam-B_830 Pfam-B_8684 Pfam-B_5335 Pfam-B_9935 Pfam-B_11896 Pfam-B_3327 Pfam-B_10299 Pfam-B_2506 Pfam-B_5428 Pfam-B_5348 Pfam-B_1774 RIP Pfam-B_2846 Pfam-B_1971 E1-E2_ATPase Pfam-B_11155 Pfam-B_7561 Pfam-B_2773 Pfam-B_3463 Pfam-B_10143 laminin_G Pfam-B_607 Pfam-B_7667 Pfam-B_1592 Pfam-B_8825Pfam-B_8824 Pfam-B_5154 Pfam-B_11145 DAG_PE-bind Pfam-B_1574 Pfam-B_9661 Pfam-B_8034 Pfam-B_479 Pfam-B_3083 Pfam-B_10105 Pfam-B_92 Pfam-B_10702 Pfam-B_436 Pfam-B_6378 Pfam-B_4594 Pfam-B_697 Pfam-B_813 Pfam-B_3898Pfam-B_9609 Pfam-B_8120 Pfam-B_701 Pfam-B_4521 Pfam-B_10707 Pfam-B_523Pfam-B_1331Pfam-B_6308Pfam-B_4051 Pfam-B_1075 sugar_tr Pfam-B_1110 Pfam-B_4324 Pfam-B_3494 Pfam-B_330 Pfam-B_8207 Pfam-B_4202 lectin_c Pfam-B_4420Pfam-B_275 Pfam-B_8170 Pfam-B_6380 Pfam-B_7558 Pfam-B_5136 Pfam-B_6716 Pfam-B_10130 Pfam-B_5421Pfam-B_8583Pfam-B_3169Pfam-B_8197 Pfam-B_10144Pfam-B_2361 Pfam-B_1516 Pfam-B_6721 Pfam-B_2519 Pfam-B_7235 Pfam-B_10298 Pfam-B_4191 Pfam-B_1466 Pfam-B_752 Pfam-B_4244 Pfam-B_10001 Pfam-B_3063 Pfam-B_747 Pfam-B_3817 Pfam-B_2350 Pfam-B_1041 Pfam-B_3187 Pfam-B_2061 Pfam-B_7116 Pfam-B_954 Pfam-B_1970 Pfam-B_3899 Pfam-B_2716 Pfam-B_2518Pfam-B_8228Pfam-B_4061 Pfam-B_7668 Pfam-B_459 Pfam-B_2120 Pfam-B_10021 Pfam-B_2172 Pfam-B_3516 Pfam-B_11762Pfam-B_2977 Pfam-B_2225 Pfam-B_7442 Pfam-B_2887 Pfam-B_1465Pfam-B_11323Pfam-B_5799 filament Pfam-B_73 Pfam-B_6291 Pfam-B_2211 Pfam-B_8194 Pfam-B_3708 Pfam-B_2416 Pfam-B_5864 Pfam-B_10142 Pfam-B_1532 Pfam-B_1826 Pfam-B_7559 Pfam-B_5604Pfam-B_9667 Pfam-B_8257 Pfam-B_764 Pfam-B_2204 Pfam-B_2836 Pfam-B_7425 Pfam-B_2541Pfam-B_5406Pfam-B_2775 Pfam-B_6534 Pfam-B_699 Pfam-B_7392 Pfam-B_7665 Pfam-B_460 Pfam-B_11763 Pfam-B_7557 Pfam-B_7595 Pfam-B_2122Pfam-B_534 hemopexin Pfam-B_5411 Pfam-B_2540Pfam-B_2604 Pfam-B_4060 Pfam-B_10189 Pfam-B_864Pfam-B_4538 Pfam-B_803 Pfam-B_10397 Pfam-B_3901 Pfam-B_2839Pfam-B_3580 Pfam-B_6532 Pfam-B_11068 Pfam-B_1283 Pfam-B_3916Pfam-B_4072 Pfam-B_698 Pfam-B_5607 Pfam-B_8982 Pfam-B_4555 Pfam-B_3133Pfam-B_7301 Pfam-B_4075Pfam-B_6600 Pfam-B_4394 Pfam-B_1067 Pfam-B_4556 Pfam-B_7209 Pfam-B_2016 Pfam-B_711 Pfam-B_11808 Pfam-B_3882 Pfam-B_8468 Pfam-B_8150 Pfam-B_4058Pfam-B_5909 Pfam-B_8145Pfam-B_8139 Pfam-B_2351 ATP-synt_C Pfam-B_2995Pfam-B_1291 Pfam-B_7300 Pfam-B_2747 Pfam-B_7952 Pfam-B_2701 Pfam-B_6013 Pfam-B_7299 Pfam-B_9191 Pfam-B_8196 Pfam-B_11038 Pfam-B_3474 Pfam-B_5397Pfam-B_11766 Pfam-B_4976 Pfam-B_4062 Pfam-B_7017Pfam-B_5313 Pfam-B_8603 Pfam-B_225 Pfam-B_887 Pfam-B_2173 Pfam-B_4931 Pfam-B_7210 Pfam-B_7951Pfam-B_1134 Pfam-B_7049 Pfam-B_5741 Pfam-B_1106 Pfam-B_3902 Pfam-B_5857 Pfam-B_11753dsrm Pfam-B_2598 Pfam-B_4714 Pfam-B_3072 Pfam-B_1979 Pfam-B_10721 Pfam-B_3371 Pfam-B_2994Pfam-B_11500 Pfam-B_735 pkinasePfam-B_3707 Pfam-B_2241 Pfam-B_5973 ras Pfam-B_6257 Pfam-B_4085 Pfam-B_8143 Pfam-B_11146 efhand Pfam-B_5405 Pfam-B_5908 Pfam-B_3706ig Pfam-B_2161Pfam-B_1136 Pfam-B_910 Pfam-B_2362 Pfam-B_3170Pfam-B_7926Pfam-B_4057Pfam-B_8153 Pfam-B_9767 Pfam-B_2305Pfam-B_282 Pfam-B_474 Pfam-B_6371 Pfam-B_6563 Pfam-B_11241 Pfam-B_8918Pfam-B_4077 Pfam-B_2949 Pfam-B_8975Pfam-B_2855 Pfam-B_1116 Pfam-B_3071Pfam-B_381 Pfam-B_8003 Pfam-B_5407 Pfam-B_2827 RuBisCO_smallPfam-B_3575 Pfam-B_5211 Pfam-B_3767 Pfam-B_147Pfam-B_6038 Pfam-B_1016 Pfam-B_1731 Pfam-B_3947 Pfam-B_5007 Pfam-B_102Pfam-B_3960 ABC_tranPfam-B_4206 Pfam-B_4709 Pfam-B_7677 Pfam-B_9309 Pfam-B_8138 Pfam-B_5986 Pfam-B_2 Pfam-B_4807 Pfam-B_3314 Pfam-B_6369 Pfam-B_10297 Pfam-B_7055 Pfam-B_2034 Pfam-B_4610Pfam-B_11743 Pfam-B_863Pfam-B_8760Pfam-B_1025 rrm Pfam-B_9692 Pfam-B_1203 Pfam-B_10784 Pfam-B_1927 Pfam-B_7994 Pfam-B_3192 Pfam-B_5742 zn-protease Pfam-B_3758 Pfam-B_8527 Pfam-B_6144 Pfam-B_1713 Pfam-B_1855 Pfam-B_8230 Pfam-B_1649 Pfam-B_10814 Pfam-B_4295 Pfam-B_7873 Pfam-B_1525 Pfam-B_5278 Pfam-B_4806 Pfam-B_3574 Pfam-B_963 Pfam-B_11198 Pfam-B_5398lipocalinPfam-B_11830Pfam-B_11809 Pfam-B_5412 Pfam-B_9506 Pfam-B_135 Pfam-B_1656Pfam-B_1618 Pfam-B_8409 Pfam-B_122 Pfam-B_1282Pfam-B_768 Pfam-B_7880 Pfam-B_93 Pfam-B_1436 Pfam-B_2581 Pfam-B_1047 Pfam-B_6949 Pfam-B_915 Pfam-B_7399Pfam-B_6053 Pfam-B_3897 Pfam-B_11199 Pfam-B_219 Pfam-B_83Pfam-B_8195 Pfam-B_4513 Pfam-B_8141Pfam-B_1821Pfam-B_5515 Pfam-B_8717 Pfam-B_2488 Pfam-B_2157Pfam-B_3075 Pfam-B_4607Pfam-B_1589Pfam-B_477Pfam-B_3257Pfam-B_8017 Pfam-B_1840 Pfam-B_1267 cyclin Pfam-B_148 Pfam-B_4418 Pfam-B_2035 Pfam-B_4059 Pfam-B_3663 Pfam-B_727 Pfam-B_1871 Pfam-B_2434 Pfam-B_3082 Pfam-B_7727 Pfam-B_8142Pfam-B_8193 Pfam-B_961Pfam-B_11905 Pfam-B_3429 Pfam-B_5754Pfam-B_700Pfam-B_1956 Pfam-B_1617 Pfam-B_1813 Pfam-B_3458 Pfam-B_7635 Pfam-B_4076 Pfam-B_8161 Pfam-B_6578 Pfam-B_1473 Pfam-B_7826 Pfam-B_2284 Pfam-B_10965Pfam-B_5293 Pfam-B_3405 Pfam-B_7451 Pfam-B_8175 Pfam-B_162 Pfam-B_5101Pfam-B_1255 Pfam-B_974 Pfam-B_5404 Pfam-B_7745 Pfam-B_6909 Pfam-B_8227Pfam-B_2840 Pfam-B_2285 copper-bind Pfam-B_7456 Pfam-B_5246 Pfam-B_7923 Pfam-B_28 Pfam-B_11892C2 Pfam-B_1823 Pfam-B_176 Pfam-B_9590 Pfam-B_7971 Pfam-B_2766 Pfam-B_11918 Pfam-B_1868 Pfam-B_11906 Pfam-B_1967 Pfam-B_5323 Pfam-B_2262 Pfam-B_9648 Pfam-B_9407 Pfam-B_9508 Pfam-B_303 Pfam-B_10946 Pfam-B_2555 Pfam-B_1026Pfam-B_1068 Pfam-B_11900 Pfam-B_2060Pfam-B_1612Pfam-B_8518Pfam-B_3618 Pfam-B_324 Pfam-B_10720 Pfam-B_5334 Pfam-B_878Pfam-B_11735 Pfam-B_3730 Pfam-B_1157 Pfam-B_8407Pfam-B_7010Pfam-B_4924Pfam-B_6414 Pfam-B_3302 Pfam-B_403Pfam-B_1238Pfam-B_10357 Pfam-B_831 Pfam-B_7349 Pfam-B_3289Pfam-B_4726 Pfam-B_2150Pfam-B_4724 Pfam-B_3428 Pfam-B_3837 Pfam-B_2425 Pfam-B_796 Pfam-B_4666 Pfam-B_1684 Pfam-B_2400Pfam-B_3633Pfam-B_2884 Pfam-B_7350 Pfam-B_1186 Pfam-B_284 Pfam-B_2327Pfam-B_4243 Pfam-B_10558 Pfam-B_10884 zf-C4 Pfam-B_7233 Pfam-B_10236 Pfam-B_4552 Pfam-B_5089 Pfam-B_5153 Pfam-B_3999 Pfam-B_975 Pfam-B_10244 Pfam-B_10563 Pfam-B_305 Pfam-B_5214 Pfam-B_909 Pfam-B_10562Pfam-B_5173 Pfam-B_10241 Pfam-B_594Pfam-B_42 Pfam-B_770 Pfam-B_4550 Pfam-B_880 Pfam-B_4665 Pfam-B_6338 Pfam-B_8116 Pfam-B_6374Pfam-B_38 Pfam-B_7756 Pfam-B_6660 Pfam-B_10581 Pfam-B_125 Pfam-B_762Pfam-B_7115 Pfam-B_4664 Pfam-B_6341 Pfam-B_41Pfam-B_110 Pfam-B_10578 Pfam-B_1720 Pfam-B_1726 Pfam-B_331Pfam-B_65 Pfam-B_10580 Pfam-B_11715 Pfam-B_4997Pfam-B_827 Pfam-B_10617Pfam-B_8019 Pfam-B_123 Pfam-B_49 Pfam-B_9936 Pfam-B_13 Pfam-B_1706Pfam-B_765 Pfam-B_11744 Pfam-B_10237 Pfam-B_7008 Pfam-B_422Pfam-B_595 Pfam-B_11716 Pfam-B_7009 Pfam-B_11487 Pfam-B_7069 Pfam-B_826 Pfam-B_8749 Pfam-B_4881 Pfam-B_138 Pfam-B_3505 Pfam-B_3703 Pfam-B_11714 Pfam-B_6342Pfam-B_11490 Pfam-B_109Pfam-B_395 Pfam-B_7197 Pfam-B_11717 Pfam-B_9904Pfam-B_9897 Pfam-B_1234 Pfam-B_2844 Pfam-B_993Pfam-B_10711 Pfam-B_1755 Pfam-B_2913Pfam-B_6595 Pfam-B_1351 Pfam-B_1758 Pfam-B_6594 SH2 Pfam-B_10391 Pfam-B_1220 Pfam-B_1521 Pfam-B_11240 Pfam-B_2912SH3 Pfam-B_10727 Pfam-B_6330 Pfam-B_10471 Pfam-B_10708Pfam-B_918 Pfam-B_10726 Pfam-B_1204 Pfam-B_6340 Pfam-B_7052 Pfam-B_2802 Pfam-B_159 Pfam-B_7196 Pfam-B_9911 Pfam-B_3593 Pfam-B_10245 Pfam-B_505 Pfam-B_4648 Pfam-B_2008 Pfam-B_10788 Pfam-B_7797 Pfam-B_7653 Pfam-B_10786 Pfam-B_10243Pfam-B_766 Pfam-B_423 ldl_recept_a Pfam-B_10246 Pfam-B_3564 Pfam-B_189 Pfam-B_10785 Pfam-B_3935 Pfam-B_3508 Pfam-B_10790 Pfam-B_6332Pfam-B_2841 Pfam-B_10242 Pfam-B_10239 Pfam-B_10789 Pfam-B_11806 Pfam-B_2223 Pfam-B_10010 Pfam-B_2659 Pfam-B_2696 Pfam-B_3507 Pfam-B_718 Pfam-B_11859 Pfam-B_1677 Pfam-B_411 Pfam-B_338 Pfam-B_10793 Pfam-B_11123 Pfam-B_6598 Pfam-B_888Pfam-B_9052Pfam-B_1725 Pfam-B_160Pfam-B_919 Pfam-B_8890Pfam-B_8808Pfam-B_11593 Pfam-B_6339 Pfam-B_10787 AAA Pfam-B_2675 Pfam-B_1223 Pfam-B_11805 Pfam-B_2697 Pfam-B_4451 Pfam-B_1089 Pfam-B_8313 Pfam-B_11592 Pfam-B_9750Pfam-B_9431Pfam-B_1883 Pfam-B_141 Pfam-B_6599 Pfam-B_7798 Pfam-B_3887 Pfam-B_6580 Pfam-B_11350 Pfam-B_2228 Pfam-B_372 Pfam-B_4885 Pfam-B_5656Pfam-B_10099 Pfam-B_9215 Pfam-B_2792 Pfam-B_2893 Pfam-B_7900 Pfam-B_1375 Pfam-B_11591 Pfam-B_4174 Pfam-B_969 Pfam-B_630 Pfam-B_1533 Pfam-B_6435 Pfam-B_5653 Pfam-B_2914 Pfam-B_7830 Pfam-B_1502Pfam-B_4483 Pfam-B_582 Pfam-B_4074 Pfam-B_530 Pfam-B_1729 Pfam-B_1152 Pfam-B_2676 Pfam-B_777 Pfam-B_7311 Pfam-B_2892 Pfam-B_2296 Pfam-B_5264 Pfam-B_3212 Pfam-B_382 Pfam-B_2298 Pfam-B_674 Pfam-B_5041 Pfam-B_10587 HSP20 Pfam-B_881Pfam-B_1824Pfam-B_2945Pfam-B_2299 Pfam-B_955 Pfam-B_4675 Pfam-B_10791 Pfam-B_3299 Pfam-B_2881 Pfam-B_5591 Pfam-B_97 Pfam-B_4082 Pfam-B_2297 zf-CCHC Pfam-B_2842 Pfam-B_4403Pfam-B_10556 Pfam-B_7488 Pfam-B_5654 Pfam-B_1490 Pfam-B_870 Pfam-B_3263 Pfam-B_3069 Pfam-B_10585 Pfam-B_6004 Pfam-B_8724 Pfam-B_2880 Pfam-B_9049 Pfam-B_3050 Pfam-B_6003 Pfam-B_1640 Pfam-B_5117 Pfam-B_5079 Pfam-B_2530rhv Pfam-B_10718Pfam-B_6000Pfam-B_2531 Pfam-B_2759 Pfam-B_75Pfam-B_10589 Pfam-B_7654 Pfam-B_7487 Pfam-B_5116 Pfam-B_563 Pfam-B_9548 Pfam-B_774 Pfam-B_1489Pfam-B_2013 Pfam-B_6511 Pfam-B_2290 Pfam-B_31 rvt Pfam-B_60 Pfam-B_9057 Pfam-B_885Pfam-B_865 Pfam-B_1802 Pfam-B_9533Pfam-B_9543 Pfam-B_2292 Pfam-B_2469 Pfam-B_4225Pfam-B_3953 Pfam-B_6515 Pfam-B_10543 Pfam-B_4397 Pfam-B_3068 Pfam-B_6523 Pfam-B_4395 Pfam-B_9531 Pfam-B_939 hormone_rec Pfam-B_1260 Pfam-B_7489 Pfam-B_4398 Pfam-B_6802 Pfam-B_1241 Pfam-B_620 Pfam-B_297Pfam-B_1361 Pfam-B_2515 Pfam-B_2289 Pfam-B_2240 Pfam-B_7483Pfam-B_10597Pfam-B_6012 Pfam-B_3572 Pfam-B_6801Pfam-B_11467 Pfam-B_6959 Pfam-B_10604Pfam-B_6002 Pfam-B_3879 Pfam-B_9494 Pfam-B_10572 Pfam-B_6478 Pfam-B_2403 Pfam-B_2669 Pfam-B_6006 Pfam-B_10574Pfam-B_400 Pfam-B_2989 Pfam-B_10575 Pfam-B_7569 Pfam-B_5144Pfam-B_130 Pfam-B_2758 Pfam-B_2838 Pfam-B_7486 Pfam-B_7491 Pfam-B_5797 Pfam-B_7573Pfam-B_7282 Pfam-B_1804 Pfam-B_10573Pfam-B_10403 Pfam-B_1556 Pfam-B_2405 Pfam-B_3903 Pfam-B_7329Pfam-B_3829 Pfam-B_7375 Pfam-B_6323 Pfam-B_6018Pfam-B_6017 Pfam-B_1595 Pfam-B_5143 Pfam-B_3288 Pfam-B_10576Pfam-B_2878 Pfam-B_5141 il8 Pfam-B_725 Pfam-B_10607 Pfam-B_5368 Pfam-B_4279 Pfam-B_39527tm_1 Pfam-B_10218 Pfam-B_8722 Pfam-B_1600 Pfam-B_5078 Pfam-B_170 Pfam-B_7750 Pfam-B_5115 Pfam-B_11828 Pfam-B_2877 Pfam-B_1418Pfam-B_3905 Pfam-B_7771 Pfam-B_9359 Pfam-B_9919 Pfam-B_8094Y_phosphatase Pfam-B_3904 Pfam-B_5142 Pfam-B_5820 Pfam-B_4396Pfam-B_7479 Pfam-B_5111 Pfam-B_9130 Pfam-B_1127 Pfam-B_2288 Pfam-B_9459 Pfam-B_7480 Pfam-B_7769 Pfam-B_749 Pfam-B_11625 Pfam-B_10594 Pfam-B_6992Pfam-B_2477 Pfam-B_1665 Pfam-B_659 Pfam-B_2291Pfam-B_1922 Pfam-B_6480 Pfam-B_7749Pfam-B_438Pfam-B_623 Pfam-B_11928 Pfam-B_6184Pfam-B_5766 Pfam-B_4164 Pfam-B_5765 Pfam-B_6878 Pfam-B_437 Pfam-B_6548 Pfam-B_108 Pfam-B_10566 Pfam-B_3880 Pfam-B_940Pfam-B_9916 Pfam-B_1602 Pfam-B_3165Pfam-B_9848 Pfam-B_1258 Pfam-B_3945Pfam-B_3315 Pfam-B_7321 Pfam-B_9918Pfam-B_3095Pfam-B_2703Pfam-B_7648 Pfam-B_5072 Pfam-B_11104 Pfam-B_7322 Pfam-B_11771 Pfam-B_4663 Pfam-B_6183 Pfam-B_5069Pfam-B_3842Pfam-B_2978 Pfam-B_6723 Pfam-B_1265 Pfam-B_8581Pfam-B_1601 Pfam-B_19 rnaseHPfam-B_859 Pfam-B_533 Pfam-B_3944 Pfam-B_4150 Pfam-B_3442 Pfam-B_5070 Pfam-B_5077 rvp Pfam-B_5376 Pfam-B_11643Pfam-B_4258Pfam-B_1125 Pfam-B_7077Pfam-B_990 Pfam-B_7374 Pfam-B_4110 Pfam-B_7647 Pfam-B_705Pfam-B_9846 Pfam-B_2528 Pfam-B_8580 Pfam-B_1388 Pfam-B_201 Pfam-B_5075 Pfam-B_4046 Pfam-B_1931 Pfam-B_5073 Pfam-B_5266 Pfam-B_478 Pfam-B_3876 Pfam-B_3940Pfam-B_3962 Pfam-B_7078 Pfam-B_7372Pfam-B_7373 Pfam-B_339Pfam-B_2529 Pfam-B_3939Pfam-B_9129Pfam-B_6547 Pfam-B_7712 Pfam-B_11584 Pfam-B_773 Pfam-B_2012 ldl_recept_b Pfam-B_9827 Pfam-B_5185 Pfam-B_11644 Pfam-B_6847 Pfam-B_3878 Pfam-B_7836 Pfam-B_2409 Pfam-B_4684 Pfam-B_11419 Pfam-B_1323 Pfam-B_3120 Pfam-B_5186 Pfam-B_3256 Pfam-B_495 Pfam-B_1803 Pfam-B_7838 Pfam-B_5643 Pfam-B_5184 Pfam-B_3963 Pfam-B_989 Pfam-B_5960 Pfam-B_497 Pfam-B_2618 Pfam-B_1056 Pfam-B_5114 Pfam-B_2011 Pfam-B_3689 Pfam-B_3310 Pfam-B_7616 Pfam-B_312 Pfam-B_5112 Pfam-B_1788 Pfam-B_2504Pfam-B_936Pfam-B_1112 Pfam-B_3048Pfam-B_2111 Pfam-B_6728 Pfam-B_7526Pfam-B_4851 Pfam-B_7713 Pfam-B_9303 Pfam-B_9365 Pfam-B_2619 Pfam-B_555Pfam-B_134 Pfam-B_8758Pfam-B_8759 Pfam-B_757Pfam-B_2015 Pfam-B_2007 Pfam-B_11638 Pfam-B_313 Pfam-B_2140 Pfam-B_7107 Pfam-B_2147 Pfam-B_9455 Pfam-B_2058 Pfam-B_375Pfam-B_341 Pfam-B_7426 Pfam-B_8727 Pfam-B_2546 Pfam-B_7482 Pfam-B_496 Pfam-B_5167 Pfam-B_376 Pfam-B_6924 Pfam-B_10601Pfam-B_10600 Pfam-B_1360Pfam-B_2009 Pfam-B_5234 Pfam-B_2547 Pfam-B_7481 Pfam-B_11209 Pfam-B_7770Pfam-B_7772 toxin Pfam-B_8696 Pfam-B_5168 Pfam-B_8089 Pfam-B_7111 Pfam-B_7378 HLH Pfam-B_6923 Pfam-B_5113 Pfam-B_2006 Pfam-B_4917 Pfam-B_81 Pfam-B_7477 Pfam-B_4261 Pfam-B_9305 Pfam-B_8090 Pfam-B_6516 Pfam-B_11927Pfam-B_4918Pfam-B_4688 Pfam-B_3611 Pfam-B_85 Pfam-B_10668 Pfam-B_4382 Pfam-B_3586 Pfam-B_6303 Pfam-B_4533Pfam-B_6671 Pfam-B_11737 Pfam-B_2479 Pfam-B_10082 Pfam-B_1820 Pfam-B_1913 Pfam-B_7022 Pfam-B_4532Pfam-B_6667Pfam-B_2764 Pfam-B_7376 Pfam-B_9285 Pfam-B_4921Pfam-B_4919 Pfam-B_3587 Pfam-B_4008Pfam-B_8333 Pfam-B_11164 Pfam-B_4847 Pfam-B_4922Pfam-B_1365Pfam-B_9331Pfam-B_4689 Pfam-B_23 Pfam-B_7355 Pfam-B_4920 Pfam-B_5491 Pfam-B_2992 Pfam-B_4148Pfam-B_4149 Pfam-B_7565Pfam-B_10599 Pfam-B_5999 Pfam-B_2340 Pfam-B_7354 Pfam-B_3294Pfam-B_3295 Pfam-B_3985 Pfam-B_8334Pfam-B_292 Pfam-B_5128 Pfam-B_10608 Pfam-B_9912Pfam-B_8528Pfam-B_8126 Pfam-B_3549 Pfam-B_11741 Pfam-B_481 Pfam-B_4350Pfam-B_4356 COX1 Pfam-B_8494Pfam-B_3122 Pfam-B_5267 Pfam-B_1531Pfam-B_1730 Pfam-B_3137 Pfam-B_5175 Pfam-B_5490 Pfam-B_6304 Pfam-B_2572 Pfam-B_731 Pfam-B_6221 Pfam-B_3988 Pfam-B_5265 Pfam-B_5176 Pfam-B_1835Pfam-B_4163 Pfam-B_3875Pfam-B_2527 Pfam-B_8332 Pfam-B_2149 Pfam-B_2789 Pfam-B_730 Pfam-B_941 Pfam-B_10478 Pfam-B_9439 Pfam-B_1498 Pfam-B_2148 Pfam-B_2784 Pfam-B_3987 Pfam-B_729 Pfam-B_2333 Pfam-B_2219 Pfam-B_3136Pfam-B_1022 Pfam-B_8578 Pfam-B_2220Pfam-B_2063 Pfam-B_440 Pfam-B_10479 Pfam-B_4652 Pfam-B_3373 Pfam-B_6674Pfam-B_4357 Pfam-B_10807 Pfam-B_118 Pfam-B_397 Pfam-B_9310 Pfam-B_3530 Pfam-B_7200 Pfam-B_4409 Pfam-B_1997 Pfam-B_10918 Pfam-B_271 Pfam-B_552Pfam-B_1357 Pfam-B_7201 Pfam-B_6608 Pfam-B_5163 Pfam-B_6475 Pfam-B_452 Pfam-B_597 Pfam-B_9861 Pfam-B_10756 Pfam-B_288 Pfam-B_10852Pfam-B_2125 Pfam-B_5304 Pfam-B_11658 Pfam-B_8103 Pfam-B_3612 Pfam-B_10477 Pfam-B_4653 Pfam-B_11657 Pfam-B_2961 Pfam-B_1281 Pfam-B_4829 Pfam-B_5162Pfam-B_9698 Pfam-B_3686 Pfam-B_11642 Pfam-B_4402Pfam-B_896 Pfam-B_5272 Pfam-B_302 Pfam-B_540Pfam-B_9546 Pfam-B_10476 Pfam-B_9544 Pfam-B_4536 Pfam-B_2823 Pfam-B_1895 Pfam-B_3444 Pfam-B_4535 mito_carr Pfam-B_1801 Pfam-B_8210 Pfam-B_2578 Pfam-B_8209

Pfam-B_1832 Pfam-B_9545

Pfam-B_8105 Pfam-B_11637

Pfam-B_1830Pfam-B_1831 Pfam-B_10270 Pfam-B_7023 Pfam-B_3422 Pfam-B_6041Pfam-B_1928

Pfam-B_3482

Pfam-B_5208 fer2 Pfam-B_5207

Pfam-B_3062

Pfam-B_4888 Pfam-B_6086Pfam-B_901 Pfam-B_3055Pfam-B_10833Pfam-B_2516 Pfam-B_819 apple Pfam-B_10433 Pfam-B_3852 Pfam-B_1698 Pfam-B_9398 Pfam-B_9706 Pfam-B_8482Pfam-B_8483 Pfam-B_820Pfam-B_4506 Pfam-B_2464Pfam-B_2639 Pfam-B_506 Pfam-B_1206Pfam-B_424 Pfam-B_5780 Pfam-B_767 Pfam-B_5958 Pfam-B_3874 Pfam-B_6610 Pfam-B_9028Pfam-B_6182 Pfam-B_99 Pfam-B_11860 Pfam-B_8108 Pfam-B_3854Pfam-B_3862 Pfam-B_2524 Pfam-B_3848 Pfam-B_7384 Pfam-B_4722Pfam-B_3853

Pfam-B_1158

Pfam-B_2698 Pfam-B_9220 Pfam-B_9219

Pfam-B_794

Pfam-B_3306

Pfam-B_550

Pfam-B_5760thiored Pfam-B_9149Pfam-B_1890

Pfam-B_8928 aminotran

Pfam-B_573Pfam-B_1133

Pfam-B_8471 Pfam-B_5971

Pfam-B_2191

Pfam-B_5161 Pfam-B_5590Pfam-B_5589

Pfam-B_528

Pfam-B_604 Pfam-B_514

Pfam-B_842Pfam-B_4698

Pfam-B_6572Pfam-B_10698 Pfam-B_10699 Pfam-B_6576Pfam-B_7136

Pfam-B_1088 Pfam-B_4686Pfam-B_10692

Pfam-B_9822 Pfam-B_268 Pfam-B_11929 Pfam-B_7062 Pfam-B_1780 Pfam-B_327 Pfam-B_232 Pfam-B_2454 Pfam-B_7465 Pfam-B_9701 Pfam-B_291 Pfam-B_8623 Pfam-B_8720 Pfam-B_4069 Pfam-B_2418 Pfam-B_3006 Pfam-B_3807 Pfam-B_10517 Pfam-B_9523 Pfam-B_1678Pfam-B_1175 Pfam-B_3154Pfam-B_3149 Pfam-B_6827 Pfam-B_7608 Pfam-B_6999 Pfam-B_191 Pfam-B_5781 Pfam-B_379 Pfam-B_7627 Pfam-B_9819 Pfam-B_1569 Pfam-B_2986 Pfam-B_2105 Pfam-B_8662 Pfam-B_5387 Pfam-B_11841 Pfam-B_213 Pfam-B_9818 Pfam-B_4339 Pfam-B_2141 Pfam-B_5157 Pfam-B_7607 Pfam-B_9382 Pfam-B_1991 Pfam-B_11276 Pfam-B_5803 Pfam-B_8137 Pfam-B_739 Pfam-B_11143 Pfam-B_6471 Pfam-B_7625 Pfam-B_10190 Pfam-B_8167 Pfam-B_2685 Pfam-B_2293 Pfam-B_8712 Pfam-B_3182 Pfam-B_3164 Pfam-B_3107 Pfam-B_3780 Pfam-B_6537 Pfam-B_8067 Pfam-B_9547 Pfam-B_2760 Pfam-B_2124 Pfam-B_6701 Pfam-B_11414Pfam-B_404 Pfam-B_7068 Pfam-B_5670 Pfam-B_7237 Pfam-B_10688 Pfam-B_1937 Pfam-B_2928 Pfam-B_9820 Pfam-B_2942 Pfam-B_11415 Pfam-B_8179 Pfam-B_7600 tRNA-synt_1 Pfam-B_838 Pfam-B_3168 Pfam-B_613 Pfam-B_10463 Pfam-B_4641 Pfam-B_10677 Pfam-B_9957 Pfam-B_6995 Pfam-B_10464 Pfam-B_9979 Pfam-B_7464 Pfam-B_7098 Pfam-B_4054 Pfam-B_2294 Pfam-B_4817 Pfam-B_824 Pfam-B_9383 Pfam-B_8667 Pfam-B_839 Pfam-B_3185 Pfam-B_1792 Pfam-B_153 Pfam-B_7619Pfam-B_7416 Pfam-B_740 Pfam-B_3412 Pfam-B_509 Pfam-B_7762 Pfam-B_3926 Zn_clus Pfam-B_981 Pfam-B_5647 Pfam-B_11864 Pfam-B_7605 Pfam-B_10465 Pfam-B_591 Pfam-B_663 Pfam-B_3162 Pfam-B_72 Pfam-B_1100 Pfam-B_4961 Pfam-B_9699 Pfam-B_3184 Pfam-B_9537 Pfam-B_2071 Pfam-B_7606 Pfam-B_55 Pfam-B_615 Pfam-B_642 Pfam-B_182 Pfam-B_1061 Pfam-B_6174 Pfam-B_3917 Pfam-B_4960 Pfam-B_10640 Pfam-B_2748 Pfam-B_8071 Pfam-B_1562 Pfam-B_10466 Pfam-B_9030 Pfam-B_9821Pfam-B_9817 Pfam-B_39 Pfam-B_101 Pfam-B_301 Pfam-B_56 Pfam-B_5019 Pfam-B_8127 Pfam-B_4470 Pfam-B_7417 Pfam-B_2295 Pfam-B_10469 Pfam-B_622 Pfam-B_2742 Pfam-B_1936 Pfam-B_7834 Pfam-B_6778 Pfam-B_2664 Pfam-B_6606 Pfam-B_4887 Pfam-B_1421 Pfam-B_9342 aldedh Pfam-B_3538 Pfam-B_10637 Pfam-B_11501 Pfam-B_4010 Pfam-B_5377 Pfam-B_9341 Pfam-B_7615 Pfam-B_4472 Pfam-B_3668 Pfam-B_5804 Pfam-B_451 response_reg Pfam-B_7601 Pfam-B_1172 Pfam-B_464 Pfam-B_1356 Pfam-B_3087 Pfam-B_1946 Pfam-B_5652 Pfam-B_2243 pyr_redox Pfam-B_11140 Pfam-B_6996 Pfam-B_11412 Pfam-B_848 Pfam-B_1487 Pfam-B_4400 Pfam-B_70 Pfam-B_4407 Pfam-B_7755 Pfam-B_2837 Pfam-B_250 Pfam-B_106 Pfam-B_4414 Pfam-B_1420 Pfam-B_10460 Pfam-B_9778 Pfam-B_9385 Pfam-B_2599 Pfam-B_4068 S12 Pfam-B_1327 Pfam-B_5004 Pfam-B_7602 Pfam-B_165 Pfam-B_7593 Pfam-B_709 Pfam-B_2803 Pfam-B_2762 Pfam-B_11141 Pfam-B_2443 Pfam-B_318 Pfam-B_4925 Pfam-B_11422 Pfam-B_3186 Pfam-B_35 Pfam-B_6834 Pfam-B_703 Pfam-B_7134 Pfam-B_1001 Pfam-B_362 Pfam-B_3931 Pfam-B_3667 Pfam-B_1296 Pfam-B_1179 Pfam-B_1062 Pfam-B_6946 Pfam-B_11566 Pfam-B_1724 Pfam-B_7795 Pfam-B_3737 Pfam-B_9319 Pfam-B_4207 Pfam-B_457 Pfam-B_4364 photoRC Pfam-B_3721 Pfam-B_5627 Pfam-B_4518 Pfam-B_1484 Pfam-B_7580 Pfam-B_10281 Pfam-B_5104 Pfam-B_2317 Pfam-B_10724 Pfam-B_1248 Pfam-B_5934 Pfam-B_2927 Pfam-B_11142 Pfam-B_6994 Pfam-B_430 Pfam-B_4963 Pfam-B_494 Pfam-B_214 Pfam-B_2870 Pfam-B_7594 pro_isomerase Pfam-B_10710 Pfam-B_3064 Pfam-B_7603 Pfam-B_197 Pfam-B_900 Pfam-B_682 Pfam-B_1552 Pfam-B_5379 Pfam-B_862 Pfam-B_906 Pfam-B_203 Pfam-B_173 Pfam-B_895 Pfam-B_2453 Pfam-B_11162 Pfam-B_984 Pfam-B_5746 Pfam-B_3183 Pfam-B_7624 Pfam-B_172 Pfam-B_4703 Pfam-B_2991 Pfam-B_48 Pfam-B_9565 Pfam-B_1554 Pfam-B_7662 Pfam-B_2926 Pfam-B_8680 Pfam-B_3411 Pfam-B_383 Pfam-B_6044 Pfam-B_1278Pfam-B_2591 Pfam-B_1785 Pfam-B_2858 Pfam-B_1214 Pfam-B_4710 Pfam-B_510 Pfam-B_3915 Pfam-B_1553 Pfam-B_4902 Pfam-B_1864 Pfam-B_9415 Pfam-B_3724 Pfam-B_5158 Pfam-B_7135 Pfam-B_4570 Pfam-B_3736 7tm_2 Pfam-B_8043 Pfam-B_435 Pfam-B_1485 Pfam-B_9357 Pfam-B_1174 Pfam-B_4569 Pfam-B_2072 Pfam-B_1759 Pfam-B_8688 Pfam-B_7998 Pfam-B_3558 Pfam-B_5094 Pfam-B_4962 Pfam-B_4401 Pfam-B_8Pfam-B_612Pfam-B_7108 Pfam-B_3603 Pfam-B_5925 Pfam-B_10470 Pfam-B_664 Pfam-B_856 Pfam-B_3016 Pfam-B_9146 Pfam-B_6015 Pfam-B_11163 Pfam-B_966 Pfam-B_208 cpn60 Pfam-B_584 Pfam-B_2129 Pfam-B_9143 Pfam-B_3054 Pfam-B_3437 Pfam-B_1249 Pfam-B_1046 Pfam-B_200 Pfam-B_1877 Pfam-B_8742 Pfam-B_5628 Pfam-B_9358 Pfam-B_3860 Pfam-B_8718 Pfam-B_4194 Pfam-B_583 Pfam-B_3053 Pfam-B_8109 Pfam-B_8056 Pfam-B_377 Pfam-B_7132 vwd Pfam-B_679 Pfam-B_8699 Pfam-B_1462 Pfam-B_6810 Pfam-B_2565 Pfam-B_1518 Pfam-B_1790 Pfam-B_4548 Pfam-B_7007 Pfam-B_1626 Pfam-B_5974 Pfam-B_4804 Pfam-B_6057 Pfam-B_647 Pfam-B_988Pfam-B_916 Pfam-B_4803 Pfam-B_1463 Pfam-B_9646 Pfam-B_1579 Pfam-B_4143 Pfam-B_5835 Pfam-B_809 Pfam-B_4285 hormone2 Pfam-B_8719 Pfam-B_9700 Pfam-B_1114 Pfam-B_6129 Pfam-B_5671 Pfam-B_3330 Pfam-B_2466 Pfam-B_4634 Pfam-B_2526 Pfam-B_69 Pfam-B_5972 Pfam-B_925 Pfam-B_4186Pfam-B_8650 Pfam-B_8716 Pfam-B_6452 Pfam-B_10775 Pfam-B_2442 Pfam-B_7285 Pfam-B_10633 Pfam-B_413 Pfam-B_4286 Pfam-B_469 Pfam-B_6542 Pfam-B_374 Pfam-B_9811 Pfam-B_8670 subtilase Pfam-B_2017 Pfam-B_204 Pfam-B_6233 Pfam-B_9145 Pfam-B_3041 Pfam-B_5789 Pfam-B_6128 Pfam-B_9144 Pfam-B_4789 Pfam-B_2109 Pfam-B_2560 Pfam-B_2478 Pfam-B_249 Pfam-B_6773 Pfam-B_1549 Pfam-B_9992 Pfam-B_3426 Pfam-B_965 Pfam-B_10741 Pfam-B_6231 Pfam-B_983 Pfam-B_4637 Pfam-B_9991 Pfam-B_835 Pfam-B_776 Pfam-B_4984Pfam-B_2815 Pfam-B_2816 Pfam-B_2559 Pfam-B_9230 Pfam-B_10452 Pfam-B_9232 Pfam-B_902 Pfam-B_9227 Pfam-B_4156 Pfam-B_241 Pfam-B_3469 Pfam-B_1224 Pfam-B_958 Pfam-B_1558 Pfam-B_1090 Pfam-B_10451Pfam-B_10450 Pfam-B_8507 Pfam-B_9993

Pfam-B_10161 Pfam-B_10162 Pfam-B_3523 Pfam-B_4157 Pfam-B_9231 Pfam-B_6601

Pfam-B_4155

Pfam-B_10160

Pfam-B_6849

Pfam-B_1199 Pfam-B_5926 Pfam-B_10163Pfam-B_4561 bZIP Pfam-B_1973 Pfam-B_2365 Pfam-B_7796 Pfam-B_333oxidored_fadPfam-B_10157

Pfam-B_419 Pfam-B_3956Pfam-B_5225 Pfam-B_10159 Pfam-B_6298Pfam-B_4560 Pfam-B_418Pfam-B_9347 Pfam-B_10166 Pfam-B_503Pfam-B_1972 Pfam-B_1200 Pfam-B_10164 Pfam-B_7736 Pfam-B_10167Pfam-B_10169 Pfam-B_10168

Pfam-B_3950 Pfam-B_3951 Pfam-B_3693 Pfam-B_4227 Pfam-B_5399 Pfam-B_2596 Pfam-B_7344 Pfam-B_3925 Pfam-B_2276 Pfam-B_8214 Pfam-B_6952 Pfam-B_10820 Pfam-B_6403 Pfam-B_3344Pfam-B_4312 Pfam-B_6991 Pfam-B_3846 Pfam-B_3735 Pfam-B_10674 Pfam-B_8799 Pfam-B_10091Pfam-B_1949 Pfam-B_21 Pfam-B_6317 Pfam-B_10624 Pfam-B_5532 Pfam-B_2303 Pfam-B_3573 Pfam-B_949 Pfam-B_676 Pfam-B_2668 Pfam-B_5551 Pfam-B_2098 Pfam-B_1450 Pfam-B_4384 Pfam-B_356 Pfam-B_11368 Pfam-B_11477 Pfam-B_3563 Pfam-B_9210 Pfam-B_2706 Pfam-B_7775 Pfam-B_513 Pfam-B_4720 Pfam-B_3268 Pfam-B_6529 Pfam-B_5220 Pfam-B_1231 Pfam-B_6450 Pfam-B_2018 Pfam-B_538 Pfam-B_7410 Pfam-B_286 Pfam-B_2973 Pfam-B_1225 Pfam-B_3525 Pfam-B_1596 Pfam-B_2674 Pfam-B_1213 Pfam-B_10823 Pfam-B_10318 Pfam-B_7478 Pfam-B_3614 Pfam-B_11367 Pfam-B_4798 Pfam-B_4670 Pfam-B_5239 Pfam-B_4185 Pfam-B_190 Pfam-B_9833 Pfam-B_5215 Pfam-B_4311 Pfam-B_7002 Pfam-B_8584 Pfam-B_11360 Pfam-B_2456 Pfam-B_9237 Pfam-B_7551 Pfam-B_2382 Pfam-B_5217 Pfam-B_779 Pfam-B_3345 Pfam-B_4050 Pfam-B_1699 Pfam-B_4705 Pfam-B_5783 Pfam-B_1530 oxidored_molyb Pfam-B_602 Pfam-B_5991 Pfam-B_8761 Pfam-B_4310 Pfam-B_4537 Pfam-B_6786 Pfam-B_266 Pfam-B_169 Pfam-B_652 Pfam-B_651 Pfam-B_922 Pfam-B_10825 Pfam-B_7383 Pfam-B_8467 Pfam-B_278 Pfam-B_1217 Pfam-B_9304 Pfam-B_3457 Pfam-B_6140 Pfam-B_5651 Pfam-B_5400 Pfam-B_1254 Pfam-B_1212 Pfam-B_5847 Pfam-B_414 Pfam-B_641 Pfam-B_7744 Pfam-B_9601 Pfam-B_192 Pfam-B_3562 Pfam-B_8702 Pfam-B_5222 Pfam-B_9832 Pfam-B_3949 Pfam-B_1581 Pfam-B_2359 Pfam-B_556 Pfam-B_2975 Pfam-B_161Pfam-B_4778 Pfam-B_3900 Pfam-B_11670 Pfam-B_2850 Pfam-B_29 Pfam-B_7774 Pfam-B_9521 Pfam-B_7722 Pfam-B_7632 Pfam-B_4915 Pfam-B_3265 Pfam-B_2432 Pfam-B_2397 Pfam-B_8969 Pfam-B_8253Pfam-B_1150 Pfam-B_561 Pfam-B_11671 Pfam-B_151 Pfam-B_3613 Pfam-B_4328 Pfam-B_1406 Pfam-B_7768 Pfam-B_198 Pfam-B_542 Pfam-B_3321 Pfam-B_2005 heme_1 Pfam-B_2894 Pfam-B_1320 Pfam-B_2428 Pfam-B_1963 Pfam-B_1416 Pfam-B_246 Pfam-B_8672 Pfam-B_5553 Pfam-B_7207 Pfam-B_5548 Pfam-B_9467 Pfam-B_11361 Pfam-B_1312Pfam-B_683 Pfam-B_8703 Pfam-B_5221 Pfam-B_1003 Pfam-B_10706 Pfam-B_4113 Pfam-B_1101 Pfam-B_163 Pfam-B_445 Pfam-B_8065 Pfam-B_6918 Pfam-B_8973 Pfam-B_2896 Pfam-B_694 Pfam-B_277 Pfam-B_343 Pfam-B_1155 Pfam-B_491 adh_short Pfam-B_860 Pfam-B_811 Pfam-B_4706 Pfam-B_11476 Pfam-B_2268 Pfam-B_1461 Pfam-B_2725 Pfam-B_4327 Pfam-B_3483 Pfam-B_6029 Pfam-B_882 Pfam-B_2642 Pfam-B_3822 Pfam-B_454 Pfam-B_1904 Pfam-B_6739 Pfam-B_254 Pfam-B_11600 Pfam-B_1337 Pfam-B_876 Pfam-B_11499 Pfam-B_179 Pfam-B_342 Pfam-B_10319 Pfam-B_7001 Pfam-B_2415 Pfam-B_7552 Pfam-B_950 Pfam-B_8851 Pfam-B_4193 Pfam-B_2192 Pfam-B_7206 Pfam-B_4146 Pfam-B_1919 Pfam-B_11478 Pfam-B_2396 Pfam-B_244 Pfam-B_2209 Pfam-B_4209 Pfam-B_164 Pfam-B_9168 Pfam-B_2229 Pfam-B_2112 Pfam-B_10090 Pfam-B_351 Pfam-B_6579 sigma54 Pfam-B_5118 Pfam-B_7153 Pfam-B_3567 Pfam-B_8968 Pfam-B_1812 Pfam-B_921Pfam-B_5620 Pfam-B_4708 Pfam-B_2053 Pfam-B_2266 Pfam-B_131 Pfam-B_1744 Pfam-B_1745 Pfam-B_3305 Pfam-B_3569 Pfam-B_2895 Pfam-B_8252 Pfam-B_5216 Pfam-B_1791 Pfam-B_242 Pfam-B_10821 Pfam-B_5846 Pfam-B_107 Pfam-B_1178 Pfam-B_247 Pfam-B_9211 Pfam-B_2123 Pfam-B_2959 Pfam-B_2627 Pfam-B_11495 Pfam-B_769 Pfam-B_3343 Pfam-B_3843 Pfam-B_650 Pfam-B_10087 Pfam-B_2417 Pfam-B_11747 Pfam-B_6743Pfam-B_2046Pfam-B_11369 Pfam-B_6138 Pfam-B_3090 Pfam-B_6347 Pfam-B_4916 Pfam-B_8619Pfam-B_10693 Pfam-B_2307 Pfam-B_780 Pfam-B_9830 Pfam-B_180 Pfam-B_10253 Pfam-B_7628 Pfam-B_2273 Pfam-B_2175 thiolase Pfam-B_2429 DNA_pol Pfam-B_1523 Pfam-B_543 Pfam-B_387 Pfam-B_206 Pfam-B_1447 Pfam-B_2613 Pfam-B_522 Pfam-B_2879 Pfam-B_207 Pfam-B_205 Pfam-B_1071Pfam-B_1964 Pfam-B_2653 Pfam-B_1882 Pfam-B_562 Pfam-B_9212 Pfam-B_712 Pfam-B_9831 Pfam-B_1404 Pfam-B_846 Pfam-B_9162 Pfam-B_2476 Pfam-B_648 Pfam-B_3568 Pfam-B_4800 Pfam-B_240 Pfam-B_4616 Pfam-B_977 Pfam-B_485 Pfam-B_5782 Pfam-B_4777 Pfam-B_1380 Pfam-B_868 Pfam-B_5171 Pfam-B_9317 Pfam-B_6363 Pfam-B_6742 Pfam-B_7409 Pfam-B_269 Pfam-B_2673 Pfam-B_2430 Pfam-B_2724 Pfam-B_329 Pfam-B_6510 Pfam-B_2134 Pfam-B_124 Pfam-B_4677 Pfam-B_493 Pfam-B_3948 Pfam-B_4313 Pfam-B_8797 Pfam-B_10086 Pfam-B_2538 Pfam-B_3350 Pfam-B_2705 Pfam-B_7773 Pfam-B_557 Pfam-B_4719 Pfam-B_5618Pfam-B_5619 Pfam-B_10731 Pfam-B_2514 Pfam-B_10816 Pfam-B_2252 Pfam-B_11922 Pfam-B_10598 Pfam-B_672 Pfam-B_5552 Pfam-B_5003 Pfam-B_8473 Pfam-B_5970 Pfam-B_11746 Pfam-B_11365 Pfam-B_2947 Pfam-B_2394 Pfam-B_7721 Pfam-B_259 Pfam-B_1167 Pfam-B_4271 Pfam-B_1017 Pfam-B_7146 Pfam-B_951 Pfam-B_262 Pfam-B_4354 Pfam-B_3028 Pfam-B_10 Pfam-B_1328 Pfam-B_10725 Pfam-B_802 Pfam-B_2431 Pfam-B_7511 Pfam-B_1565 Pfam-B_10682 Pfam-B_5396 Pfam-B_10092 Pfam-B_2373 Pfam-B_9159 Pfam-B_2512 Pfam-B_1348 Pfam-B_2737 Pfam-B_9158 Pfam-B_140 Pfam-B_139 Pfam-B_5170 Pfam-B_2768 Pfam-B_945 Pfam-B_10530 Pfam-B_7723 Pfam-B_9327 Pfam-B_938 Pfam-B_507

Pfam-B_9668 Pfam-B_6346 Pfam-B_6045 Pfam-B_1066 Pfam-B_11320 Pfam-B_11321 Pfam-B_6627 Pfam-B_276 Pfam-B_7956 Pfam-B_9989 Pfam-B_11774 Pfam-B_27 Pfam-B_2374 Pfam-B_2427 Pfam-B_1147 Pfam-B_6171 Pfam-B_5664 Pfam-B_2785 Pfam-B_3453 Pfam-B_2563 Pfam-B_519 Pfam-B_1102 Pfam-B_1095 Pfam-B_2422 Pfam-B_2040 UPAR_LY6 Pfam-B_580 Pfam-B_570 Pfam-B_3565 Pfam-B_1319 Pfam-B_10037Pfam-B_9986Pfam-B_9938Pfam-B_9657Pfam-B_947Pfam-B_9409Pfam-B_8844Pfam-B_8637Pfam-B_8563Pfam-B_8053Pfam-B_8030Pfam-B_7857Pfam-B_7457Pfam-B_6310Pfam-B_6280Pfam-B_6089Pfam-B_4993Pfam-B_498Pfam-B_4845Pfam-B_4836Pfam-B_3607Pfam-B_3521Pfam-B_3378Pfam-B_3366Pfam-B_3284Pfam-B_323Pfam-B_1322 Pfam-B_7598 Pfam-B_11468 Pfam-B_535 Pfam-B_3349 Pfam-B_3078 Pfam-B_1330 Pfam-B_8645 Pfam-B_5074 Pfam-B_1413 Pfam-B_10228 Pfam-B_6118 Pfam-B_421 Pfam-B_692 Pfam-B_976Pfam-B_6352 Pfam-B_2258 Pfam-B_11 actin Pfam-B_2800 Pfam-B_1588 Pfam-B_325 Pfam-B_3347 Pfam-B_1805 Pfam-B_6477 Pfam-B_3266 Pfam-B_4195 Pfam-B_7362 Pfam-B_4820 Pfam-B_2001 Pfam-B_3301 Pfam-B_2799 Pfam-B_2638 Pfam-B_1115 Pfam-B_4352 Pfam-B_10658 Pfam-B_7512 Pfam-B_8755 Pfam-B_321 Pfam-B_9964 Pfam-B_8753 Pfam-B_5107 Pfam-B_4575 Pfam-B_1854 Pfam-B_371 Pfam-B_3839 Pfam-B_1943 Pfam-B_10529 Pfam-B_2908 Pfam-B_183 Pfam-B_6353 Pfam-B_1950 Pfam-B_5071 Pfam-B_3844 Pfam-B_2523 Pfam-B_9344 Pfam-B_10644 Pfam-B_9320 Pfam-B_1153 Pfam-B_2301 Pfam-B_10231 Pfam-B_9423 Pfam-B_8418 Pfam-B_7212 Pfam-B_2480 Pfam-B_2404 Pfam-B_5390 Pfam-B_349 Pfam-B_3455 Pfam-B_3108 Pfam-B_565 Pfam-B_1563 Pfam-B_995 Pfam-B_3600 Pfam-B_2910 Pfam-B_2375 Pfam-B_6173 Pfam-B_9675 Pfam-B_3423 wnt Pfam-B_1982 Pfam-B_9588 Pfam-B_631 Pfam-B_10926 Pfam-B_3625 Pfam-B_2029 Pfam-B_4374 Pfam-B_7366 Pfam-B_3511 Pfam-B_1697 Pfam-B_5721 Pfam-B_8881 Pfam-B_6164 Pfam-B_9871 Pfam-B_5526 Pfam-B_6852 Pfam-B_3031 Pfam-B_3821 Pfam-B_9731 Pfam-B_816 Pfam-B_2741 Pfam-B_2088 Pfam-B_1151 Pfam-B_11366 Pfam-B_10660 Pfam-B_2770 Pfam-B_8894Pfam-B_7185 Pfam-B_1094Pfam-B_504 Cys-protease Pfam-B_6032 Pfam-B_5424 Pfam-B_4857 Pfam-B_6872 Pfam-B_5049 Pfam-B_639 Pfam-B_5896 fer4_NifH COX2 Pfam-B_2806Pfam-B_10932Pfam-B_4481Pfam-B_9410Pfam-B_8845Pfam-B_4183Pfam-B_8564Pfam-B_871Pfam-B_8932Pfam-B_7858Pfam-B_7454Pfam-B_420Pfam-B_239Pfam-B_9751Pfam-B_655Pfam-B_8937Pfam-B_11614Pfam-B_11404Pfam-B_3260Pfam-B_4579Pfam-B_5942Pfam-B_7112Pfam-B_884Pfam-B_5611Pfam-B_4366 Pfam-B_610 Pfam-B_6071 Pfam-B_5859 Pfam-B_5125 Pfam-B_11111 Pfam-B_1040 Pfam-B_1847 Pfam-B_3279Pfam-B_8707 Pfam-B_953 Pfam-B_6362 Pfam-B_1092 Pfam-B_6722 Pfam-B_4137 Pfam-B_3789 kazal Pfam-B_1362 Pfam-B_1583 Pfam-B_9346 Pfam-B_5389 Pfam-B_5663 Pfam-B_3918 Pfam-B_3286 Pfam-B_9674 Pfam-B_2769 Pfam-B_4767 Pfam-B_486 Pfam-B_67 Pfam-B_6031 Pfam-B_9878 Pfam-B_1338Pfam-B_9879 Pfam-B_3847 Pfam-B_7363 Pfam-B_3597 Pfam-B_9321 Pfam-B_7468 Pfam-B_2371 Pfam-B_4454 Pfam-B_6354 Pfam-B_9963 Pfam-B_9962 Pfam-B_1280 Pfam-B_5662 Pfam-B_8754 Pfam-B_3768 Pfam-B_10179 Pfam-B_11322Pfam-B_9421 Pfam-B_2738 Pfam-B_1503 Pfam-B_3503 Pfam-B_4995 Pfam-B_3790 Pfam-B_1568 Pfam-B_2014 Pfam-B_911 Pfam-B_5720 Pfam-B_6163 Pfam-B_4840 Pfam-B_2507 Pfam-B_1012 Pfam-B_10659 Pfam-B_3287 Pfam-B_7365 Pfam-B_3510 Pfam-B_111 Pfam-B_4821 Pfam-B_417 Pfam-B_470 Pfam-B_18 Pfam-B_2426 Pfam-B_6502 Pfam-B_8880 Pfam-B_9870 Pfam-B_6851 Pfam-B_3820 Pfam-B_10739 Pfam-B_1363 Pfam-B_5119 Pfam-B_994 Pfam-B_6355 Pfam-B_1180 Pfam-B_710 Pfam-B_11448Pfam-B_9762 Pfam-B_5318Pfam-B_4320 Pfam-B_3348Pfam-B_3079 Pfam-B_3076Pfam-B_3416 Pfam-B_2536Pfam-B_8653 Pfam-B_8604 Pfam-B_5064 Pfam-B_3858 Pfam-B_1412 Pfam-B_10225Pfam-B_195 Pfam-B_3207 Pfam-B_3231 Pfam-B_394 Pfam-B_1794 Pfam-B_7975 Pfam-B_3605 Pfam-B_3623 Pfam-B_4856 Pfam-B_221 Pfam-B_539 Pfam-B_1191 Pfam-B_1055 Pfam-B_10036Pfam-B_2813Pfam-B_7395Pfam-B_4434Pfam-B_8352Pfam-B_9408Pfam-B_2227Pfam-B_5615Pfam-B_4173Pfam-B_8049Pfam-B_8028Pfam-B_5280Pfam-B_7452Pfam-B_4983Pfam-B_6278Pfam-B_6088Pfam-B_4992Pfam-B_4437Pfam-B_11615Pfam-B_2056Pfam-B_3606Pfam-B_10328connexinPfam-B_3365Pfam-B_2218gln-synt Pfam-B_1321 Pfam-B_3920 Pfam-B_389 Pfam-B_6172 Pfam-B_3283 Pfam-B_3443 Pfam-B_3454 Pfam-B_3091 Pfam-B_564 Pfam-B_1103 Pfam-B_1378 Pfam-B_3512 Pfam-B_6605 Pfam-B_3856 Pfam-B_7160 Pfam-B_3582 Pfam-B_3645 Pfam-B_256 sigma70 Pfam-B_9238 Pfam-B_2082 Pfam-B_3709 Pfam-B_14 Pfam-B_1453

Pfam-B_6993 Pfam-B_274 Pfam-B_9873 Pfam-B_2217 Pfam-B_996 Pfam-B_586 Pfam-B_1587 Pfam-B_289 Pfam-B_1264 Pribosyltran Pfam-B_1221 Pfam-B_4832 Pfam-B_7540 Pfam-B_5944 Pfam-B_9318 Pfam-B_8012 Pfam-B_1442 Pfam-B_558 Pfam-B_10795 Pfam-B_4341 Pfam-B_4574 Pfam-B_6095 Pfam-B_1784 Pfam-B_7645 Pfam-B_4679 Pfam-B_9518 Pfam-B_6539 Pfam-B_3244 Pfam-B_2545 Pfam-B_621 Pfam-B_10381 Pfam-B_1952 Pfam-B_4030 Pfam-B_1778 Pfam-B_4826 Pfam-B_745 Pfam-B_3662 Pfam-B_3581 Pfam-B_10502 Pfam-B_2092 Pfam-B_3389 Pfam-B_3239 Pfam-B_9091 Pfam-B_115 Pfam-B_8711 Pfam-B_1507 Pfam-B_7340 Pfam-B_4884 Pfam-B_8708 Pfam-B_7586 Pfam-B_720 Pfam-B_5192 Pfam-B_5151 Pfam-B_7726 Pfam-B_6571 Pfam-B_6512 Pfam-B_7259 Pfam-B_7140KH-domain Pfam-B_3101 Pfam-B_3355 Pfam-B_3594 Pfam-B_2023 Pfam-B_1087Pfam-B_10638 Pfam-B_1359 Pfam-B_10026Pfam-B_10032 Pfam-B_1586 Pfam-B_1566 Pfam-B_4685 Pfam-B_8911 Pfam-B_8757 Pfam-B_8461 Pfam-B_7637 Pfam-B_7521 Pfam-B_7413 Pfam-B_7406 Pfam-B_7177 Pfam-B_7142 Pfam-B_6837 Pfam-B_5640 Pfam-B_4601 Pfam-B_4362 Pfam-B_3894 Pfam-B_3766 Pfam-B_3753 Pfam-B_3238 Pfam-B_3129 Pfam-B_2055 Pfam-B_7851 Pfam-B_1317 Pfam-B_4707 Pfam-B_11574 Pfam-B_2003 Pfam-B_3576 Pfam-B_7585 Pfam-B_719 Pfam-B_5191 Pfam-B_3908 Pfam-B_7725 Pfam-B_6570 Pfam-B_3566 Pfam-B_7258 Pfam-B_8746 Pfam-B_8748Pfam-B_10356 Pfam-B_4615 Pfam-B_10887 Pfam-B_209 Pfam-B_10717 Pfam-B_3579 Pfam-B_10031 Pfam-B_1754 Pfam-B_6490 Pfam-B_273 Pfam-B_9880 Pfam-B_9876 Pfam-B_1653 Pfam-B_9612 tsp_1 Pfam-B_9282Pfam-B_5092 Pfam-B_3760Pfam-B_722 Pfam-B_4704 Pfam-B_9338Pfam-B_9329 Pfam-B_2163Pfam-B_8013 Pfam-B_5458Pfam-B_665 Pfam-B_96 Pfam-B_6632 Pfam-B_8795Pfam-B_6609 Pfam-B_4340Pfam-B_635 Pfam-B_3046Pfam-B_6327 Pfam-B_898Pfam-B_6097 Pfam-B_721Pfam-B_3808 Pfam-B_7405 Pfam-B_7006 Pfam-B_6768 alpha-amylase Pfam-B_7178 neur Pfam-B_1257 Pfam-B_10902 Pfam-B_917 Pfam-B_1394 Pfam-B_1318 Pfam-B_1371 Pfam-B_3560 Pfam-B_3475 Pfam-B_6491 Pfam-B_2521 Pfam-B_867 Pfam-B_5062 Pfam-B_959 Pfam-B_920 Pfam-B_2069 Pfam-B_9339 Pfam-B_7583 Pfam-B_5006 Pfam-B_5189 Pfam-B_3906 Pfam-B_2556 Pfam-B_1735 Pfam-B_2398 Pfam-B_317 Pfam-B_2076 oxidored_nitro Pfam-B_10716Pfam-B_775 Pfam-B_2888 Pfam-B_10663 Pfam-B_1105 Pfam-B_3354 Pfam-B_6581 Pfam-B_11571 Pfam-B_4266 Pfam-B_2411 Pfam-B_5923 Pfam-B_1121 Pfam-B_7584 Pfam-B_3787 Pfam-B_5190 Pfam-B_3907 Pfam-B_2557 Pfam-B_4697 Pfam-B_3561 Pfam-B_2974 Pfam-B_8747 Pfam-B_11397 Pfam-B_265 Pfam-B_1806 Pfam-B_10506 Pfam-B_695 Pfam-B_2121 Pfam-B_5945 Pfam-B_10148 Pfam-B_4022 Pfam-B_1441 Pfam-B_4733 Pfam-B_10588 Pfam-B_3361 Pfam-B_230 Pfam-B_3445 Pfam-B_1396 Pfam-B_10030 Pfam-B_11421 Pfam-B_3061 Pfam-B_713 Pfam-B_7738 Pfam-B_4124 Pfam-B_5993 Pfam-B_6536 Pfam-B_2647 Pfam-B_1597 Pfam-B_3077 Pfam-B_10375 Pfam-B_1341 Pfam-B_1273 Pfam-B_1777 Pfam-B_4825 Pfam-B_5639 Pfam-B_3661 Pfam-B_10647 Pfam-B_94 Pfam-B_1243 Pfam-B_3388 Pfam-B_2644 Pfam-B_5821 Pfam-B_181 Pfam-B_3047 Pfam-B_1737 Pfam-B_11573Pfam-B_4831 Pfam-B_857 Pfam-B_6488 Pfam-B_476 Pfam-B_1190 Pfam-B_4344 Pfam-B_1410 Pfam-B_127 Pfam-B_567

Pfam-B_3 Pfam-B_2246 Pfam-B_444 Pfam-B_1559 Pfam-B_3977 Pfam-B_8913 Pfam-B_4040 Pfam-B_9946 Pfam-B_238 Pfam-B_345 Pfam-B_5684 Pfam-B_10073 Pfam-B_6046 Pfam-B_5121Pfam-B_7359 Pfam-B_3450 Pfam-B_1643 Pfam-B_529 Pfam-B_6754 Pfam-B_3620 Pfam-B_3782 Pfam-B_8346 Pfam-B_1431 Pfam-B_8452 Pfam-B_2753 Pfam-B_960 Pfam-B_5900 Pfam-B_4247 Pfam-B_11520 Pfam-B_2256 Pfam-B_891Pfam-B_76 Pfam-B_87Pfam-B_738 Pfam-B_8235Pfam-B_3889 Pfam-B_7531Pfam-B_3589 Pfam-B_6427Pfam-B_3793 Pfam-B_406 Pfam-B_2286 Pfam-B_785 phoslip Pfam-B_5592 Pfam-B_9258 Pfam-B_112 Pfam-B_1332 Pfam-B_3070 Pfam-B_290 Pfam-B_2718 Pfam-B_2093 Pfam-B_1432 Pfam-B_1426 Pfam-B_11253 gluts beta-lactamase Pfam-B_1131 Pfam-B_1044 Pfam-B_2137 Pfam-B_1981 Pfam-B_4331 Pfam-B_3045 Pfam-B_223 Pfam-B_10202 Pfam-B_1639 Pfam-B_11388 Pfam-B_10872 Pfam-B_103 Pfam-B_3815 Pfam-B_3246 Pfam-B_11894 Pfam-B_1930 Pfam-B_2833 Pfam-B_2533 Pfam-B_930 Pfam-B_5538 Pfam-B_5257 Pfam-B_7352 Pfam-B_589 Pfam-B_2544 Pfam-B_2321 Pfam-B_2938 Pfam-B_3110 Pfam-B_11726 Pfam-B_2448 Pfam-B_10893 Pfam-B_2186 Pfam-B_6115 Pfam-B_9386 Pfam-B_8741 Pfam-B_2663 Pfam-B_5307 Pfam-B_11458 Pfam-B_5 Pfam-B_11866 Pfam-B_6848 Pfam-B_5202 Pfam-B_2866 Pfam-B_3552 Pfam-B_10609 Pfam-B_10382 Pfam-B_10079 Pfam-B_577 Pfam-B_1272 Pfam-B_4338 Pfam-B_841 Pfam-B_5539 Pfam-B_5258 Pfam-B_7353 Pfam-B_78 Pfam-B_3089 Pfam-B_3440 Pfam-B_2048 Pfam-B_7782 Pfam-B_3699 Pfam-B_2449 Pfam-B_4730 Pfam-B_4123 Pfam-B_9790 Pfam-B_9387 thyroglobulin_1 Pfam-B_8589 Pfam-B_7924 Pfam-B_783 Pfam-B_7317 Pfam-B_6977 Pfam-B_6249 Pfam-B_7699 Pfam-B_4973 Pfam-B_433 Pfam-B_2406 Pfam-B_6436 Pfam-B_1132 Pfam-B_956 Pfam-B_2138 Pfam-B_1850 Pfam-B_9252 Pfam-B_2645 Pfam-B_1005 Pfam-B_2629 Pfam-B_1666 Pfam-B_8451 Pfam-B_7063 Pfam-B_6803 Pfam-B_5880 Pfam-B_2990 Pfam-B_5439 Pfam-B_1122 Pfam-B_9345 Pfam-B_753 Pfam-B_7503 Pfam-B_6047 Pfam-B_7476 Pfam-B_5022 Pfam-B_8489 Pfam-B_1773 Pfam-B_691 Pfam-B_6289 Pfam-B_9207 HSP70 Pfam-B_3748 Pfam-B_11469 Pfam-B_34 Pfam-B_807 Pfam-B_852 Pfam-B_1118 Pfam-B_2793 Pfam-B_10198 Pfam-B_186 Pfam-B_1752 Pfam-B_10873 Pfam-B_2487 Pfam-B_1287 Pfam-B_1129 Pfam-B_3929 Pfam-B_5537 Pfam-B_7818 Pfam-B_5059 Pfam-B_45 Pfam-B_2543 Pfam-B_2320 Pfam-B_11399 Pfam-B_1423 Pfam-B_2462 Pfam-B_1098 Pfam-B_10892 Pfam-B_1034 Pfam-B_1334 Pfam-B_1486 Pfam-B_8740 Pfam-B_2205 Pfam-B_4004 Pfam-B_11457 Pfam-B_393 Pfam-B_11865 Pfam-B_2958 Pfam-B_1263 Pfam-B_2865 Pfam-B_3539 Pfam-B_10602 Pfam-B_10379 Pfam-B_5122 Pfam-B_627 Pfam-B_4332 Pfam-B_7495 Pfam-B_5023 Pfam-B_5558 Pfam-B_3728 Pfam-B_4858 Pfam-B_593 Pfam-B_9257 Pfam-B_5906 Pfam-B_873 Pfam-B_5685 Pfam-B_5231 Pfam-B_3211 Pfam-B_1130 Pfam-B_2255 Pfam-B_5881Pfam-B_611 Pfam-B_86Pfam-B_662 Pfam-B_797Pfam-B_1415 Pfam-B_7527Pfam-B_10366 Pfam-B_5333Pfam-B_3791 Pfam-B_3796 Pfam-B_5123 Pfam-B_929 lipase Pfam-B_1962 Pfam-B_5520 Pfam-B_1910 Pfam-B_7389 Pfam-B_2948 Pfam-B_475 Pfam-B_4152 Pfam-B_11893 Pfam-B_628 Pfam-B_8453 Pfam-B_11676 Pfam-B_545 Pfam-B_11264 Pfam-B_5882 MHC_I Pfam-B_146 Pfam-B_1163 Pfam-B_1545 Pfam-B_1292 Pfam-B_5678 peroxidase Pfam-B_6201 Pfam-B_2356 Pfam-B_793 Pfam-B_5686 Pfam-B_2354 Pfam-B_2306 Pfam-B_9635 Pfam-B_6349 Pfam-B_9251 Pfam-B_7496 Pfam-B_7357 Pfam-B_9804 Pfam-B_1346 Pfam-B_487 Pfam-B_2936 Pfam-B_2036 Pfam-B_1108

Pfam-B_3368 Pfam-B_2602 Pfam-B_4433 Pfam-B_4435Pfam-B_2690 Pfam-B_1300Pfam-B_2222 Pfam-B_4848Pfam-B_4838 Pfam-B_3447Pfam-B_1838 Pfam-B_3178Pfam-B_11889 Pfam-B_2670Pfam-B_6228 Pfam-B_1606Pfam-B_8878 Pfam-B_2704COesterase Pfam-B_1366 Pfam-B_5980 Pfam-B_8888 Pfam-B_5155 Pfam-B_7672 Pfam-B_7243 Pfam-B_5384 Pfam-B_2612 Pfam-B_726 Pfam-B_224 Pfam-B_1769 Pfam-B_9470 Pfam-B_8885 Pfam-B_5156Pfam-B_1259 Pfam-B_5830 Pfam-B_786 Pfam-B_6708 Pfam-B_8275 Pfam-B_7747 Pfam-B_4501 Pfam-B_6940 Pfam-B_8585 Pfam-B_6961 Pfam-B_11783 Pfam-B_3430 Pfam-B_2455 Pfam-B_1407Pfam-B_2114Pfam-B_7356Pfam-B_1582Pfam-B_6577 Pfam-B_9575Pfam-B_9574 Pfam-B_9083Pfam-B_9082 Pfam-B_5774Pfam-B_9000 Pfam-B_8787Pfam-B_8786 Pfam-B_3259Pfam-B_8601 Pfam-B_217Pfam-B_8208 Pfam-B_8184Pfam-B_8180 Pfam-B_8319Pfam-B_7784 Pfam-B_5227Pfam-B_7737 Pfam-B_5188Pfam-B_7626 Pfam-B_7597Pfam-B_7596 Pfam-B_7145Pfam-B_7144 Pfam-B_6874Pfam-B_6873 Pfam-B_6209Pfam-B_6202 Pfam-B_7731Pfam-B_61 Pfam-B_5738Pfam-B_5736 Pfam-B_7657Pfam-B_572 Pfam-B_2615Pfam-B_5457 Pfam-B_7633Pfam-B_5346 Pfam-B_2571Pfam-B_5273 Pfam-B_5218Pfam-B_7724 Pfam-B_5096Pfam-B_5095 Pfam-B_7051Pfam-B_4941 Pfam-B_4893Pfam-B_4892 Pfam-B_490Pfam-B_489 Pfam-B_10317Pfam-B_4599 Pfam-B_4119Pfam-B_4120 Pfam-B_4081Pfam-B_4080 Pfam-B_3167Pfam-B_4052 Pfam-B_4175Pfam-B_3921 Pfam-B_7439Pfam-B_3859 Pfam-B_3832Pfam-B_3831 Pfam-B_7225Pfam-B_3799 Pfam-B_7125Pfam-B_3757 Pfam-B_11177Pfam-B_3588 Pfam-B_1054Pfam-B_3364 Pfam-B_4144Pfam-B_3242 Pfam-B_11891Pfam-B_2985 Pfam-B_2930Pfam-B_2929 Pfam-B_6444Pfam-B_2864 Pfam-B_7519Pfam-B_2534 Pfam-B_2935Pfam-B_2447 Pfam-B_9676Pfam-B_2312 Pfam-B_962Pfam-B_2300 Pfam-B_228Pfam-B_227 Pfam-B_9150Pfam-B_2251 Pfam-B_3218Pfam-B_2184 Pfam-B_1753Pfam-B_2049 Pfam-B_5438Pfam-B_202 Pfam-B_10424Pfam-B_1992 Pfam-B_8133Pfam-B_1836 Pfam-B_658Pfam-B_1786 Pfam-B_1969Pfam-B_1702 Pfam-B_2650Pfam-B_1632 Pfam-B_8587 Pfam-B_6962 Pfam-B_2979 Pfam-B_4438 Pfam-B_9288 Pfam-B_4660

Pfam-B_5978 FGF Pfam-B_1808 Pfam-B_8813 Pfam-B_2513 Pfam-B_11288 Pfam-B_9524 Pfam-B_3105 Pfam-B_2807 Pfam-B_6939 Pfam-B_3067 Pfam-B_366 Pfam-B_3174 Pfam-B_2662 Pfam-B_11832 Pfam-B_2064 Pfam-B_6065 Pfam-B_9656 Pfam-B_6060 Pfam-B_8848 Pfam-B_8840 Pfam-B_4849 Pfam-B_5274Pfam-B_11095Pfam-B_4839 Pfam-B_409 Pfam-B_482 Pfam-B_3275 Pfam-B_2474 Pfam-B_3100 Pfam-B_9985 Pfam-B_2709 Pfam-B_8883 Pfam-B_1367 Pfam-B_1929 Pfam-B_264 Pfam-B_2597

Pfam-B_4530Pfam-B_1513 Pfam-B_6256Pfam-B_1342 Pfam-B_155Pfam-B_128 Pfam-B_11796Pfam-B_11795Pfam-B_11578Pfam-B_11579 Pfam-B_4786Pfam-B_11545Pfam-B_11498Pfam-B_11497 Pfam-B_6789Pfam-B_11443 Pfam-B_6788Pfam-B_11442 Pfam-B_6787Pfam-B_11440Pfam-B_11363Pfam-B_11362 Pfam-B_525Pfam-B_11351 Pfam-B_6645Pfam-B_11034 Pfam-B_6564Pfam-B_10679 Pfam-B_3585Pfam-B_10666 Pfam-B_462Pfam-B_10268 Pfam-B_6306Pfam-B_10186Pfam-B_10129Pfam-B_10140Pfam-B_2551Pfam-B_7673Pfam-B_3861Pfam-B_7441Pfam-B_7418Pfam-B_7419Pfam-B_154Pfam-B_7327Pfam-B_1250Pfam-B_7257Pfam-B_2101Pfam-B_7232Pfam-B_11751Pfam-B_6921Pfam-B_8605Pfam-B_579Pfam-B_1580Pfam-B_5055Pfam-B_3792Pfam-B_5008Pfam-B_3705Pfam-B_4880Pfam-B_6836Pfam-B_4824Pfam-B_3871Pfam-B_3872Pfam-B_7459Pfam-B_3864Pfam-B_7398Pfam-B_3849Pfam-B_933Pfam-B_3716Pfam-B_808Pfam-B_3385Pfam-B_10967Pfam-B_3094Pfam-B_7292Pfam-B_3034Pfam-B_8325Pfam-B_2625Pfam-B_2198Pfam-B_2199Pfam-B_7176Pfam-B_2095 GATase Pfam-B_1866Pfam-B_1027Pfam-B_1842Pfam-B_319Pfam-B_1814Pfam-B_10311Pfam-B_1522Pfam-B_521Pfam-B_1117Pfam-B_6437Pfam-B_10399Pfam-B_1524Pfam-B_10316 recA Pfam-B_9860 Pfam-B_9714 Pfam-B_9292 Pfam-B_9259 Pfam-B_9147 Pfam-B_908 Pfam-B_897 Pfam-B_8403 Pfam-B_8237 Pfam-B_8112 Pfam-B_8111 Pfam-B_7835 Pfam-B_7658 Pfam-B_7614 Pfam-B_7589 Pfam-B_7579 Pfam-B_7515 Pfam-B_7497 Pfam-B_7494 Pfam-B_7470 Pfam-B_7448 Pfam-B_7438 Pfam-B_7316 Pfam-B_7277 Pfam-B_7254 Pfam-B_7216 Pfam-B_7190 Pfam-B_7187 Pfam-B_7104 Pfam-B_7003 Pfam-B_6955 Pfam-B_6870 Pfam-B_6869 Pfam-B_6864 Pfam-B_685 Pfam-B_6835 Pfam-B_6795 Pfam-B_6758 Pfam-B_6757 Pfam-B_6747 Pfam-B_6725 Pfam-B_6700 Pfam-B_6644 Pfam-B_6626 Pfam-B_6618 Pfam-B_6268 Pfam-B_6236 Pfam-B_6011 Pfam-B_5950 Pfam-B_5840 Pfam-B_5806 Pfam-B_5743 Pfam-B_5696 Pfam-B_5608 Pfam-B_5302 Pfam-B_5213 Pfam-B_5147 Pfam-B_5134 Pfam-B_5099 Pfam-B_5093 Pfam-B_4988 Pfam-B_4987 Pfam-B_4979 Pfam-B_4948 Pfam-B_4947 Pfam-B_4911 Pfam-B_4842 Pfam-B_4788

Pfam-B_11447Pfam-B_6790Pfam-B_11446Pfam-B_3640Pfam-B_11292Pfam-B_4345Pfam-B_617 S4 Pfam-B_1942Pfam-B_9863Pfam-B_1918Pfam-B_9465Pfam-B_2987Pfam-B_883Pfam-B_1766Pfam-B_853Pfam-B_2873Pfam-B_8356Pfam-B_7832Pfam-B_7833Pfam-B_6982Pfam-B_7683Pfam-B_9194Pfam-B_7581Pfam-B_5130Pfam-B_7547Pfam-B_450Pfam-B_7502Pfam-B_934Pfam-B_7094Pfam-B_3399Pfam-B_7060Pfam-B_11630Pfam-B_6876Pfam-B_6868Pfam-B_6867Pfam-B_1827Pfam-B_6731Pfam-B_8730Pfam-B_673Pfam-B_736Pfam-B_661Pfam-B_2401Pfam-B_6519Pfam-B_1951Pfam-B_6252Pfam-B_5997Pfam-B_5996Pfam-B_9290Pfam-B_5898Pfam-B_536Pfam-B_5836Pfam-B_2154Pfam-B_5300Pfam-B_7122Pfam-B_5124Pfam-B_8499Pfam-B_5103Pfam-B_11524Pfam-B_4812Pfam-B_5992Pfam-B_4790Pfam-B_6735Pfam-B_4776Pfam-B_4775Pfam-B_4774Pfam-B_1201Pfam-B_4633Pfam-B_7765Pfam-B_4309Pfam-B_10635Pfam-B_429Pfam-B_7510Pfam-B_3885 sodfe Pfam-B_3745Pfam-B_11589Pfam-B_3681Pfam-B_2828Pfam-B_304Pfam-B_6596Pfam-B_2943Pfam-B_4676Pfam-B_2883Pfam-B_6719Pfam-B_2595Pfam-B_3806Pfam-B_258Pfam-B_2593Pfam-B_2577Pfam-B_2044Pfam-B_2391Pfam-B_6070Pfam-B_2335Pfam-B_8803Pfam-B_2224Pfam-B_689Pfam-B_1923Pfam-B_320Pfam-B_1741Pfam-B_5566Pfam-B_171Pfam-B_6160Pfam-B_1262Pfam-B_11306Pfam-B_1193Pfam-B_2475Pfam-B_11903Pfam-B_11880Pfam-B_11873Pfam-B_11850Pfam-B_11854Pfam-B_2921Pfam-B_11745Pfam-B_6865Pfam-B_11613Pfam-B_6813Pfam-B_11512 Pfam-B_4787 Pfam-B_4763 Pfam-B_4759 Pfam-B_4597 Pfam-B_4581 Pfam-B_4528 Pfam-B_4512 Pfam-B_4413 Pfam-B_3993 Pfam-B_3923 Pfam-B_3922 Pfam-B_3838 Pfam-B_3812 Pfam-B_3749 Pfam-B_3679 Pfam-B_3362 Pfam-B_3190 Pfam-B_3171 Pfam-B_3132 Pfam-B_3093 Pfam-B_3085 Pfam-B_3073 Pfam-B_3043 Pfam-B_3010 Pfam-B_2954 Pfam-B_2657 Pfam-B_2495 Pfam-B_2369 Pfam-B_2338 Pfam-B_2287 Pfam-B_2171 Pfam-B_1762 Pfam-B_1749 Pfam-B_1619 Pfam-B_1609 Pfam-B_1548 Pfam-B_1205 Pfam-B_11779Pfam-B_11694Pfam-B_11663Pfam-B_11623Pfam-B_11601Pfam-B_11518Pfam-B_11302Pfam-B_11232Pfam-B_11194Pfam-B_11127Pfam-B_10962Pfam-B_10954Pfam-B_10849Pfam-B_1045Pfam-B_10363 gpdh

Pfam-B_3066Pfam-B_11330Pfam-B_6706Pfam-B_11259Pfam-B_4967Pfam-B_1104Pfam-B_10320Pfam-B_10305Pfam-B_822Pfam-B_10120

Figure 5.11. The Family network (using both CPs and non-CPs)

Red nodes are nodes found in the non-CP network. Green nodes are nodes found only in the CP-network. Blue edges denote edges found in the CP network. Pink edges are edges found only in the CP network. CHAPTER 5. CIRCULAR PATTERN MATCHING 131

Pfam-B_9905 Pfam-B_364 Pfam-B_1477 Pfam-B_4931 Pfam-B_2181 Pfam-B_474 Pfam-B_5740 Pfam-B_7664 Pfam-B_1663 Pfam-B_7299 Pfam-B_10259 Pfam-B_2848 Pfam-B_3341 Pfam-B_4314 Pfam-B_2826 Pfam-B_2622 Pfam-B_4494 Pfam-B_350 Pfam-B_1165 Pfam-B_6451 Pfam-B_3063Pfam-B_11500 Pfam-B_758 Pfam-B_7017 Pfam-B_363 Pfam-B_11762 Pfam-B_4134 Pfam-B_8734 Pfam-B_11918Pfam-B_1473Pfam-B_3075 Pfam-B_1532 Pfam-B_9935Pfam-B_10678 Pfam-B_2723 Pfam-B_1339 Pfam-B_1660 Pfam-B_4497Pfam-B_10147 Pfam-B_4212 Pfam-B_6384 Pfam-B_3463 Pfam-B_402 Pfam-B_9637 Pfam-B_9932 Pfam-B_199 Pfam-B_2887 Pfam-B_1448 Pfam-B_1694 Pfam-B_4213 Pfam-B_3424 Pfam-B_1680 Pfam-B_9638 Pfam-B_2342 Pfam-B_8684 Pfam-B_1474 Pfam-B_2995 Pfam-B_2977 Pfam-B_255

Pfam-B_2994 cytochrome_c Pfam-B_4424 Pfam-B_6534 Pfam-B_6195 Pfam-B_2362 Pfam-B_5187 Pfam-B_370 Pfam-B_7301Pfam-B_1291 Pfam-B_6532 TGF-beta Pfam-B_3901Pfam-B_7300 Pfam-B_10303 Pfam-B_5973 Pfam-B_9933 Pfam-B_3902 Pfam-B_7411 Pfam-B_1695 Pfam-B_9931 Pfam-B_1823 Pfam-B_4191Pfam-B_5313 Pfam-B_5293 Pfam-B_1041 Pfam-B_7923 Pfam-B_7880 Pfam-B_6565 Pfam-B_7971 Pfam-B_1308 Pfam-B_1826 Pfam-B_1509 ATP-synt_C Pfam-B_5334Pfam-B_3133 Pfam-B_6144 Pfam-B_2035Pfam-B_11763 Pfam-B_7994 Pfam-B_7873 Pfam-B_2581 Pfam-B_9590 Pfam-B_2211 Pfam-B_2157 Pfam-B_7425 Pfam-B_10965 Pfam-B_2834 Pfam-B_7504 copper-bind Pfam-B_1970 Pfam-B_136 hormone

Pfam-B_459 Pfam-B_1961 Pfam-B_3495 Pfam-B_460 Pfam-B_698 Pfam-B_4556 Pfam-B_9192 Pfam-B_11112 Pfam-B_10130 Pfam-B_1368 Pfam-B_1965 Pfam-B_8057 Pfam-B_4555 Pfam-B_166 Pfam-B_7669 Pfam-B_1516 Pfam-B_1985 Pfam-B_677 Pfam-B_1051 Pfam-B_8095 Pfam-B_2361 Pfam-B_9182Pfam-B_11833Pfam-B_3741 Pfam-B_7897 Pfam-B_893 Pfam-B_1197 Pfam-B_9043 Pfam-B_5794 Pfam-B_2434 Pfam-B_964 Pfam-B_2034 Pfam-B_2 Pfam-B_8515 Pfam-B_9650 Pfam-B_6964Pfam-B_5571 Pfam-B_6965Pfam-B_6040 Pfam-B_4696 Pfam-B_6485 Pfam-B_7118 Pfam-B_9436 Pfam-B_8891 Pfam-B_10142 Pfam-B_9649 Pfam-B_5048 Pfam-B_9045 Pfam-B_11046 Pfam-B_9180 Pfam-B_1893 Pfam-B_10145 Pfam-B_2339 Pfam-B_6291Pfam-B_10144 Pfam-B_11151 Pfam-B_8445 Pfam-B_3494 Pfam-B_6963Pfam-B_8163Pfam-B_11775 Pfam-B_6484Pfam-B_5959 Pfam-B_10105 Pfam-B_8594 Pfam-B_9044 Pfam-B_626 Pfam-B_1592 Pfam-B_697 Pfam-B_4277 Pfam-B_11279 Pfam-B_10131 Pfam-B_7117Pfam-B_8516 Pfam-B_5584 Pfam-B_483 Pfam-B_5385 Pfam-B_1433 Pfam-B_7119 Pfam-B_1971Pfam-B_10143Pfam-B_5335 Pfam-B_8545 Pfam-B_3153 Pfam-B_92 Pfam-B_1728 Pfam-B_4492 Pfam-B_9041 Pfam-B_6180 Pfam-B_6513 Pfam-B_1299 Pfam-B_963Pfam-B_147 Pfam-B_3883 Pfam-B_6424 Pfam-B_684 Pfam-B_2715Pfam-B_9046 Pfam-B_914 Pfam-B_3371 Pfam-B_8548 Pfam-B_2305 Pfam-B_4592 Pfam-B_9047 Pfam-B_4593 Pfam-B_1301 Pfam-B_3529Pfam-B_2383 Pfam-B_3324 Pfam-B_2120 Pfam-B_9042 Pfam-B_7055Pfam-B_4418 Pfam-B_10289 zf-C2H2 Pfam-B_5814 Pfam-B_5687 Pfam-B_8919 Pfam-B_5815 Pfam-B_5747 Pfam-B_5759 Pfam-B_89 Pfam-B_7399 Pfam-B_5679 Pfam-B_10290 Pfam-B_5703 Pfam-B_11014 Pfam-B_10814 Pfam-B_1927 Pfam-B_4591 Pfam-B_6415 Pfam-B_8546 Pfam-B_10288 Pfam-B_5768 Pfam-B_5758 Pfam-B_4224 Pfam-B_1650 Pfam-B_5360 Pfam-B_5812 Pfam-B_8676 Pfam-B_3322 Pfam-B_5383 efhand Pfam-B_6053 Pfam-B_1685 Pfam-B_1658 Pfam-B_750 Pfam-B_5669 Pfam-B_8927 Pfam-B_4142 Pfam-B_6563 Pfam-B_1457 Pfam-B_4618 Pfam-B_1659 Pfam-B_1908 cellulase Pfam-B_4093 Pfam-B_6423 Pfam-B_10358 E1-E2_ATPase Pfam-B_9648 Pfam-B_6650 laminin_Nterm Pfam-B_7462 Pfam-B_4619 Pfam-B_4255 Pfam-B_2608 Pfam-B_3877 Pfam-B_5435 Pfam-B_5361 Pfam-B_2801 Pfam-B_8224

Pfam-B_4083 Pfam-B_8096 Pfam-B_3943 Pfam-B_4646 Pfam-B_4084 Pfam-B_10625 Pfam-B_1440 Pfam-B_348 Pfam-B_4482 Pfam-B_11130 Pfam-B_3200 Pfam-B_4721 Pfam-B_4036 Pfam-B_11256

Pfam-B_7505 Pfam-B_4489 Pfam-B_6617 Pfam-B_6419 Pfam-B_801Pfam-B_4898 Pfam-B_6146 Pfam-B_11192 Pfam-B_6622 Pfam-B_3461Pfam-B_220 Pfam-B_3972 Pfam-B_4645 Pfam-B_330 Pfam-B_1990

Pfam-B_332 Pfam-B_813 Pfam-B_6721 Pfam-B_575 Pfam-B_150 Pfam-B_270 Pfam-B_8869 Pfam-B_6535 Pfam-B_5583 Pfam-B_3109 Pfam-B_8380 Pfam-B_10813 Pfam-B_2694 Pfam-B_4894 Pfam-B_1782 Pfam-B_1316Pfam-B_1289 Pfam-B_5348 Pfam-B_671 Pfam-B_3971 Pfam-B_3714 Pfam-B_3002 Pfam-B_1240 Pfam-B_4632 Pfam-B_3715 Pfam-B_5714 Pfam-B_178 Pfam-B_9943 Pfam-B_1628 Pfam-B_1223 Pfam-B_10299 Pfam-B_283 Pfam-B_8974 Pfam-B_10298 Pfam-B_9140 Pfam-B_20 Pfam-B_9973 Pfam-B_2773Pfam-B_10294 Pfam-B_196 Pfam-B_8763 Pfam-B_6294 Pfam-B_2846 Pfam-B_9662 Pfam-B_359 adh_zinc Pfam-B_2425 Pfam-B_804 Pfam-B_3303 Pfam-B_32 Pfam-B_1349 Pfam-B_8960 Pfam-B_6287Pfam-B_4594 Pfam-B_748Pfam-B_4240 Pfam-B_10711 Pfam-B_4483 Pfam-B_2310 Pfam-B_6578 Pfam-B_159 Pfam-B_11021 Pfam-B_3837 Pfam-B_1351 Pfam-B_1574 Pfam-B_9984 Pfam-B_9838 Pfam-B_9661 Pfam-B_10947 Pfam-B_968 Pfam-B_5842 Pfam-B_7350 Pfam-B_1220 Pfam-B_969 Pfam-B_913 Pfam-B_5718 pilin Pfam-B_3223 Pfam-B_2377 Pfam-B_1110 Pfam-B_2684 Pfam-B_10956Pfam-B_9972 Pfam-B_2324 Pfam-B_3528 Pfam-B_1644 Pfam-B_2792 Pfam-B_3516 Pfam-B_10373 Pfam-B_5793 Pfam-B_3237 Pfam-B_3446 Pfam-B_1502 Pfam-B_3074Pfam-B_2158 Pfam-B_3543 Pfam-B_10708 Pfam-B_830 Pfam-B_9138 Pfam-B_40 Pfam-B_3593 Pfam-B_3467 Pfam-B_10727 Pfam-B_58 Pfam-B_8547 Pfam-B_6225 Pfam-B_3342 Pfam-B_5841 Pfam-B_37 Pfam-B_8544 fibrinogen_C Pfam-B_993 Pfam-B_9139 Pfam-B_1079 Pfam-B_2812 Pfam-B_10726 Pfam-B_9380 cadherin Pfam-B_2790 Pfam-B_11159 Pfam-B_918 Pfam-B_590 Pfam-B_6380 Pfam-B_2600 Pfam-B_7640 Pfam-B_160 Pfam-B_1182 Pfam-B_6426 Pfam-B_11160 Pfam-B_9971 Pfam-B_6248 Pfam-B_831 Pfam-B_141Pfam-B_1089 Pfam-B_1354 Pfam-B_8371 Pfam-B_4140 Pfam-B_919 laminin_B Pfam-B_1189Pfam-B_8910 Pfam-B_372 Pfam-B_10955 Pfam-B_6580 Pfam-B_10419 Pfam-B_4547 pou Pfam-B_9981 laminin_EGF Pfam-B_6544 Pfam-B_11846 Pfam-B_7235 Pfam-B_10368 Pfam-B_3933 Pfam-B_5964 Pfam-B_3196 Pfam-B_2188 Pfam-B_2601 Pfam-B_9683 Pfam-B_1593 Pfam-B_1941 Pfam-B_7642 Pfam-B_10370Pfam-B_10371 Pfam-B_9983 Pfam-B_5199 Pfam-B_30 Pfam-B_3946 Pfam-B_8524Pfam-B_5109Pfam-B_8954Pfam-B_3995 Pfam-B_5540 Pfam-B_5087Pfam-B_8466Pfam-B_1144 Pfam-B_4127Pfam-B_5045 Pfam-B_3172 Pfam-B_11100 Pfam-B_10719Pfam-B_7997 Pfam-B_368 Pfam-B_425 Pfam-B_9980 Pfam-B_7453 ins Cys_knot Pfam-B_1414 Pfam-B_4269 Pfam-B_3197 Pfam-B_1143Pfam-B_9684Pfam-B_8376 Pfam-B_2304 Pfam-B_7680 Pfam-B_100 Pfam-B_4006 Pfam-B_11098 Pfam-B_6845Pfam-B_5198Pfam-B_10527 Pfam-B_7816 Pfam-B_3570 STphosphatase Pfam-B_948 Pfam-B_6216 Pfam-B_4631 Pfam-B_337 Pfam-B_9968 Pfam-B_2126 Pfam-B_5514 Pfam-B_7686 Pfam-B_4474 Pfam-B_6858 Pfam-B_982 Pfam-B_4104Pfam-B_9356 Pfam-B_1123 Pfam-B_1846 Pfam-B_5750 Pfam-B_814 Pfam-B_7056 Pfam-B_5196Pfam-B_4126Pfam-B_8373 Pfam-B_1124 Pfam-B_9527 Pfam-B_8034 Pfam-B_7681 Pfam-B_3173 Pfam-B_4945 Pfam-B_8157 Pfam-B_2262 Pfam-B_5413 Pfam-B_3492 Pfam-B_2555 Pfam-B_1613 Pfam-B_4475 Pfam-B_9682 Pfam-B_5197Pfam-B_8443 Pfam-B_6098 Pfam-B_324 Pfam-B_122 Pfam-B_6619 Pfam-B_2630Pfam-B_4128 Pfam-B_7932Pfam-B_3540 Pfam-B_6859 Pfam-B_3097 Pfam-B_3517 Pfam-B_11038 Pfam-B_8150 Pfam-B_4130 Pfam-B_5510 Pfam-B_2416 Pfam-B_1726 Pfam-B_7999Pfam-B_8460 Pfam-B_358Pfam-B_3557Pfam-B_667 Pfam-B_8383 Pfam-B_5084 Pfam-B_367 Pfam-B_1067 Pfam-B_2172 Pfam-B_11859 Pfam-B_5341 Pfam-B_11001 Pfam-B_3994 Pfam-B_8005 Pfam-B_4659 Pfam-B_5340 Pfam-B_447 filament AAA Pfam-B_3942 Pfam-B_5317 Pfam-B_9605 Pfam-B_10297 Pfam-B_11280 Pfam-B_5085 Pfam-B_8370Pfam-B_1822Pfam-B_6217 Pfam-B_8955 Pfam-B_10904 Pfam-B_1800 sugar_tr Pfam-B_2131 Pfam-B_5214 hemopexin Pfam-B_11590Pfam-B_8379 Pfam-B_5339 HTH_1 Pfam-B_3059 Pfam-B_2008 Pfam-B_10722Pfam-B_10723 DAG_PE-bind Pfam-B_8205 Pfam-B_3249 Pfam-B_607 Pfam-B_174 Pfam-B_1510 Pfam-B_7751 Pfam-B_5541Pfam-B_8372 Pfam-B_1075 Pfam-B_8447 Pfam-B_4139 Pfam-B_3767 Pfam-B_5864 homeobox Pfam-B_5412 Pfam-B_8801 Pfam-B_5710 Pfam-B_3222 Pfam-B_4129 Pfam-B_4658 Pfam-B_5512 Pfam-B_6660 Pfam-B_2797 Pfam-B_3302 Pfam-B_6916 Pfam-B_1106 Pfam-B_1282 Pfam-B_1622 Pfam-B_8523 Pfam-B_415 Pfam-B_8302 Pfam-B_8076 Pfam-B_9276 Pfam-B_5411 Pfam-B_7091 Pfam-B_10718 Pfam-B_10374 Pfam-B_455 laminin_G Pfam-B_7668 Pfam-B_1026 Pfam-B_2139 Pfam-B_2278 Pfam-B_1302 fn2 Pfam-B_10799 Pfam-B_73 Pfam-B_9606 Pfam-B_3314 Pfam-B_6428 Pfam-B_8242 Pfam-B_4237 Pfam-B_7665 Pfam-B_8238 Pfam-B_1286 Pfam-B_7667 Pfam-B_10827 wap Pfam-B_8397 Pfam-B_3758 Pfam-B_7756 Pfam-B_954 Pfam-B_2900 Pfam-B_6915 Pfam-B_4206 Pfam-B_1705 Pfam-B_9767 Pfam-B_5133 Pfam-B_5350 Pfam-B_4749 Pfam-B_8249 Pfam-B_3969 Pfam-B_8204 Pfam-B_3474 Pfam-B_4813 Pfam-B_6033Pfam-B_3195 Pfam-B_843 Pfam-B_7561 Pfam-B_8409 Pfam-B_9561 Pfam-B_9942 Pfam-B_2350 Pfam-B_880 Pfam-B_5131 Pfam-B_2836 Pfam-B_11068 Pfam-B_2682 Pfam-B_1216 Pfam-B_7654 Pfam-B_10703 ABC_tran Pfam-B_2463 Pfam-B_2408 Pfam-B_4700 Pfam-B_2488 Pfam-B_10824 Pfam-B_8206 Pfam-B_844 Pfam-B_2173 Pfam-B_10817 Pfam-B_727 Pfam-B_1060Pfam-B_7274 Pfam-B_8245 Pfam-B_701 Pfam-B_2351 Pfam-B_10189 Pfam-B_6021 Pfam-B_5028 Pfam-B_1084 Pfam-B_909 Pfam-B_3037 Pfam-B_2178 Pfam-B_309 Pfam-B_7666 Pfam-B_4490 PH Pfam-B_3574 Pfam-B_4521 Pfam-B_11075 Pfam-B_1693 Pfam-B_2756 Pfam-B_915 Pfam-B_1326Kunitz_BPTI Pfam-B_7252 Pfam-B_9692 Pfam-B_1871 Pfam-B_1177 Pfam-B_456 HSP20 Pfam-B_1157 RuBisCO_small Pfam-B_335 Pfam-B_3083 Pfam-B_1876 Pfam-B_1512 Pfam-B_8241 Pfam-B_8912 Pfam-B_11092 Pfam-B_1068 Pfam-B_6417 Pfam-B_1731 Pfam-B_1029 Pfam-B_97 trefoil Pfam-B_9562 fn3 Pfam-B_9903 Pfam-B_4199 vwc Pfam-B_5709 Pfam-B_2635 Pfam-B_1684 Pfam-B_3633 tRNA-synt_2 Pfam-B_1297 Pfam-B_2763 Pfam-B_4088 Pfam-B_4045 Pfam-B_10993 Pfam-B_686Pfam-B_4499 Pfam-B_688 Pfam-B_2611Pfam-B_9563 Pfam-B_2150 Pfam-B_5708 Pfam-B_11898 Pfam-B_687 vwa trypsin Pfam-B_9901 Pfam-B_4908 Pfam-B_1043 Pfam-B_10818 fn1 Pfam-B_4724 Pfam-B_4243 Pfam-B_4909 fer4 Pfam-B_4044 Pfam-B_2923 Pfam-B_10826 Pfam-B_1508 Pfam-B_6489 lectin_c sushi Pfam-B_2777 Pfam-B_5182 Pfam-B_6988 Pfam-B_8862 Pfam-B_10822 Pfam-B_7286 Pfam-B_5650 Pfam-B_10819 Pfam-B_606 Pfam-B_6987 Pfam-B_818 Pfam-B_3935 Pfam-B_5428 Pfam-B_8717 Pfam-B_3999 Pfam-B_11897 Pfam-B_7263 Pfam-B_3193 EGF Pfam-B_1649 Pfam-B_6378 Pfam-B_453 Pfam-B_4174 Pfam-B_764 Pfam-B_7653 Pfam-B_7719 Pfam-B_768 Pfam-B_5154 rrm zf-C4 Pfam-B_975Pfam-B_887Pfam-B_8468 Pfam-B_4701 Pfam-B_2963 Pfam-B_5808 Pfam-B_870 Pfam-B_10543 Pfam-B_9437 ank Pfam-B_7311 Pfam-B_4852 Pfam-B_1725 Pfam-B_6515 Pfam-B_2028 Pfam-B_5810 Pfam-B_7329 Pfam-B_889 Pfam-B_3873 Pfam-B_2245 Pfam-B_5041 Pfam-B_3829 Pfam-B_5264 Pfam-B_7718 Pfam-B_547 Pfam-B_5809 Pfam-B_390 Pfam-B_7717 Pfam-B_3429 Pfam-B_4074 Pfam-B_5075 Pfam-B_210 Pfam-B_9439 Pfam-B_5591 Pfam-B_5757 Pfam-B_3017 Pfam-B_5077 Pfam-B_11741 Pfam-B_10702 Pfam-B_461 Pfam-B_7830 Pfam-B_1113 Pfam-B_1056 Pfam-B_8944 Pfam-B_479 Y_phosphatase Pfam-B_616 Pfam-B_2541 Pfam-B_10707 RIP Pfam-B_7372 Pfam-B_1323 Pfam-B_3898Pfam-B_5741 Pfam-B_1465 Pfam-B_5078 Pfam-B_51 Pfam-B_11324 Pfam-B_7557 Pfam-B_5960 Pfam-B_3978 Pfam-B_7182 Pfam-B_8527Pfam-B_7451 Pfam-B_59 Pfam-B_4252 Pfam-B_2980Pfam-B_10946 Pfam-B_2540 Pfam-B_3899 Pfam-B_3575 Pfam-B_2659 Pfam-B_2978 Pfam-B_645 Pfam-B_6023 Pfam-B_1840Pfam-B_8975 Pfam-B_7559 Pfam-B_6724 Pfam-B_2568 Pfam-B_523 Pfam-B_1656 Pfam-B_7456 Pfam-B_4202 Pfam-B_3713 Pfam-B_2575 Pfam-B_11808Pfam-B_8003 Pfam-B_5101 Pfam-B_3165 Pfam-B_2981Pfam-B_11794 Pfam-B_4420 Pfam-B_5278 Pfam-B_7375 Pfam-B_520 notch Pfam-B_436 Pfam-B_4607 Pfam-B_11241 Pfam-B_1589 Pfam-B_9558 Pfam-B_5386 Pfam-B_11037 Pfam-B_1979 Pfam-B_8918 Pfam-B_8094 Pfam-B_2111 Pfam-B_6943 Pfam-B_275 Pfam-B_5404 Pfam-B_477Pfam-B_864Pfam-B_3882 serpin Pfam-B_2982 Pfam-B_3848 Pfam-B_3124Pfam-B_2386Pfam-B_11026 Pfam-B_381Pfam-B_8257 Pfam-B_8170 Pfam-B_11331 Pfam-B_3102 Pfam-B_9609 Pfam-B_2225 Pfam-B_1618Pfam-B_752Pfam-B_83 Pfam-B_9006 Pfam-B_9936 Pfam-B_1136 Pfam-B_1617 Pfam-B_1025 Pfam-B_9897 Pfam-B_11900 Pfam-B_11593 Pfam-B_8103 Pfam-B_11017 Pfam-B_4563 Pfam-B_7373Pfam-B_9846 Pfam-B_10270 Pfam-B_2983 lipocalinPfam-B_7558 Pfam-B_11771 Pfam-B_7376 Pfam-B_1467 Pfam-B_11715 Pfam-B_8197Pfam-B_8230Pfam-B_10217 Pfam-B_8017Pfam-B_2161Pfam-B_8145 Pfam-B_2228 Pfam-B_3842 Pfam-B_7809 Pfam-B_11009 Pfam-B_2241 Pfam-B_9904 Pfam-B_5069 Pfam-B_633 Pfam-B_1466 Pfam-B_10099Pfam-B_9750 Pfam-B_2515 Pfam-B_5072Pfam-B_10566 Pfam-B_773 Pfam-B_1817 Pfam-B_1063 Pfam-B_747 Pfam-B_2204 Pfam-B_9052 Pfam-B_7374 Pfam-B_7384 Pfam-B_11332 Pfam-B_11717 Pfam-B_10170Pfam-B_2855Pfam-B_6600 Pfam-B_796 Pfam-B_9848 Pfam-B_3360 Pfam-B_7708 Pfam-B_2314 Pfam-B_4072 Pfam-B_2701 Pfam-B_5073 Pfam-B_7646 Pfam-B_8987 Pfam-B_1816 Pfam-B_8143 Pfam-B_1331 Pfam-B_8195 Pfam-B_9911 Pfam-B_5079 Pfam-B_767 Pfam-B_2445 Pfam-B_1206 Pfam-B_1815 Pfam-B_1107 Pfam-B_4552 Pfam-B_8583 Pfam-B_4939 Pfam-B_3072 Pfam-B_5656 Pfam-B_3050 Pfam-B_10925 Pfam-B_11716 Pfam-B_93 Pfam-B_3509 Pfam-B_201Pfam-B_5070 Pfam-B_2234 Pfam-B_5129 Pfam-B_7635 Pfam-B_1525 Pfam-B_8808Pfam-B_2802 Pfam-B_794 Pfam-B_1819 Pfam-B_4550 Pfam-B_2716 Pfam-B_546 Pfam-B_974 Pfam-B_4726 Pfam-B_424 Pfam-B_3981 Pfam-B_8749Pfam-B_3081 Pfam-B_534 Pfam-B_3618 Pfam-B_6595 Pfam-B_1818 Pfam-B_8986 Pfam-B_4807Pfam-B_7392 Pfam-B_3707 Pfam-B_2060 Pfam-B_3289 Pfam-B_3048Pfam-B_705 Pfam-B_8989 Pfam-B_2483 Pfam-B_2223Pfam-B_2697 Pfam-B_1133 Pfam-B_2002 Pfam-B_7545 Pfam-B_11714Pfam-B_1621 Pfam-B_3580 Pfam-B_6414Pfam-B_8518 SH3 Pfam-B_11592 Pfam-B_305 Pfam-B_3080 Pfam-B_11000Pfam-B_4513 Pfam-B_4028 Pfam-B_5316 Pfam-B_7442 Pfam-B_229 Pfam-B_4610 Pfam-B_3730 Pfam-B_2913SH2 Pfam-B_506 Pfam-B_6076Pfam-B_3436 Pfam-B_6909 dsrm Pfam-B_4924 Pfam-B_3887 Pfam-B_5730 Pfam-B_1829 Pfam-B_473 Pfam-B_1713Pfam-B_8207 Pfam-B_3082Pfam-B_4077 Pfam-B_1956 Pfam-B_1883 Pfam-B_550 Pfam-B_7053 pkinase Pfam-B_403Pfam-B_1238 Pfam-B_2675 Pfam-B_8105 Pfam-B_573 Pfam-B_1960 Pfam-B_2016 Pfam-B_4061 Pfam-B_2912 Pfam-B_7546 Pfam-B_2643 Pfam-B_548Pfam-B_7727Pfam-B_11753 Pfam-B_11735Pfam-B_878 Pfam-B_6594 Pfam-B_3012 Pfam-B_3015 Pfam-B_9407 Pfam-B_4244Pfam-B_7952 Pfam-B_8890 Pfam-B_2516Pfam-B_8108 Pfam-B_5734Pfam-B_5375 Pfam-B_11198Pfam-B_8824 Pfam-B_2696 Pfam-B_1830 Pfam-B_8026 Pfam-B_7970 Pfam-B_3773 Pfam-B_7334 Pfam-B_8825 ig Pfam-B_7010 Pfam-B_10357 Pfam-B_1481 Pfam-B_1283 Pfam-B_3706 Pfam-B_1612 Pfam-B_1831 Pfam-B_492 Pfam-B_10471 Pfam-B_4712 Pfam-B_8120 Pfam-B_8139 Pfam-B_3663 Pfam-B_4451 Pfam-B_303 Pfam-B_11199 Pfam-B_3916 Pfam-B_699Pfam-B_2061Pfam-B_1855 Pfam-B_10833 Pfam-B_1654Pfam-B_6122 Pfam-B_4058 Pfam-B_9431 Pfam-B_1828Pfam-B_1683 Pfam-B_8192 Pfam-B_8153 Pfam-B_8407 Pfam-B_1614 Pfam-B_7033 Pfam-B_2550Pfam-B_770 Pfam-B_1967 Pfam-B_2747 Pfam-B_1807 Pfam-B_5406Pfam-B_5515 Pfam-B_888 Pfam-B_4648 Pfam-B_3014 Pfam-B_7595 Pfam-B_9274 Pfam-B_7659 Pfam-B_2519 Pfam-B_3817 Pfam-B_11591 Pfam-B_338 Pfam-B_1620 Pfam-B_3187 Pfam-B_8161 Pfam-B_1668 Pfam-B_675 Pfam-B_711 Pfam-B_8228 Pfam-B_7069 Pfam-B_8027Pfam-B_7560 Pfam-B_8982 Pfam-B_11809 Pfam-B_1016 Pfam-B_7378 Pfam-B_2219 Pfam-B_5421 Pfam-B_4051 Pfam-B_6949 Pfam-B_1996 Pfam-B_5397 Pfam-B_2604 Pfam-B_4976 Pfam-B_1720 Pfam-B_7797 Pfam-B_9267 Pfam-B_2494 Pfam-B_4075 Pfam-B_11830 Pfam-B_2598 ldl_recept_a Pfam-B_2166 lectin_legA Pfam-B_3192 Pfam-B_3169 lectin_legB Pfam-B_9264 Pfam-B_7155 Pfam-B_1436 Pfam-B_6944 Pfam-B_2775 Pfam-B_2220 Pfam-B_9255 Pfam-B_7926Pfam-B_5799Pfam-B_8142 Pfam-B_8193Pfam-B_2839 Pfam-B_3960 Pfam-B_11323 Pfam-B_3427 Pfam-B_3121 Pfam-B_2949 Pfam-B_4538Pfam-B_5398 Pfam-B_7826 Pfam-B_11301 Pfam-B_10473 Pfam-B_3013 Pfam-B_4085 Pfam-B_863 Pfam-B_2063 Pfam-B_1469 Pfam-B_7951 Pfam-B_5909 Pfam-B_8760 Pfam-B_6957 Pfam-B_1255Pfam-B_3257Pfam-B_8141 Pfam-B_8227 Pfam-B_10010Pfam-B_7798 Pfam-B_11240 Pfam-B_6338 Pfam-B_407 Pfam-B_3816 Pfam-B_1774 Pfam-B_8603 Pfam-B_3071Pfam-B_4057 Pfam-B_3708 Pfam-B_8175 Pfam-B_10241 Pfam-B_5857 Pfam-B_8138 Pfam-B_4076Pfam-B_2506 Pfam-B_6341 Pfam-B_10435Pfam-B_7034 Pfam-B_8194Pfam-B_700 Pfam-B_10244 Pfam-B_308 Pfam-B_10021 Pfam-B_6308Pfam-B_2840 Pfam-B_2518 Pfam-B_7677 Pfam-B_102 Pfam-B_9667 Pfam-B_352 Pfam-B_4059 Pfam-B_5604 Pfam-B_9191 Pfam-B_5908 Pfam-B_7008 Pfam-B_10236 Pfam-B_307 Pfam-B_5754 Pfam-B_5136 Pfam-B_7052 Pfam-B_2570Pfam-B_728 Pfam-B_4062Pfam-B_4806Pfam-B_1821 cyclin Pfam-B_7009 Pfam-B_4606 myosin_head Pfam-B_7049Pfam-B_11743 Pfam-B_2766 Pfam-B_7174 Pfam-B_2827 Pfam-B_2122 Pfam-B_11766Pfam-B_219 Pfam-B_10440 Pfam-B_334 Pfam-B_3897 Pfam-B_7981 Pfam-B_9309 Pfam-B_9506 Pfam-B_7849 Pfam-B_9008 Pfam-B_4060 Pfam-B_7848 Pfam-B_8196 Pfam-B_7846 Pfam-B_105Pfam-B_733 Pfam-B_7845 Pfam-B_3170 Pfam-B_336 Pfam-B_10748 Pfam-B_28 Pfam-B_1134 Pfam-B_9007 Pfam-B_1047 Pfam-B_7745 Pfam-B_7844Pfam-B_7982Pfam-B_554 Pfam-B_735 Pfam-B_123 Pfam-B_646 Pfam-B_2907 Pfam-B_225 Pfam-B_7843 Pfam-B_6586 Pfam-B_3824 Pfam-B_10758 Pfam-B_135 Pfam-B_109 Pfam-B_395Pfam-B_13 Pfam-B_2583 Pfam-B_6257 Pfam-B_5986Pfam-B_2906Pfam-B_10751 Pfam-B_1373 Pfam-B_233 Pfam-B_5742 Pfam-B_595 Pfam-B_422Pfam-B_138 Pfam-B_7569 Pfam-B_16 Pfam-B_467 Pfam-B_1739 Pfam-B_766 Pfam-B_6478 Pfam-B_2026 Pfam-B_3953 Pfam-B_1706 Pfam-B_167 Pfam-B_10752Pfam-B_2419 Pfam-B_2421 Pfam-B_5323 Pfam-B_6480 Pfam-B_885 Pfam-B_10243 Pfam-B_22 Pfam-B_4Pfam-B_6588 Pfam-B_2838 Pfam-B_9241 Pfam-B_4997Pfam-B_826 Pfam-B_6340 Pfam-B_3903 Pfam-B_2285 Pfam-B_6330 Pfam-B_3905 Pfam-B_2729 Pfam-B_11487 Pfam-B_10237 Pfam-B_961 Pfam-B_1601 Pfam-B_3508 Pfam-B_1600 Pfam-B_3505 Pfam-B_827 Pfam-B_6323 Pfam-B_10884 Pfam-B_1204 Pfam-B_6342 Pfam-B_2284 Pfam-B_7573Pfam-B_5142 Pfam-B_5141 Pfam-B_2842 il8 Pfam-B_7196 Pfam-B_1418 Pfam-B_3904 Pfam-B_6332Pfam-B_1521 Pfam-B_10218 Pfam-B_5163 Pfam-B_623 Pfam-B_1595 Pfam-B_267Pfam-B_5533 Pfam-B_865 Pfam-B_5143 Pfam-B_7197 Pfam-B_1758 Pfam-B_438 Pfam-B_437 Pfam-B_5144 Pfam-B_11490 Pfam-B_5304 Pfam-B_10242 Pfam-B_1868 Pfam-B_3507 Pfam-B_3852 Pfam-B_4713Pfam-B_9 Pfam-B_3442Pfam-B_8758 Pfam-B_7750 Pfam-B_765 Pfam-B_189 Pfam-B_9698 Pfam-B_7078 Pfam-B_8696 Pfam-B_2844 Pfam-B_9130 Pfam-B_466 Pfam-B_5765 Pfam-B_5162 Pfam-B_2841Pfam-B_505 Pfam-B_3444 Pfam-B_605 Pfam-B_5766 Pfam-B_423 Pfam-B_9508 Pfam-B_7749Pfam-B_1665 Pfam-B_10246 Pfam-B_1755 Pfam-B_8722Pfam-B_7771 Pfam-B_7712 Pfam-B_7648Pfam-B_7647 Pfam-B_2547 7tm_1Pfam-B_4258 Pfam-B_2703 Pfam-B_5176 Pfam-B_10245 Pfam-B_7426 Pfam-B_3612 Pfam-B_10760 Pfam-B_9359 Pfam-B_2578 Pfam-B_2546 Pfam-B_4150Pfam-B_725 Pfam-B_3315 Pfam-B_8581 Pfam-B_819 Pfam-B_302 Pfam-B_130 Pfam-B_6728 mito_carr Pfam-B_6184 Pfam-B_5643 Pfam-B_2420 Pfam-B_989 Pfam-B_6183 Pfam-B_5175 Pfam-B_9459 Pfam-B_3945 Pfam-B_9916 Pfam-B_3952 Pfam-B_3939 Pfam-B_9129 Pfam-B_901 Pfam-B_9918 Pfam-B_3944 ldl_recept_b Pfam-B_4394 Pfam-B_1788Pfam-B_4225 Pfam-B_3962Pfam-B_7526 Pfam-B_533 Pfam-B_1895 Pfam-B_331 Pfam-B_11625 Pfam-B_9303Pfam-B_1913 Pfam-B_2400 Pfam-B_7616 Pfam-B_940 Pfam-B_5376 Pfam-B_1698 Pfam-B_5820 Pfam-B_8580 Pfam-B_5186 Pfam-B_42 Pfam-B_7077 Pfam-B_5234 Pfam-B_2327 Pfam-B_3963 Pfam-B_1258 Pfam-B_1265 Pfam-B_718 Pfam-B_49 Pfam-B_990 Pfam-B_6924 Pfam-B_6723 Pfam-B_4665 Pfam-B_1125 Pfam-B_11644Pfam-B_7282 Pfam-B_376 Pfam-B_9285 Pfam-B_4666 Pfam-B_5167 Pfam-B_5184 Pfam-B_3288Pfam-B_11584 Pfam-B_6548Pfam-B_4164 Pfam-B_313 Pfam-B_9305 Pfam-B_8759 Pfam-B_7111 Pfam-B_2140 Pfam-B_10581Pfam-B_5173Pfam-B_7233 Pfam-B_125 Pfam-B_11419 Pfam-B_2619 Pfam-B_5128 Pfam-B_749 Pfam-B_6878 Pfam-B_5007 Pfam-B_7321Pfam-B_7322 Pfam-B_4664 Pfam-B_9365 Pfam-B_11744 Pfam-B_1388Pfam-B_936Pfam-B_375Pfam-B_2618 Pfam-B_312 Pfam-B_4046 Pfam-B_7769 Pfam-B_110 Pfam-B_5153Pfam-B_10558 Pfam-B_7210 Pfam-B_1556 Pfam-B_1931 Pfam-B_7107 Pfam-B_8727 Pfam-B_2884 Pfam-B_3256 Pfam-B_11104 Pfam-B_3095 Pfam-B_341 Pfam-B_7772 Pfam-B_284 Pfam-B_1112 Pfam-B_9455 ras Pfam-B_7713 Pfam-B_11643 Pfam-B_9919Pfam-B_3689 Pfam-B_5185 Pfam-B_1602 Pfam-B_7900 Pfam-B_10563Pfam-B_10580Pfam-B_3703 Pfam-B_6923Pfam-B_3940 Pfam-B_41 Pfam-B_8322 Pfam-B_11927 Pfam-B_10239 Pfam-B_4709 Pfam-B_3611Pfam-B_2504 Pfam-B_11638 Pfam-B_8019 Pfam-B_3586 Pfam-B_11164 Pfam-B_8313 Pfam-B_1186 Pfam-B_10562 Pfam-B_5168 Pfam-B_4688Pfam-B_10668Pfam-B_4918Pfam-B_3294Pfam-B_8089 GTP_EFTU Pfam-B_6847 Pfam-B_10477 Pfam-B_4279 Pfam-B_38 Pfam-B_4714 Pfam-B_7022 Pfam-B_11123 Pfam-B_1234 Pfam-B_65 Pfam-B_10578 Pfam-B_6547 Pfam-B_6475 Pfam-B_3310Pfam-B_2409 Pfam-B_4684 Pfam-B_10479 Pfam-B_3564 Pfam-B_8090 Pfam-B_2240 Pfam-B_10617 Pfam-B_118 Pfam-B_5797 hormone_rec Pfam-B_4881 Pfam-B_2479 Pfam-B_6671Pfam-B_4533Pfam-B_4532 Pfam-B_4851 Pfam-B_4652Pfam-B_397 Pfam-B_4919Pfam-B_4922 Pfam-B_271Pfam-B_1281Pfam-B_3549 Pfam-B_9215 Pfam-B_6667 Pfam-B_9331 Pfam-B_10478 Pfam-B_6339 Pfam-B_3853 Pfam-B_10082 Pfam-B_4920 toxin Pfam-B_1498 Pfam-B_288 Pfam-B_2469 Pfam-B_9057 Pfam-B_4917 Pfam-B_4689 Pfam-B_4722Pfam-B_3854 Pfam-B_4397 Pfam-B_2992 Pfam-B_1365 Pfam-B_563 Pfam-B_2989 Pfam-B_4653 Pfam-B_10476 Pfam-B_9049 Pfam-B_4082 Pfam-B_4149Pfam-B_4921 Pfam-B_597 Pfam-B_881 Pfam-B_2058 Pfam-B_2784 Pfam-B_1677 Pfam-B_3212 Pfam-B_4350 Pfam-B_3295 Pfam-B_3299 Pfam-B_2530 Pfam-B_2333 Pfam-B_1997 Pfam-B_1490 Pfam-B_4148 Pfam-B_3530 Pfam-B_2789 Pfam-B_1127 Pfam-B_3862 Pfam-B_297 Pfam-B_1241 Pfam-B_6004 Pfam-B_3587 Pfam-B_99 Pfam-B_3263Pfam-B_1824 Pfam-B_1489 Pfam-B_1445 Pfam-B_8578 Pfam-B_4847 Pfam-B_11642 Pfam-B_1357Pfam-B_2125 Pfam-B_6959 Pfam-B_5116Pfam-B_2759 Pfam-B_5407 Pfam-B_8323 rhv Pfam-B_10852 Pfam-B_3879Pfam-B_4396 Pfam-B_2764 Pfam-B_4829 Pfam-B_6003 Pfam-B_4356 Pfam-B_552 Pfam-B_4395 Pfam-B_2531 Pfam-B_2013 Pfam-B_6674 Pfam-B_2524 rvt Pfam-B_2477 Pfam-B_3120 Pfam-B_9548Pfam-B_2945 Pfam-B_2296 Pfam-B_1802 Pfam-B_2299 Pfam-B_10807 Pfam-B_11860 Pfam-B_9543 Pfam-B_11892Pfam-B_1267 Pfam-B_5999 Pfam-B_4415 Pfam-B_4357 Pfam-B_6802 Pfam-B_2298 Pfam-B_9310Pfam-B_3373 Pfam-B_11828 Pfam-B_6000 Pfam-B_11905 Pfam-B_170 Pfam-B_10556Pfam-B_9533 Pfam-B_2289 Pfam-B_4398 C2 Pfam-B_11928 Pfam-B_2297 Pfam-B_11467 Pfam-B_6511Pfam-B_9531 Pfam-B_7486 Pfam-B_4403 Pfam-B_7023 Pfam-B_6992 Pfam-B_1260Pfam-B_3068 Pfam-B_6801 Pfam-B_1813 Pfam-B_10756 Pfam-B_10607Pfam-B_10574 Pfam-B_5246 Pfam-B_2292 Pfam-B_400 Pfam-B_2403 Pfam-B_9861 Pfam-B_774 Pfam-B_7483 Pfam-B_9494Pfam-B_1361 Pfam-B_6610 Pfam-B_659Pfam-B_6012 Pfam-B_2405 Pfam-B_4110 Pfam-B_4536 Pfam-B_6017 Pfam-B_10597 Pfam-B_2823 Pfam-B_11658 Pfam-B_1640 Pfam-B_2290 Pfam-B_1804 Pfam-B_11657 Pfam-B_6018 Pfam-B_10604 Pfam-B_10594 Pfam-B_1832 Pfam-B_2961

Pfam-B_5115 Pfam-B_4535 Pfam-B_3686

Pfam-B_7480

Pfam-B_7479 Pfam-B_3062 Pfam-B_11906

Pfam-B_3876 Pfam-B_2528 Pfam-B_2529 Pfam-B_6006 Pfam-B_757Pfam-B_134 Pfam-B_859 Pfam-B_11637 rnaseH Pfam-B_2011Pfam-B_339Pfam-B_5114Pfam-B_7482 Pfam-B_495 Pfam-B_5112 Pfam-B_2669 Pfam-B_2758 Pfam-B_478 Pfam-B_2012Pfam-B_1803 rvp Pfam-B_10601Pfam-B_85 Pfam-B_5207 Pfam-B_630 Pfam-B_7481 Pfam-B_555 Pfam-B_3482 Pfam-B_674 Pfam-B_939 Pfam-B_10433 Pfam-B_2880 Pfam-B_1152 Pfam-B_7477 Pfam-B_2015 Pfam-B_3878 Pfam-B_582 fer2 Pfam-B_497 Pfam-B_19 zf-CCHCPfam-B_31 Pfam-B_5654 Pfam-B_620 Pfam-B_496 Pfam-B_411 Pfam-B_11209 Pfam-B_5208 Pfam-B_5113 Pfam-B_108 Pfam-B_81 Pfam-B_7487 Pfam-B_10600 Pfam-B_3874 Pfam-B_4675 Pfam-B_955 Pfam-B_7488 Pfam-B_2892 Pfam-B_10585 Pfam-B_5117 Pfam-B_5368 Pfam-B_2527 Pfam-B_530 Pfam-B_7489 Pfam-B_6516 Pfam-B_3069 Pfam-B_2893 Pfam-B_23 Pfam-B_8724Pfam-B_60 Pfam-B_2676 Pfam-B_1533 Pfam-B_6523 Pfam-B_10608 Pfam-B_5272 Pfam-B_75Pfam-B_382Pfam-B_10589 Pfam-B_10576 Pfam-B_452 Pfam-B_3875 Pfam-B_1531 COX1 Pfam-B_10572 Pfam-B_2881 Pfam-B_3572 Pfam-B_10918 Pfam-B_10587 Pfam-B_7491 Pfam-B_4409

Pfam-B_1730

Pfam-B_2877 Pfam-B_5266 Pfam-B_1928 Pfam-B_9220 Pfam-B_10575 Pfam-B_3880 Pfam-B_10573 Pfam-B_2698 Pfam-B_5653 Pfam-B_7836 Pfam-B_3422 Pfam-B_2878 Pfam-B_8209 Pfam-B_7838 Pfam-B_9219 Pfam-B_1922 Pfam-B_6062 Pfam-B_8210 Pfam-B_8181 Pfam-B_585 Pfam-B_2291

Pfam-B_1729 Pfam-B_2288 Pfam-B_2274Pfam-B_11329 Pfam-B_7201 Pfam-B_8334 Pfam-B_2147 Pfam-B_6041 Pfam-B_8528 Pfam-B_9544 Pfam-B_4663 Pfam-B_4402 Pfam-B_9545Pfam-B_896 Pfam-B_10403 Pfam-B_5491 applePfam-B_9398 Pfam-B_8332 HLH Pfam-B_2006 Pfam-B_2639Pfam-B_2464 Pfam-B_540Pfam-B_9546 Pfam-B_2007Pfam-B_1360 Pfam-B_8483 Pfam-B_3122 aminotran Pfam-B_9028 Pfam-B_1022 Pfam-B_4261 Pfam-B_4008 Pfam-B_4888 Pfam-B_5780 Pfam-B_5490 Pfam-B_5958 Pfam-B_6182 Pfam-B_1801

Pfam-B_1820 Pfam-B_1835

Pfam-B_10599

Pfam-B_3985Pfam-B_6304 Pfam-B_9928Pfam-B_4070Pfam-B_1839 Pfam-B_8126 Pfam-B_5265 Pfam-B_8333 Pfam-B_11737Pfam-B_4163Pfam-B_3137 Pfam-B_7355 Pfam-B_8494 Pfam-B_440 Pfam-B_6221

Pfam-B_8182 Pfam-B_481 Pfam-B_5971 Pfam-B_2148Pfam-B_2149 Pfam-B_3136

Pfam-B_729 Pfam-B_3987 Pfam-B_731 Pfam-B_6303 Pfam-B_8471 Pfam-B_3988Pfam-B_2340

Pfam-B_730 Pfam-B_2191 Pfam-B_2572Pfam-B_292

Pfam-B_2084 Pfam-B_941

Pfam-B_5161

Pfam-B_7354

Pfam-B_528 Pfam-B_1390Pfam-B_2998Pfam-B_1391

Pfam-B_5267

Pfam-B_2085 Pfam-B_5590Pfam-B_5589

Pfam-B_4472 Pfam-B_8312 Pfam-B_5746 Pfam-B_10509 Pfam-B_3469 Pfam-B_4887 Pfam-B_6467 Pfam-B_222 Pfam-B_327 Pfam-B_10450 Pfam-B_2454 Pfam-B_641 Pfam-B_2986 hormone2 Pfam-B_7775 Pfam-B_1399 Pfam-B_291 Pfam-B_7619 Pfam-B_10452 Pfam-B_10637 aldedh Pfam-B_2105 Pfam-B_9819 Pfam-B_9991 Pfam-B_11276 Pfam-B_839 Pfam-B_5803 Pfam-B_10281 Pfam-B_8364 Pfam-B_4102 Pfam-B_707 Pfam-B_7041 Pfam-B_4453 Pfam-B_10451 Pfam-B_56 Pfam-B_10517 Pfam-B_4107 Pfam-B_287 Pfam-B_2815 Pfam-B_5670 Pfam-B_4339 Pfam-B_5517Pfam-B_5509 Pfam-B_7237 Pfam-B_9822 Pfam-B_2816 Pfam-B_3162 Pfam-B_5628 Pfam-B_8305 Pfam-B_10512 Pfam-B_232 Pfam-B_4024 Pfam-B_560Pfam-B_2988 Pfam-B_9992 Pfam-B_3164 Pfam-B_900 Pfam-B_8368 Pfam-B_613 Pfam-B_8667 Pfam-B_8360 Pfam-B_6846 Pfam-B_3412 Pfam-B_5804 Pfam-B_4641 Pfam-B_7416 Pfam-B_5471Pfam-B_5482 Pfam-B_10508 Pfam-B_427 Pfam-B_4984 Pfam-B_2926 Pfam-B_4960 Pfam-B_207 Pfam-B_1111 Pfam-B_832 Pfam-B_6231 Pfam-B_4913 Pfam-B_835 Pfam-B_615 Pfam-B_1090 Pfam-B_405 Pfam-B_612 Pfam-B_1562 Pfam-B_3724 Pfam-B_5138 Pfam-B_983 Pfam-B_1848 Pfam-B_1046 Pfam-B_2927 Pfam-B_4961 Pfam-B_10640 Pfam-B_1406 Pfam-B_1395 Pfam-B_5019 Pfam-B_591 Pfam-B_10741 Pfam-B_1232 Pfam-B_11142 Pfam-B_4637 Pfam-B_11501 7tm_2 Pfam-B_9820 Pfam-B_295 Pfam-B_1333 Pfam-B_1233 Pfam-B_234 Pfam-B_6537 Pfam-B_838 Pfam-B_2243 Pfam-B_3668 Pfam-B_1936 Pfam-B_5516Pfam-B_5513 Pfam-B_10510 Pfam-B_6492 Pfam-B_3149 Pfam-B_7417 Pfam-B_5487 Pfam-B_6736 Pfam-B_902 Pfam-B_3404 Zn_clus Pfam-B_9817 Pfam-B_10511 Pfam-B_6819 Pfam-B_6233 Pfam-B_9146 Pfam-B_9993 Pfam-B_868 Pfam-B_6822 Pfam-B_70 Pfam-B_5387 Pfam-B_361 Pfam-B_1783 Pfam-B_666Pfam-B_741 Pfam-B_3434 Pfam-B_3667 Pfam-B_3022 Pfam-B_9358 Pfam-B_52 Pfam-B_9565 Pfam-B_8326 Pfam-B_294 Pfam-B_2953 Pfam-B_388 Pfam-B_8179 Pfam-B_6363 Pfam-B_4703 Pfam-B_4902 Pfam-B_5627 Pfam-B_1575 Pfam-B_6066 Pfam-B_1348 Pfam-B_5652 Pfam-B_9818 Pfam-B_1863 Pfam-B_36 Pfam-B_11649 Pfam-B_7904 Pfam-B_895 tRNA-synt_1 Pfam-B_2187 Pfam-B_133ketoacyl-synt Pfam-B_2071 Pfam-B_6946 Pfam-B_6174 Pfam-B_173 Pfam-B_1278 Pfam-B_1946 Pfam-B_296 Pfam-B_4873 Pfam-B_1761 Pfam-B_682 Pfam-B_2453Pfam-B_1552 Pfam-B_3846 Pfam-B_776 Pfam-B_53Pfam-B_177 Pfam-B_35 Pfam-B_3843 Pfam-B_7774 Pfam-B_3426 Pfam-B_723 Pfam-B_3737 Pfam-B_8623 Pfam-B_244 Pfam-B_2499 Pfam-B_9821 Pfam-B_4822 Pfam-B_6733 Pfam-B_6863 Pfam-B_1554 Pfam-B_5934 Pfam-B_6862 Pfam-B_8699 Pfam-B_3411 response_reg Pfam-B_3738 Pfam-B_1877 Pfam-B_7062 Pfam-B_7108 Pfam-B_430 Pfam-B_1553 Pfam-B_1785 Pfam-B_9646 Pfam-B_50 Pfam-B_4407 Pfam-B_408 Pfam-B_2072 Pfam-B_2748 Pfam-B_8650 Pfam-B_1759 Pfam-B_360 Pfam-B_4364 Pfam-B_1224 Pfam-B_24 Pfam-B_7501Pfam-B_1591 Pfam-B_10688 Pfam-B_241 Pfam-B_517 Pfam-B_126 Pfam-B_5377 Pfam-B_4186 Pfam-B_1487 Pfam-B_9357 Pfam-B_1484 Pfam-B_3523 Pfam-B_1249 Pfam-B_7132 Pfam-B_8670 Pfam-B_5671 Pfam-B_1812 Pfam-B_966 Pfam-B_6875 Pfam-B_117 Pfam-B_8 Pfam-B_2928 Pfam-B_3117 Pfam-B_2112 heme_1 Pfam-B_3670Pfam-B_17 Pfam-B_3154 Pfam-B_5972 Pfam-B_1626 Pfam-B_2858 cpn60 Pfam-B_235 Pfam-B_200 Pfam-B_2560 Pfam-B_10677 oxidored_molyb Pfam-B_2755 Pfam-B_3860 Pfam-B_6810 Pfam-B_7383 Pfam-B_926 Pfam-B_6044 Pfam-B_4548 Pfam-B_1853 Pfam-B_927 Pfam-B_11143 Pfam-B_3721 Pfam-B_4143 Pfam-B_1446 Pfam-B_11841 Pfam-B_809 Pfam-B_9143 Pfam-B_4804Pfam-B_4803 Pfam-B_3736 Pfam-B_3881 subtilase Pfam-B_1485 Pfam-B_11511 Pfam-B_791 Pfam-B_647 Pfam-B_7285 Pfam-B_362 Pfam-B_4285 Pfam-B_1114 Pfam-B_2956 Pfam-B_2109 Pfam-B_7098 Pfam-B_2664 Pfam-B_5239 Pfam-B_2648 Pfam-B_3672 Pfam-B_7248Pfam-B_7246 Pfam-B_1627 Pfam-B_208 Pfam-B_6778 Pfam-B_10216 Pfam-B_8662 Pfam-B_7773 Pfam-B_355 Pfam-B_6334 Pfam-B_8680 Pfam-B_8493 pro_isomerase Pfam-B_4286 Pfam-B_8719 Pfam-B_1579 Pfam-B_2591 Pfam-B_916 Pfam-B_977 Pfam-B_2443 Pfam-B_5974 Pfam-B_4011 Pfam-B_11414 Pfam-B_1676 Pfam-B_374 Pfam-B_301 Pfam-B_3041 Pfam-B_5835 Pfam-B_204 Pfam-B_1520 Pfam-B_583 Pfam-B_1978 Pfam-B_1551 Pfam-B_268 Pfam-B_72 Pfam-B_9144 Pfam-B_9702 Pfam-B_404 Pfam-B_10555 Pfam-B_6128 Pfam-B_709 Pfam-B_9385 p450 vwd Pfam-B_2017 Pfam-B_3500 Pfam-B_6335 Pfam-B_800 Pfam-B_8430Pfam-B_33 Pfam-B_101 Pfam-B_6129 Pfam-B_10633 Pfam-B_629 Pfam-B_3393 Pfam-B_849 Pfam-B_965 Pfam-B_2370 Pfam-B_2942 Pfam-B_469 Pfam-B_1454 Pfam-B_848 Pfam-B_6333 Pfam-B_3330 Pfam-B_250 Pfam-B_988 Pfam-B_576Pfam-B_3240 Pfam-B_10215 Pfam-B_1682 Pfam-B_1449Pfam-B_322 Pfam-B_11415 Pfam-B_1001 Pfam-B_3138 Pfam-B_10214 Pfam-B_2742 Pfam-B_4662 Pfam-B_9145 Pfam-B_9811 Pfam-B_4154 Pfam-B_1860 Pfam-B_6336 Pfam-B_5000 Pfam-B_11491Pfam-B_4801 Pfam-B_11140 Pfam-B_549 Pfam-B_39 Pfam-B_11141 Pfam-B_347 Pfam-B_1100 Pfam-B_6337 Pfam-B_10775 Pfam-B_4510 Pfam-B_9859 Pfam-B_4194 Pfam-B_668Pfam-B_1285 Pfam-B_11412 Pfam-B_203 Pfam-B_2442 Pfam-B_799 Pfam-B_7195 Pfam-B_11162 Pfam-B_706 Pfam-B_2950

Pfam-B_9623 Pfam-B_970 Pfam-B_249 Pfam-B_4509 Pfam-B_10226 Pfam-B_9383 Pfam-B_9624 Pfam-B_1569

Pfam-B_7068 Pfam-B_1558 Pfam-B_4508

Pfam-B_11163 Pfam-B_7977 Pfam-B_8284 Pfam-B_9382 Pfam-B_3087 Pfam-B_8419 Pfam-B_925 Pfam-B_4159 Pfam-B_9998 Pfam-B_8439

Pfam-B_6298 Pfam-B_4138 Pfam-B_6237 Pfam-B_1629 Pfam-B_8508 Pfam-B_8289 Pfam-B_4511 Pfam-B_10159 Pfam-B_2616 Pfam-B_4561 bZIP Pfam-B_1973 Pfam-B_10168Pfam-B_10167 Pfam-B_4789 Pfam-B_6849 Pfam-B_5926 Pfam-B_6604 Pfam-B_4141 Pfam-B_503 Pfam-B_6773 Pfam-B_10157 Pfam-B_1199 Pfam-B_10163 Pfam-B_418 Pfam-B_8358 Pfam-B_10802 Pfam-B_69 oxidored_fadPfam-B_4560 Pfam-B_8423 Pfam-B_2365 Pfam-B_10164 Pfam-B_1972Pfam-B_7796 Pfam-B_7736 Pfam-B_5225 Pfam-B_3956Pfam-B_419 Pfam-B_1200 Pfam-B_333 Pfam-B_10166 Pfam-B_9347 Pfam-B_10169

Pfam-B_1155 Pfam-B_9159 Pfam-B_5618 Pfam-B_4050 Pfam-B_3006 Pfam-B_10790 Pfam-B_10318 Pfam-B_3807 Pfam-B_3107 Pfam-B_3186 Pfam-B_10463 Pfam-B_6347 Pfam-B_2124 Pfam-B_9523 Pfam-B_6827 Pfam-B_739 Pfam-B_8067 Pfam-B_513 Pfam-B_10816 Pfam-B_3525 Pfam-B_5781 Pfam-B_10190 Pfam-B_8507 Pfam-B_538 Pfam-B_2320 Pfam-B_3925 Pfam-B_2276 Pfam-B_6576 Pfam-B_1174 Pfam-B_2685 Pfam-B_846 Pfam-B_9778 Pfam-B_9227 Pfam-B_953 Pfam-B_7344 Pfam-B_10693 Pfam-B_3457 Pfam-B_9537 Pfam-B_2141 Pfam-B_10464 Pfam-B_3951Pfam-B_7722 Pfam-B_3110 Pfam-B_9964 Pfam-B_3090 Pfam-B_9830 Pfam-B_9317 Pfam-B_5379 Pfam-B_5157 Pfam-B_4312Pfam-B_3345 Pfam-B_379 Pfam-B_984 Pfam-B_9547 Pfam-B_8043 Pfam-B_8127 Pfam-B_6701 Pfam-B_3343 Pfam-B_7762 Pfam-B_7603 Pfam-B_10825 Pfam-B_10699 Pfam-B_10466 Pfam-B_740 Pfam-B_4570 Pfam-B_10162 Pfam-B_445 Pfam-B_10793 Pfam-B_3184 Pfam-B_10821 Pfam-B_2642 Pfam-B_3926 Pfam-B_4156 Pfam-B_1581 Pfam-B_1254 Pfam-B_6572 Pfam-B_2724 Pfam-B_8252 Pfam-B_213 Pfam-B_1213 Pfam-B_3603 Pfam-B_602 Pfam-B_922 Pfam-B_1088 Pfam-B_5171 Pfam-B_1792 Pfam-B_4401 Pfam-B_1175 Pfam-B_622 Pfam-B_2870 Pfam-B_4817 Pfam-B_663 Pfam-B_4227 Pfam-B_9831 Pfam-B_10319 Pfam-B_153 Pfam-B_151 Pfam-B_4698 Pfam-B_2512 Pfam-B_7606 Pfam-B_11864 Pfam-B_182 Pfam-B_5217 Pfam-B_7007 Pfam-B_164 Pfam-B_4310 Pfam-B_3268 Pfam-B_1337 Pfam-B_10788 Pfam-B_318 Pfam-B_1420 Pfam-B_3950 Pfam-B_1061 Pfam-B_3185 Pfam-B_5647 Pfam-B_165 Pfam-B_9319 Pfam-B_493 Pfam-B_9962 Pfam-B_650 Pfam-B_3183 Pfam-B_10465 Pfam-B_180 Pfam-B_2896 Pfam-B_2428 Pfam-B_882 Pfam-B_1697 Pfam-B_604 Pfam-B_2209 Pfam-B_9832 Pfam-B_10391 Pfam-B_7511 Pfam-B_6606 Pfam-B_2430 Pfam-B_10823 sigma54 Pfam-B_9030 Pfam-B_342 Pfam-B_7632 Pfam-B_9168 Pfam-B_1167 Pfam-B_5396 Pfam-B_3305 Pfam-B_10785 Pfam-B_1062Pfam-B_1678 Pfam-B_703 Pfam-B_1212 Pfam-B_2295 Pfam-B_3949 Pfam-B_802 Pfam-B_7615 Pfam-B_6601 Pfam-B_1320 DNA_pol Pfam-B_2229 Pfam-B_509 Pfam-B_2895 Pfam-B_3613 Pfam-B_242 Pfam-B_4686 Pfam-B_2252 Pfam-B_7600 Pfam-B_4925 Pfam-B_2266 Pfam-B_1791 Pfam-B_5399 Pfam-B_214 Pfam-B_206 Pfam-B_7625 Pfam-B_9230 Pfam-B_842 Pfam-B_4311 Pfam-B_6015 Pfam-B_191 Pfam-B_7602 Pfam-B_8109 Pfam-B_4569 Pfam-B_10161 Pfam-B_1950 Pfam-B_3440 Pfam-B_7628 Pfam-B_2273 Pfam-B_5619 Pfam-B_9833 Pfam-B_2760 Pfam-B_7755 Pfam-B_862 Pfam-B_4414 Pfam-B_2429 Pfam-B_1147 Pfam-B_514 S12 Pfam-B_6991 Pfam-B_2476 Pfam-B_1421 Pfam-B_1327 Pfam-B_2762 Pfam-B_5004 photoRC Pfam-B_179 Pfam-B_8167 Pfam-B_2431 Pfam-B_7593 Pfam-B_205 Pfam-B_7136 Pfam-B_6834 Pfam-B_2382 Pfam-B_1423 Pfam-B_1404 Pfam-B_2725 Pfam-B_5846 Pfam-B_8799 Pfam-B_9415 Pfam-B_55 Pfam-B_2373 Pfam-B_7607 Pfam-B_1179 Pfam-B_491 Pfam-B_2294 Pfam-B_2565 Pfam-B_10469Pfam-B_464 Pfam-B_11566 Pfam-B_1745 Pfam-B_10698 Pfam-B_1150 Pfam-B_11805 Pfam-B_6599 Pfam-B_2850 Pfam-B_1356 Pfam-B_163 Pfam-B_9158 Pfam-B_1248 Pfam-B_3182 Pfam-B_10253 Pfam-B_3064 Pfam-B_7768 Pfam-B_9963 Pfam-B_2514 Pfam-B_107 Pfam-B_694 Pfam-B_4400 Pfam-B_4963 Pfam-B_494 Pfam-B_7721 Pfam-B_7594 Pfam-B_2268 Pfam-B_259 Pfam-B_9237 Pfam-B_6471 Pfam-B_5216Pfam-B_3948 Pfam-B_5215 Pfam-B_8056 Pfam-B_3168 Pfam-B_779 Pfam-B_3028 Pfam-B_9162 Pfam-B_3344 Pfam-B_7744 Pfam-B_4069 Pfam-B_2894 Pfam-B_7627 Pfam-B_824 Pfam-B_10160 Pfam-B_9232 Pfam-B_4885Pfam-B_1523 Pfam-B_10460 Pfam-B_9304 Pfam-B_2596 Pfam-B_7580 Pfam-B_7605 Pfam-B_10692 Pfam-B_457 Pfam-B_48 Pfam-B_485 Pfam-B_938 Pfam-B_2321 Pfam-B_5170 Pfam-B_2768 adh_short Pfam-B_4068 Pfam-B_5158 Pfam-B_7662 Pfam-B_417 Pfam-B_414 Pfam-B_8253 Pfam-B_6140 Pfam-B_6598 Pfam-B_10786 Pfam-B_2293 Pfam-B_8137 Pfam-B_10820 Pfam-B_5532 Pfam-B_7782 Pfam-B_3915 Pfam-B_906 Pfam-B_958 Pfam-B_2737 Pfam-B_1724 Pfam-B_664 Pfam-B_5094 Pfam-B_7723 Pfam-B_557 Pfam-B_1744 Pfam-B_4313 Pfam-B_435 Pfam-B_343 Pfam-B_7998 Pfam-B_139 Pfam-B_4962 Pfam-B_2599 Pfam-B_769 Pfam-B_4327 Pfam-B_7601 Pfam-B_4710 Pfam-B_856 Pfam-B_2129 Pfam-B_4185 Pfam-B_4518 Pfam-B_10470 Pfam-B_510 Pfam-B_131 Pfam-B_3917 Pfam-B_7608 Pfam-B_6138 Pfam-B_4328 Pfam-B_451 Pfam-B_377 Pfam-B_7135 Pfam-B_9327 Pfam-B_5400 Pfam-B_11806 Pfam-B_1790 Pfam-B_4354 Pfam-B_6403 Pfam-B_8797 Pfam-B_5847 Pfam-B_5620

Pfam-B_14 Pfam-B_556 Pfam-B_951 Pfam-B_3321 Pfam-B_7551 Pfam-B_2398 Pfam-B_10682 Pfam-B_8755 Pfam-B_2082 Pfam-B_6317 Pfam-B_254 Pfam-B_1890 Pfam-B_3728 Pfam-B_2098 Pfam-B_545 Pfam-B_356 Pfam-B_6349 Pfam-B_11368 Pfam-B_5222 Pfam-B_11477 Pfam-B_3563 Pfam-B_628 Pfam-B_5663 Pfam-B_2427 Pfam-B_2706 Pfam-B_5944 Pfam-B_9210 Pfam-B_9880 Pfam-B_2526 Pfam-B_6346 Pfam-B_10624 Pfam-B_190 Pfam-B_7598 Pfam-B_7212 Pfam-B_3576 Pfam-B_6045 Pfam-B_4704 Pfam-B_1963 Pfam-B_3856 Pfam-B_1225 Pfam-B_5760 Pfam-B_6353 Pfam-B_3510 Pfam-B_1964 Pfam-B_4821 Pfam-B_8881 Pfam-B_959 Pfam-B_11367 Pfam-B_5220 Pfam-B_4798 Pfam-B_780 Pfam-B_4670 Pfam-B_10659 Pfam-B_9344 Pfam-B_7818 Pfam-B_3820 Pfam-B_9675 Pfam-B_10 Pfam-B_11360 Pfam-B_3483 Pfam-B_5118 Pfam-B_543 Pfam-B_240 Pfam-B_651 Pfam-B_10660 Pfam-B_9212 Pfam-B_1017 Pfam-B_4705 Pfam-B_3287 Pfam-B_1530 Pfam-B_2374 Pfam-B_1416 Pfam-B_266 Pfam-B_652 Pfam-B_9876 Pfam-B_3768 Pfam-B_949 Pfam-B_1981 Pfam-B_5664 Pfam-B_10644 Pfam-B_3053 Pfam-B_8851 Pfam-B_9588 Pfam-B_2959 Pfam-B_9149 Pfam-B_192 Pfam-B_3562 Pfam-B_8451 Pfam-B_1066 Pfam-B_775 Pfam-B_10087 Pfam-B_278 Pfam-B_2975 Pfam-B_161Pfam-B_4778 Pfam-B_9320 Pfam-B_7956 Pfam-B_2411 Pfam-B_2769 Pfam-B_2375 Pfam-B_3286 Pfam-B_6786 Pfam-B_2397 Pfam-B_8969 Pfam-B_10716 Pfam-B_3821 Pfam-B_1312 Pfam-B_10086 Pfam-B_6173 Pfam-B_3265 kazal Pfam-B_7340 Pfam-B_2005 Pfam-B_9346 Pfam-B_1115 Pfam-B_3054 Pfam-B_67 Pfam-B_1071 Pfam-B_6512 Pfam-B_8619 Pfam-B_1318Pfam-B_6029 Pfam-B_2456 Pfam-B_7207 Pfam-B_6289 Pfam-B_11361 Pfam-B_2432 Pfam-B_8452 Pfam-B_3283 Pfam-B_3560 Pfam-B_1507 Pfam-B_10090 Pfam-B_5721 Pfam-B_8753 Pfam-B_712 Pfam-B_1003 Pfam-B_6918 Pfam-B_4352 Pfam-B_8973 Pfam-B_2003 Pfam-B_1221 Pfam-B_5257 Pfam-B_1982 Pfam-B_2613 Pfam-B_169 Pfam-B_454 Pfam-B_8467 Pfam-B_4820 Pfam-B_4706 Pfam-B_811 Pfam-B_8928 Pfam-B_2123 Pfam-B_11476 Pfam-B_3789 Pfam-B_10663 Pfam-B_683 Pfam-B_3279 Pfam-B_1949 Pfam-B_1447 Pfam-B_3561 Pfam-B_8880 Pfam-B_3354 thiored Pfam-B_7206 Pfam-B_2833 Pfam-B_6739 Pfam-B_8065 Pfam-B_2396 actin Pfam-B_7552 Pfam-B_6579 Pfam-B_2415 Pfam-B_1850 Pfam-B_3900 Pfam-B_11478 Pfam-B_3567 Pfam-B_10658 Pfam-B_8453 Pfam-B_8968 Pfam-B_1359 Pfam-B_3031 Pfam-B_9873 Pfam-B_18 Pfam-B_976 Pfam-B_6031 Pfam-B_1153 Pfam-B_3355 Pfam-B_1773 Pfam-B_3569 Pfam-B_111 Pfam-B_3582 Pfam-B_10092 Pfam-B_3822 Pfam-B_3301 Pfam-B_2417 Pfam-B_11747 Pfam-B_6743Pfam-B_2046Pfam-B_11369 Pfam-B_1363 Pfam-B_5662 Pfam-B_6605 Pfam-B_1092 Pfam-B_1737 Pfam-B_6355 Pfam-B_561 Pfam-B_11600 Pfam-B_627 Pfam-B_2426 Pfam-B_6352 Pfam-B_2653 Pfam-B_1904 Pfam-B_387 Pfam-B_2480 Pfam-B_3918 Pfam-B_2879 Pfam-B_10717 Pfam-B_1699 Pfam-B_1328 Pfam-B_562 Pfam-B_3920 Pfam-B_1596 Pfam-B_2738 Pfam-B_3423 Pfam-B_9668 Pfam-B_4537 Pfam-B_3568 Pfam-B_27 Pfam-B_2359 Pfam-B_2538 Pfam-B_648 Pfam-B_11894 Pfam-B_4777 Pfam-B_2134 Pfam-B_1380 Pfam-B_3790 Pfam-B_4679 Pfam-B_5258 Pfam-B_1190 Pfam-B_3511 Pfam-B_950 Pfam-B_9989 Pfam-B_6742 Pfam-B_9321 Pfam-B_3614 Pfam-B_2645 Pfam-B_631 Pfam-B_2069 Pfam-B_9211 Pfam-B_3573 Pfam-B_522 Pfam-B_1317 Pfam-B_6510 Pfam-B_5104 Pfam-B_262 Pfam-B_8754 Pfam-B_1178 Pfam-B_124 Pfam-B_1158 Pfam-B_5945 Pfam-B_10179 Pfam-B_2023 Pfam-B_2507 Pfam-B_3350 Pfam-B_695 Pfam-B_8707 Pfam-B_10091 Pfam-B_6032 Pfam-B_3566 Pfam-B_6450 Pfam-B_3306 Pfam-B_11893 Pfam-B_5003 Pfam-B_593 Pfam-B_11746 Pfam-B_2629 Pfam-B_11365 Pfam-B_5221 Pfam-B_2947 Pfam-B_2394 Pfam-B_2770 Cys-protease Pfam-B_2705 Pfam-B_9238 Pfam-B_4685 Pfam-B_9674 Pfam-B_21 Pfam-B_10598 Pfam-B_1568 Pfam-B_2307 Pfam-B_5720 Pfam-B_10725 Pfam-B_10674 Pfam-B_911 Pfam-B_7478 Pfam-B_29 Pfam-B_542 Pfam-B_1882 Pfam-B_860 Pfam-B_1180 Pfam-B_7160 Pfam-B_11 Pfam-B_6354 Pfam-B_3512 Pfam-B_351

Pfam-B_286 Pfam-B_276 Pfam-B_11774 Pfam-B_1518 Pfam-B_2448 Pfam-B_183 Pfam-B_2800 Pfam-B_246 Pfam-B_6627 Pfam-B_5551 Pfam-B_325 Pfam-B_3347 Pfam-B_1805 Pfam-B_519 Pfam-B_2258 Pfam-B_1102 Pfam-B_1095 Pfam-B_6477 Pfam-B_1217 Pfam-B_3266 Pfam-B_8346 Pfam-B_10529 Pfam-B_2422 Pfam-B_5558 Pfam-B_1413 UPAR_LY6 Pfam-B_580 Pfam-B_570 Pfam-B_3565Pfam-B_9986Pfam-B_947Pfam-B_8844Pfam-B_8637Pfam-B_8053Pfam-B_7857Pfam-B_7457Pfam-B_6310Pfam-B_6280Pfam-B_5881Pfam-B_4993Pfam-B_4845Pfam-B_4836Pfam-B_4574Pfam-B_4089Pfam-B_3607Pfam-B_3521Pfam-B_3378Pfam-B_3366Pfam-B_323Pfam-B_1322 Pfam-B_3780 Pfam-B_7002 Pfam-B_11468 Pfam-B_1131 Pfam-B_2674 Pfam-B_7540 Pfam-B_584 Pfam-B_1991 Pfam-B_11922 Pfam-B_1588 Pfam-B_6047 Pfam-B_4113 Pfam-B_2799 Pfam-B_2638 Pfam-B_186 Pfam-B_920 Pfam-B_7512 Pfam-B_6172 Pfam-B_2303 Pfam-B_994 Pfam-B_2973 Pfam-B_642 Pfam-B_1280 Pfam-B_2908 Pfam-B_10724 Pfam-B_9878 Pfam-B_487 Pfam-B_2523 Pfam-B_3207 Pfam-B_5782 Pfam-B_6581 Pfam-B_1461 Pfam-B_8418 Pfam-B_2404 Pfam-B_5553 Pfam-B_535 Pfam-B_4320 Pfam-B_3079 Pfam-B_565 Pfam-B_4454 Pfam-B_1563 Pfam-B_995 Pfam-B_1330 Pfam-B_4677 Pfam-B_8645 Pfam-B_5390 Pfam-B_3600 Pfam-B_7468 Pfam-B_8708 Pfam-B_1864 Pfam-B_1943 Pfam-B_4374 Pfam-B_6046 Pfam-B_7409 Pfam-B_6164 Pfam-B_9871 Pfam-B_5526 Pfam-B_6852 Pfam-B_1371 Pfam-B_1653 Pfam-B_4207 Pfam-B_1214 Pfam-B_2192 Pfam-B_6071 Pfam-B_5859 Pfam-B_5125 Pfam-B_2088 Pfam-B_2029 Pfam-B_1151 Pfam-B_11366 Pfam-B_11111 Pfam-B_10706 Pfam-B_1040 Pfam-B_3246 Pfam-B_345 Pfam-B_4266Pfam-B_1806 Pfam-B_1463Pfam-B_679 Pfam-B_2449 Pfam-B_195 Pfam-B_247 Pfam-B_9601 Pfam-B_9423 Pfam-B_1287 Pfam-B_2001 Pfam-B_4857 Pfam-B_6872 Pfam-B_5049 Pfam-B_639 COX2 Pfam-B_4481Pfam-B_8845Pfam-B_4183Pfam-B_871Pfam-B_7858Pfam-B_7454Pfam-B_420Pfam-B_239Pfam-B_5880Pfam-B_655Pfam-B_11614Pfam-B_11404Pfam-B_3046Pfam-B_8239Pfam-B_3260Pfam-B_4579Pfam-B_5942Pfam-B_7112Pfam-B_5611Pfam-B_4366 Pfam-B_7834 Pfam-B_7001 Pfam-B_610 Pfam-B_4209 Pfam-B_1101 Pfam-B_4137 Pfam-B_1362 Pfam-B_1332 Pfam-B_2627 Pfam-B_5651 pyr_redox Pfam-B_9635 Pfam-B_7410 Pfam-B_577 Pfam-B_529 Pfam-B_2301 Pfam-B_873 Pfam-B_11499 Pfam-B_1338 Pfam-B_3597 Pfam-B_793 Pfam-B_4271 Pfam-B_4152 Pfam-B_4884 Pfam-B_9957 Pfam-B_3594 Pfam-B_11495 Pfam-B_10710 Pfam-B_198 Pfam-B_4995 Pfam-B_2014 Pfam-B_6163 Pfam-B_4840 Pfam-B_6502 Pfam-B_4707 Pfam-B_277 Pfam-B_9870 Pfam-B_6851 Pfam-B_3211 Pfam-B_5389 Pfam-B_10739 Pfam-B_8489 Pfam-B_5107 Pfam-B_1098 Pfam-B_4856 Pfam-B_221 Pfam-B_539 Pfam-B_1191Pfam-B_2813Pfam-B_8352Pfam-B_2227Pfam-B_5615Pfam-B_8049Pfam-B_5280Pfam-B_7452Pfam-B_4983Pfam-B_6278Pfam-B_2256Pfam-B_4992Pfam-B_11615Pfam-B_2056Pfam-B_230Pfam-B_3194Pfam-B_3606Pfam-B_10328connexinPfam-B_3365gln-synt Pfam-B_1321 Pfam-B_981 Pfam-B_3538Pfam-B_3735 Pfam-B_1565Pfam-B_710 Pfam-B_11448 Pfam-B_1132 Pfam-B_2673 Pfam-B_3858 Pfam-B_2121 Pfam-B_2991 Pfam-B_7975 Pfam-B_3605 Pfam-B_1639 Pfam-B_1412 Pfam-B_1847 Pfam-B_9879 Pfam-B_9421 Pfam-B_256 Pfam-B_2306 Pfam-B_876 Pfam-B_5552 Pfam-B_5318 Pfam-B_3349 Pfam-B_3078 Pfam-B_564 Pfam-B_5119 Pfam-B_1103 Pfam-B_1378 Pfam-B_2536 Pfam-B_2018 Pfam-B_8604 Pfam-B_4010 Pfam-B_321 Pfam-B_6171 Pfam-B_5783 Pfam-B_269 sigma70 Pfam-B_3709

Pfam-B_10228 Pfam-B_274 Pfam-B_2217 Pfam-B_996 Pfam-B_7365 Pfam-B_7357 Pfam-B_1643 Pfam-B_5991 Pfam-B_1587 Pfam-B_5520 Pfam-B_289 Pfam-B_1264 Pfam-B_1257 Pribosyltran Pfam-B_4832 Pfam-B_1666Pfam-B_9318 Pfam-B_7738 Pfam-B_797 Pfam-B_7527 Pfam-B_3231 Pfam-B_1442 Pfam-B_4341 Pfam-B_6095 Pfam-B_371 Pfam-B_1784 Pfam-B_5006 Pfam-B_2406 Pfam-B_9701 Pfam-B_9518 Pfam-B_6539 Pfam-B_3454 Pfam-B_5830 Pfam-B_2545 Pfam-B_621 Pfam-B_10381 Pfam-B_1952 Pfam-B_4030 Pfam-B_1778 Pfam-B_4826 Pfam-B_745 Pfam-B_3662 Pfam-B_6993 Pfam-B_115 Pfam-B_1130 Pfam-B_5192 Pfam-B_6571 Pfam-B_5151 Pfam-B_7725 Pfam-B_7259 Pfam-B_10026 Pfam-B_7362 Pfam-B_7140KH-domain Pfam-B_3101 Pfam-B_3061 Pfam-B_10032 Pfam-B_10609 Pfam-B_9699 Pfam-B_8911 Pfam-B_8757 Pfam-B_816 Pfam-B_7672 Pfam-B_7637 Pfam-B_7521 Pfam-B_7413 Pfam-B_7406 Pfam-B_7177 Pfam-B_7142 Pfam-B_6837 Pfam-B_5640 Pfam-B_4601 Pfam-B_4575 Pfam-B_2055 Pfam-B_6490 Pfam-B_7851 Pfam-B_1272 Pfam-B_11574 Pfam-B_10231 Pfam-B_5191 Pfam-B_6570 Pfam-B_3908 Pfam-B_2557 Pfam-B_7258 Pfam-B_8746 Pfam-B_8748 Pfam-B_10356 Pfam-B_4615 Pfam-B_3839 Pfam-B_10887 Pfam-B_720 Pfam-B_7366 Pfam-B_209 Pfam-B_4331 Pfam-B_3787 Pfam-B_10031 Pfam-B_6491 Pfam-B_273 Pfam-B_9252 Pfam-B_9338Pfam-B_9329 Pfam-B_917Pfam-B_867 Pfam-B_5439Pfam-B_8235 Pfam-B_1122Pfam-B_7531 Pfam-B_8894Pfam-B_7185 Pfam-B_5458Pfam-B_665 Pfam-B_4340Pfam-B_635 Pfam-B_898Pfam-B_6097 Pfam-B_1094Pfam-B_421 Pfam-B_721Pfam-B_3808 alpha-amylase Pfam-B_6529 Pfam-B_4708Pfam-B_9521Pfam-B_5092 Pfam-B_3760Pfam-B_722Pfam-B_9207 HSP70 Pfam-B_5592 neur Pfam-B_1586 Pfam-B_7363 Pfam-B_3045 Pfam-B_10202 Pfam-B_1394 Pfam-B_7178 Pfam-B_3475 Pfam-B_2521 Pfam-B_1118 Pfam-B_10198 Pfam-B_2371 Pfam-B_1129 Pfam-B_5062 Pfam-B_9339 Pfam-B_5189 Pfam-B_1735 Pfam-B_3906 Pfam-B_2556 Pfam-B_317 Pfam-B_2076 oxidored_nitro Pfam-B_719 Pfam-B_1794 Pfam-B_1105 Pfam-B_11571 Pfam-B_10506 Pfam-B_1121 Pfam-B_5190 Pfam-B_4697 Pfam-B_3907 Pfam-B_7726 Pfam-B_2974 Pfam-B_8747 Pfam-B_11397 Pfam-B_1583 Pfam-B_265 Pfam-B_10030 Pfam-B_5923 Pfam-B_11573Pfam-B_4831 Pfam-B_4332 Pfam-B_10148 Pfam-B_7645 Pfam-B_738 Pfam-B_3889 Pfam-B_1854 Pfam-B_1441 Pfam-B_3361 Pfam-B_3445 Pfam-B_3503 Pfam-B_1396 Pfam-B_11421 Pfam-B_10602 Pfam-B_2317 Pfam-B_5993 Pfam-B_6536 Pfam-B_3453 Pfam-B_8813 Pfam-B_1597 Pfam-B_3077 Pfam-B_10375 Pfam-B_1341 Pfam-B_1273 Pfam-B_1777 Pfam-B_4825 Pfam-B_5639 Pfam-B_3661 Pfam-B_7359 Pfam-B_1346 Pfam-B_10731 Pfam-B_1410 Pfam-B_2753 Pfam-B_127 Pfam-B_567 Pfam-B_181 Pfam-B_7405 Pfam-B_3047 Pfam-B_10225 Pfam-B_6488 Pfam-B_476 Pfam-B_1431

Pfam-B_8473 Pfam-B_10502 Pfam-B_2092 Pfam-B_3389 Pfam-B_3239 Pfam-B_482 Pfam-B_3 Pfam-B_2246 Pfam-B_444 Pfam-B_9467 Pfam-B_4040 Pfam-B_9946 Pfam-B_8761 Pfam-B_2914 Pfam-B_5121Pfam-B_4616 Pfam-B_6754 Pfam-B_3620 Pfam-B_3782 Pfam-B_7585 Pfam-B_5074 Pfam-B_470 Pfam-B_960 Pfam-B_5900 Pfam-B_4247 Pfam-B_11520 Pfam-B_3589 Pfam-B_6427Pfam-B_3793 Pfam-B_406 Pfam-B_785 phoslip Pfam-B_2602 Pfam-B_6999 Pfam-B_5684 Pfam-B_4848 Pfam-B_3447 Pfam-B_1962 Pfam-B_2670 Pfam-B_1606 Pfam-B_2704 Pfam-B_9258 Pfam-B_112 Pfam-B_956 Pfam-B_2222 Pfam-B_4838 Pfam-B_2356 Pfam-B_11889 Pfam-B_6228 Pfam-B_8878 Pfam-B_3070 Pfam-B_5539 Pfam-B_5059 Pfam-B_726 Pfam-B_1769 Pfam-B_4146 Pfam-B_3894 Pfam-B_3766 Pfam-B_3753 Pfam-B_3238 Pfam-B_3178 Pfam-B_290 Pfam-B_2718 Pfam-B_2093 Pfam-B_1919 gluts beta-lactamase Pfam-B_10791 Pfam-B_11671 Pfam-B_11388 Pfam-B_10872 Pfam-B_103 Pfam-B_3815 Pfam-B_10638 Pfam-B_2533 Pfam-B_930 Pfam-B_2544 Pfam-B_11726 Pfam-B_10893 Pfam-B_9386 Pfam-B_11458 Pfam-B_5 Pfam-B_11866 Pfam-B_6848 Pfam-B_2048 Pfam-B_10382 Pfam-B_6994 Pfam-B_5538 Pfam-B_7353 Pfam-B_7747 Pfam-B_6940 Pfam-B_7584 Pfam-B_3847 Pfam-B_3625 Pfam-B_4338 Pfam-B_841 Pfam-B_3089 Pfam-B_2462 Pfam-B_4730 Pfam-B_9387 Pfam-B_783 Pfam-B_7317 Pfam-B_6977 Pfam-B_6249 Pfam-B_2938 Pfam-B_6436 Pfam-B_6435 Pfam-B_1005 Pfam-B_4660 Pfam-B_2040 Pfam-B_10926 Pfam-B_7063 Pfam-B_6803 Pfam-B_1231Pfam-B_2053 Pfam-B_7476 Pfam-B_5022 Pfam-B_3748 Pfam-B_11469 Pfam-B_9345 Pfam-B_753 Pfam-B_1172 Pfam-B_2888Pfam-B_857Pfam-B_1087 Pfam-B_2286Pfam-B_5123Pfam-B_5122 Pfam-B_11350 Pfam-B_777 Pfam-B_807 Pfam-B_852 Pfam-B_3174 Pfam-B_5925 Pfam-B_5537 Pfam-B_7352 Pfam-B_3105 Pfam-B_6939 Pfam-B_11670 Pfam-B_1752 Pfam-B_10873 Pfam-B_2487 Pfam-B_7583 Pfam-B_3929 Pfam-B_2543 Pfam-B_3699 Pfam-B_10892 Pfam-B_1486 Pfam-B_11457 Pfam-B_393 Pfam-B_11865 Pfam-B_2958 Pfam-B_11399 Pfam-B_10379 Pfam-B_7495 Pfam-B_1012 Pfam-B_5023 Pfam-B_9257 Pfam-B_5906 Pfam-B_6995 Pfam-B_5686 Pfam-B_5685 Pfam-B_4849 Pfam-B_5274Pfam-B_11095Pfam-B_4839 Pfam-B_34 Pfam-B_238 Pfam-B_3275 Pfam-B_2474 Pfam-B_3100 Pfam-B_9985 Pfam-B_2709 Pfam-B_8883 Pfam-B_4800 Pfam-B_7586 Pfam-B_10366 Pfam-B_5333Pfam-B_3791 Pfam-B_3796 Pfam-B_3579 Pfam-B_7503 Pfam-B_929 lipase Pfam-B_2597 Pfam-B_1910 Pfam-B_7389 Pfam-B_2948 Pfam-B_3844 Pfam-B_5064 Pfam-B_475 Pfam-B_5424 Pfam-B_5882 Pfam-B_1450 Pfam-B_94 Pfam-B_1243 Pfam-B_3388 Pfam-B_2644 Pfam-B_1838 MHC_I Pfam-B_146 Pfam-B_1163 Pfam-B_5970 peroxidase Pfam-B_6201 Pfam-B_1375 Pfam-B_7496 Pfam-B_3693 Pfam-B_2936 Pfam-B_2036 Pfam-B_1108

Pfam-B_8888 Pfam-B_11676 Pfam-B_5202 Pfam-B_45 Pfam-B_8712Pfam-B_1334 Pfam-B_5978 Pfam-B_7243 Pfam-B_5384 Pfam-B_2612 wnt Pfam-B_11320 Pfam-B_5789 Pfam-B_2354Pfam-B_6065 Pfam-B_8885 Pfam-B_691 Pfam-B_7699 Pfam-B_78Pfam-B_4193 Pfam-B_1259 Pfam-B_9790 Pfam-B_9470 Pfam-B_786 Pfam-B_6708 Pfam-B_8275 Pfam-B_6452 Pfam-B_2138 Pfam-B_6961 Pfam-B_11783 Pfam-B_8672 Pfam-B_2455 Pfam-B_2478 Pfam-B_10073 Pfam-B_2205Pfam-B_2663Pfam-B_8589 Pfam-B_4123Pfam-B_2186Pfam-B_1034thyroglobulin_1Pfam-B_8741 Pfam-B_8740Pfam-B_4004Pfam-B_5307Pfam-B_7924Pfam-B_7356Pfam-B_1582Pfam-B_6577Pfam-B_5155Pfam-B_1808Pfam-B_5156 Pfam-B_3437Pfam-B_9700 Pfam-B_9575Pfam-B_9574 Pfam-B_9083Pfam-B_9082 Pfam-B_3259Pfam-B_8601 Pfam-B_2163Pfam-B_8013 Pfam-B_8319Pfam-B_7784 Pfam-B_5218Pfam-B_7724 Pfam-B_5188Pfam-B_7626 Pfam-B_7596Pfam-B_7597 Pfam-B_7145Pfam-B_7144 Pfam-B_6874Pfam-B_6873 Pfam-B_6202Pfam-B_6209 Pfam-B_7657Pfam-B_572 Pfam-B_2615Pfam-B_5457 Pfam-B_7633Pfam-B_5346 Pfam-B_2866Pfam-B_4973 Pfam-B_4893Pfam-B_4892 Pfam-B_10317Pfam-B_4599 Pfam-B_4119Pfam-B_4120 Pfam-B_4081Pfam-B_4080 Pfam-B_3167Pfam-B_4052 Pfam-B_7439Pfam-B_3859 Pfam-B_3832Pfam-B_3831 Pfam-B_7225Pfam-B_3799 Pfam-B_7125Pfam-B_3757 Pfam-B_3552Pfam-B_3539 Pfam-B_11891Pfam-B_2985 Pfam-B_2930Pfam-B_2929 Pfam-B_7519Pfam-B_2534 Pfam-B_2935Pfam-B_2447 Pfam-B_9150Pfam-B_2251 Pfam-B_1753Pfam-B_2049 Pfam-B_5438Pfam-B_202 Pfam-B_10424Pfam-B_1992 Pfam-B_658Pfam-B_1786 Pfam-B_1969Pfam-B_1702 Pfam-B_4530Pfam-B_1513 Pfam-B_1407Pfam-B_2114 Pfam-B_5896Pfam-B_1319 Pfam-B_11796Pfam-B_11795Pfam-B_11578Pfam-B_11579Pfam-B_11498Pfam-B_11497 Pfam-B_6789Pfam-B_11443 Pfam-B_6962 Pfam-B_2979 Pfam-B_11321 Pfam-B_4470 Pfam-B_4634 Pfam-B_2137 Pfam-B_4438 Pfam-B_3430 Pfam-B_1937 Pfam-B_4767

FGF Pfam-B_1930 Pfam-B_1263 Pfam-B_589 Pfam-B_6115 Pfam-B_5980 Pfam-B_2513 Pfam-B_11288 Pfam-B_9524 Pfam-B_11322Pfam-B_6722 Pfam-B_2668 Pfam-B_3067 Pfam-B_8720 Pfam-B_11832 Pfam-B_2064

Pfam-B_6787Pfam-B_11440Pfam-B_11363Pfam-B_11362 Pfam-B_525Pfam-B_11351 Pfam-B_6645Pfam-B_11034Pfam-B_6564Pfam-B_10679 Pfam-B_3585Pfam-B_10666 Pfam-B_6306Pfam-B_10186Pfam-B_10129Pfam-B_10140 fer4_NifH Pfam-B_10036Pfam-B_8787Pfam-B_8786Pfam-B_2662Pfam-B_8587Pfam-B_8180Pfam-B_8184Pfam-B_2551Pfam-B_7673Pfam-B_3861Pfam-B_7441Pfam-B_7418Pfam-B_7419Pfam-B_154Pfam-B_7327Pfam-B_1250Pfam-B_7257Pfam-B_11751Pfam-B_6921Pfam-B_5096Pfam-B_5095Pfam-B_1580Pfam-B_5055Pfam-B_3792Pfam-B_5008Pfam-B_7051Pfam-B_4941Pfam-B_490Pfam-B_489Pfam-B_3705Pfam-B_4880Pfam-B_6836Pfam-B_4824Pfam-B_933Pfam-B_3716Pfam-B_10967Pfam-B_3094Pfam-B_7292Pfam-B_3034Pfam-B_6444Pfam-B_2864Pfam-B_4501Pfam-B_2807Pfam-B_3450Pfam-B_2793Pfam-B_8325Pfam-B_2625Pfam-B_228Pfam-B_227Pfam-B_8214Pfam-B_2175Pfam-B_7176Pfam-B_2095 GATase Pfam-B_1866Pfam-B_1027Pfam-B_1842Pfam-B_8133Pfam-B_1836Pfam-B_6256Pfam-B_1342Pfam-B_521Pfam-B_1117Pfam-B_6437Pfam-B_10399Pfam-B_1524Pfam-B_10316 recA Pfam-B_9860 Pfam-B_9714 Pfam-B_962 Pfam-B_96 Pfam-B_9292 Pfam-B_9147 Pfam-B_908 Pfam-B_8585 Pfam-B_8403 Pfam-B_8237 Pfam-B_7835 Pfam-B_7731 Pfam-B_7695 Pfam-B_7614 Pfam-B_7589 Pfam-B_7579 Pfam-B_7515 Pfam-B_7497 Pfam-B_7494 Pfam-B_7470 Pfam-B_7448 Pfam-B_7438 Pfam-B_7316 Pfam-B_7277 Pfam-B_7254 Pfam-B_7216 Pfam-B_7190 Pfam-B_7187 Pfam-B_7104 Pfam-B_7003 Pfam-B_6955 Pfam-B_6870 Pfam-B_6869 Pfam-B_6864 Pfam-B_6835 Pfam-B_6795 Pfam-B_6758 Pfam-B_6757 Pfam-B_6747 Pfam-B_6725 Pfam-B_6700 Pfam-B_6644 Pfam-B_6626 Pfam-B_6618 Pfam-B_6268 Pfam-B_6236 Pfam-B_6011 Pfam-B_5950 Pfam-B_5840 Pfam-B_579 Pfam-B_5774 Pfam-B_5743 Pfam-B_5738 Pfam-B_5608 Pfam-B_5302 Pfam-B_5213 Pfam-B_5134 Pfam-B_5099 Pfam-B_5093

Pfam-B_1566Pfam-B_713Pfam-B_10902Pfam-B_11447Pfam-B_6790Pfam-B_11446Pfam-B_9612 tsp_1 Pfam-B_617 S4 Pfam-B_11929Pfam-B_9979Pfam-B_1942Pfam-B_9863Pfam-B_2785Pfam-B_9731Pfam-B_6060Pfam-B_9656Pfam-B_1918Pfam-B_9465Pfam-B_9410Pfam-B_9408Pfam-B_264Pfam-B_9288Pfam-B_2690Pfam-B_8848Pfam-B_2987Pfam-B_883Pfam-B_1766Pfam-B_853Pfam-B_2873Pfam-B_8356Pfam-B_8932Pfam-B_8028Pfam-B_7832Pfam-B_7833Pfam-B_5227Pfam-B_7737Pfam-B_6982Pfam-B_7683Pfam-B_5130Pfam-B_7547Pfam-B_450Pfam-B_7502Pfam-B_2806Pfam-B_7395Pfam-B_934Pfam-B_7094Pfam-B_3399Pfam-B_7060Pfam-B_6952 thiolasePfam-B_11630Pfam-B_6876Pfam-B_6868Pfam-B_6867Pfam-B_1827Pfam-B_6731Pfam-B_8730Pfam-B_673Pfam-B_736Pfam-B_661Pfam-B_8795Pfam-B_6609Pfam-B_2401Pfam-B_6519Pfam-B_1951Pfam-B_6252Pfam-B_2990Pfam-B_611Pfam-B_9751Pfam-B_6088Pfam-B_5997Pfam-B_5996Pfam-B_9290Pfam-B_5898Pfam-B_536Pfam-B_5836Pfam-B_3947Pfam-B_5211 Pfam-B_4988 Pfam-B_4987 Pfam-B_4979 Pfam-B_4948 Pfam-B_4947 Pfam-B_4842 Pfam-B_4788 Pfam-B_4787 Pfam-B_4763 Pfam-B_4597 Pfam-B_4581 Pfam-B_4144 Pfam-B_3923 Pfam-B_3922 Pfam-B_3872 Pfam-B_3838 Pfam-B_3812 Pfam-B_3749 Pfam-B_3679 Pfam-B_3362 Pfam-B_3218 Pfam-B_3190 Pfam-B_3171 Pfam-B_3093 Pfam-B_3085 Pfam-B_3073 Pfam-B_3043 Pfam-B_3010 Pfam-B_2954 Pfam-B_2657 Pfam-B_2650 Pfam-B_2559 Pfam-B_2495 Pfam-B_2369 Pfam-B_2338 Pfam-B_2287 Pfam-B_2199 Pfam-B_2184 Pfam-B_2171 Pfam-B_2068 Pfam-B_1762 Pfam-B_1749 Pfam-B_1619 Pfam-B_1609 Pfam-B_1548 Pfam-B_1522 Pfam-B_128 Pfam-B_1205Pfam-B_11779Pfam-B_11694Pfam-B_11663Pfam-B_11623Pfam-B_11601Pfam-B_11518Pfam-B_11302Pfam-B_11232Pfam-B_11194Pfam-B_11127Pfam-B_10962Pfam-B_10954Pfam-B_10849Pfam-B_1054 Pfam-B_1045 gpdh

Pfam-B_8499Pfam-B_5103Pfam-B_11524Pfam-B_4812Pfam-B_5992Pfam-B_4790Pfam-B_6735Pfam-B_4776Pfam-B_4775Pfam-B_4774Pfam-B_1201Pfam-B_4633Pfam-B_8937Pfam-B_4437Pfam-B_4433Pfam-B_4435Pfam-B_10932Pfam-B_4434Pfam-B_10647Pfam-B_4362Pfam-B_11292Pfam-B_4345Pfam-B_7765Pfam-B_4309Pfam-B_10635Pfam-B_429Pfam-B_3921Pfam-B_4175Pfam-B_8564Pfam-B_4173Pfam-B_7510Pfam-B_3885Pfam-B_7459Pfam-B_3864Pfam-B_7398Pfam-B_3849 sodfe Pfam-B_3745Pfam-B_11589Pfam-B_3681Pfam-B_11177Pfam-B_3588Pfam-B_808Pfam-B_3385Pfam-B_884Pfam-B_3284Pfam-B_2828Pfam-B_304Pfam-B_4676Pfam-B_2883Pfam-B_2563Pfam-B_2741Pfam-B_6719Pfam-B_2595Pfam-B_3806Pfam-B_258Pfam-B_2044Pfam-B_2391Pfam-B_6070Pfam-B_2335Pfam-B_9804Pfam-B_223Pfam-B_689Pfam-B_1923Pfam-B_6768Pfam-B_1754Pfam-B_320Pfam-B_1741Pfam-B_5566Pfam-B_171Pfam-B_486Pfam-B_1453Pfam-B_1559Pfam-B_1432Pfam-B_3977Pfam-B_1426Pfam-B_1929Pfam-B_1367COesterase Pfam-B_1366Pfam-B_6160Pfam-B_1262Pfam-B_11306Pfam-B_1193Pfam-B_2475Pfam-B_11903Pfam-B_11880Pfam-B_11873Pfam-B_11850Pfam-B_11854Pfam-B_2921Pfam-B_11745Pfam-B_6865Pfam-B_11613Pfam-B_4786Pfam-B_11545Pfam-B_6813Pfam-B_11512Pfam-B_6788Pfam-B_11442Pfam-B_3066Pfam-B_11330Pfam-B_6706Pfam-B_11259Pfam-B_8913Pfam-B_11253Pfam-B_4967Pfam-B_1104Pfam-B_10320Pfam-B_10305Pfam-B_462Pfam-B_10268Pfam-B_1183 histone

Figure 5.12. The Family network using only non-circular patterns

Red nodes are nodes found in the non-CP network. Green nodes are nodes found only in the CP-network. Blue edges denote edges found in the CP network. Pink edges are edges found only in the CP network. CHAPTER 5. CIRCULAR PATTERN MATCHING 132

Pfam-B_2464 Pfam-B_10215 Pfam-B_8548 Pfam-B_1301 Pfam-B_4314 Pfam-B_9637

Pfam-B_11192

Pfam-B_549 Pfam-B_4424 Pfam-B_3424 Pfam-B_2383 Pfam-B_4619 Pfam-B_5780 Pfam-B_10358 Pfam-B_6182 cellulasePfam-B_684 Pfam-B_364 Pfam-B_2370 Pfam-B_5584 Pfam-B_5812 Pfam-B_2723 Pfam-B_10216 Pfam-B_5679

Pfam-B_1928 Pfam-B_1520 Pfam-B_3341 Pfam-B_5768 Pfam-B_4888 Pfam-B_8546 Pfam-B_6424 Pfam-B_4593Pfam-B_10289 Pfam-B_6513 Pfam-B_3062 Pfam-B_363 Pfam-B_6041 Pfam-B_4591 Pfam-B_10214 Pfam-B_5687 Pfam-B_1658 Pfam-B_3500 Pfam-B_10288 Pfam-B_4255 Pfam-B_4592Pfam-B_5814 Pfam-B_1680 Pfam-B_4224 Pfam-B_8891 Pfam-B_5703 Pfam-B_3529 Pfam-B_6415 Pfam-B_1728 Pfam-B_5669 Pfam-B_5815 Pfam-B_9638 Pfam-B_4618 Pfam-B_5958 Pfam-B_6423 Pfam-B_706 Pfam-B_1659Pfam-B_10290 Pfam-B_8919 Pfam-B_1685 Pfam-B_758 Pfam-B_8482

Pfam-B_5583 Pfam-B_1308 Pfam-B_8210 Pfam-B_914

Pfam-B_1531

Pfam-B_10802 Pfam-B_1801 Pfam-B_6622

Pfam-B_1644 Pfam-B_199 Pfam-B_10625 Pfam-B_4087 Pfam-B_2684

Pfam-B_6451 Pfam-B_8239 Pfam-B_10799 Pfam-B_1165 Pfam-B_1474

Pfam-B_9546 Pfam-B_3422 Pfam-B_4402 Pfam-B_8547 Pfam-B_2848 Pfam-B_9544 Pfam-B_4213 Pfam-B_2998

Pfam-B_6384 Pfam-B_8734 Pfam-B_2084 Pfam-B_255 Pfam-B_1660 Pfam-B_4474Pfam-B_4475 Pfam-B_4011

Pfam-B_8544 cytochrome_c Pfam-B_3303 Pfam-B_1477 Pfam-B_1663 fer4 Pfam-B_9562 Pfam-B_442Pfam-B_380 Pfam-B_9648 Pfam-B_7977 Pfam-B_2694

Pfam-B_7200 Pfam-B_8439 Pfam-B_4499 Pfam-B_4212 Pfam-B_10822 Pfam-B_1391 Pfam-B_6563 Pfam-B_347 Pfam-B_1512Pfam-B_4852Pfam-B_1177 Pfam-B_1326 Pfam-B_7477 Pfam-B_6053 Pfam-B_3876 Pfam-B_1629 Pfam-B_10824Pfam-B_2756Pfam-B_10819 Pfam-B_1216Pfam-B_2408 Pfam-B_5793 Pfam-B_5350 Pfam-B_2085 Pfam-B_10826 Pfam-B_1927 Pfam-B_8358 Pfam-B_1627 Pfam-B_9563 Pfam-B_9942 Pfam-B_3240 Pfam-B_10827 Pfam-B_1693 Pfam-B_2616 vwc fn3 Pfam-B_3970 Pfam-B_7274 Pfam-B_4154 Pfam-B_686 Pfam-B_1730Pfam-B_896 Pfam-B_2682 Pfam-B_540 Pfam-B_10918Pfam-B_2527 Pfam-B_10818Pfam-B_4813 trypsin Pfam-B_3969 Pfam-B_9927 Pfam-B_8289 Pfam-B_2648 Pfam-B_818 Pfam-B_9623 Pfam-B_8430 Pfam-B_4409 Pfam-B_282 Pfam-B_2763 Pfam-B_1060 Pfam-B_9624 Pfam-B_687 Pfam-B_85 Pfam-B_6038 Pfam-B_5718 Pfam-B_5509 Pfam-B_10817 Pfam-B_23 Pfam-B_668 Pfam-B_7286Pfam-B_688 Pfam-B_4159 Pfam-B_576 COX1 Pfam-B_2528 Pfam-B_2011 Pfam-B_799 Pfam-B_2012 Pfam-B_10601 Pfam-B_800 Pfam-B_453 Pfam-B_1285 Pfam-B_1848 Pfam-B_3002 Pfam-B_8423Pfam-B_4141 Pfam-B_2611 Pfam-B_3875 Pfam-B_7263 Pfam-B_4240 Pfam-B_452 Pfam-B_1446 Pfam-B_1860 Pfam-B_8326 Pfam-B_11209 Pfam-B_33 Pfam-B_6033 Pfam-B_1275Pfam-B_2178 Pfam-B_8419 Pfam-B_1853 vwa Pfam-B_5444 Pfam-B_4138 Pfam-B_295 Pfam-B_8305Pfam-B_5513 Pfam-B_1063 Pfam-B_10015 Pfam-B_8508 Pfam-B_2187 Pfam-B_5028 Pfam-B_81 Pfam-B_1274 Pfam-B_6516 Pfam-B_339 Pfam-B_629 Pfam-B_8493 Pfam-B_2529 Pfam-B_859 Pfam-B_11329 wap Pfam-B_2797 Pfam-B_1064 Pfam-B_2305 Pfam-B_8368 Pfam-B_10600Pfam-B_7482 Pfam-B_7481Pfam-B_5112 Pfam-B_5441 Pfam-B_555 efhand Pfam-B_8245 Pfam-B_757 Pfam-B_6062 Pfam-B_666 rvp Pfam-B_134 Pfam-B_322 Pfam-B_4102 Kunitz_BPTI Pfam-B_478 Pfam-B_741 Pfam-B_495 Pfam-B_4418 Pfam-B_496 rnaseH Pfam-B_5114 Pfam-B_1390 Pfam-B_10145 Pfam-B_5517 Pfam-B_3878 Pfam-B_8364 Pfam-B_2015 Pfam-B_1803 Pfam-B_497 Pfam-B_296 Pfam-B_8676 Pfam-B_8312 Pfam-B_11846 Pfam-B_5113 Pfam-B_8204 Pfam-B_19 Pfam-B_294 Pfam-B_11279 Pfam-B_1333 fer2 Pfam-B_456 Pfam-B_5115 Pfam-B_804 Pfam-B_8862 Pfam-B_11046 Pfam-B_2068 Pfam-B_1863 Pfam-B_11029 Pfam-B_7480 Pfam-B_4107 Pfam-B_10419 ins Pfam-B_10594 Pfam-B_5487 Pfam-B_11160 fn1 Pfam-B_1846 Pfam-B_9527 Pfam-B_6002 Pfam-B_3978 Pfam-B_9494 Pfam-B_932 Pfam-B_11159 Pfam-B_3434 Pfam-B_3467 Pfam-B_1286 Pfam-B_659 Pfam-B_7479 Pfam-B_1818 Pfam-B_5482 Pfam-B_1816 Pfam-B_6017 Pfam-B_8206 Pfam-B_5708 Pfam-B_7486 Pfam-B_5471 Pfam-B_194 Pfam-B_1804 Pfam-B_330 Pfam-B_5709 Pfam-B_1815 Pfam-B_5516 Pfam-B_1819 Pfam-B_7483 Pfam-B_4563 Pfam-B_4490 Pfam-B_2403 Pfam-B_6227 fibrinogen_C Pfam-B_9903 Pfam-B_10604 Pfam-B_332 Pfam-B_6544 Pfam-B_1817 Pfam-B_10597 Pfam-B_6801 Pfam-B_2405 Pfam-B_2790 Pfam-B_4410 Pfam-B_11467 Pfam-B_5111 Pfam-B_6000 Pfam-B_1361 Pfam-B_461 Pfam-B_390 Pfam-B_6012 Pfam-B_9971 Pfam-B_8205 Pfam-B_874 Pfam-B_2568 Pfam-B_10556 Pfam-B_10574 Pfam-B_3879 Pfam-B_6018 Pfam-B_400 Pfam-B_3342 Pfam-B_455 Pfam-B_1490 Pfam-B_10607 Pfam-B_1113 Pfam-B_6903 Pfam-B_1260 Pfam-B_9838 Pfam-B_4395 Pfam-B_10217Pfam-B_2386 Pfam-B_3482 Pfam-B_9980 Pfam-B_616 Pfam-B_2289 Pfam-B_2299 Pfam-B_5842 rhv Pfam-B_9543 Pfam-B_9140 Pfam-B_1677 Pfam-B_6004 Pfam-B_6511 Pfam-B_3816 Pfam-B_9973 Pfam-B_9531 Pfam-B_1802 Pfam-B_2823 Pfam-B_9983 Pfam-B_1941 Pfam-B_9943 Pfam-B_9982 Pfam-B_6802 rvt Pfam-B_3446 Pfam-B_9548Pfam-B_2531 Pfam-B_548 Pfam-B_968 Pfam-B_4632 Pfam-B_4397 Pfam-B_2530 Pfam-B_3068 Pfam-B_4939 Pfam-B_8302 Pfam-B_5710 Pfam-B_1489 Pfam-B_2298 Pfam-B_7809 Pfam-B_9901 Pfam-B_4398 Pfam-B_3017 Pfam-B_2013 Pfam-B_590 Pfam-B_3294 Pfam-B_5116 Pfam-B_2296 Pfam-B_10170 Pfam-B_2028 Pfam-B_9533 Pfam-B_3509 Pfam-B_606 Pfam-B_881 Pfam-B_6671 Pfam-B_2297 Pfam-B_1189 Pfam-B_3263 Pfam-B_3543 Pfam-B_10756 Pfam-B_2009 Pfam-B_6003Pfam-B_1824 Pfam-B_6294 Pfam-B_3193 Pfam-B_4237 Pfam-B_3299 Pfam-B_4396 Pfam-B_7252 Pfam-B_813 Pfam-B_5841 Pfam-B_9861 Pfam-B_4536 Pfam-B_2759 Pfam-B_8910 Pfam-B_4082Pfam-B_297 Pfam-B_7182 Pfam-B_9138 Pfam-B_1123 Pfam-B_4403Pfam-B_3212 PH Pfam-B_7642 Pfam-B_4535

Pfam-B_774 Pfam-B_3373 Pfam-B_9139 Pfam-B_210Pfam-B_546 cadherin Pfam-B_547 Pfam-B_10807Pfam-B_4350 STphosphatase Pfam-B_2290Pfam-B_2945 Pfam-B_3880 Pfam-B_229 Pfam-B_2126 Pfam-B_4920 Pfam-B_2479 Pfam-B_5820 Pfam-B_989 Pfam-B_7719 Pfam-B_3295 Pfam-B_5634 Pfam-B_4482 Pfam-B_9310 Pfam-B_2963 Pfam-B_7717 Pfam-B_4533 Pfam-B_5386 Pfam-B_1459 Pfam-B_4547 Pfam-B_2292 Pfam-B_6146 Pfam-B_4357 Pfam-B_9007 Pfam-B_4918 Pfam-B_4356 Pfam-B_1931 Pfam-B_11927 Pfam-B_2703 Pfam-B_3611 Pfam-B_7573 Pfam-B_7718 Pfam-B_10142 Pfam-B_1665 Pfam-B_4606 Pfam-B_7155 Pfam-B_4921Pfam-B_2992 Pfam-B_7659 Pfam-B_9129 Pfam-B_6957 Pfam-B_4148Pfam-B_4919 Pfam-B_623 Pfam-B_9008 Pfam-B_11037 Pfam-B_307 Pfam-B_7334 Pfam-B_6674 Pfam-B_4149 Pfam-B_4922Pfam-B_4532 Pfam-B_5757 Pfam-B_7033 Pfam-B_5335 Pfam-B_1516 Pfam-B_1124 Pfam-B_5144 Pfam-B_6006 Pfam-B_336 Pfam-B_7981 Pfam-B_3494 Pfam-B_2409Pfam-B_9919 Pfam-B_8944 Pfam-B_3015 Pfam-B_8089 Pfam-B_7849 Pfam-B_697 Pfam-B_4688 Pfam-B_7846 Pfam-B_105 Pfam-B_1970 Pfam-B_334 Pfam-B_10473 Pfam-B_1621 Pfam-B_8578 toxin Pfam-B_4917 Pfam-B_4252 Pfam-B_7321 Pfam-B_7053 Pfam-B_520 E1-E2_ATPase Pfam-B_8090 Pfam-B_2878 Pfam-B_5141 Pfam-B_3124 Pfam-B_459 Pfam-B_9331 Pfam-B_3587 ank Pfam-B_7034Pfam-B_7174 Pfam-B_10668 myosin_head Pfam-B_2550 Pfam-B_1601 Pfam-B_10575 Pfam-B_1640 Pfam-B_7845 Pfam-B_92 Pfam-B_10144 Pfam-B_4689Pfam-B_10082 Pfam-B_4150 Pfam-B_5142 Pfam-B_11009 Pfam-B_733 Pfam-B_554 Pfam-B_473 Pfam-B_7848 Pfam-B_1971 Pfam-B_3256 Pfam-B_3713 Pfam-B_2504 Pfam-B_11026 Pfam-B_2981 Pfam-B_2494 Pfam-B_2361 Pfam-B_11000 Pfam-B_2583 Pfam-B_3013 Pfam-B_5143Pfam-B_4110 Pfam-B_1622 Pfam-B_3369 Pfam-B_2120 Pfam-B_9130Pfam-B_1556 Pfam-B_2983Pfam-B_6943 Pfam-B_728 Pfam-B_10440 Pfam-B_460 Pfam-B_3310 Pfam-B_10143 Pfam-B_7713 Pfam-B_7843 Pfam-B_6291 Pfam-B_10131 Pfam-B_5184 Pfam-B_3315Pfam-B_7526 Pfam-B_7982 Pfam-B_10130 Pfam-B_1600 Pfam-B_5607 Pfam-B_6944 Pfam-B_2570 Pfam-B_4555 Pfam-B_2619 Pfam-B_1365 Pfam-B_11643Pfam-B_5766 Pfam-B_2982 Pfam-B_2547 Pfam-B_437 Pfam-B_10351 Pfam-B_10435Pfam-B_3121 Pfam-B_3586 Pfam-B_308 Pfam-B_5176 Pfam-B_10350 Pfam-B_407 Pfam-B_4684 Pfam-B_4851 Pfam-B_5175 Pfam-B_1788 Pfam-B_939 Pfam-B_4556 Pfam-B_2546 Pfam-B_11104 Pfam-B_352 Pfam-B_7844 Pfam-B_8696 7tm_1 notch Pfam-B_312 Pfam-B_2575 Pfam-B_11794 Pfam-B_156 Pfam-B_698 Pfam-B_4847 Pfam-B_438 Pfam-B_11017 Pfam-B_2980 Pfam-B_2618 Pfam-B_3963 Pfam-B_11638 Pfam-B_1388 ATP-synt_C Pfam-B_7712 Pfam-B_7648 Pfam-B_725 Pfam-B_2747 Pfam-B_7926 Pfam-B_1618 Pfam-B_5234 Pfam-B_5185 Pfam-B_6547 Pfam-B_2758 Pfam-B_4943 Pfam-B_1654 Pfam-B_2241 Pfam-B_1602 Pfam-B_11766Pfam-B_2122Pfam-B_2161 Pfam-B_11743 Pfam-B_3771Pfam-B_6398 Pfam-B_2140 Pfam-B_375 Pfam-B_8175 Pfam-B_6641 Pfam-B_6923 cyclin Pfam-B_10403 Pfam-B_940 Pfam-B_3940 Pfam-B_885 Pfam-B_4806 Pfam-B_8273 Pfam-B_2710 Pfam-B_4076 Pfam-B_11323 Pfam-B_4060 Pfam-B_472 Pfam-B_3952 Pfam-B_3095 Pfam-B_1774 Pfam-B_6183 Pfam-B_9916 Pfam-B_11644 Pfam-B_28 Pfam-B_735 Pfam-B_5168 Pfam-B_8982Pfam-B_2204 Pfam-B_6909Pfam-B_7442 Pfam-B_5186 Pfam-B_341 Pfam-B_2274 Pfam-B_7022 Pfam-B_8196 Pfam-B_523 Pfam-B_10021 Pfam-B_3944 Pfam-B_8003 Pfam-B_8230 Pfam-B_275 Pfam-B_6417 Pfam-B_7282 Pfam-B_8194 Pfam-B_3939 Pfam-B_7111 Pfam-B_8142 Pfam-B_6184 Pfam-B_2519 Pfam-B_10576 Pfam-B_7749Pfam-B_7750 Pfam-B_936 Pfam-B_711 Pfam-B_1025 Pfam-B_8825 Pfam-B_2669 Pfam-B_9418 Pfam-B_102Pfam-B_5406 Pfam-B_9309Pfam-B_4538 Pfam-B_1265 Pfam-B_7322 Pfam-B_8120 Pfam-B_10572 Pfam-B_3945 Pfam-B_436 Pfam-B_620 Pfam-B_3071 Pfam-B_8824 Pfam-B_337Pfam-B_6098 Pfam-B_8580 Pfam-B_3953 Pfam-B_3962 Pfam-B_4085 Pfam-B_5909 Pfam-B_1656 Pfam-B_3960 Pfam-B_4225 Pfam-B_3187 Pfam-B_4059Pfam-B_5754Pfam-B_3257 Pfam-B_851 Pfam-B_8581 Pfam-B_2061 Pfam-B_2314 Pfam-B_1112 Pfam-B_5398 Pfam-B_1713 Pfam-B_2506Pfam-B_5368 Pfam-B_11584 Pfam-B_7647 Pfam-B_9006 Pfam-B_425 il8 Pfam-B_8197 Pfam-B_7951 Pfam-B_2604Pfam-B_8257 Pfam-B_4062 Pfam-B_1620 Pfam-B_7107 Pfam-B_1136 Pfam-B_4057 Pfam-B_7816 Pfam-B_7616 Pfam-B_3882 Pfam-B_2598Pfam-B_5857 Pfam-B_8138 Pfam-B_5742 Pfam-B_10573 Pfam-B_1467 Pfam-B_30 Pfam-B_477 Pfam-B_7491 Pfam-B_6847 Pfam-B_2949Pfam-B_3192 Pfam-B_135 Pfam-B_4631 Pfam-B_2058 Pfam-B_4258 Pfam-B_3081 Pfam-B_3540 Pfam-B_11808 Pfam-B_2518 Pfam-B_6600 Pfam-B_7451 serpin Pfam-B_11753 Pfam-B_865 Pfam-B_479 Pfam-B_3707 ldl_recept_a Pfam-B_1016 Pfam-B_5604Pfam-B_7049 Pfam-B_4075 Pfam-B_3072 Pfam-B_8758 pkinasePfam-B_2839 Pfam-B_7826 Pfam-B_864 Pfam-B_4058 Pfam-B_2764 Pfam-B_8193 Pfam-B_2716 Pfam-B_4006 Pfam-B_83 Pfam-B_5136 Pfam-B_7932 HTH_1 Pfam-B_5799 Pfam-B_8527 Pfam-B_4051 Pfam-B_8141 dsrm RIP Pfam-B_7595Pfam-B_699Pfam-B_2840 Pfam-B_11419 Pfam-B_8170 Pfam-B_8161 Pfam-B_1821 Pfam-B_9191Pfam-B_6308 Pfam-B_11735 Pfam-B_3706Pfam-B_5278Pfam-B_11809 Pfam-B_5421 Pfam-B_10946 Pfam-B_796Pfam-B_1238 Pfam-B_1979 Pfam-B_8975 Pfam-B_9667 Pfam-B_1612 Pfam-B_1956 Pfam-B_11199 Pfam-B_8195 Pfam-B_5397 Pfam-B_8139 Pfam-B_5908 Pfam-B_10357 Pfam-B_1525 Pfam-B_8228Pfam-B_7392 Pfam-B_1255Pfam-B_4807 Pfam-B_3916 ig Pfam-B_7008 Pfam-B_108 Pfam-B_403 Pfam-B_7010 Pfam-B_7952 Pfam-B_4976Pfam-B_219 Pfam-B_8019 Pfam-B_9911 Pfam-B_7009 Pfam-B_93 Pfam-B_1589 Pfam-B_3289 Pfam-B_3170 Pfam-B_2540 Pfam-B_5323 Pfam-B_3703 Pfam-B_768 Pfam-B_11900 Pfam-B_8145 Pfam-B_7456 Pfam-B_3082 Pfam-B_1720 Pfam-B_5741 Pfam-B_8017 Pfam-B_11830 Pfam-B_3429 Pfam-B_4664 Pfam-B_4726 Pfam-B_3618 Pfam-B_2802Pfam-B_2696 Pfam-B_5101 Pfam-B_10297 Pfam-B_9904 Pfam-B_2827 Pfam-B_6523 Pfam-B_5656 Pfam-B_7052 Pfam-B_65 Pfam-B_11744 Pfam-B_3580 Pfam-B_1436 Pfam-B_1134 Pfam-B_10558 Pfam-B_5153 Pfam-B_8407 Pfam-B_2541 Pfam-B_1883 Pfam-B_10702 Pfam-B_49 Pfam-B_3474 SH2 Pfam-B_1234 Pfam-B_5089 Pfam-B_2230 Pfam-B_8518 Pfam-B_5154 Pfam-B_8143 Pfam-B_4451 Pfam-B_125 Pfam-B_176 Pfam-B_6371 Pfam-B_6595 Pfam-B_3817 lipocalin Pfam-B_1465 Pfam-B_2225 Pfam-B_4202 Pfam-B_41 Pfam-B_594 Pfam-B_3730 Pfam-B_110 Pfam-B_4881 Pfam-B_10099 Pfam-B_7558Pfam-B_7559 Pfam-B_3887 Pfam-B_7487 Pfam-B_3564 Pfam-B_148 Pfam-B_6369 Pfam-B_878 Pfam-B_8583 Pfam-B_1026 Pfam-B_6716 Pfam-B_888 Pfam-B_2016 Pfam-B_3169 Pfam-B_38 Pfam-B_9215 Pfam-B_2913 Pfam-B_3102 Pfam-B_3080 Pfam-B_4924Pfam-B_2060 Pfam-B_10563 Pfam-B_2327 Pfam-B_1203 Pfam-B_4072 Pfam-B_7557 Pfam-B_1068 Pfam-B_5117 Pfam-B_331Pfam-B_7116 Pfam-B_11592 Pfam-B_11198 Pfam-B_3302 Pfam-B_6594Pfam-B_9750SH3 Pfam-B_9431 Pfam-B_974 Pfam-B_3069 Pfam-B_10617 Pfam-B_2223 Pfam-B_1186 Pfam-B_10580 Pfam-B_1116 Pfam-B_6721 Pfam-B_381 Pfam-B_803 Pfam-B_11591 Pfam-B_7727 Pfam-B_4244 Pfam-B_9609 Pfam-B_5404Pfam-B_8918 Pfam-B_10578 Pfam-B_4061 Pfam-B_700 Pfam-B_3899 Pfam-B_6414 Pfam-B_4077 Pfam-B_42 Pfam-B_1182 Pfam-B_2912 Pfam-B_10581 Pfam-B_11146 Pfam-B_3516 Pfam-B_2228 Pfam-B_8207 Pfam-B_10562 Pfam-B_910 Pfam-B_5079 Pfam-B_2884 Pfam-B_5173 Pfam-B_150 Pfam-B_4420 Pfam-B_2701 Pfam-B_3572 Pfam-B_11021 Pfam-B_7505 Pfam-B_1331 Pfam-B_1283 Pfam-B_284 Pfam-B_2400 Pfam-B_11593 Pfam-B_3663 Pfam-B_6380 Pfam-B_10001 Pfam-B_830 Pfam-B_7378 Pfam-B_3898 Pfam-B_4666 zn-protease Pfam-B_4607 Pfam-B_7209 ABC_tran Pfam-B_5348 Pfam-B_3883Pfam-B_671 Pfam-B_5070 Pfam-B_7233 Pfam-B_1617 Pfam-B_1466 Pfam-B_20 Pfam-B_2675 Pfam-B_9897Pfam-B_2697 C2 Pfam-B_11145 Pfam-B_7069 zf-CCHC Pfam-B_1840 Pfam-B_60Pfam-B_10589 Pfam-B_982 Pfam-B_1574Pfam-B_10298 Pfam-B_4610 Pfam-B_2855 Pfam-B_4665 Pfam-B_2310 ras Pfam-B_3165 Pfam-B_747 Pfam-B_2158 Pfam-B_9052 Pfam-B_8153 Pfam-B_8724 Y_phosphatase Pfam-B_752 Pfam-B_11155 pilin Pfam-B_534 Pfam-B_674 Pfam-B_8116 Pfam-B_9662 Pfam-B_2377 Pfam-B_10784 Pfam-B_9661 Pfam-B_10585 Pfam-B_5072 Pfam-B_530 Pfam-B_1110 Pfam-B_955 Pfam-B_5077 Pfam-B_1267 Pfam-B_11896 Pfam-B_913Pfam-B_3074 Pfam-B_2111 Pfam-B_11906 Pfam-B_762 Pfam-B_10947 Pfam-B_4594 Pfam-B_2515 Pfam-B_10299 Pfam-B_8105 Pfam-B_2978 Pfam-B_359 Pfam-B_7374Pfam-B_9846 Pfam-B_3048 Pfam-B_630Pfam-B_582 Pfam-B_411Pfam-B_382 Pfam-B_121 Pfam-B_10294 thiored Pfam-B_7115 Pfam-B_2773 Pfam-B_7235 Pfam-B_7372 Pfam-B_2659 Pfam-B_75 Pfam-B_5073Pfam-B_11771 Pfam-B_1152 Pfam-B_3327 Pfam-B_6374 Pfam-B_162 Pfam-B_1349 Pfam-B_58 Pfam-B_5069 Pfam-B_8108 Pfam-B_9848 Pfam-B_514 Pfam-B_1832 Pfam-B_7635 Pfam-B_5654 Pfam-B_3767Pfam-B_3428 Pfam-B_37 Pfam-B_7376 Pfam-B_6287 Pfam-B_8928 Pfam-B_1158 Pfam-B_2516 Pfam-B_1323 Pfam-B_2880 Pfam-B_5760 Pfam-B_4295 Pfam-B_2892 Pfam-B_1729 Pfam-B_10566Pfam-B_5078 Pfam-B_2881 Pfam-B_10397 Pfam-B_3055 Pfam-B_8409 Pfam-B_7488 Pfam-B_2676 Pfam-B_5653 Pfam-B_773Pfam-B_1056 Pfam-B_2777 Pfam-B_4675 Pfam-B_3050 Pfam-B_5405 Pfam-B_2893 Pfam-B_2846Pfam-B_11256 Pfam-B_1890 Pfam-B_1831 Pfam-B_201Pfam-B_705 Pfam-B_3575 Pfam-B_5075 Pfam-B_3547 Pfam-B_767 Pfam-B_10587 Pfam-B_604 Pfam-B_3848 Pfam-B_8094 Pfam-B_31 Pfam-B_1533 Pfam-B_3842 Pfam-B_11278 Pfam-B_506 Pfam-B_1830 Pfam-B_10707 Pfam-B_7384 Pfam-B_3758 Pfam-B_424 Pfam-B_5960 Pfam-B_4174 Pfam-B_7665 Pfam-B_8528 Pfam-B_10270 Pfam-B_5041 Pfam-B_5490 Pfam-B_1206 Pfam-B_8332 Pfam-B_3137 Pfam-B_842 Pfam-B_889 Pfam-B_7489 Pfam-B_3985 Pfam-B_2814 Pfam-B_4261Pfam-B_292 Pfam-B_794 Pfam-B_7311 Pfam-B_5491 Pfam-B_1731 Pfam-B_9974

HLH Pfam-B_701Pfam-B_6572 Pfam-B_1835 Pfam-B_5591 Pfam-B_3987 Pfam-B_4008 Pfam-B_7667 Pfam-B_4074 Pfam-B_10698 Pfam-B_11737 Pfam-B_3122 Pfam-B_11280

Pfam-B_7830Pfam-B_5264 Pfam-B_3314 Pfam-B_1820 Pfam-B_2572 Pfam-B_7668 Pfam-B_1088 Pfam-B_10718 Pfam-B_870 Pfam-B_4206 Pfam-B_6576

Pfam-B_7565Pfam-B_3136 Pfam-B_10904 Pfam-B_5809

Pfam-B_2148 Pfam-B_1022 Pfam-B_4698 Pfam-B_10692 Pfam-B_9912 Pfam-B_7136 Pfam-B_3988 Pfam-B_6619 Pfam-B_4045

Pfam-B_4686 Pfam-B_2149 sushi Pfam-B_2245 Pfam-B_731

Pfam-B_5810 Pfam-B_730 Pfam-B_6303 Pfam-B_174 Pfam-B_309 Pfam-B_8524 Pfam-B_8954 Pfam-B_6845 Pfam-B_2340 Pfam-B_10719Pfam-B_3995 Pfam-B_4701 Pfam-B_915 Pfam-B_6915 Pfam-B_6489 Pfam-B_11100Pfam-B_4945Pfam-B_5413 Pfam-B_7836 Pfam-B_7640 Pfam-B_5197 Pfam-B_2304 Pfam-B_9692 Pfam-B_7997 Pfam-B_843 Pfam-B_4044 Pfam-B_6916 Pfam-B_8466 Pfam-B_844 Pfam-B_5045 Pfam-B_2288 Pfam-B_7751 Pfam-B_10703 Pfam-B_2139 Pfam-B_8397 Pfam-B_5512 Pfam-B_5540 Pfam-B_4700 Pfam-B_4269 Pfam-B_1414 Pfam-B_2900 Pfam-B_10527 Pfam-B_368 Pfam-B_6216 Pfam-B_4139 Pfam-B_358 Pfam-B_3492 Pfam-B_1593 Pfam-B_8076 Pfam-B_6859 Pfam-B_3222 Pfam-B_4126 Pfam-B_8372 Pfam-B_7680 Pfam-B_5199 Pfam-B_2188 Pfam-B_3237 Pfam-B_8379 Pfam-B_8443 Pfam-B_1649 EGF Pfam-B_3249 Pfam-B_8376 Pfam-B_7453 Pfam-B_8371 lectin_c Pfam-B_8383Pfam-B_3059 Pfam-B_8717 Pfam-B_8460 Pfam-B_4127 rrm Pfam-B_1223 Pfam-B_7681 Pfam-B_5109 Pfam-B_4129 Pfam-B_8447 Pfam-B_10722 Pfam-B_5196 Pfam-B_5198 Pfam-B_2635 Pfam-B_1822 Pfam-B_8373 Pfam-B_11590 Pfam-B_4658Pfam-B_7999 Pfam-B_9684Pfam-B_8005Pfam-B_4128 Pfam-B_5339 Pfam-B_5340 Pfam-B_4659 Pfam-B_5084Pfam-B_4130 Pfam-B_1800 Pfam-B_5510 Pfam-B_9683 Pfam-B_7091 Pfam-B_3942 Pfam-B_7686 Pfam-B_1144 Pfam-B_6858 Pfam-B_948 homeobox Pfam-B_5541 Pfam-B_3574 Pfam-B_2416 Pfam-B_5085 Pfam-B_3173 Pfam-B_8157 Pfam-B_9276 Pfam-B_3097 Pfam-B_9605 Pfam-B_3557 Pfam-B_2131 Pfam-B_1143 Pfam-B_5266 Pfam-B_3947 Pfam-B_667 pou Pfam-B_9606 Pfam-B_918 Pfam-B_993 Pfam-B_2147 Pfam-B_8163 Pfam-B_1502Pfam-B_10726 Pfam-B_3593 Pfam-B_3223Pfam-B_814 Pfam-B_4894 Pfam-B_10727 Pfam-B_3837 Pfam-B_10721

Pfam-B_8380 Pfam-B_1089 Pfam-B_5341Pfam-B_2630Pfam-B_8523Pfam-B_8370 Pfam-B_160 Pfam-B_372 Pfam-B_1351Pfam-B_10708 Pfam-B_2291 Pfam-B_6580 Pfam-B_141 Pfam-B_919 Pfam-B_6578 Pfam-B_10720 Pfam-B_1922 Pfam-B_2792 Pfam-B_831 Pfam-B_7349

Pfam-B_7354 Pfam-B_1220

Pfam-B_969 Pfam-B_159Pfam-B_10711 Pfam-B_5265

Pfam-B_2715 Pfam-B_2643 Pfam-B_2906

Pfam-B_2420

Pfam-B_6963

Pfam-B_9046

Pfam-B_11833Pfam-B_3741 Pfam-B_9180 Pfam-B_7669

Pfam-B_7897 Pfam-B_5959Pfam-B_8516 Pfam-B_5048 Pfam-B_964 Pfam-B_6965 Pfam-B_4277Pfam-B_7118 Pfam-B_7462

Pfam-B_2421 Pfam-B_466 Pfam-B_1893 Pfam-B_5571 zf-C2H2 Pfam-B_1965 Pfam-B_10736 Pfam-B_8515 Pfam-B_605 Pfam-B_7664

Pfam-B_1961 Pfam-B_3877 Pfam-B_9182 Pfam-B_1373 Pfam-B_6964 Pfam-B_4712Pfam-B_467 Pfam-B_9042 Pfam-B_167 Pfam-B_9047

Pfam-B_7117 Pfam-B_10760 Pfam-B_5316 Pfam-B_9043Pfam-B_7119 Pfam-B_5533 Pfam-B_16 Pfam-B_9045 Pfam-B_267 Pfam-B_2729 Pfam-B_10758 Pfam-B_9 Pfam-B_9241 Pfam-B_9650 Pfam-B_2907 Pfam-B_166 Pfam-B_6485 Pfam-B_9041 Pfam-B_3824 Pfam-B_5794 Pfam-B_1368Pfam-B_6040 Pfam-B_2419

Pfam-B_1448 Pfam-B_4696Pfam-B_9436 Pfam-B_646Pfam-B_6588 Pfam-B_1739 Pfam-B_9044 Pfam-B_10752Pfam-B_6586 Pfam-B_4 Pfam-B_6484

Pfam-B_7695 Pfam-B_10751Pfam-B_4713Pfam-B_233

Pfam-B_22 GTP_EFTU

Pfam-B_8323

Pfam-B_1985

Pfam-B_10303

Pfam-B_2826

Pfam-B_6333 Pfam-B_2362 Pfam-B_10168 Pfam-B_56 Pfam-B_464 Pfam-B_5435Pfam-B_3200 Pfam-B_361 Pfam-B_1990 Pfam-B_1281 Pfam-B_6403 Pfam-B_3917 Pfam-B_278 Pfam-B_6733 Pfam-B_900Pfam-B_10253 Pfam-B_1469 Pfam-B_2927 Pfam-B_6174 Pfam-B_3853 Pfam-B_123 Pfam-B_3507 Pfam-B_4382 Pfam-B_5973 Pfam-B_3164 Pfam-B_5643 Pfam-B_7300 Pfam-B_8989 Pfam-B_3412 Pfam-B_8224 Pfam-B_2994 Pfam-B_6849 Pfam-B_791 Pfam-B_10167 Pfam-B_8222 Pfam-B_10477 Pfam-B_8987 Pfam-B_189 Pfam-B_10157 Pfam-B_984 Pfam-B_7607 Pfam-B_6332 Pfam-B_3133 Pfam-B_6875 Pfam-B_191 laminin_EGF Pfam-B_6419 Pfam-B_7769 Pfam-B_10469 Pfam-B_288 Pfam-B_10512Pfam-B_10510 Pfam-B_140 Pfam-B_2002 Pfam-B_4722 Pfam-B_7770 Pfam-B_4560 Pfam-B_2953 Pfam-B_826 Pfam-B_4913 Pfam-B_271 Pfam-B_4028 Pfam-B_5734 Pfam-B_11490 Pfam-B_5000 Pfam-B_3943 Pfam-B_3411 Pfam-B_9565 Pfam-B_6537 Pfam-B_4083 Pfam-B_7577 Pfam-B_1904 Zn_clus Pfam-B_7197 Pfam-B_6532 Pfam-B_3901 Pfam-B_7796 Pfam-B_10640 Pfam-B_131 Pfam-B_613 Pfam-B_10237 Pfam-B_1200 Pfam-B_3549 Pfam-B_6610 Pfam-B_138 Pfam-B_10226 Pfam-B_10166 Pfam-B_10465 Pfam-B_7904 Pfam-B_5847 Pfam-B_1946 Pfam-B_8 Pfam-B_10637 Pfam-B_1440 Pfam-B_397 Pfam-B_7602 Pfam-B_3427 Pfam-B_419 Pfam-B_503 Pfam-B_612 Pfam-B_2844 Pfam-B_6330 Pfam-B_10463 Pfam-B_11860 Pfam-B_1758 Pfam-B_5376 oxidored_fad Pfam-B_10460 Pfam-B_3972 Pfam-B_6248 Pfam-B_6042 Pfam-B_4710 Pfam-B_1291 Pfam-B_6534 Pfam-B_3109 Pfam-B_2755 Pfam-B_445 Pfam-B_7605 Pfam-B_1668 Pfam-B_11301 response_reg Pfam-B_1521Pfam-B_827 Pfam-B_9303 Pfam-B_3902 Pfam-B_5225 Pfam-B_8223 Pfam-B_5138 Pfam-B_10509 Pfam-B_8988 Pfam-B_11143 Pfam-B_2524 Pfam-B_7771 Pfam-B_7195 Pfam-B_2365 Pfam-B_6475 Pfam-B_427 Pfam-B_765 Pfam-B_9365 Pfam-B_839 Pfam-B_4617 Pfam-B_1421 Pfam-B_9267 Pfam-B_6946 Pfam-B_2443 Pfam-B_6340 Pfam-B_333 Pfam-B_214 Pfam-B_10852 Pfam-B_1498 Pfam-B_1829Pfam-B_9255 Pfam-B_4997 Pfam-B_838 Pfam-B_10479 Pfam-B_10476 Pfam-B_139 Pfam-B_5387 Pfam-B_3854 Pfam-B_1532 Pfam-B_3956 Pfam-B_7600 Pfam-B_3721 Pfam-B_5162 Pfam-B_6339 Pfam-B_1913 Pfam-B_2887 Pfam-B_6298 Pfam-B_10464 Pfam-B_283 Pfam-B_1997 Pfam-B_6347 Pfam-B_4093 Pfam-B_9158 Pfam-B_7608 Pfam-B_2986 Pfam-B_165 Pfam-B_4084 Pfam-B_9162 tRNA-synt_1 Pfam-B_1960 Pfam-B_8026 Pfam-B_718 Pfam-B_766 Pfam-B_109 Pfam-B_418 Pfam-B_1591 Pfam-B_395 Pfam-B_9827 Pfam-B_3881 Pfam-B_10281 Pfam-B_9264 Pfam-B_1481 Pfam-B_2071 Pfam-B_10169 Pfam-B_430 Pfam-B_408 Pfam-B_3404 Pfam-B_4453 Pfam-B_4313 Pfam-B_7603 Pfam-B_676 Pfam-B_1399 Pfam-B_1562 Pfam-B_3852 Pfam-B_422Pfam-B_7196 Pfam-B_10159 Pfam-B_703 Pfam-B_5157 Pfam-B_2373 Pfam-B_99 Pfam-B_10242 Pfam-B_2608 Pfam-B_862 aldedh Pfam-B_505 Pfam-B_7301 Pfam-B_7017 Pfam-B_10164 Pfam-B_4561 Pfam-B_3738 Pfam-B_4652 Pfam-B_387 Pfam-B_4036 Pfam-B_491 Pfam-B_7560 Pfam-B_2789 Pfam-B_7108 Pfam-B_6337 Pfam-B_787 Pfam-B_4310 Pfam-B_5375 Pfam-B_1683 Pfam-B_4887 Pfam-B_2842 Pfam-B_10246 Pfam-B_6336 Pfam-B_575 Pfam-B_7248 Pfam-B_9698 Pfam-B_10163 Pfam-B_2724 Pfam-B_9274 Pfam-B_423Pfam-B_1204 Pfam-B_13 Pfam-B_5313 Pfam-B_1972 Pfam-B_10466 Pfam-B_213 Pfam-B_3971 Pfam-B_9859 Pfam-B_4311 Pfam-B_5377 Pfam-B_6335 Pfam-B_474 Pfam-B_55 Pfam-B_388 Pfam-B_2784 Pfam-B_1614 Pfam-B_595 Pfam-B_7606 Pfam-B_1183 Pfam-B_118 Pfam-B_1828 Pfam-B_11841 Pfam-B_70 Pfam-B_2926Pfam-B_3162 Pfam-B_1676 Pfam-B_1354 Pfam-B_5652 Pfam-B_2841Pfam-B_3508 Pfam-B_5797 Pfam-B_5926 Pfam-B_7736 Pfam-B_1996Pfam-B_8027 laminin_B Pfam-B_723 Pfam-B_1111 Pfam-B_1212 Pfam-B_4506 Pfam-B_2034 Pfam-B_9168 Pfam-B_2124 Pfam-B_10245 Pfam-B_4653 Pfam-B_7601 Pfam-B_360 Pfam-B_5730 Pfam-B_2725 Pfam-B_3154 Pfam-B_35 Pfam-B_4703 Pfam-B_2928 Pfam-B_820 Pfam-B_6608 Pfam-B_5360 Pfam-B_3117 Pfam-B_4312 Pfam-B_3321 Pfam-B_6342 Pfam-B_11164 Pfam-B_3149 Pfam-B_3505 Pfam-B_988 Pfam-B_222 Pfam-B_1167 Pfam-B_2211 Pfam-B_1199 Pfam-B_7041 Pfam-B_10478 Pfam-B_6144 Pfam-B_492 Pfam-B_9347 Pfam-B_107 Pfam-B_1706 Pfam-B_7994 Pfam-B_3528 Pfam-B_2017 Pfam-B_510 Pfam-B_7246 Pfam-B_4407 Pfam-B_3343 Pfam-B_3603 Pfam-B_6086 Pfam-B_11487 Pfam-B_4279 Pfam-B_1237 Pfam-B_552 Pfam-B_9706 Pfam-B_10243 Pfam-B_10370 Pfam-B_2252 Pfam-B_1882 Pfam-B_4339 Pfam-B_1213 lectin_legB Pfam-B_3724 Pfam-B_1079 Pfam-B_1357 Pfam-B_5846 Pfam-B_1575 Pfam-B_7880 laminin_Nterm Pfam-B_2157 Pfam-B_11918 Pfam-B_2333 Pfam-B_262 copper-bind Pfam-B_8096 Pfam-B_1725 Pfam-B_2434 Pfam-B_1356 Pfam-B_3153 Pfam-B_3345 Pfam-B_9159 Pfam-B_414 Pfam-B_1420 Pfam-B_6334 Pfam-B_2 Pfam-B_6426 Pfam-B_5385 Pfam-B_7501 Pfam-B_10373 Pfam-B_597 Pfam-B_10368 Pfam-B_1823 Pfam-B_6542 p450 Pfam-B_3344 DNA_pol Pfam-B_2581 Pfam-B_7299 Pfam-B_3530 Pfam-B_2870 Pfam-B_2425 Pfam-B_2125 Pfam-B_10965 Pfam-B_2977 Pfam-B_5383 Pfam-B_3458 Pfam-B_1433 Pfam-B_10371 Pfam-B_10555 Pfam-B_3670 Pfam-B_1724 Pfam-B_6601 Pfam-B_4155 Pfam-B_235 Pfam-B_916 Pfam-B_3075 ketoacyl-synt Pfam-B_52 Pfam-B_8057Pfam-B_5361 Pfam-B_10508 Pfam-B_509 Pfam-B_8095 Pfam-B_11763 Pfam-B_24 Pfam-B_3063 Pfam-B_6492 Pfam-B_7425 Pfam-B_9231 Pfam-B_11500Pfam-B_1473 Pfam-B_11511 Pfam-B_117 Pfam-B_11762 Pfam-B_1232 Pfam-B_36 Pfam-B_9227 Pfam-B_287 Pfam-B_10633 Pfam-B_3672 Pfam-B_958 hormone Pfam-B_6659 Pfam-B_2956 Pfam-B_4157 Pfam-B_17 Pfam-B_9230 Pfam-B_133 Pfam-B_234 Pfam-B_9590 Pfam-B_11649 Pfam-B_560 Pfam-B_6736

Pfam-B_6863 Pfam-B_177 Pfam-B_6846 Pfam-B_2988 Pfam-B_6819

Pfam-B_126 Pfam-B_1761 Pfam-B_4873 Pfam-B_517 Pfam-B_4191 Pfam-B_50 Pfam-B_355 Pfam-B_6822 Pfam-B_53 Pfam-B_405 Pfam-B_927

Pfam-B_1233

Pfam-B_6862

Pfam-B_9819 Pfam-B_3560 Pfam-B_3510 Pfam-B_5334 Pfam-B_5215 Pfam-B_5400 Pfam-B_9383 Pfam-B_8684 Pfam-B_8471 Pfam-B_1348 Pfam-B_2973Pfam-B_5789 Pfam-B_256 Pfam-B_2431 Pfam-B_3787 Pfam-B_895 Pfam-B_622 Pfam-B_8742 Pfam-B_4271 Pfam-B_712 Pfam-B_6349 Pfam-B_10644 Pfam-B_6233 Pfam-B_557 Pfam-B_1949 Pfam-B_3184 Pfam-B_5123 Pfam-B_6452 Pfam-B_9423 Pfam-B_4374 Pfam-B_3613 vwd Pfam-B_5671 Pfam-B_9831 Pfam-B_1318 Pfam-B_6355 Pfam-B_1296 Pfam-B_3511 Pfam-B_641 Pfam-B_124 Pfam-B_9818 Pfam-B_163 Pfam-B_4634 Pfam-B_469 Pfam-B_5974 Pfam-B_3737 Pfam-B_11806 Pfam-B_3949 Pfam-B_4354 Pfam-B_9932 Pfam-B_1699Pfam-B_10091 Pfam-B_9821 Pfam-B_2429 Pfam-B_4286 Pfam-B_9344 Pfam-B_911 Pfam-B_5216 Pfam-B_2342Pfam-B_3495 Pfam-B_3107 Pfam-B_5945 Pfam-B_10391 Pfam-B_2858 Pfam-B_4984 Pfam-B_2565 Pfam-B_11600 Pfam-B_2815 Pfam-B_3354 Pfam-B_2533 Pfam-B_9817 Pfam-B_6128 Pfam-B_10823 Pfam-B_4143 Pfam-B_6605 Pfam-B_2896 Pfam-B_10090 TGF-beta Pfam-B_3843 Pfam-B_4209 Pfam-B_1877 Pfam-B_9832 Pfam-B_513 Pfam-B_1963 Pfam-B_959 Pfam-B_11805 Pfam-B_3951 Pfam-B_2834 Pfam-B_2112 Pfam-B_922 Pfam-B_720 Pfam-B_5972 Pfam-B_5782 Pfam-B_627 actin Pfam-B_9382 Pfam-B_3016 Pfam-B_994 Pfam-B_2428 Pfam-B_8688 Pfam-B_247 Pfam-B_2426 Pfam-B_206 Pfam-B_2596 Pfam-B_190 Pfam-B_651 Pfam-B_3186 Pfam-B_6353 aminotran Pfam-B_10688 Pfam-B_2456 Pfam-B_9833 Pfam-B_902 Pfam-B_164 Pfam-B_6195 Pfam-B_2674 Pfam-B_583 Pfam-B_6778Pfam-B_4902 Pfam-B_2816 Pfam-B_351 Pfam-B_5647 Pfam-B_10789 Pfam-B_4497 Pfam-B_1487 Pfam-B_7503 Pfam-B_1982 Pfam-B_7755 Pfam-B_379 7tm_2 Pfam-B_6599 Pfam-B_2737 Pfam-B_8670 Pfam-B_413 Pfam-B_111 Pfam-B_3483 Pfam-B_457 Pfam-B_4472 Pfam-B_7722 Pfam-B_9933 Pfam-B_846 Pfam-B_779 Pfam-B_10825 Pfam-B_4518 Pfam-B_249 Pfam-B_180 Pfam-B_3090 Pfam-B_240 Pfam-B_5122 Pfam-B_2374 Pfam-B_179 Pfam-B_2748 Pfam-B_1003 Cys-protease Pfam-B_10087 Pfam-B_1359 Pfam-B_402 Pfam-B_1317 Pfam-B_9822 Pfam-B_9991 Pfam-B_2895 Pfam-B_7068 Pfam-B_1695 Pfam-B_977 Pfam-B_10820 Pfam-B_2894 Pfam-B_966 Pfam-B_3950 Pfam-B_10147 Pfam-B_370 Pfam-B_1745 Pfam-B_4194 Pfam-B_1337 Pfam-B_1964 Pfam-B_5971 Pfam-B_719 Pfam-B_3330 Pfam-B_9601 Pfam-B_1981 Pfam-B_9992 Pfam-B_1071 Pfam-B_6354 Pfam-B_11350 Pfam-B_1339 Pfam-B_10821 Pfam-B_8662 Pfam-B_5627 Pfam-B_561 Pfam-B_694 Pfam-B_3469 Pfam-B_3183Pfam-B_11422 Pfam-B_343 Pfam-B_2673 Pfam-B_2664 Pfam-B_3512 Pfam-B_6450 Pfam-B_9993 Pfam-B_4885 Pfam-B_1406 Pfam-B_9421 Pfam-B_1744 Pfam-B_2742 Pfam-B_277 Pfam-B_4285 Pfam-B_7098 Pfam-B_2959 Pfam-B_9346 Pfam-B_6129 Pfam-B_5399 Pfam-B_10678 Pfam-B_7834 Pfam-B_809 Pfam-B_7762 Pfam-B_2738 Pfam-B_6598 Pfam-B_7723 Pfam-B_9830 Pfam-B_9934 Pfam-B_8623 Pfam-B_10092 Pfam-B_4537Pfam-B_2359 Pfam-B_1694 Pfam-B_2286 Pfam-B_3948 Pfam-B_9327 Pfam-B_10086 Pfam-B_3355 Pfam-B_2685 Pfam-B_591 Pfam-B_769 Pfam-B_9931 Pfam-B_602 Pfam-B_5217 Pfam-B_3087 Pfam-B_6231 Pfam-B_9811 Pfam-B_2069 Pfam-B_5006 Pfam-B_2629 Pfam-B_2427 Pfam-B_976 Pfam-B_2430 Pfam-B_254 Pfam-B_3457 Pfam-B_67 Pfam-B_5651 subtilase Pfam-B_318 Pfam-B_10677 Pfam-B_198 Pfam-B_136 Pfam-B_3463 Pfam-B_3041 Pfam-B_7795 Pfam-B_1461 Pfam-B_3931 Pfam-B_10816 Pfam-B_2191Pfam-B_5590Pfam-B_7383 Pfam-B_1780 Pfam-B_6627 Pfam-B_10793 Pfam-B_1569 Pfam-B_5589 Pfam-B_9820 Pfam-B_7721 Pfam-B_5171 Pfam-B_6352 Pfam-B_1558 Pfam-B_10785 Pfam-B_5944 Pfam-B_2375 Pfam-B_2003 Pfam-B_965

Pfam-B_8749 Pfam-B_301 Pfam-B_1150 Pfam-B_1418 Pfam-B_3246 Pfam-B_7146 Pfam-B_2001 Pfam-B_2382 Pfam-B_8067 Pfam-B_9237 Pfam-B_949 Pfam-B_951 Pfam-B_3286 Pfam-B_169 heme_1 Pfam-B_9778 Pfam-B_2418 Pfam-B_6959 Pfam-B_1312 Pfam-B_5620Pfam-B_9358 Pfam-B_2507 Pfam-B_857 Pfam-B_6701 Pfam-B_6317 Pfam-B_4720 Pfam-B_538 Pfam-B_1272Pfam-B_1568 Pfam-B_6346 Pfam-B_1937 Pfam-B_10717 Pfam-B_1221 Pfam-B_97 Pfam-B_983 Pfam-B_3860Pfam-B_9143 Pfam-B_7405 Pfam-B_4054 Pfam-B_7366 Pfam-B_2268 Pfam-B_682 Pfam-B_3525 Pfam-B_4327 Pfam-B_2837 Pfam-B_3820 Pfam-B_3579 Pfam-B_9674 Pfam-B_8109 Pfam-B_9212 Pfam-B_4364 Pfam-B_7619 Pfam-B_9668 Pfam-B_7975 Pfam-B_3287 Pfam-B_4637 Pfam-B_1596 Pfam-B_5532 Pfam-B_9317 Pfam-B_7775 Pfam-B_3768 Pfam-B_173 Pfam-B_8702 Pfam-B_1129 Pfam-B_2082 Pfam-B_8712 Pfam-B_4704 Pfam-B_7654 Pfam-B_5618 Pfam-B_6478 Pfam-B_6502 pyr_redox Pfam-B_3350 Pfam-B_1484 Pfam-B_9030 Pfam-B_1178 Pfam-B_8650 Pfam-B_2512 Pfam-B_542 Pfam-B_3573 Pfam-B_7627 Pfam-B_3821 Pfam-B_3061 Pfam-B_10319 Pfam-B_8753 Pfam-B_1127 Pfam-B_9304 Pfam-B_2023 Pfam-B_835 Pfam-B_11715 Pfam-B_4328 Pfam-B_7773 Pfam-B_1130 Pfam-B_3935 Pfam-B_1278 Pfam-B_1523 oxidored_molyb Pfam-B_8703 Pfam-B_4207 Pfam-B_5835 Pfam-B_3915 Pfam-B_329 Pfam-B_4186 Pfam-B_4552 Pfam-B_2838 Pfam-B_1153 Pfam-B_7363 Pfam-B_200 Pfam-B_1280Pfam-B_5389 Pfam-B_4925 Pfam-B_2599 Pfam-B_7615 Pfam-B_7362 Pfam-B_10638 Pfam-B_9675 Pfam-B_6362 Pfam-B_106 Pfam-B_8065 Pfam-B_8127 Pfam-B_921 Pfam-B_679 Pfam-B_5390 Pfam-B_8252 Pfam-B_2266 Pfam-B_775 Pfam-B_4961 Pfam-B_1595 Pfam-B_2432 Pfam-B_7768 Pfam-B_1812 Pfam-B_10710 Pfam-B_3558 Pfam-B_9357 Pfam-B_7653 Pfam-B_1626 Pfam-B_250 Pfam-B_882 Pfam-B_8467 Pfam-B_1115 Pfam-B_1087 Pfam-B_2480 Pfam-B_556 Pfam-B_10190 Pfam-B_3839 Pfam-B_683 Pfam-B_10716 Pfam-B_1579 Pfam-B_11716 Pfam-B_8253 Pfam-B_3903 Pfam-B_6323 Pfam-B_8489 sigma54 Pfam-B_3120 Pfam-B_493 Pfam-B_1586 Pfam-B_3905 Pfam-B_3822 Pfam-B_868 Pfam-B_244 Pfam-B_9989 Pfam-B_11717 Pfam-B_2803 Pfam-B_950 Pfam-B_8754Pfam-B_2613 Pfam-B_7416 Pfam-B_5934 Pfam-B_7365 Pfam-B_2209 Pfam-B_7594 Pfam-B_6029 Pfam-B_7285 Pfam-B_9146 Pfam-B_2229 Pfam-B_207 Pfam-B_2989 kazal Pfam-B_7160 Pfam-B_4470 Pfam-B_1737 HSP20 Pfam-B_362 Pfam-B_7153 Pfam-B_1320 Pfam-B_1431 Pfam-B_7593 Pfam-B_811 Pfam-B_4719 Pfam-B_305 Pfam-B_7625 Pfam-B_9211 Pfam-B_4185 Pfam-B_1485 Pfam-B_3031 Pfam-B_2888 Pfam-B_8755 Pfam-B_170 Pfam-B_6480 Pfam-B_3614 Pfam-B_2476 Pfam-B_1518 Pfam-B_2591 Pfam-B_8056 Pfam-B_2850 Pfam-B_5239 Pfam-B_7624 Pfam-B_1794 Pfam-B_1583 Pfam-B_8071 Pfam-B_10318 Pfam-B_543 Pfam-B_1017 Pfam-B_9936 Pfam-B_11714 Pfam-B_10218 Pfam-B_10529 Pfam-B_2642 Pfam-B_5558 Pfam-B_780 Pfam-B_7569 Pfam-B_4550 Pfam-B_9979 Pfam-B_3268

Pfam-B_11320 Pfam-B_1953 Pfam-B_1450 Pfam-B_3426Pfam-B_3856 Pfam-B_6032 Pfam-B_891Pfam-B_9876Pfam-B_9521Pfam-B_8012Pfam-B_7552Pfam-B_7551Pfam-B_750Pfam-B_7465Pfam-B_7329Pfam-B_691Pfam-B_6428Pfam-B_6118Pfam-B_6045Pfam-B_5686Pfam-B_5062Pfam-B_504Pfam-B_3552Pfam-B_3405Pfam-B_3378Pfam-B_1413Pfam-B_10795Pfam-B_10037 Pfam-B_2356 Pfam-B_11331 Pfam-B_1549 Pfam-B_8718 Pfam-B_2923 Pfam-B_5552 Pfam-B_3920 Pfam-B_383Pfam-B_5746 Pfam-B_6046 Pfam-B_9288 Pfam-B_6993 Pfam-B_1371 Pfam-B_9467 Pfam-B_1509 Pfam-B_7006 Pfam-B_586 Pfam-B_86 Pfam-B_558 Pfam-B_2050 Pfam-B_2018 Pfam-B_2406 Pfam-B_349 Pfam-B_8587 Pfam-B_3244 Pfam-B_1132 Pfam-B_5131 Pfam-B_4916 Pfam-B_482 Pfam-B_470 Pfam-B_9091 Pfam-B_1962 Pfam-B_4789 Pfam-B_5539Pfam-B_4767 Pfam-B_59 Pfam-B_1287 Pfam-B_9321 Pfam-B_1332 Pfam-B_5970 Pfam-B_6180 Pfam-B_713 Pfam-B_2627 Pfam-B_10602 Pfam-B_9731 Pfam-B_8585 Pfam-B_8461 Pfam-B_793 Pfam-B_7553 Pfam-B_7001 Pfam-B_409 Pfam-B_3623 Pfam-B_3129 Pfam-B_2466 Pfam-B_11898 Pfam-B_5650 Pfam-B_925 Pfam-B_172 Pfam-B_848 Pfam-B_3645 Pfam-B_4511 Pfam-B_4146 Pfam-B_1090 Pfam-B_4884 Pfam-B_6581 Pfam-B_11092 cpn60 Pfam-B_5923 alpha-amylasePfam-B_8711 tsp_1 Pfam-B_9282 Pfam-B_2990Pfam-B_87 Pfam-B_96Pfam-B_6632 adh_zinc Pfam-B_609 Pfam-B_10706Pfam-B_4677Pfam-B_7410 Pfam-B_970 Pfam-B_5548 Pfam-B_1224 Pfam-B_5880Pfam-B_1507Pfam-B_6529Pfam-B_2163Pfam-B_1416Pfam-B_860Pfam-B_8545Pfam-B_7464Pfam-B_6515Pfam-B_4858Pfam-B_8802Pfam-B_8894Pfam-B_3423Pfam-B_1044Pfam-B_265Pfam-B_1094Pfam-B_433Pfam-B_1047Pfam-B_5942Pfam-B_7468Pfam-B_8795fer4_NifH Pfam-B_4548Pfam-B_8719 tRNA-synt_2 Pfam-B_4384 Pfam-B_4492 Pfam-B_1566 Pfam-B_876 Pfam-B_10079 Pfam-B_577 Pfam-B_203 Pfam-B_69 Pfam-B_4909 Pfam-B_7598 Pfam-B_366 Pfam-B_10546 Pfam-B_5551 Pfam-B_3368 Pfam-B_5538 wnt Pfam-B_51 Pfam-B_3211 Pfam-B_5553 Pfam-B_1180 Pfam-B_6173 Pfam-B_1001 Pfam-B_6057 Pfam-B_6047 Pfam-B_3594 Pfam-B_9635 Pfam-B_1462 Pfam-B_11897 Pfam-B_920 Pfam-B_4344 Pfam-B_76 Pfam-B_4733 Pfam-B_10955 Pfam-B_1217 Pfam-B_4113 Pfam-B_10609 Pfam-B_3443 Pfam-B_2662 Pfam-B_2647 Pfam-B_1131 Pfam-B_5133 Pfam-B_4915 Pfam-B_1838 Pfam-B_10926 Pfam-B_5821 Pfam-B_6988 Pfam-B_10506 Pfam-B_4707 Pfam-B_1919 Pfam-B_8445 Pfam-B_10902 Pfam-B_6773 Pfam-B_8716 Pfam-B_11075 Pfam-B_2192 Pfam-B_4352 Pfam-B_197Pfam-B_39 Pfam-B_2306 Pfam-B_264 Pfam-B_2255Pfam-B_9873Pfam-B_5991Pfam-B_4022Pfam-B_21 Pfam-B_10Pfam-B_3322Pfam-B_5104Pfam-B_1726Pfam-B_1930Pfam-B_10374Pfam-B_1503Pfam-B_1092Pfam-B_5684Pfam-B_3047Pfam-B_394Pfam-B_3539Pfam-B_2284connexinPfam-B_1412Pfam-B_10588Pfam-B_10036Pfam-B_238 Pfam-B_34 Pfam-B_645 Pfam-B_8346 Pfam-B_11321 Pfam-B_2346 Pfam-B_8473 Pfam-B_10741 Pfam-B_4821 Pfam-B_5537

Pfam-B_7359 Pfam-B_3108 Pfam-B_487 Pfam-B_8880 Pfam-B_18 Pfam-B_10660 Pfam-B_6571 Pfam-B_3844 Pfam-B_4050 Pfam-B_2121 Pfam-B_9341 Pfam-B_2645 Pfam-B_2222 Pfam-B_4848 Pfam-B_672 Pfam-B_4995 Pfam-B_4331 Pfam-B_3576 Pfam-B_2029 Pfam-B_10073Pfam-B_5662 Pfam-B_2556 Pfam-B_7583 Pfam-B_223 Pfam-B_2293 Pfam-B_9537 Pfam-B_8848 Pfam-B_7496 Pfam-B_6512 Pfam-B_8453 Pfam-B_4123 Pfam-B_8672Pfam-B_224 Pfam-B_631 Pfam-B_7725 Pfam-B_8840 Pfam-B_7495 Pfam-B_3561 Pfam-B_1175 Pfam-B_3045 Pfam-B_2741 Pfam-B_1639 Pfam-B_5071 Pfam-B_6994 Pfam-B_2186 Pfam-B_4501 Pfam-B_1300 Pfam-B_5121 Pfam-B_3566 Pfam-B_2663Pfam-B_2205Pfam-B_8589 Pfam-B_2938Pfam-B_11399Pfam-B_2048 Pfam-B_321Pfam-B_183Pfam-B_1847Pfam-B_2866Pfam-B_2865Pfam-B_4973Pfam-B_1098Pfam-B_2448Pfam-B_2449 Pfam-B_5978Pfam-B_9470 Pfam-B_9219Pfam-B_9220 Pfam-B_217Pfam-B_8208 Pfam-B_4004Pfam-B_7924 Pfam-B_7597Pfam-B_7596 Pfam-B_1334Pfam-B_6115 Pfam-B_78 Pfam-B_589 Pfam-B_8179Pfam-B_5803 Pfam-B_2571Pfam-B_5273 Pfam-B_4175Pfam-B_3921 Pfam-B_9963Pfam-B_389 Pfam-B_9676Pfam-B_2312 Pfam-B_10202Pfam-B_1346 Pfam-B_27 Pfam-B_10658 Pfam-B_7584 Pfam-B_2793 Pfam-B_8707 Pfam-B_2770 Pfam-B_7585 Pfam-B_3450 Pfam-B_451 Pfam-B_276 Pfam-B_9251 Pfam-B_4685 Pfam-B_692 Pfam-B_5231 Pfam-B_2690 Pfam-B_3070 Pfam-B_2398 sigma70 Pfam-B_1666 Pfam-B_2411 Pfam-B_4454 Pfam-B_2137 Pfam-B_1061 Pfam-B_5720 Pfam-B_6570 Pfam-B_5721 Pfam-B_14 Pfam-B_1735 Pfam-B_1012 Pfam-B_3305adh_short Pfam-B_4266Pfam-B_1653 Pfam-B_1172 Pfam-B_4193 Pfam-B_5663 Pfam-B_2557 Pfam-B_7586 Pfam-B_9804 Pfam-B_8881 Pfam-B_10659 Pfam-B_8418 Pfam-B_4332 Pfam-B_4124 Pfam-B_2910 Pfam-B_2354 Pfam-B_3279 Pfam-B_9342Pfam-B_8451 Pfam-B_1034 Pfam-B_2807 Pfam-B_3847 Pfam-B_6996 Pfam-B_628 Pfam-B_4849 Pfam-B_5274 Pfam-B_5064 Pfam-B_10693 Pfam-B_1806 Pfam-B_1363 Pfam-B_4697 Pfam-B_11 Pfam-B_7357 Pfam-B_3091 Pfam-B_186 Pfam-B_3301

Pfam-B_5758Pfam-B_1299Pfam-B_8763thyroglobulin_1Pfam-B_8740 Pfam-B_8741Pfam-B_1121Pfam-B_2521Pfam-B_7851Pfam-B_7818Pfam-B_5258Pfam-B_5257Pfam-B_5222Pfam-B_5221Pfam-B_5220Pfam-B_1807Pfam-B_5129Pfam-B_10545Pfam-B_3699Pfam-B_2462Pfam-B_11726Pfam-B_10530 thiolase Pfam-B_1147Pfam-B_9964Pfam-B_2806Pfam-B_9938Pfam-B_10259Pfam-B_9898Pfam-B_1190Pfam-B_9880Pfam-B_10932Pfam-B_9657Pfam-B_4433Pfam-B_9656Pfam-B_9410Pfam-B_9409Pfam-B_3918Pfam-B_9320Pfam-B_5774Pfam-B_9000Pfam-B_5830Pfam-B_8813Pfam-B_8564Pfam-B_8563Pfam-B_4415Pfam-B_8322 Pfam-B_11795Pfam-B_11796Pfam-B_4786Pfam-B_11545Pfam-B_740Pfam-B_739Pfam-B_2101Pfam-B_7232Pfam-B_2864Pfam-B_6444Pfam-B_1342Pfam-B_6256Pfam-B_6138Pfam-B_6140Pfam-B_3348Pfam-B_5859Pfam-B_5738Pfam-B_5736Pfam-B_2615Pfam-B_5457Pfam-B_1263Pfam-B_5202Pfam-B_5095Pfam-B_5096Pfam-B_6579Pfam-B_4706Pfam-B_486Pfam-B_4195Pfam-B_2321Pfam-B_2320Pfam-B_962Pfam-B_2300Pfam-B_227Pfam-B_228Pfam-B_303Pfam-B_1967Pfam-B_319Pfam-B_1814Pfam-B_2650Pfam-B_1632Pfam-B_4530Pfam-B_1513Pfam-B_128Pfam-B_155pro_isomerase Pfam-B_9962 Pfam-B_9259 Pfam-B_897 Pfam-B_8787 Pfam-B_8180 Pfam-B_8133 Pfam-B_8112 Pfam-B_8111 Pfam-B_7658 Pfam-B_7051 Pfam-B_685 Pfam-B_6788 Pfam-B_5808 Pfam-B_5806 Pfam-B_5696 Pfam-B_5515 Pfam-B_5227 Pfam-B_5167 Pfam-B_5147 Pfam-B_5059 Pfam-B_4911 Pfam-B_490 Pfam-B_4759 Pfam-B_4663 Pfam-B_462 Pfam-B_4528 Pfam-B_4512 Pfam-B_45 Pfam-B_4413 Pfam-B_3993 Pfam-B_3864 Pfam-B_3849 Pfam-B_3440 Pfam-B_3385 Pfam-B_3132 Pfam-B_2833 Pfam-B_2243 Pfam-B_1943Pfam-B_11177Pfam-B_10884Pfam-B_10363 histone

Pfam-B_3455Pfam-B_816Pfam-B_8932Pfam-B_8030Pfam-B_3981Pfam-B_7646Pfam-B_9194Pfam-B_7581Pfam-B_5118Pfam-B_7478Pfam-B_481Pfam-B_729Pfam-B_9238Pfam-B_7212Pfam-B_7580Pfam-B_7134Pfam-B_3197Pfam-B_6657Pfam-B_5439Pfam-B_662Pfam-B_1582Pfam-B_6577Pfam-B_8797Pfam-B_650Pfam-B_10775Pfam-B_647Pfam-B_3046Pfam-B_6327Pfam-B_6209Pfam-B_6202Pfam-B_7731Pfam-B_61Pfam-B_9751Pfam-B_6089Pfam-B_9762Pfam-B_6071Pfam-B_9560Pfam-B_6021Pfam-B_536Pfam-B_5836Pfam-B_8605Pfam-B_579Pfam-B_246Pfam-B_5783Pfam-B_2154Pfam-B_5300Pfam-B_7355Pfam-B_5267Pfam-B_7724Pfam-B_5218Pfam-B_5155Pfam-B_5156Pfam-B_3076Pfam-B_5125Pfam-B_7122Pfam-B_5124Pfam-B_8937Pfam-B_498Pfam-B_9588Pfam-B_4820Pfam-B_6604Pfam-B_4749Pfam-B_6617Pfam-B_4721Pfam-B_507Pfam-B_4616Pfam-B_6060Pfam-B_4435Pfam-B_6013Pfam-B_4394Pfam-B_3581Pfam-B_4362Pfam-B_73Pfam-B_4324Pfam-B_8435Pfam-B_4140Pfam-B_4119Pfam-B_4120Pfam-B_7535Pfam-B_3873Pfam-B_3871Pfam-B_3872Pfam-B_3260Pfam-B_3606Pfam-B_6065Pfam-B_3430Pfam-B_1054Pfam-B_3364Pfam-B_4144Pfam-B_3242Pfam-B_3110Pfam-B_7782Pfam-B_2828Pfam-B_304Pfam-B_11891Pfam-B_2985Pfam-B_6596Pfam-B_2943Pfam-B_6719Pfam-B_2595Pfam-B_2593Pfam-B_2577Pfam-B_2559Pfam-B_2560Pfam-B_8584Pfam-B_2415Pfam-B_3437Pfam-B_2317Pfam-B_8845Pfam-B_2227Pfam-B_8803Pfam-B_2224Pfam-B_884Pfam-B_2218Pfam-B_2198Pfam-B_2199Pfam-B_2184Pfam-B_3218Pfam-B_945Pfam-B_2175Pfam-B_1463Pfam-B_1864Pfam-B_917Pfam-B_1754Pfam-B_10311Pfam-B_1522Pfam-B_1545Pfam-B_1432Pfam-B_1292Pfam-B_1426Pfam-B_1122Pfam-B_1415Pfam-B_1407Pfam-B_2114COesterase Pfam-B_1367Pfam-B_1929Pfam-B_1366Pfam-B_6172Pfam-B_1338Pfam-B_8761Pfam-B_1231Pfam-B_3728Pfam-B_11894

Pfam-B_5246Pfam-B_11892Pfam-B_4829Pfam-B_11642Pfam-B_1378Pfam-B_11366Pfam-B_9225Pfam-B_11324Pfam-B_3640Pfam-B_11292 HSP70 Pfam-B_11264Pfam-B_5678Pfam-B_11253Pfam-B_11162Pfam-B_11141Pfam-B_3416Pfam-B_11111Pfam-B_6435Pfam-B_10787Pfam-B_5896Pfam-B_1055Pfam-B_8653Pfam-B_1040Pfam-B_822Pfam-B_10120Pfam-B_8034 Cys_knotPfam-B_10543 AAA

Figure 5.13. The Family network using only circular patterns

Red nodes are nodes found in the non-CP network. Green nodes are nodes found only in the CP-network. Blue edges denote edges found in the CP network. Pink edges are edges found only in the CP network. Chapter 6

Circular Pattern Discovery

6.1 Introduction

The circular pattern discovery (CPD) problem is to identify “interesting” circular patterns in text T. Here, “interesting” is typically defined in terms of constraints in the search, for instance, based on occurrence frequency, pattern length, proximity between patterns, etc. When T is a database of sequences, additional constraints, may be imposed, for example the coverage or quorum constraint [86]. In biological applications, for instance, interesting circular patterns are likely to have biological relevance, e.g., they could point to proteins with related functions, even with low sequence similarity.

The CPD problem is related to the more well-known circular pattern matching (CPM) prob- lem discussed in Chapter 5.

To our knowledge, there is no existing work that explicitly studied the problem of pattern discovery involving circular patterns. Motivated by the increasing significance of circular per- mutations in various applications, from computational biology to pattern analysis, we propose methods to address the circular pattern discovery problem.

Main Results. We define and solve the ECPD and ACPD problems to find the “interesting” circular patterns, as defined using specified constraints.

133 CHAPTER 6. CIRCULAR PATTERN DISCOVERY 134

In this chapter, we present an algorithm to solve the ECPD problem. We also present two algorithms to solve the ACPD problem. One algorithm is based on Maes [81] CPM algorithm. Another algorithm is based on our ACPM2 algorithm. On average, the ACPD algorithm based on ACPM2 algorithm is better than the ACPD algorithm based on Maes [81] CPM algorithm.

The following two theorems represent our main contribution on the CPD problem.

Theorem 6.1: Given a database sequence SeqDB, with r sequences, and the parameters m1,m2,k, f ,g, The algorithm ECPD uses suffix trees and suffix links to solve the exact circular 2 pattern discovery problem in O(m2N) time, where N is the total number of symbols in SeqDB.

Theorem 6.2: Given a database sequence SeqDB, with r sequences, and the parameters m1,m2,k, f ,g, Algorithm ACPD which based on ACPM2 algorithm uses suffix array to solve 3 2 3 the ACPD problem in O(km2N ) worst case, and O(km2N) on average, where N is the total number of symbols in SeqDB.

Organization. In the next section, we define the circular pattern discovery problems. Al- gorithms for the ECPD problem are presented and analyzed in Section 3. The ACPD problems are introduced and solved in Section 4. In Section 5, we show experiments on analyzing circular permutations in multidomain proteins using our algorithms. In Section 6, we summarize our work on circular pattern matching problems.

6.2 The Circular Pattern Discovery Problem

Although a lot of progress has been made in pattern discovery, sequential data mining, and pattern matching, there has not been much attention to the issue of pattern discovery with cyclic patterns. While the pattern matching problem assumes that a pattern of interest will be provided before matching can start, pattern discovery does not require any initial pattern. Starting with no specific query pattern, one may seek to find “interesting” circular substrings within the sequence, or database of sequences, for example, circular substrings that occurred with a minimal number of occurrences. We call this the circular pattern discovery (CPD) problem. We consider three variations of the CPD problem below. CHAPTER 6. CIRCULAR PATTERN DISCOVERY 135

Exact Circular Pattern Discovery Problem (ECPD). Given a text T and a number f , return all the high frequency cyclic substrings s (i.e. with f requency ≥ f ) and their respective circular shifts in T.

Approximate Circular Pattern Discovery Problem (ACPD). Given a text T and the num- bers f and k, return all the high frequency circular substrings s (i.e. with f requency ≥ f ) and their respective circular shifts with k-approximate matches in T.

Circular Pattern Discovery Problem (parameterized form). Given a database sequence

Q, with r sequences, and the parameters m1,m2,k, f ,g, return all circular substrings s and their respective circular shifts that have a k-approximate match in Q, with a high occurence frequency ( f requency ≥ f ), occurs in at least g sequences (where g ≤ r), and the length m of each matching substring satisfies the constraint m1 ≤ m ≤ m2.

The parameter g models the coverage or quorum constraints, often imposed in motif dis- covery for biological sequences [86]. For instance, we may want a subsequence to appear in a certain proportion of the members of a protein family, before we can accept the subsequence as a valid motif for that family. The ECPD problem corresponds to the case with k = 0. Clearly, the parameterization could be modified to impose more or less constraints in the discovery, as desired.

The Challenge. To see the difficulty involved in the CPD problem, we can consider the complexity of the na¨ıve algorithm for the problem. First, consider pattern discovery for patterns of a specific length, say m (that is, m1 = m2 = m), on a text T of length n. For the ECPD problem, we will have (n − m + 1) substrings, each with m cyclic shifts, with each m-length pattern requiring O(nm) time to search in T. The overall time will be in O(m2n2). We can improve this to O(m2n) using standard linear time pattern matching algorithms, such as the KMP or Boyer-Moore algorithms. When we consider a range of pattern lengths (for example, 3 m1 ≤ m ≤ m2), the overall complexity becomes O(m2n). When no length is specified, we will need to consider all possible pattern lengths, hence, the overall time complexity will be in O(n4). Using a similar analysis, for the ACPD problem, we will need time in O(km2n2) for 3 2 one single pattern length m, and in O(km2n )) for a range of pattern lengths (m1 ≤ m ≤ m2). We have assumed the use of Ukkonen’s O(kn) algorithm for k-approximate matching of a given m-length substring of the text. With no specified constraint on the length, we will require time CHAPTER 6. CIRCULAR PATTERN DISCOVERY 136

in O(kn5) for the parametrized ACPD problem.

The CPD problem is related to the CPM problem, but is much more complicated. Applying CPM algorithms directly will be making the assumption that we know what we are trying to ”discover”, or would require an exhaustive consideration of all cyclic substrings. In the follow- ing, we first propose a fast ECPD algorithm, by exploiting suffix links, which are typically part of a standard suffix tree. We then address the ACPD problem, using suffix arrays.

6.3 The ECPD Algorithm

Our ECPD algorithm is based on our algorithm for the ECPM problem as presented in Section 5.2.

First the ECPD algorithm (Algorithm 6.1) will build a suffix tree ST from the sequence T. Then the algorithm checks each m-length pattern from T for possible circular pattern matching, using the suffix tree. Each operation of checking an m-length pattern for possible CPM will need an O(m) time cost using the ECPM algorithm. There are O((m2 − m1)N) patterns whose length m is in the range [m1,m2]. So the time complexity of the ECPD algorithm will be in 2 O((m2 − m1 + 1)(m2 + m1)N) = O(m2N).

Algorithm 6.1: ECPD Algorithm

ECPD(T,N,m1,m2, f ) 1 ST ← BuildSuffixTree(T)

2 for m = m1 to m2 do 3 for each m-length substring P of T do

4 ηocc ← ECPM(ST,P) 5 if ηocc ≥ f then do 6 Output the circular pattern P 7 end if 8 end for 9 end for

Based on the foregoing, we summerize our results on the ECPD problem in the following theorem: CHAPTER 6. CIRCULAR PATTERN DISCOVERY 137

Theorem 6.1: Given a database sequence SeqDB, with r sequences, and the parameters

m1,m2,k, f ,g, The ECPD Algorithm uses suffix trees and suffix links to solve the exact circular 2 pattern discovery problem in O(m2N) time, where N is the total number of symbols in SeqDB.

6.4 The ACPD Algorithm

We first apply an existing ACPM algorithm directly on the ACPD problem. This models the use of a generic ACPM algorithm for the ACPD problem. In particular, we modify Maes’s ACPM algorithm to solve the ACPD problem. Subsequently, we describe our ACPD algorithm, and compare the two methods.

6.4.1 ACPD using Maes’ Algorithm

Algorithm 6.2 shows a method to solve the ACPD problem based on Maes’ algorithm. The algorithm compares each m-length pattern from the text with each (m+k)-length subtext, where m m1 ≤ m ≤ m2. For each length m, there are min{O(N),O(|Σ| )} patterns and subtext, so there 2 2 are O(N ) comparisons. The total number of comparisons is in O((m2 − m1)N ). For each comparison, the time cost is in O(m2 logm). Thus, the time complexity of Algorithm 5.10 is m2 O(N2m2 m) = O(m3N2 m ) ∑m=m1 log 2 log 2 .

Landau’s Algorithm

Algorithm 6.3 is based on Landau’s incremental algorithm [67]. Similar to Maes’ algorithm, this algorithm compares each m-length pattern with each m + k-length subtext, where m is the m length of pattern and m1 ≤ m ≤ m2. For each length m, there are min(O(N),O(|Σ| ) patterns and 2 2 subtext, so there are O(N ) comparisons. The total number of comparisons is O((m2 −m1)N ). For each comparisons, the time cost is O(km). Thus, the time complexity of algorithm 6.3 is m2 O(N2km) = O(km2N2) ∑m=m1 2 . CHAPTER 6. CIRCULAR PATTERN DISCOVERY 138

Algorithm 6.2: ACPD Based on Maes’ algorithm

ACPD-MAES(T,N,k,m1,m2, f ) 1 for m = m1 to m2 do 2 for each m-length substring P of T do

3 ηocc ← 0 4 for i = 1 to N − m − k do 5 if MAESALGORITHM(P,T[i...i + m + k]) is true then do

6 ηocc ← ηocc + 1 7 end if 8 end for

9 if ηocc ≥ f then do 10 Output P 11 end if 12 end for 13 end for

Algorithm 6.3: ACPD Based on Landau’s algorithm

ACPD-LANDAU(T,N,k,m1,m2, f ) 1 ST ← BuildSuffixTree(T)

2 for m = m1 to m2 do 3 for each m-length substring Pm do 4 for i = 1 to N − m − k do

5 LANDAUALGORITHM(Pm,T[i...i + m + k]) 6 end for 7 end for 8 end for CHAPTER 6. CIRCULAR PATTERN DISCOVERY 139

6.4.2 Proposed ACPD Algorithm

Our ACPD algorithm (Algorithm 6.4) uses the same framework as the described ACPM algorithm (Section 5.3.4). The algorithm first constructs the hypotheses on potential circular pattern matches using q-gram filtration. For each hypothesis (i.e. each matching q-gram in T), the ACPD algorithm verifies all possible circular shifts that involves this q-gram match.

The algorithm constructs O(N2) worst-case number of hypotheses with parameter q = m1 b k+1 c, where N is the total length of the concatenated sequences (for a database of sequences). N2 q On average, the number of hypotheses is O( |Σ|q ). When |Σ| is close to O(N), the number of hypotheses will reduce to O(N).

At verification, the algorithm checks the circular shifts of length m1,m1 +1,...,m2. For each pattern with length m, there are O(m) circular shifts to check, where m1 ≤ m ≤ m2. The time for checking all of m-length circular shifts in the same circular pattern involving the current hypothesis is in O(km) using the verification algorithm (Algorithm 5.5). In each hypothesis, the number of circular pattern with length m is m. So the time cost of checking all circular shifts of circular pattern with length m in a hypothesis is O(km2). The total time for verification will O( m2 km2) = O(km3) be ∑m=m1 2 . The worst case time complexity of Algorithm 6.4 will thus be in 3 2 3 2 km2N q O(km2N ). On average, the runing time will be in O( |Σ|q ). When |Σ| is close to O(N), as is 3 typically the case, the time complexity will reduce to O(km2N).

A further improvement will be to exploit the huge redundancy in going from a pattern of length m to a pattern of length (m + 1). We observe that to go from an m-length pattern starting at position i in T to an (m + 1)-length pattern starting from the same position involves adding just one symbol at the end. Most k-approximate cyclic pattern matches at one length will also be cyclic matches at the next length, if the cyclic edit distance is less than k − 1. Also, non- matching regions in T for the m-length pattern cannot be a match for the (m+1)-length pattern if the circular edit distance with the m-length pattern is greater than k + 1. Exploiting this redundancy by keeping the full dynamic programming table, will lead to a further reduction of 2 2 2 the overall complexity to O(m2N ) worst case and O(m2N) on average. This is made possible only by the nature of the CPD problem. CHAPTER 6. CIRCULAR PATTERN DISCOVERY 140

Based on the foregoing, we summerize our results on the ACPD problem in the following theorem:

Theorem 6.2: Given a database sequence SeqDB, with r sequences, and the parame- 3 2 ters m1,m2,k, f ,g, Algorithm ACPD solves the ACPD problem in O(km2N ) worst case, and 3 O(km2N) on average, where N is the total number of symbols in SeqDB.

In the above CPD algorithms, we have assumed one single input sequence, T, for simplicity. For a database of sequences, we can simply concatenate the sequences in the database to form one long sequence, and then construct the generalized suffix tree (or suffix array) for the long sequence. Then, we can keep track of the number of occurrence of each circular pattern in each sequence in the database to support quorum constraints.

6.4.3 Comparison

3 2 The time complexity of the ACPD solution via Maes’ algorithm is O(m2N logm2). The 2 2 time complexity of the ACPD solution via Landau’s algorithm is O(km2N . The time complex- 3 2 m ity of the proposed ACPD algorithm is O(km2N ). The number of patterns is minN,|Σ| . The proposed ACPD algorithm is based on q-gram filtration. The average number of hypotheses N2 m is in O( m ). When |Σ| k is close to O(N), the time complexity of the ACPD algorithm will |Σ| k 3 reduce to O(km2N). By exploiting the nature of the CPD problem in terms of the redundancy between m-length paterns and (m + 1)-length patterns starting at the same location in the text, 2 2 the complexity for the proposed ACPD algorithm can be reduced to O(m2N ) worst case, and 2 O(m2N) on average.

6.5 Experiments

We performed circular pattern discovery on the same protein multidomain sequences with pattern length m, where 4 ≤ m ≤ 30. Figure 6.1 shows the variation of the number of distinct patterns (both circular and non circular pattern) with pattern lengths. When the length is 4, the number of distinct patterns is large. When the length increases, the number of distinct patterns CHAPTER 6. CIRCULAR PATTERN DISCOVERY 141

Algorithm 6.4: Proposed ACPD algorithm

ACPD-QGRAM(T,N,k,m1,m2, f ) 1 < SA,LCP > ← COMPUTESA(T ) m1 2 q ← b k+1 c, Qset ← NULL 3 for i = 1 to N do 4 if LCP[i] ≥ q then do

5 Qset ← c(Qset ,SA[i]) 6 else

7 for j = 1 to length(Qset ) do 8 for l = j+1 to length(Qset ) do 9 Ppos ← Qset [j], Tpos ← Qset [l] 10 for m = m1 to m2 do 11 T1 ← T[Tpos+q+1...Tpos+m+k]; T2 ← Text[Tpos-(m-q+k)...Tpos-1] 12 for h = 0 to m-q do

13 P ← T[Ppos+q+1...Ppos+q+h] ◦ Text[Ppos-m+q+h...Ppos-1] 14 if BIDIRECTIONALED2(P,T1,T2,k) is true then do 15 if P is the first occurrence then do

16 ηocc(P) ← 1 17 else

18 ηocc(P) ← ηocc(P) + 1 19 end if 20 end if 21 end for 22 end for 23 end for 24 end for

25 Qset ← NULL 26 end if 27 end for

28 ∀ P, Output P if ηocc(P) ≥ f CHAPTER 6. CIRCULAR PATTERN DISCOVERY 142

Algorithm 6.5: Modified ACPM Hypothesis Verification

BIDIRECTIONALED2(P,T1,T2,k) 1 P2 ← PR

2 ED1 ← DP(P,T1,k) R 3 ED2 ← DP(P2,T2 ,k) 4 ED ← ED2R S rep(0,q) S ED1 5 for h=1 to |P| + 1 6 if (ED[h] + ED[h+m-1] ≤ k) then do 7 return true 8 end if 9 end for

decreases quickly. Figure 6.2 shows the variation of the number of the circular patterns (i.e excluding the non-circular patterns) with pattern lengths. As expected, the number decreases rapidly with increasing length. In Figure 6.3 we show the maximum number of occurrences of one pattern at different pattern lengths. Table 6.1 shows the number of distinct patterns and the number of non-circular patterns at different pattern lengths. Table 6.1 also shows the ratio with Number o f distinct patterns Number o f non−circular patterns . Interestingly, some circular patterns were discovered to occur with lenghts as large as 26 symbols. That is, 26 domains were repeated, but in the form of a cyclic permutation between two multidomain proteins in the database.

Table 6.2 shows an example of a circular pattern with five domains. The pattern is {PDA1N075, PD000041, PD000041, PD248344, PD000041}. We assign the symbol ”A” to protein domain PDA1N075, the symbol ”B” to protein domain PD000041 and the symbol ”C” to protein do- main PD248344. It occurs in multi-domain protein sequences Q99407 HUMAN, ANK1 HUMAN and Q13768 HUMAN with different orders.

Table 6.3 shows an example of circular pattern with thirteen domains. The pattern is {PD000768, PD000767, PD005993, PD000768, PD000767, PDA1J766, PD005993, PD000768, PD000767, PDA1J7O1, PD000165, PD272501, PDA1J3F2}. We assign the string {FGHFGIHFGJKLM} to the pattern. It occurs in the multi-domain protein sequences Q9Y4V9 HUMAN, Q5JR23 HUMAN and Q9UJ57 HUMAN with different orders. CHAPTER 6. CIRCULAR PATTERN DISCOVERY 143

Number of Distinct Patterns 15000 10000 5000 Number of Distinct Normal Patterns 0

5 10 15 20 25 30

Pattern Length

Figure 6.1. Variation of the number of distinct patterns (including non-circular and circular patterns) with pattern length.

Number of True CPs 400 300 200 Number of True CPs Number of True 100 0

5 10 15 20 25 30

Pattern Length

Figure 6.2. Variation of number of circular patterns with pattern length. CHAPTER 6. CIRCULAR PATTERN DISCOVERY 144

Table 6.1. The number of distinct patterns with pattern length Pattern # Distinct # Non-circular # Circular Ratio Length Patterns Patterns patterns 4 80333 79934 399 100.50% 5 8199 7944 255 103.21% 6 4720 4705 15 100.32% 7 2754 2677 77 102.88% 8 836 830 6 100.72% 9 550 532 18 103.38% 10 855 855 0 100.00% 11 937 921 16 101.74% 12 1011 983 28 102.85% 13 727 676 51 107.54% 14 299 296 3 101.01% 15 265 262 3 101.15% 16 141 141 0 100.00% 17 139 139 0 100.00% 18 116 100 16 116.00% 19 50 50 0 100.00% 20 58 57 1 101.75% 21 51 51 0 100.00% 22 19 19 0 100.00% 23 37 37 0 100.00% 24 18 18 0 100.00% 25 34 34 0 100.00% 26 18 17 1 105.88% 27 19 19 0 100.00% 28 2 2 0 100.00% 29 8 8 0 100.00% 30 5 5 0 100.00%

Table 6.2. Sample discovered circular patterns with length five. A≡PDA1N075, B≡PD000041, C≡PD000041, D≡PD248344, E≡PD000041 Protein Pos order Pattern Q99407 HUMAN 6 0 ABBCB Q13768 HUMAN 7 0 ABBCB ANK1 HUMAN 6 3 CBABB Q13768 HUMAN 6 4 BABBC Q99407 HUMAN 5 4 BABBC CHAPTER 6. CIRCULAR PATTERN DISCOVERY 145

Maximum Number of Occurrences Maximum Number of Occurrenct Maximum 0 20 40 60 80 100 120

5 10 15 20 25 30

Pattern Length

Figure 6.3. Variation of maximum number of occurrences with pattern length.

Table 6.3. Sample discovered circular patterns with length thirteen. F≡PD000768, G≡PD000767, H≡PD005993, I≡PDA1J766,J≡ PDA1J7O1, K≡PD000165, L≡PD272501, M≡PDA1J3F2 Protein Pos order Pattern Q9Y4V9 HUMAN 20 0 FGHFGIHFGJKLM Q5JR23 HUMAN 21 1 GHFGIHFGJKLMF Q9UJ57 HUMAN 38 2 HFGIHFGJKLMFG CHAPTER 6. CIRCULAR PATTERN DISCOVERY 146

6.6 Summary

In this chapter, we introduced the ECPD and ACPD problems, and proposed two algorithms for their solution. The first algorithm uses suffix trees and suffix links to solve the ECPD 2 problem in O(m2N) time. The second algorithm solves the more challenging ACPD problem 2 2 2 in O(km2N ) worst case, and O(km2N) on average, using suffix arrays. By exploiting the redundancy between patterns of different lengths that start at the same position in the text, 2 the overall complexity is reduced to O(km2N ) worst case, and O(km2N) on average. Our results can be compared with the approach that directly uses Maes’ algorithm, one of the best 3 2 available ACPM algorithms for the ACPD problem, which runs in O(m2N logm2) time, or 3 O(m2N logm2) for small values of m2. Although pattern discovery has been well studied, to our knowledge, this is the first attempt at a focused study on the CPD problem.

We show an experiment for discovering the exact circular patterns in ProDom database. We reported interesting exact circular patterns discovered by our algorithm. To our knowledge, this is the first experiment at a focused study on discovering the exact circular patterns. Chapter 7

Conclusion and Future Work

7.1 Conclusion

In this work, we present two novel data structures to solve the space problem of the suffix tree and its applications. These two structures are the virtual suffix tree (VST) and the Proba- bilistic Suffix Array (PSA). We consider the circular pattern matching problem and define the circular pattern discovery problem. We present algorithms using suffix data structures to solve both the exact and inexact variants of the CPM problem. Our focus is on efficiently answering enumerative queries involving circular patterns. We also present algorithms based on our CPM algorithms to solve the CPD problem. To our knowledge, this is the first attempt at a focused study on discovering circular patterns.

The Virtual Suffix Tree. We introduce the VST (virtual suffix tree), an efficient data structure for suffix trees and suffix arrays. The VST provides the same functionality as the suffix tree, including support for pattern matching, and suffix links. But the VST requires a much smaller space than the suffix tree and the other recently proposed space-efficient data structures for suffix trees and suffix arrays. With n=length of the sequence, the worst case space is 18n bytes compared with 20n bytes for the other data structures for suffix trees and suffix arrays (such as the enhanced suffix array [2], and the linearized suffix tree [62]). On average, the space requirement (including that for suffix arrays and suffix links) is 13.8n bytes for the

147 CHAPTER 7. CONCLUSION AND FUTURE WORK 148

regular VST, and 12.05n bytes in its compact form.

The Probabilistic Suffix Array. We have presented the probabilistic suffix array (PSA), a data structure for representing information in variable length Markov models. The PSA pro- vides the same functionality as the probabilistic suffix tree (PST), but the space is significantly smaller than that of the PST. We construct the PSA in linear time and linear space, independent of the order of the Markov model. The space is dependent on the number of interval nodes of the suffix array. In the worst case, the memory requirement is 33N bytes. On average, the needed space will be 26N bytes, which includes the work space of the construction phase. The needed space for PSA is significantly smaller than that for the PST which is implemented on n a regular suffix tree. Prediction using the PSA is in O(mlog |Σ| )time, where m is the pattern length, and Σ is the symbol alphabet.

Circular Pattern Matching. We defined the circular pattern matching(CPM) problems in the related work. We present a linear time algorithm to solve the ECPM problem which aims at finding an exact circular pattern P in the text T. The ACPM problem is to find an approximate occurrence of circular pattern P in text T. We present three algorithms for the ACPM2 problem and one greedy algorithm that produces an incomplete result. Of all the algorithms reported in the literature for the ACPM problem, our ACPM q-gram-based bidirectional algorithm with suffix trees provides the best results on average, with respect to both time and space complexity. Using our algorithms, we performed experiments on the analysis of circular permutations in multidomain proteins. Based on the results, we developed a method for function prediction for such multidomain proteins.

Circular Pattern Discovery. Based on the work on the circular pattern matching problem, we defined the Exact Circular Pattern Discovery (ECPD) problem and the Approximate Circular Pattern Discovery (ACPD) problem. We present an ECPD algorithm that uses suffix trees and 2 suffix links to solve the exact circular pattern discovery problem in O(m2N) time. We also present an efficient algorithm to solve the ACPD problem based on our bidirectional ACPM2 algorithm. Comparing with the ACPD algorithm based on Maes [81] CPM algorithm, our algorithm is the better solution on average. CHAPTER 7. CONCLUSION AND FUTURE WORK 149

7.2 Future Work

We briefly describe some potential future work, based on the material presented in this work.

7.2.1 Circular Pattern Discovery

We have presented algorithms for the circular pattern discovery problem. Our presented algorithms are based on our CPM algorithm. In our ACPD algorithm, there are lot of repeating operations that compares the circular shifts vs. the subtext. In our future thinking, we may find a way to avoid the repeated comparisons.

7.2.2 Network Analysis for Circular Multidomain Proteins

In this work we proposed efficient algorithms for rapid detection of cyclic permutations in multidomain proteins, using suffix trees and suffix arrays. Based on the results, we formed net- works linking different multidomain proteins based on cyclic patterns observed in the proteins, using current data in the ProDom database [21]. Using these networks we performed functional annotation of the multidomain proteins. Significance of functional relatedness between two multidomain proteins was accessed using z-scores and p-scores. We performed only simple analysis of the networks, for instance, based on simple in-degree and out-degree characteristics of the nodes in the network. The overall performance using the method on a network formed with the Top 500 proteins (as ranked by degree) was as follows: sensitivity 0.81; precision 0.88, F-measure 0.84. Although based only on sequence data, and with no alignment, these results are comparable with ProFunc [69, 70] and ProKnow [101], both of which perform at around 70% accuracy. These are web-based servers that predict protein function from its 3D structure, using initial structure alignment, and a combination of algorithms. Motivated by these impressive results, an interesting future direction will be a more detailed study of these networks of cyclic patterns, built on just protein sequence data. Network characterization using more rig- orous methods such as betweeness [43], centrality [52], and pairwise discontinuity [102] could show more light on the potential relationships between related multidomain proteins, or non- CHAPTER 7. CONCLUSION AND FUTURE WORK 150

related multidomain proteins with potentially similar function.

7.2.3 From PSA to PFA

The probabilistic suffix automata is a subclass of the probabilistic finite automata (PFA) which is a simple model for learning in Markov models with order L. Ron et al. [109] proved that “every distribution generated by a probabilistic suffix automata can equivalently be gener- ated by a PST”. In our future work, one can consider how to construct a probabilistic suffix automata from our probabilistic suffix array data structure. Previous work [9, 109] has con- structed the probabilistic suffix automata from the probabilistic suffix tree (PST). But the space and time complexity is O(Ln2) which is huge. The compressed suffix array [45] and the com- pact suffix array [82, 83] are space-efficient suffix data structures. It would be interesting to study how these space-efficient suffix data structures can be used to improve the PSA.

7.2.4 Approximate Pattern Matching Using PSA

Dynamic programing [114] is a well know method to find approximate patterns, but the time complexity is O(m2n) to find all the occurrences, where m is the length of pattern and n is the length of the text. Ukkonen’s algorithm [121] improved the time complexity of this problem to O(mnk), where k is maximum error allowed in the pattern. This is still is large in practice. In our thinking, the probability generated from PSA prediction may give us a useful measurement for the approximate pattern matching problem. This simple thinking will work for existential queries in O(n+mlog|Σ|) time, where O(n) is the PSA construction cost and O(mlog|Σ|) is the prediction cost. But for enumerative queries, finding all the possible matching positions using the PSA will be a major challenge.

7.2.5 Prediction with PSA using Inexact Matching

Markov models in information retrieval are often based on item frequency and document frequency. However, in the real world, there are often some mismatch/errors (i.e. insert, delete, substitute) in the sequence. If we could compute the Markov Model based on an approximation CHAPTER 7. CONCLUSION AND FUTURE WORK 151

of the item frequency and document frequency, we will get a more robust model in prediction. Here, approximate frequencies are computed by considering possible mutations in the sequence.

Calculating Approximate Probability

The transition matrix of the Markov model will be calculated based on an approximate probabil- ity. The approximate probability is calculated by the approximate item frequency and document frequency. We use the following equation in place of the regular equation which calculates the transition matrix using the exact term frequency (TF) or document frequency (DF).

P(Xn+1 = j|Xn = i,Xn−1 = in−1...,X0 = i0) = argmax P(Ym+n = j|YnYm+n−1Ym+n−2...Ym+n−L) ED(Y,X)≤k where ED(Y,X)=ED(Y[m+n-L+1 . . . m+n],X[n-L+1 . . . n])

Difficulty

For this extension of the PSA, we need to cope with two hard problems.

1. How to calculate the approximate term frequency (TF) and document frequency (DF) in efficient time and space complexity. This is similar to the motif discovery problem. We have to find a better algorithm to calculate the approximate term frequency (TF) and document frequency (DF).

2. How to associate nodes in the PSA to the approximate probability. In our current PSA, symbols in the same edge share the same node. Hence they have the same probability, tf and df. Thus the space for the PSA is linear with respect to the length of sequence. In this future work, symbols in the same edge may not share the same probability. Thus in the nai¨ve implementation, the space for the PSA will be quadratic with respect to the length of the sequence. Reducing this space requirement is an interesting challenge. CHAPTER 7. CONCLUSION AND FUTURE WORK 152

If these two problems can be solved, a robust representation for Markov models will be achieved.

7.3 Publications from the Dissertation

1. Jie Lin, Yue Jiang, and Don Adjeroh. The Virtual Suffix tree: An efficient data structure for suffix trees and suffix arrays, the Prague Stringology Conference (PSC), 2008.

2. Jie Lin, Yue Jiang, Donald A. Adjeroh. The Virtual Suffix Tree, International Journal of Foundations of Computer Science, Vol:20, Issue No:6, pp1109-1133, 2009.

3. Jie Lin and Don Adjeroh. All-against-all circular pattern matching. 2011. Under review.

4. Jie Lin and Don Adjeroh. Circular Pattern Discovery. 21st International Workshop on Combinatorial Algorithms, 2011.

5. Jie Lin, Don Adjeroh and Bing-Hua Jiang. Probabilistic Suffix Array: Efficient Modelling and Prediction of Protein Families. 2011. Under review.

6. Jie Lin, Don Adjeroh and Bing-Hua Jiang. Algorithms for Efficient Detection of CPs in Multidomain Protein. 2011. To be submitted. Bibliography

[1] N Abe and M Warmuth. On the computational complexity of approximating distributions by probabilistic automata. Machine Learning, 9:205–260, 1992.

[2] M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch. Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2:53 – 86, 2004.

[3] D. Adjeroh, T. Bell, and A. Mukherjee. The Burrows-Wheeler Transform: Data Com- pression, Suffix Arrays and Pattern Matching. Springer-Verlag, 2008.

[4] Don Adjeroh and Fei Nan. Suffix sorting via Shannon-Fano-Elias codes. Algorithms, 3(2):145–167, 2010.

[5] Alfred V. Aho and Margaret J. Corasick. Efficient string matching: An aid to biblio- graphic search. Commun. ACM, 18:333–340, June 1975.

[6] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403–410, October 1990.

[7] Amihood Amir, Yonatan Aumann, Gad M. Landau, Moshe Lewenstein, and Noa Lewen- stein. Pattern matching with swaps. In FOCS, pages 144–153, 1997.

[8] Arne Andersson and Stefan Nilsson. Efficient implementation of suffix trees. 25(2):129– 141, 1995.

[9] Alberto Apostolico and Gill Bejerano. Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space. Journal of Computational Biology, 7(3-4):381–393, 2000.

153 BIBLIOGRAPHY 154

[10] Alberto Apostolico and Maxime Crochemore. Optimal canonization of all substrings of a string. Inf. Comput., 95(1):76–95, 1991.

[11] Alberto Apostolico and Raffaele Giancarlo. The Boyer Moore Galil string searching strategies revisited. SIAM J. Comput., 15(1):98–105, 1986.

[12] H. Arimura, H. Asaka, H. Sakamoto, and S. Arikawa. Efficient discovery of proximity patterns with suffix arrays. In A. Amir and G.M. Landau, editors, CPM, volume 2089 of Lecture Notes in Computer Science, pages 152–156. Springer, 2001.

[13] R.A. Baeza-Yates and G.H. Gonnet. A new approach to text searching. In N.J. Belkin and C.J. van Rijsbergen, editors, SIGIR 89, Proceedings of the 12th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, vol- ume 23, pages 168–75. ACM, New York (published as a special issue of SIGIR Forum, Vol. 23, 1-2, Fall 88/Winter 89), 1989.

[14] Ricardo A. Baeza-Yates and Chris H. Perleberg. Fast and practical approximate string matching. Information Processing Letters, 59(1):21–27, 1996.

[15] Alex Bateman, Lachlan Coin, Richard Durbin, Robert D. Finn, Volker Hollich1, Sam Griffiths-Jones, Ajay Khanna, Mhairi Marshall, Simon Moxon, Erik L. L. Sonnhammer1, David J. Studholme, Corin Yeats, and Sean R. Eddy. The Pfam protein families database. Nucleic Acids Res, 32 (Database issue):D138–D141, 2004.

[16] Ron Begleiter, Ran El-Yaniv, and Golan Yona. On prediction using variable order Markov models. J. Artif. Intell. Res. (JAIR), 22:385–421, 2004.

[17] Gill Bejerano and Golan Yona. Variations on probabilistic suffix trees: Statistical mod- eling and prediction of protein families. Bioinformatics, 17(1):23–43, 2001.

[18] Michael A. Bender and Martin Farach-Colton. The LCA problem revisited. In Gaston H. Gonnet, Daniel Panario, and Alfredo Viola, editors, LATIN, volume 1776 of Lecture Notes in Computer Science, pages 88–94. Springer, 2000.

[19] Kellogg S. Booth. Lexicographically least circular substrings. Inf. Process. Lett., 10(4/5):240–242, 1980. BIBLIOGRAPHY 155

[20] R.S. Boyer and J.S. Moore. A fast string searching algorithm. Communications of the ACM, 20(10):62–72, 1977.

[21] Catherine Bru, Emmanuel Courcelle, Sbastien Carrre, Yoann Beausse, Rine Dalmar, and Daniel Kahn. The ProDom database of protein domain families: More emphasis on 3D. Nucleic Acids Res, 33:212–215, 2005.

[22] H. Bunke and U. Buhler. Applications of approximate string matching to 2D shape recognition. Pattern Recognition, 26(12):1797–1812, December 1993.

[23] M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, Palo Alto, California, May 1994.

[24] Y. Cao, A. Janke, P.J. Waddell, M. Westerman, O. Takenaka, S. Murata, N. Okada, S. Paabo, and M. Hasegawa. Conflict among individual mitochondrial proteins in re- solving the phylogeny of eutherian orders. J Mol Evol., 47:307–322, 1998.

[25] H. Carrillo and D. Lipman. The multiple sequence alignment problem in biology. SIAM Journal on Applied Mathematics, 48(5):1073–1082, 1988.

[26] William I. Chang and Jordan Lampe. Theoretical and empirical comparisons of ap- proximate string matching algorithms. In CPM ’92: Proceedings of the Third Annual Symposium on Combinatorial Pattern Matching, pages 175–184, London, UK, 1992. Springer-Verlag.

[27] William I. Chang and Eugene L. Lawler. Sublinear approximate string matching and biological applications. Algorithmica, 12(4/5):327–344, 1994.

[28] John G. Cleary and W. J. Teahan. Unbounded length contexts for PPM. Computer Journal, 40(2/3):67–75, 1997.

[29] Richard Cole and Ramesh Hariharan. Faster suffix tree construction with missing suffix links. In STOC, pages 407–415, 2000.

[30] S Cong, J Han, and D A Padua. Parallel mining of closed sequential patterns. In KDD. ACM, 2005. BIBLIOGRAPHY 156

[31] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press, 1990.

[32] Florence Corpet, Florence Servant, Jerome Gouzy, and Daniel Kahn. ProDom and ProDom-CG: Tools for protein domain analysis and whole genome comparisons. Nu- cleic Acids Res, 28(1):267–269, 2000.

[33] Paul D. Cotter, Colin Hill1, and R. Paul Ross. Bacterial lantibiotics: Strategies to improve therapeutic potential. Current Protein and Peptide Science, 6:61–75, 2005.

[34] Maxime Crochemore and Thierry Lecroq. Tight bounds on the complexity of the Apostolico-Giancarlo algorithm. Information Processing Letters, 63:195–203, 1997.

[35] Craik DJ. Circling the enemy: Cyclic proteins in plant defence. Trends Plant Sci., 14(6):328–335, 2009.

[36] Jean-Pierre Duval. Factorizing words over an ordered alphabet. J. Algorithms, 4(4):363– 381, 1983.

[37] Yariv Ephraim, Amir Dembo, and Lawrence R. Rabiner. A minimum discrimination information approach for hidden Markov modeling. IEEE Transactions on Information Theory, 35(5):1001–1013, 1989.

[38] Martin Farach-Colton, Paolo Ferragina, and S. Muthukrishnan. On the sorting- complexity of suffix tree construction. J. ACM, 47(6):987–1011, 2000.

[39] Robert D. Finn, John Tate, Jaina Mistry, Penny C. Coggill, Stephen John Sammut, Hans rudolf Hotz, Goran Ceric, Kristoffer Forslund, Sean R. Eddy, Erik L. L. Sonnhammer, and Alex Bateman. The Pfam protein families database. Nucleic Acids Res, 36:281–288, 2008.

[40] S. Garcia-Vallve, A. Rojas, J. Palau, and A. Romeu. Circular permutants in beta- glucosidases (family 3) within a predicted double-domain topology that includes a (beta/alpha)8-barrel.. Proteins, 31:214–223, 1998.

[41] Robert Giegerich, Stefan Kurtz, and Jens Stoye. Efficient implementation of lazy suffix trees. Software — Practice and Experience, 33(11), 2003. BIBLIOGRAPHY 157

[42] David Gillman and Michael Sipser. Inference and minimization of hidden Markov chains. In COLT, pages 147–158, 1994.

[43] M. Girvan and M. E. J. Newman. Community structure in social and biological networks. PNAS, 99(12):7821–7826, June 2002.

[44] Jens Gregor and Michael G. Thomason. Dynamic programming alignment of sequences representing cyclic patterns. IEEE Trans. Pattern Anal. Mach. Intell., 15(2):129–135, 1993.

[45] Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing, 35(2):378–407, 2005.

[46] D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and Com- putational Biology. Cambridge University Press, Cambridge, UK, 1997.

[47] U. Heinemann and M. Hahn. Circular permutation of polypeptide chains: Implications for protein folding and stability. Prog. Biophys. Mol. Biol., 64:121–143, 1995.

[48] R.N. Horspool. Practical fast searching in strings. 10(6):501–6, 1980.

[49] M Hu, J Yang, and W Su. Permu-pattern: Discovery of mutable permutation patterns. In KDD. ACM, 2008.

[50] James W. Hunt and Thomas G. Szymanski. A fast algorithm for computing longest common subsequences. Commun. ACM, 20(5):350–353, 1977.

[51] Costas S. Iliopoulos and M. Sohel Rahman. Indexing circular patterns. In WALCOM, pages 46–57, 2008.

[52] H. Jeong, S. P. Mason, A. L. Barabasi,´ and Z. N. Oltvai. Lethality and centrality in protein networks. Nature, 411(6833):41–42, May 2001.

[53] Petteri Jokinen and Esko Ukkonen. Two algorithms for approximate string matching in static texts (extended abstract). In A. Tarlecki, editor, Mathematical Foundations of Computer Science 1991: Proc. of the 16th International Symposium, pages 240–248. Springer, Berlin, Heidelberg, 1991. BIBLIOGRAPHY 158

[54] Maizel JV Jr and Lenk RP. Enhanced graphic matrix analysis of nucleic acid and protein sequences. Proc Natl Acad Sci U S A, 78(12):7665–7669, 1981.

[55] Jongsun Jung and Byungkook Lee. Circularly permuted proteins in the protein structure database. Protein Sci., 10(9):1881–1886, 2001.

[56] Juha Karkk¨ ainen.¨ Suffix cactus: A cross between suffix tree and suffix array. In CPM: 6th Symposium on Combinatorial Pattern Matching, 1995.

[57] Juha Karkk¨ ainen,¨ Peter Sanders, and Stefan Burkhardt. Linear work suffix array con- struction. J. ACM, 53(6):918–936, 2006.

[58] S. Karlin, G. Ghandour, F. Ost, S. Tavare, and L.J. Korn. New approaches for com- puter analysis of nucleic acid sequences. Proceedings, National Academy of Sciences, 80(18):5660–5664, 1983.

[59] R.M. Karp and M.O. Rabin. Efficient randomized pattern-matching algorithms. 31(2):249–260, 1987.

[60] T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park. An efficient index data structure with the capabilities of suffix trees and suffix arrays for alphabets of non-negligible size. In 12th Annual Symposium on Combinatorial Pattern Matching, 2001.

[61] Dong Kyue Kim, Jeong Eun Jeon, and Heejin Park. An efficient index data structure with the capabilities of suffix trees and suffix arrays for alphabets of non-negligible size. In SPIRE 2004, 2004.

[62] Dong Kyue Kim, Minhwan Kim, and Heejin Park. Linearized suffix tree: an efficient index data structure with the capabilities of suffix trees and suffix arrays. Algorithmica, 2007.

[63] Dong Kyue Kim, Jeong Seop Sim, Heejin Park, and Kunsoo Park. Constructing suffix arrays in linear time. J. Discrete Algorithms, 3(2-4):126–142, 2005.

[64] D.E. Knuth, J.H. Morris, and V.R. Pratt. Fast pattern matching in strings. 6(2):323–350, 1977. BIBLIOGRAPHY 159

[65] Pang Ko and Srinivas Aluru. Space efficient linear time construction of suffix arrays. J. Discrete Algorithms, 3(2-4):143–156, 2005.

[66] R. Kohli and C. Walsh. Enzymology of acyl chain macrocyclization in natural product biosynthesis. Chemical Communications, 3:297–307, 2003.

[67] Gad M. Landau, Eugene W. Myers, and Jeanette P. Schmidt. Incremental string compar- ison. SIAM Journal on Computing, 27:557–582, 1998.

[68] G.M. Landau and U. Vishkin. Fast string matching with k differences. Journal of Com- puter and System Sciences, 37(1):63–78, 1988.

[69] Roman A. Laskowski, James D. Watson, and Janet M. Thornton. Profunc: A server for predicting protein function from 3D structure. Nucleic Acids Research, 33(Web-Server- Issue):89–93, 2005.

[70] Roman A. Laskowski, James D. Watson, and Janet M. Thornton. Protein function pre- diction using local 3D templates. Journal of Molecular Biology, 351:614–626, 2005.

[71] Florencia G. Leonardi. A generalization of the PST algorithm: Modeling the sparse nature of protein sequences. Bioinformatics, 22(11):1302–1307, 2006.

[72] Vladimir Levenshtein. Binary codes capable of correcting deletions, insertions, and re- versals. Cybernetics and Control Theory, 10(8):707–710, 1966. Original in Doklady Akademii Nauk SSSR 163(4): 845–848 (1965).

[73] M. Li, J.H. Badger, X. Chen, S. Kwong, P. Kearney, and H. Zhang. An information- based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics, 17:149–154, 2001.

[74] Jie Lin and Don Adjeroh. All-against-all circular pattern matching. 2011. Under review.

[75] Jie Lin and Don Adjeroh. Circular pattern discovery. 21st International Workshop on Combinatorial Algorithms, 2011.

[76] Jie Lin, Don Adjeroh, and Binghua Jiang. Algorithms for efficient detection of cps in multidomain protein. 2011. To be submitted. BIBLIOGRAPHY 160

[77] Jie Lin, Don Adjeroh, and Binghua Jiang. Probabilistic suffix array: Efficient modelling and prediction of protein families. 2011. Under review.

[78] Jie Lin, Yue Jiang, and Don Adjeroh. The virtual suffix tree: An efficient data structure for suffix trees and suffix arrays. In Jan Holub and Jan Zˇ dˇarek,´ editors, Proceedings of the Prague Stringology Conference 2008, pages 68–83, Czech Technical University in Prague, Czech Republic, 2008.

[79] Jie Lin, Yue Jiang, and Don Adjeroh. The virtual suffix tree. Int. J. Found. Comput. Sci., 20(6):1109–1133, 2009.

[80] Mortitz G. Maaβ. Computing suffix links for suffix trees and arrays. Information Pro- cessing Letters, 101(6), 2007.

[81] Maurice Maes. On a cyclic string-to-string correction problem. Inf. Process. Lett., 35(2):73–78, 1990.

[82] Veli Makinen.¨ Compact suffix array – a space-efficient full-text index. Fundam. Inform., 56(1-2):191–210, 2003.

[83] Veli Makinen¨ and Gonzalo Navarro. Compressed compact suffix arrays. Combinatorial Pattern Matching, pages 420–433, 2004.

[84] U. Manber and G. Myers. Suffix arrays: A new method for on-line string searches. SIAM J. Computing, 22(5):935–948, 1993.

[85] Tobias Marschall and Sven Rahmann. Probabilistic arithmetic automata and their appli- cation to pattern matching statistics. Combinatorial Pattern Matching, pages 95–106, 2008.

[86] Tobias Marschall and Sven Rahmann. Efficient exact motif discovery. Bioinformatics, 25(12), 2009.

[87] Andres Marzal and Sergio Barrachina. Speeding up the computation of the edit distance for cyclic strings. Pattern Recognition, Int’l Conference on, 2:2891, 2000. BIBLIOGRAPHY 161

[88] Geoffrey Mazeroff, Jens Gregor, Michael G. Thomason, and Richard Ford. Probabilistic suffix models for API sequence analysis of Windows XP applications. Pattern Recogni- tion, 41(1):90–101, 2008.

[89] Edward M. McCreight. A space-economical suffix tree construction algorithm. J. ACM, 23:262–272, April 1976.

[90] R. A. Mollineda, E. Vidal, and F. Casacuberta. Cyclic sequence alignments: Approxi- mate versus optimal techniques. International Journal of Pattern Recognition and Arti- ficial Intelligence, 16:291–299, 2002.

[91] R. A. Mollineda, E. Vidal, and F. Casacuberta. A windowed weighted approach for approximate cyclic string matching. In ICPR ’02: Proceedings of the 16 th International Conference on Pattern Recognition (ICPR’02) Volume 4, page 40188, Washington, DC, USA, 2002. IEEE Computer Society.

[92] Ramon´ Alberto Mollineda, Enrique Vidal, and Francisco Casacuberta. Efficient tech- niques for a very accurate measurement of dissimilarities between cyclic patterns. In Proceedings of the Joint IAPR International Workshops on Advances in Pattern Recog- nition, pages 337–346, London, UK, 2000. Springer-Verlag.

[93] Krisztian´ Monostori, Arkady Zaslavsky, and Heinz Schmidt. Suffix vector: Space- and time-efficient alternative to suffix trees. In Michael J. Oudshoorn, editor, Twenty-Fifth Australasian Computer Science Conference (ACSC2002), Melbourne, Australia, 2002. ACS.

[94] J. Ian Munro, Venkatesh Raman, and S. Srinivasa Rao. Space efficient suffix trees. J. Algorithms, 39(2):205–222, 2001.

[95] G. Navarro and J. Tarhio. Boyer-Moore string matching over Ziv-Lempel compressed text. Proceedings, Combinatorial Pattern Matching, LNCS 1848, pages 166–180, 2000.

[96] Saul Ben Needleman and Christian Dennis Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443–453, 1970. BIBLIOGRAPHY 162

[97] Ge Nong, Sen Zhang, and Wai Hong Chan. Linear time suffix array construction using d-critical substrings. In CPM, pages 54–67, 2009.

[98] Jose Oncina. The Cocke-Younger-Kasami algorithm for cyclic strings. In ICPR ’96: Proceedings of the 13th International Conference on Pattern Recognition, page 413, Washington, DC, USA, 1996. IEEE Computer Society.

[99] Hasan H. Otu and Khalid Sayood. A new sequence distance measure for phylogenetic tree construction. Bioinformatics, 19(16):2122–2130, November 2003.

[100] Philipp Pagel, Matthias Oesterheld, Volker Stumpflen,¨ and Dmitrij Frishman. The DIMA web resource - exploring the protein domain network. Bioinformatics, 22(8):997–998, 2006.

[101] Debnath Pal. Inference of protein function from protein structure. Structure, 13:121–130, 2005.

[102] Anatolij Potapov, Bjorn¨ Goemann, and Edgar Wingender. The pairwise disconnectivity index as a new metric for the topological analysis of regulatory networks. BMC Bioin- formatics, 9, 2008.

[103] Elise Prieur and Thierry Lecroq. From suffix trees to suffix vectors. In Prague Stringol- ogy Conference(PCS2005), Prague, 2005.

[104] Simon J. Puglisi, William F. Smyth, and Andrew Turpin. A taxonomy of suffix array construction algorithms. ACM Computing Surveys, 39(2), 2007.

[105] A. Reyes, C. Gissi, G. Pesole, F.M. Catzeflis, and C. Saccone. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Mol. Biol. Evol., 17:979–983, 2000.

[106] J. Rissanen. Complexity of strings in the class of Markov sources. IEEE Transactions on Information Theory, IT-32(4):526–532, 1986.

[107] Jorma Rissanen. Universal coding information prediction and estimation. IEEE Trans- actions on Information Theory, IT-30(4):629–636, 1984. BIBLIOGRAPHY 163

[108] Dana Ron, Yoram Singer, and Naftali Tishby. Learning probabilistic automata with vari- able memory length. In COLT, pages 35–46, 1994.

[109] Dana Ron, Yoram Singer, and Naftali Tishby. The power of amnesia: Learning prob- abilistic automata with variable memory length. Machine Learning, 25(2-3):117–149, 1996.

[110] Lu´ıs M. S. Russo, Gonzalo Navarro, and Arlindo L. Oliveira. Fully-compressed suffix trees. In LATIN’08: Proceedings of the 8th Latin American Conference on Theoretical Informatics, pages 362–373, Berlin, Heidelberg, 2008. Springer-Verlag.

[111] Kunihiko Sadakane. Compressed suffix trees with full functionality. Theory of Comput- ing Systems, 41(4):589–607, 2007.

[112] GK Sandve and F Drabls. A survey of motif discovery methods in an integrated frame- work. Biology Direct, 1(11), 2006.

[113] Thomas B. Sebastian, Philip N. Klein, and Benjamin B. Kimia. On aligning curves. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25:116–125, 2003.

[114] P.H. Seller. The theory and computation of evolutionary distances: Pattern Recognition. jalg, 1:359–373, 1980.

[115] Y. Shiloach. Fast canonization of circular strings. J. Algorithms, 2(2):107–121, 1981.

[116] T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. J Mol Biol, 147(1):195–197, March 1981.

[117] W.F. Smyth. Computing Patterns in Strings. Addison-Wesley, 2003.

[118] Kenneth Sorensen. Distance measures based on the edit distance for permutation-type representations. Journal of Heuristics, 13:35–47, 2007.

[119] YQ. Tang, J. Yuan, G. Osapay, K. Osapay, D. Tran, CJ. Miller, AJ. Ouellette, and ME. Selsted. A cyclic antimicrobial peptide produced in primate leukocytes by the ligation of two truncated alpha-defensins. Science, 286(5439):498–502, 1999. BIBLIOGRAPHY 164

[120] Steven L. Tanimoto. A method for detecting structure in polygons. Pattern Recognition, 13(6):389–394, 1981.

[121] Esko Ukkonen. Finding approximate patterns in strings. J. Algorithms, 6(1):132–137, 1985.

[122] Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249–260, 1995.

[123] Shai Uliel, Amit Fliess, Amihood Amir, and Ron Unger. A simple algorithm for detecting circular permutations in proteins. Bioinformatics, 15(11):930–936, 1999.

[124] Shai Uliel, Amit Fliess, and Ron Unger. Naturally occurring circular permutations in proteins. Protein Eng., 14(8):533–542, August 2001.

[125] L. Wang and T. Jiang. On the complexity of multiple sequence alignment. Journal of computational biology, 1(4):337–348, 1994.

[126] J. Weiner and E. Bornberg-Bauer. Evolution of circular permutations in multidomain proteins. Mol. Biol. Evol, 23(4):734–743, 2006.

[127] J. Weiner, G. Thomas, and E. Bornberg-Bauer. Rapid motif-based prediction of circular permutations in multi-domain proteins. Bioinformatics, 21(7):932–937, 2005.

[128] P. Weiner. Linear pattern matching algorithm. Proceedings, 14th IEEE Symposium on Switching and Automata Theory, 21:1–11, 1973.

[129] Frans M. J. Willems, Yuri M. Shtarkov, and Tjalling J. Tjalkens. The context-tree weight- ing method: Basic properties. IEEE Transactions on Information Theory, 41(3):653–664, 1995.

[130] S. Wu and U. Manber. Agrep — a fast approximate pattern matching tool. In Proceedings of the Winter 1992 USENIX Conference, pages 153–62. USENIX Association, Berkeley, CA, 1992.

[131] Mikio Yamamoto and Kenneth Ward Church. Using suffix arrays to compute term fre- quency and document frequency for all substrings in a corpus. Computational Linguis- tics, 27(1):1–30, 2001. BIBLIOGRAPHY 165

[132] J Yang, W Wang, P Yu, and J Han. Mining long sequential patterns in a noisy environ- ment. In SIGMOD. ACM, 2002.

[133] Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 23(3):337–343, 1977.

[134] Jacob Ziv and Abraham Lempel. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory, 24(5):530–536, 1978.