Suffix Trees and Suffix Arrays Some Problems Exact String Matching Applications in Bioinformatics Suffix Trees History of Suffix

Suffix Trees and Suffix Arrays Some Problems Exact String Matching Applications in Bioinformatics Suffix Trees History of Suffix

Some problems • Given a pattern P = P[1..m], find all occurrences of P in a text S = S[1..n] Suffix Trees and • Another problem: – Given two strings S [1..n ] and S [1..n ] find their Suffix Arrays 1 1 2 2 longest common substring. • find i, j, k such that S 1[i .. i+k-1] = S2[j .. j+k-1] and k is as large as possible. • Any solutions? How do you solve these problems (efficiently)? Exact string matching Applications in Bioinformatics • Finding the pattern P[1..m] in S[1..n] can be • Multiple genome alignment solved simply with a scan of the string S in – Michael Hohl et al. 2002 O(m+n) time. However, when S is very long – Longest common substring problem and we want to perform many queries, it – Common substrings of more than two strings would be desirable to have a search • Selection of signature oligonucleotides for algorithm that could take O(m) time. microarrays • To do that we have to preprocess S. The – Kaderali and Schliep, 2002 preprocessing step is especially useful in • Identification of sequence repeats scenarios where the text is relatively constant – Kurtz and Schleiermacher, 1999 over time (e.g., a genome), and when search is needed for many different patterns. Suffix trees History of suffix trees • Any string of length m can be degenerated • Weiner, 1973: suffix trees introduced, linear- into m suffixes. time construction algorithm – abcdefgh (length: 8) • McCreight, 1976: reduced space-complexity – 8 suffixes: • h, gh, fgh, efgh, defgh, cdefgh, bcefgh, abcdefgh • Ukkonen, 1995: new algorithm, easier to describe • The suffixes can be stored in a suffix-tree and this tree can be generated in O(n) time • In this course, we will only cover a naive (quadratic-time) construction. • A string pattern of length m can be searched in this suffix tree in O(m) time. – Whereas, a regular sequential search would take O(n) time. 1 Definition of a suffix tree An example suffix tree • Let S=S[1..n] be a string of length n over a • The suffix tree for string: 1 2 3 4 5 6 fixed alphabet S. A suffix tree for S is a tree x a b x a c with n leaves (representing n suffixes) and the following properties: – Every internal node other than the root has at least 2 children – Every edge is labeled with a nonempty substring of S. – The edges leaving a given node have labels starting with Does a suffix tree different letters. always exist? – The concatenation of the labels of the path from the root to leaf i spells out the i-th suffix S[i..n] of S. We denote S[i..n] by Si. What about the tree for xabxa? Problem • The suffix tree for string: 1 2 3 4 5 • Note that if a suffix is a prefix of another suffix x a b x a we cannot have a tree with the properties defined in the previous slides. – e.g. xabxa The fourth suffix xa or the fifth suffix a won’t be represented by a leaf node. xa an a are not leaf nodes. Solution: the terminal character $ The suffix tree for xabxa$ • Note that if a suffix is a prefix of another suffix we cannot have a tree with the properties defined in the previous slides. – e.g. xabxa The fourth suffix xa or the fifth suffix a won’t be represented by a leaf node. • Solution: insert a special terminal character at the end such as $. Therefore xa$ will not be a prefix of the suffix xabxa. 2 Suffix tree construction Suffix tree construction - 2 • Start with a root and a leaf numbered 1, connected • If the mismatch occurs at the middle of an by an edge labeled S$. edge e = S[u ... v] • Enter suffixes S[2..n]$; S[3...n]$; ... ; S[n]$ into the – let the label of that edge be a ...a tree as follows: 1 l – If the mismatch occurred at character ak, then • To insert Ki = S[i..n]$, follow the path from the root create a new node w, and replace e by two edges matching characters of Ki until the first mismatch at S[u ... u+k-1] and S[u+k ... v] labeled by a ...a character K [ j ] (which is bound to happen) 1 k and i ak+1...al (a) If the matching cannot continue from a node, denote • Finally, in both cases (a) and (b), create a that node by w (b) Otherwise the mismatch occurs at the middle of an new leaf numbered i, and connect w to it by edge, which has to be split an edge labeled with Ki[j ... |Ki|] Example construction Example contd... • Let’s construct a suffix tree for xabxac$ • Inserting the fourth suffix xac$ will cause the first edge to be split: $ $ • Start with: $ • After inserting the second and third suffix: $ $ $ $ $ • Same thing happens for the second edge when ac$ is inserted. Example contd... Complexity of the naive construction • After inserting the remaining suffixes the tree • We need O(n-i+1) time for the ith suffix. will be completed: Therefore the total running time is: n åO(i) = O(n 2 ) 1 • What about space complexity? – Can also take O(n2) because we may need to store every suffix in the tree separately, – e.g., abcdefghijklmn 3 Storing the edge labels efficiently Suffix tree applet • Note that, we do not store the actual • http://pauillac.inria.fr/~quercia/documents- substrings S[i ... j] of S in the edges, but only info/Luminy - their start and end indices (i, j). 98/albert/JAVA+html/SuffixTreeGrow.html • Nevertheless we keep thinking of the edge labels as substrings of S. • This will reduce the space complexity to O(n) Using suffix trees for pattern matching Illustration • Given S and P. How do we find all occurrences of P in S? • T = xabxac • Observation. Each occurrence has to be a prefix of some – suffixes ={xabxac, abxac, bxac, xac, ac, c} suffix. Each such prefix corresponds to a path starting at the • Pattern P1: xa root. • Pattern P2: xb 1. Of course, as a first step, we construct the suffix tree for S. Using the naive method this takes quadratic time, but linear-time algorithms b c a x (e.g., Ukkonen’s algorithm) exist. x 6 a 2. Try to match P on a path, starting from the root. Three cases: a c b (a) The pattern does not match ? P does not occur in T c 5 x c b a x (b) The match ends in a node u of the tree. Set x = u. a c 4 c (c) The match ends inside an edge (v,w) of the tree. Set x = w. 3 2 3. All leaves below x represent occurrences of P. 1 Running Time Analysis Scalability • Search time: • For very large problems a linear time and – O(m+k) where k is the number of occurrences of space bound is not good enough. This lead to P in T and m is the length of P the development of structures such as Suffix – O(m) to find match point if it exists Arrays to conserve memory . – O(k) to find all leaves below match point 4 Effects of alphabet size on Two implementation issues suffix trees • Alphabet size • Generalizing to multiple strings • We have generally been assuming that the trees are built in such a way that – from any node, we can find an edge in constant time for any specific character in S • an array of size |S| at each node • This takes Q(m|S|) space. More compact representation Generalized suffix trees • We may try to be more compact taking only O(m) • Build a suffix tree for a set of strings S = {S1, …, Sz} space. • Some issues – At each node, have pointers to only the edges that are needed • Nodes in tree may corresponds to substrings of • This slows down the search time potentially multiple strings Si • How much? – compact edge labels: need 3 fields (start position, stop position, string) – typically the minimum of O(log m) or O(log |S|) with a binary tree representation. – leaf labels now a set of pairs indicating starting position • This effects both suffix tree construction time and and string later searching time against the suffix tree. Longest common substring problem Selecting probes for microarrays • Build a generalized suffix tree for S1$1S2$2. • Wikipedia: Oligonucleotides are short Here $1 and $2 are different new symbols not sequences of nucleotides (RNA or DNA), occurring in S1 and S2. typically with twenty or fewer base pairs. • Given a set of genomic sequences, the problem is to identify at least one • Mark every internal node of the tree with {1}, signature oligonucleotide (probe) for each sequence. These probes must {2}, or {1,2} depending on whether its path hybridize to only the desired sequence. The algorithm produces a GST from the reverse compliment of all the genomic sequences (candidate label is a substring of S1 and/or S2. probe sequences). Using the GST, the algorithm identifies all common substrings and rejects these regions because probes designed in them • Find the internal node which is labeled by would not be specific to a single genomic sequence. Criteria such as {1,2} and has the largest “string depth”. probe length are used to further prune this tree. • Example: (with the applet) • http://www.zaik.uni- – pessimist%mississippi$ koeln.de/bioinformatik/arraydesign.html.en 5 Suffix arrays Suffix arrays • Suffix arrays were introduced by Manber and Example of a suffix array for acaaacatat$ Myers in 1993 3 • More space efficient than suffix trees 4 • A suffix array for a string x of length m is an 1 array of size m that specifies the lexicographic 5 ordering of the suffixes of x.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    7 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us