Suffix Arrays

Suffix Arrays Recap from Last Time Suffix Trees $ s nonsense$ e n o e onsense$ 8 n $ n o s s $ nsense$ e n 7 s n e s sense$ e s n 6 ense$ s e $ e $ n $ nse$ n e 4 5 s $ se$ s e 3 e$ e $ 1 $ $ 2 0 Theorem:Theorem: w w is is a a substringsubstring of of x x if if and and nonsense$ onlyonly if if w w is is a a prefix prefix ofof 012345678 aa suffix suffix ofof x x.. New Stuff! Representing Suffix Trees Representing a Suffix Tree ● We know that a suffix tree has O(m) $ s e n o e nodes, where m is 8 n the number of $ s n o s $ n characters in the 7 s n e e n 6 s input string. e s e $ e s ● $ n $ This means that n e 4 5 s $ there are O(m) s e 3 edges. (Why?) e $ 1 $ ● Question: Why can’t 2 we immediately 0 claim that the space usage of the suffix nonsense$ 012345678 tree is O(m)? Representing a Suffix Tree ● Claim: Writing out all suffixes of a string $ s e n o e of length m requires 8 n Θ(m2) characters. $ n o s s $ e n ● 7 s n e Proof idea: Those n 6 s suffixes have length e s e $ e $ s 1 + 2 + … + (m+1), n e $ n s factoring in the 4 s 5 $ 3 $ e special character. e $ 1 $ ● Problem: It is 2 indeed possible to 0 build a suffix tree with Θ(m2) total nonsense$ letters on the edges. 012345678 Representing a Suffix Tree ● By being clever with s our representation, we $ e can guarantee that a n o e n suffix tree uses only 8 $ s Θ(m) space, regardless n o s $ e e n of the input string. 7 s n s e s n 6 e ● Observation: Each s $ e $ n $ edge is labeled with a n e 4 5 s $ substring of the s e 3 original input string. e $ 1 $ ● Idea: Don’t actually 2 write out the labels on 0 the edges. Just write down the start and end nonsense$ index! 012345678 Representing a Suffix Tree $ s e n o e 8 n $ n o s s $ e n 7 s n e s e s n 6 s e $ e $ n $ n e 4 5 s $ $ e n o s s e 3 start e $ 1 8 4 0 1 3 $ end 8 4 0 8 4 2 0 child nonsense$ 012345678 Representing a Suffix Tree ● Space usage $ s required for a e n o e suffix tree: 8 n $ n o s s $ ● O(m) space for e n 7 s n e s all the nodes. e s n 6 s e $ e $ n $ ● O(m) space for n e 4 5 s $ a copy of the s e 3 e $ 1 original string. $ ● 2 O(m) space for 0 the edges. nonsense$ ● Total space: O(m). 012345678 Constructing a Suffix Tree ● The naive algorithm for building a suffix $ s e n o e tree (add one suffix 8 n at a time) takes time $ n s s o $ n Θ(m2). e 7 s n e s e s n 6 ● Claim: With a much s e $ e $ n $ more clever n e 4 5 s $ approach, this can s e 3 be done in time e $ 1 O(m). $ 2 ● This is not 0 obvious. We’ll spend a full lecture nonsense$ 012345678 on this idea later on. Suffix Tree Space Usage ● Suffix tree edges take up a lot of space. ● Two machine words per edge to denote the range of characters visited. ● One machine word per edge for the pointer itself. ● Number of edges ranges from m to 2m – 1, so this is between 3m and 6m machine words for the whole string! ● Example: a human genome is about three billion characters long. ● With clever techniques, that can be packed into about 800MB. ● On a 32-bit machine, the suffix tree needs about 48GB – too big to fit into memory! ● On a 64-bit machine, the suffix tree needs about 96GB – way more than a typical machine can hold! Key Question: Can we get the benefits of a suffix tree without the space penalty? What is it about suffix trees that make them so useful algorithmically? $ s e n o e 8 n $ n o s s $ e n 7 s n e s e s n 6 s e $ e $ n $ n e 4 5 s $ s e 3 e $ 1 $ 2 0 Theorem: There is a node labeled ω in a suffix tree for T if and only if ω is a suffix of T$ or ω is a branching word in T$. $ s e n o e 8 n $ n o s s $ e n 7 s n e s e s n 6 s e $ e $ n $ n e 4 5 s $ s e 3 e AA string string ω ω is is a a $ 1 branching word in $ branching word in 2 TT$$ if if there there are are distinct distinct 0 characterscharacters a a and and b b wherewhere ω ωaa and and ω ωbb are are nonsense$ substringssubstrings of of T T$.$. Theorem: There is a node labeled ω in a suffix tree for T if and only if ω is a suffix of T$ or ω is a branching word in T$. $ s e n o e 8 n $ n o s s $ e n 7 s n e s e s n 6 s e $ e $ n $ n e 4 5 s $ s e 3 e $ 1 $ 2 0 Key Intuition: The efficiency in a suffix tree is largely due to 1. keeping the suffixes in sorted order, and 2. exposing branching words. Where We’re Going ● Today, we’ll see two data structures that encode much of the same information as suffix trees, but in much less space. ● The suffix array stores information about the ordering of the suffixes of a string. ● The LCP array stores information about the branching words of a string. ● Together, they’ll provide algorithms that match or are comparable to the time bounds from last time. Suffix Arrays Suffix Arrays $ ● A suffix array for a string A$ T is a sorted array of the ABANANABANDANA$ suffixes of the string T$. ABANDANA$ ANA$ ● Suffix arrays distill out ANABANDANA$ just the first component of ANANABANDANA$ suffix trees: they store ANDANA$ suffixes in sorted order. BANANABANDANA$ ● BANDANA$ Non-obvious fact: Suffix DANA$ arrays can be built in time NA$ O(m). Details next time! NABANDANA$ NANABANDANA$ NDANA$ ABANANABANDANA$ Suffix Arrays $ ● Last time, we saw A$ how to find all ABANANABANDANA$ ABANDANA$ instances of a pattern ANA$ P in a text T using ANABANDANA$ suffix trees. ANANABANDANA$ ANDANA$ ● How could we do that BANANABANDANA$ with suffixarrays ? BANDANA$ DANA$ ● Question: What’s a NA$ NABANDANA$ good algorithm for NANABANDANA$ finding an element of NDANA$ a sorted array? ABANANABANDANA$ Suffix Arrays $ ● Reminder: Our text string T A$ has length m. Our pattern ABANANABANDANA$ string P has length n. ABANDANA$ ● Claim: With a suffix array, we ANA$ can determine whether P ANABANDANA$ matches in T in time ANANABANDANA$ O(n log m). ANDANA$ ● Binary search has O(log m) BANANABANDANA$ rounds. BANDANA$ ● Each probe takes time O(n). DANA$ ● This bound can be made tight. NA$ (How?) NABANDANA$ ● Figure that m is often much NANABANDANA$ bigger than n, so this is a huge NDANA$ win over a raw scan. ABANANABANDANA$ ANAN Suffix Arrays $ ● Claim: With a suffix A$ array, we can find all ABANANABANDANA$ matches of a pattern P in ABANDANA$ T in time O(n log m + z), ANA$ where z is the number of ANABANDANA$ ANANABANDANA$ matches. ANDANA$ ● Idea: Binary search can BANANABANDANA$ be used to find a range of BANDANA$ values equal to some key. DANA$ Adapt that idea to find all NA$ suffixes beginning with NABANDANA$ NANABANDANA$ the same prefix. NDANA$ ABANANABANDANA$ NA Storing Suffix Arrays $ ● The way we’ve A$ ABANANABANDANA$ drawn suffix ABANDANA$ arrays is terribly ANA$ ANABANDANA$ space-inefficient. ANANABANDANA$ ● ANDANA$ It always uses BANANABANDANA$ space Θ(m2), since BANDANA$ that’s how many DANA$ total characters NA$ NABANDANA$ occur in all NANABANDANA$ suffixes. NDANA$ ● Can we do better? ABANANABANDANA$ Storing Suffix Arrays 14 ● We reduced the space 13 usage of suffix trees by 0 representing substrings, 6 implicitly, as ranges 11 4 within the original string. 2 ● Idea: Don’t store the 8 suffixes themselves. Just 1 store the starting 7 10 positions of the suffixes. 12 ● Space: Θ(m), and with 5 only one machine word 3 used per character of 9 input. ABANANABANDANA$ 012345678901234 Storing Suffix Arrays 14 ● Although the picture 13 to the right is how 0 6 we’d represent the 11 suffix array in 4 memory, for this 2 8 lecture we’ll draw 1 things out the longer 7 way. 10 12 ● This is just to build 5 3 intuition; we 9 wouldn’t actually do that in practice. ABANANABANDANA$ 012345678901234 The Story So Far ● Suffix arrays store all the suffixes of a string in sorted order. ● They provide an ⟨O(m), O(n log m + z)⟩ solution to the substring search problem. ● Intuition: Suffix trees are valuable in large part because they just keep the suffixes sorted. ● What else are suffix trees doing? Branching Words $ ● Recall: If T is a A$ ABANANABANDANA$ string, then ω is a ABANDANA$ branching word ANA$ ANABANDANA$ in T$ if there are ANANABANDANA$ characters a ≠ b ANDANA$ BANANABANDANA$ such that ωa and BANDANA$ ωb are substrings DANA$ NA$ of T$.

Suffix Arrays

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support