Suffix Arrays

Suffix Arrays Recap from Last Time Suffix Trees $ s nonsense$ e n o e onsense$ 8 n $ n o s s $ nsense$ e n 7 s n e s sense$ e s n 6 ense$ s e $ e $ n $ nse$ n e 4 5 s $ se$ s e 3 e$ e $ 1 $ $ 2 0 Theorem:Theorem: w w is is a a substringsubstring of of x x if if and and nonsense$ onlyonly if if w w is is a a prefix prefix ofof 012345678 aa suffix suffix ofof x x.. New Stuff! Representing Suffix Trees Representing a Suffix Tree ● We know that a suffix tree has O(m) $ s e n o e nodes, where m is 8 n the number of $ s n o s $ n characters in the 7 s n e e n 6 s input string. e s e $ e s ● $ n $ This means that n e 4 5 s $ there are O(m) s e 3 edges. (Why?) e $ 1 $ ● Question: Why can’t 2 we immediately 0 claim that the space usage of the suffix nonsense$ 012345678 tree is O(m)? Representing a Suffix Tree ● Claim: Writing out all suffixes of a string $ s e n o e of length m requires 8 n Θ(m2) characters. $ n o s s $ e n ● 7 s n e Proof idea: Those n 6 s suffixes have length e s e $ e $ s 1 + 2 + … + (m+1), n e $ n s factoring in the 4 s 5 $ 3 $ e special character. e $ 1 $ ● Problem: It is 2 indeed possible to 0 build a suffix tree with Θ(m2) total nonsense$ letters on the edges. 012345678 Representing a Suffix Tree ● By being clever with s our representation, we $ e can guarantee that a n o e n suffix tree uses only 8 $ s Θ(m) space, regardless n o s $ e e n of the input string. 7 s n s e s n 6 e ● Observation: Each s $ e $ n $ edge is labeled with a n e 4 5 s $ substring of the s e 3 original input string. e $ 1 $ ● Idea: Don’t actually 2 write out the labels on 0 the edges. Just write down the start and end nonsense$ index! 012345678 Representing a Suffix Tree $ s e n o e 8 n $ n o s s $ e n 7 s n e s e s n 6 s e $ e $ n $ n e 4 5 s $ $ e n o s s e 3 start e $ 1 8 4 0 1 3 $ end 8 4 0 8 4 2 0 child nonsense$ 012345678 Representing a Suffix Tree ● Space usage $ s required for a e n o e suffix tree: 8 n $ n o s s $ ● O(m) space for e n 7 s n e s all the nodes. e s n 6 s e $ e $ n $ ● O(m) space for n e 4 5 s $ a copy of the s e 3 e $ 1 original string. $ ● 2 O(m) space for 0 the edges. nonsense$ ● Total space: O(m). 012345678 Constructing a Suffix Tree ● The naive algorithm for building a suffix $ s e n o e tree (add one suffix 8 n at a time) takes time $ n s s o $ n Θ(m2). e 7 s n e s e s n 6 ● Claim: With a much s e $ e $ n $ more clever n e 4 5 s $ approach, this can s e 3 be done in time e $ 1 O(m). $ 2 ● This is not 0 obvious. We’ll spend a full lecture nonsense$ 012345678 on this idea later on. Suffix Tree Space Usage ● Suffix tree edges take up a lot of space. ● Two machine words per edge to denote the range of characters visited. ● One machine word per edge for the pointer itself. ● Number of edges ranges from m to 2m – 1, so this is between 3m and 6m machine words for the whole string! ● Example: a human genome is about three billion characters long. ● With clever techniques, that can be packed into about 800MB. ● On a 32-bit machine, the suffix tree needs about 48GB – too big to fit into memory! ● On a 64-bit machine, the suffix tree needs about 96GB – way more than a typical machine can hold! Key Question: Can we get the benefits of a suffix tree without the space penalty? What is it about suffix trees that make them so useful algorithmically? $ s e n o e 8 n $ n o s s $ e n 7 s n e s e s n 6 s e $ e $ n $ n e 4 5 s $ s e 3 e $ 1 $ 2 0 Theorem: There is a node labeled ω in a suffix tree for T if and only if ω is a suffix of T$ or ω is a branching word in T$. $ s e n o e 8 n $ n o s s $ e n 7 s n e s e s n 6 s e $ e $ n $ n e 4 5 s $ s e 3 e AA string string ω ω is is a a $ 1 branching word in $ branching word in 2 TT$$ if if there there are are distinct distinct 0 characterscharacters a a and and b b wherewhere ω ωaa and and ω ωbb are are nonsense$ substringssubstrings of of T T$.$. Theorem: There is a node labeled ω in a suffix tree for T if and only if ω is a suffix of T$ or ω is a branching word in T$. $ s e n o e 8 n $ n o s s $ e n 7 s n e s e s n 6 s e $ e $ n $ n e 4 5 s $ s e 3 e $ 1 $ 2 0 Key Intuition: The efficiency in a suffix tree is largely due to 1. keeping the suffixes in sorted order, and 2. exposing branching words. Where We’re Going ● Today, we’ll see two data structures that encode much of the same information as suffix trees, but in much less space. ● The suffix array stores information about the ordering of the suffixes of a string. ● The LCP array stores information about the branching words of a string. ● Together, they’ll provide algorithms that match or are comparable to the time bounds from last time. Suffix Arrays Suffix Arrays $ ● A suffix array for a string A$ T is a sorted array of the ABANANABANDANA$ suffixes of the string T$. ABANDANA$ ANA$ ● Suffix arrays distill out ANABANDANA$ just the first component of ANANABANDANA$ suffix trees: they store ANDANA$ suffixes in sorted order. BANANABANDANA$ ● BANDANA$ Non-obvious fact: Suffix DANA$ arrays can be built in time NA$ O(m). Details next time! NABANDANA$ NANABANDANA$ NDANA$ ABANANABANDANA$ Suffix Arrays $ ● Last time, we saw A$ how to find all ABANANABANDANA$ ABANDANA$ instances of a pattern ANA$ P in a text T using ANABANDANA$ suffix trees. ANANABANDANA$ ANDANA$ ● How could we do that BANANABANDANA$ with suffixarrays ? BANDANA$ DANA$ ● Question: What’s a NA$ NABANDANA$ good algorithm for NANABANDANA$ finding an element of NDANA$ a sorted array? ABANANABANDANA$ Suffix Arrays $ ● Reminder: Our text string T A$ has length m. Our pattern ABANANABANDANA$ string P has length n. ABANDANA$ ● Claim: With a suffix array, we ANA$ can determine whether P ANABANDANA$ matches in T in time ANANABANDANA$ O(n log m). ANDANA$ ● Binary search has O(log m) BANANABANDANA$ rounds. BANDANA$ ● Each probe takes time O(n). DANA$ ● This bound can be made tight. NA$ (How?) NABANDANA$ ● Figure that m is often much NANABANDANA$ bigger than n, so this is a huge NDANA$ win over a raw scan. ABANANABANDANA$ ANAN Suffix Arrays $ ● Claim: With a suffix A$ array, we can find all ABANANABANDANA$ matches of a pattern P in ABANDANA$ T in time O(n log m + z), ANA$ where z is the number of ANABANDANA$ ANANABANDANA$ matches. ANDANA$ ● Idea: Binary search can BANANABANDANA$ be used to find a range of BANDANA$ values equal to some key. DANA$ Adapt that idea to find all NA$ suffixes beginning with NABANDANA$ NANABANDANA$ the same prefix. NDANA$ ABANANABANDANA$ NA Storing Suffix Arrays $ ● The way we’ve A$ ABANANABANDANA$ drawn suffix ABANDANA$ arrays is terribly ANA$ ANABANDANA$ space-inefficient. ANANABANDANA$ ● ANDANA$ It always uses BANANABANDANA$ space Θ(m2), since BANDANA$ that’s how many DANA$ total characters NA$ NABANDANA$ occur in all NANABANDANA$ suffixes. NDANA$ ● Can we do better? ABANANABANDANA$ Storing Suffix Arrays 14 ● We reduced the space 13 usage of suffix trees by 0 representing substrings, 6 implicitly, as ranges 11 4 within the original string. 2 ● Idea: Don’t store the 8 suffixes themselves. Just 1 store the starting 7 10 positions of the suffixes. 12 ● Space: Θ(m), and with 5 only one machine word 3 used per character of 9 input. ABANANABANDANA$ 012345678901234 Storing Suffix Arrays 14 ● Although the picture 13 to the right is how 0 6 we’d represent the 11 suffix array in 4 memory, for this 2 8 lecture we’ll draw 1 things out the longer 7 way. 10 12 ● This is just to build 5 3 intuition; we 9 wouldn’t actually do that in practice. ABANANABANDANA$ 012345678901234 The Story So Far ● Suffix arrays store all the suffixes of a string in sorted order. ● They provide an ⟨O(m), O(n log m + z)⟩ solution to the substring search problem. ● Intuition: Suffix trees are valuable in large part because they just keep the suffixes sorted. ● What else are suffix trees doing? Branching Words $ ● Recall: If T is a A$ ABANANABANDANA$ string, then ω is a ABANDANA$ branching word ANA$ ANABANDANA$ in T$ if there are ANANABANDANA$ characters a ≠ b ANDANA$ BANANABANDANA$ such that ωa and BANDANA$ ωb are substrings DANA$ NA$ of T$.

Load more