Suffix Arrays

Recap from Last Time

Suffix Trees

$ s nonsense$ e n o e onsense$ 8 n $ n o s s $ nsense$ e n 7 s n e s sense$ e s n 6 ense$ s e $ e $ n $ nse$ n e 4 5 s $ se$ s e 3 e$ e $ 1 $ $ 2 0 Theorem:Theorem: w w is is a a substringsubstring of of x x if if and and nonsense$ onlyonly if if w w is is a a prefix prefix ofof 012345678 aa suffix suffix ofof x x.. New Stuff!

Representing Suffix Trees

Representing a

● We know that a suffix tree has O(m) $ s e n o e nodes, where m is 8 n the number of $ s n o s $ n characters in the 7 s n e e n 6 s input string. e s e $ e s ● $ n $ This means that n e 4 5 s $ there are O(m) s e 3 edges. (Why?) e $ 1 $ ● Question: Why can’t 2 we immediately 0 claim that the space usage of the suffix nonsense$ 012345678 tree is O(m)? Representing a Suffix Tree

● Claim: Writing out all suffixes of a string $ s e n o e of length m requires 8 n Θ(m2) characters. $ n o s s $ e n ● 7 s n e Proof idea: Those n 6 s suffixes have length e s e $ e $ s 1 + 2 + … + (m+1), n e $ n s factoring in the 4 s 5 $ 3 $ e special character. e $ 1 $ ● Problem: It is 2 indeed possible to 0 build a suffix tree with Θ(m2) total nonsense$ letters on the edges. 012345678

Representing a Suffix Tree

● By being clever with s our representation, we $ e can guarantee that a n o e 8 n suffix tree uses only $ s Θ(m) space, regardless n o s $ n 7 s n e e of the input string. n 6 s e s e ● Observation: Each s $ e $ n $ edge is labeled with a n e 4 5 s $ of the s e 3 original input string. e $ 1 $ ● Idea: Don’t actually 2 write out the labels on 0 the edges. Just write down the start and end nonsense$ index! 012345678

Representing a Suffix Tree

$ s e n o e 8 n $ n o s s $ e n 7 s n e s e s n 6 s e $ e $ n $ n e 4 5 s $ $ e n o s s e 3 start e $ 1 8 4 0 1 3 $ end 8 4 0 8 4 2 0 child nonsense$ 012345678

Representing a Suffix Tree

● Space usage $ s required for a e n o e suffix tree: 8 n $ n o s s $ ● O(m) space for e n 7 s n e s all the nodes. e s n 6 s e $ e $ n $ ● O(m) space for n e 4 5 s $ a copy of the s e 3 e $ 1 original string. $ ● 2 O(m) space for 0 the edges. nonsense$ ● Total space: O(m). 012345678

Constructing a Suffix Tree

● The naive algorithm for building a suffix $ s e n o e tree (add one suffix 8 n at a time) takes time $ s n o s $ n Θ(m2). e 7 s n e s e s n 6 ● Claim: With a much s e $ e $ n $ more clever n e 4 5 s $ approach, this can s e 3 be done in time e $ 1 O(m). $ 2 ● This is not 0 obvious. We’ll spend a full lecture nonsense$ 012345678 on this idea later on. Suffix Tree Space Usage

● Suffix tree edges take up a lot of space.

● Two machine words per edge to denote the range of characters visited. ● One machine word per edge for the pointer itself. ● Number of edges ranges from m to 2m – 1, so this is between 3m and 6m machine words for the whole string! ● Example: a human genome is about three billion characters long.

● With clever techniques, that can be packed into about 800MB. ● On a 32-bit machine, the suffix tree needs about 48GB – too big to fit into memory! ● On a 64-bit machine, the suffix tree needs about 96GB – way more than a typical machine can hold! Key Question: Can we get the benefits of a suffix tree without the space penalty?

What is it about suffix trees that make them so useful algorithmically?

$ s e n o e 8 n $ n o s s $ e n 7 s n e s e s n 6 s e $ e $ n $ n e 4 5 s $ s e 3 e $ 1 $ 2 0

Theorem: There is a node labeled ω in a suffix tree for T if and only if ω is a suffix of T$ or ω is a branching word in T$. $ s e n o e 8 n $ n o s s $ e n 7 s n e s e s n 6 s e $ e $ n $ n e 4 5 s $ s e 3 e AA string string ω ω is is a a $ 1 branching word in $ branching word in 2 TT$$ if if there there are are distinct distinct 0 characterscharacters a a and and b b wherewhere ω ωaa and and ω ωbb are are nonsense$ substringssubstrings of of T T$.$.

Theorem: There is a node labeled ω in a suffix tree for T if and only if ω is a suffix of T$ or ω is a branching word in T$. $ s e n o e 8 n $ n o s s $ e n 7 s n e s e s n 6 s e $ e $ n $ n e 4 5 s $ s e 3 e $ 1 $ 2 0

Key Intuition: The efficiency in a suffix tree is largely due to 1. keeping the suffixes in sorted order, and 2. exposing branching words. Where We’re Going

● Today, we’ll see two data structures that encode much of the same information as suffix trees, but in much less space.

● The suffix array stores information about the ordering of the suffixes of a string. ● The LCP array stores information about the branching words of a string. ● Together, they’ll provide algorithms that match or are comparable to the time bounds from last time.

Suffix Arrays

Suffix Arrays $ ● A suffix array for a string A$ T is a sorted array of the ABANANABANDANA$ suffixes of the string T$. ABANDANA$ ANA$ ● Suffix arrays distill out ANABANDANA$ just the first component of ANANABANDANA$ suffix trees: they store ANDANA$ suffixes in sorted order. BANANABANDANA$

● BANDANA$ Non-obvious fact: Suffix DANA$ arrays can be built in time NA$ O(m). Details next time! NABANDANA$ NANABANDANA$ NDANA$

ABANANABANDANA$ Suffix Arrays $ ● Last time, we saw A$ how to find all ABANANABANDANA$ ABANDANA$ instances of a pattern ANA$ P in a text T using ANABANDANA$ suffix trees. ANANABANDANA$ ANDANA$ ● How could we do that BANANABANDANA$ with suffixarrays ? BANDANA$ DANA$ ● Question: What’s a NA$ NABANDANA$ good algorithm for NANABANDANA$ finding an element of NDANA$ a sorted array?

ABANANABANDANA$ Suffix Arrays $ ● Reminder: Our text string T A$ has length m. Our pattern ABANANABANDANA$ string P has length n. ABANDANA$ ● Claim: With a suffix array, we ANA$ can determine whether P ANABANDANA$ matches in T in time ANANABANDANA$ O(n log m). ANDANA$ ● Binary search has O(log m) BANANABANDANA$ rounds. BANDANA$ ● Each probe takes time O(n). DANA$ ● This bound can be made tight. NA$ (How?) NABANDANA$ ● Figure that m is often much NANABANDANA$ bigger than n, so this is a huge NDANA$ win over a raw scan. ABANANABANDANA$ ANAN Suffix Arrays $ ● Claim: With a suffix A$ array, we can find all ABANANABANDANA$ matches of a pattern P in ABANDANA$ T in time O(n log m + z), ANA$ where z is the number of ANABANDANA$ ANANABANDANA$ matches. ANDANA$ ● Idea: Binary search can BANANABANDANA$ be used to find a range of BANDANA$ values equal to some key. DANA$ Adapt that idea to find all NA$ suffixes beginning with NABANDANA$ NANABANDANA$ the same prefix. NDANA$ ABANANABANDANA$ NA Storing Suffix Arrays $ ● The way we’ve A$ ABANANABANDANA$ drawn suffix ABANDANA$ arrays is terribly ANA$ ANABANDANA$ space-inefficient. ANANABANDANA$

● ANDANA$ It always uses BANANABANDANA$ space Θ(m2), since BANDANA$ that’s how many DANA$ total characters NA$ NABANDANA$ occur in all NANABANDANA$ suffixes. NDANA$ ● Can we do better? ABANANABANDANA$ Storing Suffix Arrays 14 ● We reduced the space 13 usage of suffix trees by 0 representing , 6 implicitly, as ranges 11 4 within the original string. 2 ● Idea: Don’t store the 8 suffixes themselves. Just 1 store the starting 7 10 positions of the suffixes. 12 ● Space: Θ(m), and with 5 only one machine word 3 used per character of 9 input. ABANANABANDANA$ 012345678901234 Storing Suffix Arrays 14 ● Although the picture 13 to the right is how 0 6 we’d represent the 11 suffix array in 4 memory, for this 2 8 lecture we’ll draw 1 things out the longer 7 way. 10 12 ● This is just to build 5 3 intuition; we 9 wouldn’t actually do

that in practice. ABANANABANDANA$ 012345678901234 The Story So Far

● Suffix arrays store all the suffixes of a string in sorted order. ● They provide an ⟨O(m), O(n log m + z)⟩ solution to the substring search problem. ● Intuition: Suffix trees are valuable in large part because they just keep the suffixes sorted. ● What else are suffix trees doing?

Branching Words $ ● Recall: If T is a A$ ABANANABANDANA$ string, then ω is a ABANDANA$ branching word ANA$ ANABANDANA$ in T$ if there are ANANABANDANA$ characters a ≠ b ANDANA$ BANANABANDANA$ such that ωa and BANDANA$ ωb are substrings DANA$ NA$ of T$. NABANDANA$ NANABANDANA$ NDANA$

ABANANABANDANA$ Branching Words $ ● Notice that, by sorting A$ suffixes, we’ve made it ABANANABANDANA$ easier to spot branching ABANDANA$ words. ANA$ ANABANDANA$ ● Specifically, all suffixes ANANABANDANA$ starting with a ANDANA$ branching word will be BANANABANDANA$ adjacent in the suffix BANDANA$ DANA$ array. NA$ ● The branching word will NABANDANA$ be the longest common NANABANDANA$ NDANA$ prefix (or LCP) of those adjacent suffixes. ABANANABANDANA$

Branching Words $ ● Theorem: A string ω is a A$ branching word in string T$ if ABANANABANDANA$ and only if it’s the longest common prefix of two adjacent ABANDANA$ suffixes in T’s suffix array. ANA$ ANABANDANA$ ● Proof idea: If ω is the longest ANANABANDANA$ common prefix of two adjacent suffixes, let a and b be the ANDANA$ characters immediately following BANANABANDANA$ ω in those two suffixes. Then ωa BANDANA$ and ωb are substrings of T$. DANA$ If ω is branching, choose the NA$ lexicographically smallest a and NABANDANA$ b making the definition work. NANABANDANA$ Then the last suffix starting with NDANA$ ωa and the first suffix starting with ωb are adjacent in the suffix ABANANABANDANA$ array. ■ $ A D N $ B A$ A $ B N A A D ABANANABANDANA$ N N A A A A ABANDANA$ D $ B N N A A D $ N ANA$ $ N A A A B N N A B ANABANDANA$ A D A A N $ N A A A N D A ANANABANDANA$ N B $ B A A N A A N ANDANA$ D A $ D B A A N N BANANABANDANA$ A $ N A A N D D BANDANA$ N A $ N A A A DANA$ D $ N A N $ NA$ A A NABANDANA$ N $ A $ NANABANDANA$ $ NDANA$ ABANANABANDANA$ ω is an internal node in the suffix tree for T

if and only if

ω is a branching word in T$

if and only if

ω is the LCP of two adjacent suffixes in the suffix array for T Key Intuition: Adjacent suffixes with long shared prefixes correspond to subtrees of the suffix tree.

Harnessing this Connection

Longest Repeated Substring

$ A D N ● Last time, we B A saw how to solve $ B N A A D N N the longest A A A A D $ B N repeated N A A D $ N $ N A A A substring B N N A B problem by A D A A N $ N A A A N D A using suffix N B $ B A A N A A N trees. D A $ D B A A N N ● N A Algorithm: Find A $ N D A the internal node N D $ N A A A in the suffix tree D N A A $ N $ with the longest A A label. N $ A $ ● Question: Can $ we do this with just a suffix array? ABANANABANDANA$ Longest Repeated Substring

● We can list all branching $ words from a suffix A$ ABANANABANDANA$ array in time O(m2). ABANDANA$ ● O(m) pairs; each pair ANA$ takes time O(m) to ANABANDANA$ process. ANANABANDANA$ ANDANA$ ● This worst-case bound BANANABANDANA$ can be realized. (Do you BANDANA$ see how?) DANA$ NA$ ● Contrast this with O(m) NABANDANA$ for a suffix tree. NANABANDANA$ NDANA$ ● Can we do better? ABANANABANDANA$ Longest Repeated Substring

● Observation: We don’t $ A$ actually need to know ABANANABANDANA$ what all the branching ABANDANA$ words are to find the ANA$ longest repeated ANABANDANA$ ANANABANDANA$ substring. ANDANA$ ● We just need to know BANANABANDANA$ how long they are. BANDANA$ DANA$ ● That way, we can figure NA$ out which is longest. NABANDANA$ NANABANDANA$ ● Is there some nice way NDANA$ to do this? ABANANABANDANA$ LCP Arrays

LCP Arrays

$ ● The LCP array, 0 1 A$ often denoted H, is 4 ABANANABANDANA$ an array where H[i] 1 ABANDANA$ ANA$ is the length of the 3 3 ANABANDANA$ LCP of the ith and ANANABANDANA$ 2 ANDANA$ (i+1)st suffixes in 0 3 BANANABANDANA$ the suffix array. 0 BANDANA$ 0 DANA$ ● (The letter H comes NA$ 2 NABANDANA$ from “height.”) 2 NANABANDANA$ 1 NDANA$

ABANANABANDANA$ $ A D N $ B 0 A$ A 1 $ B N A A D N N 4 ABANANABANDANA$ A A A D A $ N ABANDANA$ N $ B N 1 A A D A ANA$ $ B N A A 3 A D N N A N B ANABANDANA$ A A A A N $ 3 N A B D A ANANABANDANA$ N $ B A N 2 A N D A A ANDANA$ B A A $ N D 0 A N N 3 BANANABANDANA$ A $ D A A N D N 0 BANDANA$ N A A $ DANA$ D A A 0 $ N N NA$ A A $ 2 N A NABANDANA$ $ $ 2 A 1 NANABANDANA$ $ NDANA$ ABANANABANDANA$

Key intuition: The suffix array gives the leaves of the suffix tree. The LCP array gives the internal nodes of the suffix tree.

Using LCP Arrays

● If you already have a $ 0 A$ suffix array and LCP 1 ABANANABANDANA$ array, you can solve 4 ABANDANA$ longest repeated 1 3 ANA$ substring in time O(m): ANABANDANA$ 3 ANANABANDANA$ ● Find the largest element 2 ANDANA$ in the LCP array. 0 3 BANANABANDANA$ ● Return the string it 0 BANDANA$ corresponds to. DANA$ 0 NA$ ● Question: How fast can 2 NABANDANA$ we construct an LCP 2 1 NANABANDANA$ array? NDANA$ ABANANABANDANA$ Building LCP Arrays

Building LCP Arrays

$ ● It never hurts to start with 0 A$ the naive algorithm and see 1 ABANANABANDANA$ what happens! 4 1 ABANDANA$ ● Algorithm: For each 3 ANA$ consecutive pair of strings ANABANDANA$ in the suffix array, compute 3 2 ANANABANDANA$ the length of their longest 0 ANDANA$ common prefix. 3 BANANABANDANA$ ● We can upper-bound the 0 BANDANA$ runtime at O(m2). DANA$ 0 NA$ ● Question: Can we realize 2 NABANDANA$ this upper bound? 2 NANABANDANA$ 1 NDANA$

ABANANABANDANA$ Building LCP Arrays

$ ● Why is our naive 0 1 A$ algorithm slow? 4 ABANANABANDANA$ 1 ABANDANA$ ● Intuition: We ANA$ 3 ANABANDANA$ aren’t able to carry 3 2 ANANABANDANA$ work from one ANDANA$ 0 BANANABANDANA$ suffix over to the 3 0 BANDANA$ next. DANA$ 0 NA$ 2 NABANDANA$ 2 NANABANDANA$ 1 NDANA$

ABANANABANDANA$ Building LCP Arrays

● $ Key intuition: Suffixes A$ overlap one another! It ABANANABANDANA$ should be possible to 4 ABANDANA$ share LCP information ANA$ across suffixes. ANABANDANA$ ● For example, suppose we ANANABANDANA$ compute the LCP entry ANDANA$ BANANABANDANA$ shown here. BANDANA$ Look at the suffixes DANA$ formed by dropping the NA$ first letter of these two NABANDANA$ suffixes. NANABANDANA$ NDANA$ What do we know about

their LCP? Building LCP Arrays

● $ Key intuition: Suffixes A$ overlap one another! It ABANANABANDANA$ should be possible to 4 ABANDANA$ share LCP information ANA$ across suffixes. ANABANDANA$ ● For example, suppose we ANANABANDANA$ compute the LCP entry ANDANA$ BANANABANDANA$ shown here. 3 BANDANA$ ● Look at the suffixes DANA$ formed by dropping the NA$ first letter of these two NABANDANA$ suffixes. NANABANDANA$ NDANA$ ● What do we know about

their LCP? ABANANABANDANA$ ABANDANA$ Building LCP Arrays

$ ● Let’s do another A$ example. Suppose ABANANABANDANA$ ABANDANA$ we know the LCP of ANA$ these suffixes. 3 ANABANDANA$ ANANABANDANA$ ● As before, drop the ANDANA$ first letter from BANANABANDANA$ BANDANA$ each suffix. DANA$

● NA$ What can we say 2 NABANDANA$ about the LCP of NANABANDANA$ the resulting NDANA$

suffixes? ANA$ ANABANDANA$ Building LCP Arrays

● $ Sometimes, in dropping the A$ first letter, two adjacent ABANANABANDANA$ suffixes get spread out. ABANDANA$ Claim: Look at the second ANA$ suffix in the pair. Its LCP ANABANDANA$ with the suffix before it is ANANABANDANA$ at least the previous LCP ANDANA$ minus one. BANANABANDANA$ BANDANA$ Why? DANA$ Think about the suffix tree. NA$ The two shorter suffixes are 2 NABANDANA$ in the same subtree, so NANABANDANA$ everything between them is NDANA$ also in that subtree.are in sorted order! Building LCP Arrays

● $ Sometimes, in dropping the A$ first letter, two adjacent ABANANABANDANA$ suffixes get spread out. ABANDANA$ Claim: Look at the second ANA$ suffix in the pair. Its LCP ANABANDANA$ with the suffix before it is ANANABANDANA$ at least the previous LCP ANDANA$ minus one. BANANABANDANA$ BANDANA$ Why? DANA$ Think about the suffix tree. NA$ The two shorter suffixes are 2 NABANDANA$ in the same subtree, so NANABANDANA$ everything between them is NDANA$ also in that subtree.are in NABANDANA$ sorted order! NANABANDANA$ Building LCP Arrays

● $ Sometimes, in dropping the A$ first letter, two adjacent ABANANABANDANA$ suffixes get spread out. ABANDANA$ ● Claim: Look at the second ANA$ suffix in the pair. Its LCP ANABANDANA$ with the suffix before it is ANANABANDANA$ at least the previous LCP ANDANA$ minus one. BANANABANDANA$ BANDANA$ ● Why? DANA$ ● Think about the suffix tree. NA$ The two shorter suffixes are 2 NABANDANA$ in the same subtree, so NANABANDANA$ everything between them is NDANA$ also in that subtree. NABANDANA$ NANABANDANA$ Building LCP Arrays

● $ We know that these two A$ new suffixes must have ABANANABANDANA$ an LCP of at least 1, ABANDANA$ because the two old 3 ANA$ suffixes have an LCP of 2. ANABANDANA$

● ANANABANDANA$ However, the LCP may be ANDANA$ longer than 1, since BANANABANDANA$ we’ve never seen one of BANDANA$ these two suffixes. DANA$

● NA$ We still need to some NABANDANA$ some scanning, but we 2 NANABANDANA$ won’t necessarily have to NDANA$ rescan the entire suffix.

NABANDANA$ NANABANDANA$ Kasai’s Algorithm

● $ For each suffix of the 0 A$ original string, except the 1 ABANANABANDANA$ last: 4 1 ABANDANA$ ● Find that suffix in the suffix 3 ANA$ array. ANABANDANA$ 3 ANANABANDANA$ ● Look at the suffix that 2 comes before it. ANDANA$ 0 BANANABANDANA$ ● ( ) Find the length of the 3 ★) Findthelengthof BANDANA$ longest common prefix of 0 DANA$ those suffixes. 0 2 NA$ ● Write that down in the H NABANDANA$ array. 2 1 NANABANDANA$ ● Use the insight from the NDANA$ previous slides to speed ABANANABANDANA$ up step ( ). ★) Findthelengthof Kasai’s Algorithm

● For each suffix of the original string, except the last:

● Find that suffix in the suffix array. ● Look at the suffix that comes before it. ● ( ) Find the length of the longest★) Findthelengthof common prefix of those suffixes. ● Write that down in the H array. ● Use the insight from the previous slides to speed up step ( ). ★) Findthelengthof Kasai’s Algorithm

● For each suffix of the WithWith O( O(mm) )preprocessing preprocessing original string, except the time,time, can can be be done done in in time time O(1). last: O(1).

QuestionQuestion to to Ponder: Ponder: ● Find that suffix in the suffix HowHow would would you you do do this? this? array. ● Look at the suffix that The runtime of this step is comes before it. The runtime of this step is proportionalproportional to to how how much much the the LCP LCP ● ( ) Find the length of the increases on that step. increases on that step. ★) Findthelengthof longest common prefix of TheThe LCP LCP value value decreases Haddecreases to scan by by at at most most those suffixes. one per suffix. (We saw this earlier.) one per suffix.these (We characters saw this earlier.) The LCP value maxes out at m. (Can’t ● Write that down in the H The LCP value maxes out at m. (Can’t match more than all the characters.) array. matchABAN moreANABANDANA$ than all the characters.)

Therefore,Therefore, the the LCP LCP value value can can grow grow at at ● Use the insight from the mostABAN 2mDANA$ times. (Prove this!) most 2m times. (Prove this!) previous slides to speed Claim: AcrossAlready all iterations, known this step Claim: Acrossto allmatch iterations, this step up step ( ). takestakes a a total total of of O( O(mm) )time. time. ★) Findthelengthof Kasai’s Algorithm

● For each suffix of the WithWith O( O(mm) )preprocessing preprocessing original string, except the time,time, can can be be done done in in time time O(1). last: O(1).

QuestionQuestion to to Ponder: Ponder: ● Find that suffix in the suffix HowHow would would you you do do this? this? array. ● Look at the suffix that The runtime of this step is comes before it. The runtime of this step is proportionalproportional to to how how much much the the LCP LCP ● ( ) Find the length of the increases on that step. increases on that step. ★) Findthelengthof longest common prefix of TheThe LCP LCP value value decreases decreases by by at at most most those suffixes. one per suffix. (We saw this earlier.) one per suffix. (We saw this earlier.) The LCP value maxes out at m. (Can’t ● Write that down in the H The LCP value maxes out at m. (Can’t match more than all the characters.) array. match more than all the characters.)

Therefore,Therefore, the the LCP LCP value value can can grow grow at at ● Use the insight from the most 2m times. (Prove this!) most 2m times. (Prove this!) previous slides to speed Claim: Across all iterations, this step Claim: Across all iterations, this step up step ( ). takestakes a a total total of of O( O(mm) )time. time. ★) Findthelengthof Kasai’s Algorithm

● For each suffix of the original string, except the last:

● Find that suffix in the suffix array. ● Look at the suffix that TotalTotal comes before it. runtime:runtime: ● ( ) Find the length of the O(m). longest★) Findthelengthof common prefix of O(m). those suffixes. ● Write that down in the H array. ● Use the insight from the previous slides to speed up step ( ). ★) Findthelengthof More to Explore

● We could easily spend a whole quarter talking about suffix arrays. Here’s what we didn’t cover:

● Bottom-up tree simulations: Using LCP arrays, you can simulate any O(m)-time suffix tree algorithm that works with a bottom-up DFS in time O(m). ● Faster substring searching: Using LCP arrays, plus RMQ, you can improve the cost of a substring search to O(n + z + log m). ● Burrows-Wheeler transform: Suffix arrays, plus LCP arrays, can be used to significantly improve the performance of text compressors. ● Find this stuff interesting? These could make for really interesting final project topics!

Next Time

● Building Suffix Trees

● … is not that bad once you have a suffix array. ● Building Suffix Arrays

● … with the mother of all divide-and-conquer algorithms!