Persistent Trees

Full copy bad

Fat no des metho d

� replace each singlevalued eld of by list of all values

taken sorted by time

� requires O space p er data change unavoidable if keep old date

� to lo okup data eld need to search based on time

� store values in binary

� checkingchanging a data item takes O log m time after m up dates

� slowdown of O log m in structure access

Path copying

� much of data structure consists of xedsize nodes conencted by pointers

� can only reach no de by traversing p ointers starting from root

� changes to a no de only visible to ancestors in p ointer structure

)

k to ro ot of data structure

� when change a no de copy it and ancestors bac

� keep list of ro ots sorted by up date time

� O log m time to nd right ro ot or const if time is integers

� same access time as original structure

� additive instead of multiplicative O log m

� mo dication time and space usage equals numb er of ancestors p ossibly

huge

Combined Solution trees only

� in each no de store extra timestamp ed eld

� if full overrides one of standard elds for any accesses later than stamp ed

time

� access rule

standard access just check for overrides while following p ointers

constant factor increase in access time

� up date rule

when need to changecopy p ointer use extra if available

{ otherwise make new copy of no de with new info and recursively

mo dify parent

� Analysis

{ live no de p ointed at by current ro ot

{ p otential function numb er of ful l live no des

{ copying a no de is free new copy not full pays for copy spacetime

{ pay for lling an extra p ointer do only once since can stop at that

p oint

{ amortized space p er up date O

Power of twos Like Fib heaps Show of mo dications

Application p ersistent trees

� amortized cost O to change a eld

has O log n amortized eld change p er access

� O log n space p er access

� drawback rotations on access mean unbounded space usage

Redblack trees

� aggressive rebalancers

� store redblack bit in each no de

k bit to rebalance

� use redblac

� depth O log n

� search standard binary tree search no changes

� up date causes changes in redblack elds on path to item O rotations

� result log n space per insertdelete

� geometry do es O n changes so O n log n space

� O log n query time

Improvement

� redblack bits used only for up dates

� only need current version of redblack bits

� dont store old versions just overwrite

� only up dates needed are for O rotations

� so O space p er up date

� so O n space overall

Result O n space O log n query time for planar p oint lo cation

Extensions

� metho d extends to arbitrary p ointerbased structures

� O cost p er up date for any p ointerbased structure with any constant

indegree s

� full p ersistence with same b ounds

Sux Trees

Cro chemore and Rytter Text Algorithms

Guseld Algorithms on Strings Trees and Sequences

Weiner Linear Patternmatching algorithms IEEE conference on automata

and switching theory

McCreight A spaceeconomical sux tree construction algorithm JACM

Chen and Seifras Ecient and Elegegant Sux tree construction in Ap os

tolicoGalil Combninatorial Algorithms on Words

Another search structure dedicated to strings

Basic problem match a pattern to text

� goal decide if a given string pattern is a of the text

� p ossibly created by concatenating short ones eg newspap er

� application in IR also computational bio DNA seqs

� if pattern avilable rst can build DFA run in time linear in text

� if text available rst can build sux tree run in time linear in pattern

tree on strings Inecient b ecause run over pattern many

First idea binary

times

Tries

� Idea like bucket heaps use b ounded alphab et

� used to store dictionary of strings

� trees with children indexed by alphab et

� time to search equal length of query string

� insertion ditto

� optimal since even hashing requires this time to hash

� but b etter b ecause no hash function computed

� space an issue

{ using array increases stroage cost by jj

{ using binary tree on alphab et increases search time by log jj

{ ok for const alphab et

size in worst case sum of word lengths so pretty much solves dictionary

problem

But what ab out

2

substrings idea of all n

equivalent to trie of all n suxes

put marker at end so no sux contained in other otherwise some

sux can b e an internal no de hidden by piece of other sux

means one leaf p er sux

Naive construction insert each sux

basic alg

{ text a a

1 m

{ dene s a a

i i m

{ for i to m

{ insert s

i

2

time space O m

Better construction

uch smaller aaaaaaa note trie size may b e m

algorithm with time O jT j

idea avoid rep eated work by memoizing

also shades of nger idea

supp ose just inserted aw

next insert is w

big prex of w might already b e in trie

� avoid traversing skip to end of prex

Sux links

� any no de in trie corresp onds to string

� arrange for no de corresp to ax to p oint at no de corresp to x

� supp ose just inserted aw

� walk up tree till nd sux link

� follow link puts you on path corresp to w

� walk down tree adding no des to insert rest of w

Memoizing save your work

� can add sux link to every no de we walked up

� since walked up end of aw and are putting in w now

� charging scheme charge traversal up a no de to creation of sux link

� traversal up also covers same length traversal down

� once no de has sux link never passed up again

� thus total time sp ent going updown equals numb er of sux links

� one sux link p er no de so time O jT j

Sux Trees

2

m

problem maybe jT j is large

compress paths in sux trie

path on letters a a corresp to substring of text

i j

replace by edge lab elled by i j implicit nodes

gives tree where every no de has indegree at least

in such a tree size is order numb er of leaves O m

Search still works

{ preserves invariant at most one edge starting with given character

leaves a no de

{ store edges in array indexed by starting character

{ walk down same as trie but use indexing into text to nd chars

{ called slownd for later

Construction

� obvious build sux trie compress

2

� drawback may take m time and intermediate space

� b etter use original construction idea work in compressed domain

� as b efore insert suxes in order s s

1 m

� compressed tree of what inserted so far

� to insert s walk down tree

i

� at some p oint path diverges from whats in tree

� may force us to break an edge show

� tack on one new edge for rest of string cheap

MacReight

� use sux link idea of uplinkdown

� problem cant sux link every character only explict no des

� want to work prop ortional to real no des traversed

� need to skip characters inside edges since cant pay for them

� intro duced fastnd

{ idea fast alg for descending tree if know string present in tree

{ just check rst char on edge then skip numb er of chars equal to edge

length

{ may land you in middle of edge sp ecied o

b er of explicit no des in path

{ cost of search num

{ pay for with sux links

Analysis

� supp ose just inserted string aw

� sitting on its leaf which has parent

� invariant every internal no de except for parent of current leaf has sux

link to another explicit no de

� plausible

{ supp ose s and s diverge creating explicit no de at v

j k

{ claim s and s diverge at suxv creating another explicit

j +1 k +1

no de

{ only problem if s not yet present

k +1

{ happ ens only if k is current sux

{ only blo cks parent of current leaf

� insertion step

{ consider parent p and grandparent parent of parent g of current

i i

no de

{ g to p link has string w

i i 1

{ p to l link w

i i 2

{ go up to grandparent

{ follow sux link

{ fastnd w

1

{ claim know w is present in tree

1

es in variant

{ create sux link from p preserv

i

{ slownd w stopping when leave current tree

2

{ break current edge

{ add new edge for rest of w

2

Analysis

� Break into two costs from sufg to sufp w then sufp to p

i i 1 i i+1

w

2

� slownd cost is time to get from sufp to p plus const

i i+1

� note p is descendant of sufp

i+1 i

P

� so total cost O jp p j O n

i+1 i

what ab out time to nd sufp

i

less than slownd so at most jg j jg j to reach g

fastnd costs

i+1 i i+1

Sums to linear

Done if g b elow sufp double counts but who cares

i+1 i

what if g ab ove sufp

i+1 i

can only happ en if sufp p this is only no de b elow g

i i+1 i+1

in this case fastnd takes step to go from g to p landing in middle

i+1 i+1

of edge

only case where fastnd necessary but cant tell in advance

Weiner algorithm insert strings backwards use prex links Ukonnen online

version

Sux arrays

Applications

� prepro cess b ottom up storing rst last num of suxes in subtree

� allows to answer queries what rst last count of w in text in time O jw j

enumerate k o ccurrences in time O w jk j traverse subtree binary so

size order of numb er of o ccurences

longest common substring work b ottom up deciding if subtree contains

suxes of b oth strings Take deep est no de satisfying