String Complexity

Wojciech Szpankowski Purdue University W. Lafayette, IN 47907

June 1, 2015

Dedicated to Svante Janson for his 60 Birthday Outline

1. Working with Svante 2. String Complexity 3. Joint String Complexity Joint Papers

1. S. Janson and W. Szpankowski, Analysis of an asymmetric leader election algorithm Electronic J. of , 4, R17, 1997.

2. S. Janson, S. Lonardi, and W. Szpankowski, On Average Sequence Complexity, Theoretical Computer Science, 326, 213–227, 2004 (also Combinatorial Pattern Matching Conference, CPM’04, Istanbul, 2004).

3. S. Janson and W. Szpankowski, Partial Fillup and Search Time in LC Tries ACM Trans. on Algorithms, 3, 44:1-44:14, 2007 (also, ANALCO, Miami, 2006).

4. A. Magner, S. Janson, G. Kollias, and W. Szpankowski On Symmetry of Uniform and Preferential Attachment Graphs, Electronic J. Combinatorics, 21, P3.32, 2014 (also, 25th International Conference on Probabilistic, Combinatorial and Asymptotic Methods for the Analysis of Algorithms AofA’14, Paris, 2014). Joint Papers

1. S. Janson and W. Szpankowski, Analysis of an asymmetric leader election algorithm Electronic J. of Combinatorics, 4, R17, 1997.

2. S. Janson, S. Lonardi, and W. Szpankowski, On Average Sequence Complexity, Theoretical Computer Science, 326, 213–227, 2004 (also Combinatorial Pattern Matching Conference, CPM’04, Istanbul, 2004).

3. S. Janson and W. Szpankowski, Partial Fillup and Search Time in LC Tries ACM Trans. on Algorithms, 3, 44:1-44:14, 2007 (also, ANALCO, Miami, 2006).

4. A. Magner, S. Janson, G. Kollias, and W. Szpankowski On Symmetry of Uniform and Preferential Attachment Graphs, Electronic J. Combinatorics, 21, P3.32, 2014 (also, 25th International Conference on Probabilistic, Combinatorial and Asymptotic Methods for the Analysis of Algorithms AofA’14, Paris, 2014).

Working with Svante is easy ....

Outline

1. Working with Svante 2. String Complexity 3. Joint String Complexity Some Definitions

String Complexity of a single sequence is the number of distinct substrings.

Throughout, we write X for the string and denote by I(X) the set of distinct substrings of X over alphabet . A Example. If X = aabaa, then

I(X) = ǫ, a, b, aa, ab, ba, aab, aba, baa, aaba, abaa, aabaa , { } and I(X) = 12. But if X = aaaaa, then | | I(X) = ǫ, a, aa, aaa, aaaa, aaaaa , { } and I(X) = 6. | | The string complexity is the cardinality of I(X) and we study here the average string complexity.

E[ I(X) ] = P (X) I(X) . | | | | X n X∈A Suffix Trees and String Complexity

a b a a b a b a $ 1

$ $ 8 6

b a $ 4

a b a b a $ 3

b a a b a b a $ 2

b a $ 5

$ 7 $ 9 1 2 3 4 5 6 7 8 9 a b a a b a b a $ Non-compact suffix trie for X = abaababa and string complexity I(X) = 24.

a b a ababa$ a ba ababa$ 1 1

$ $ $ $ 8 6 8 6

ba$ ba$ 4 4

ababa$ ababa$ 3 3

b a ababa$ ba ababa$ 2 2

ba$ ba$ 5 5

$ $ 7 7 $ $ 9 1 2 3 4 5 6 7 8 9 9 1 2 3 4 5 6 7 8 9 a b a a b a b a $ a b a a b a b a $

String Complexity = # internal nodes in a non-compact suffix tree. Some Simple Facts

Let O(w) denote the number of times that the word w occurs in X. Then

I(X) = min 1, O(w) . | | { } w X∈A∗ Since between any two positions in X there is one and only one substring:

( X + 1) X O(w) = | | | |. 2 w X∈A∗ Hence ( X + 1) X I(X) = | | | | max 0, O(w) 1 . | | 2 − { − } w X∈A∗ Define: C := E[ I(X) X = n]. Then n | | | | | (n + 1)n Cn = (k 1)P (On(w) = k). 2 − − w k 2 X∈A∗ X≥

We need to study probabilistically On(w): that is:

number of w occurrences in a text X generated a probabilistic source. New Book on Pattern Matching Szpankowski Jacquet How do you distinguish a cat from a dog by their DNA? Did Shakespeare really write all of his plays? Philippe Jacquet and

and Wojciech Szpankowski Pattern matching techniques can offer answers to these questions and to many others, from molecular biology, to telecommunications, to classifying Twitter content. This book for researchers and graduate students demonstrates the

probabilistic approach to pattern matching, which predicts the performance Analytic Pattern Matching of pattern matching algorithms with very high precision using analytic Analytic combinatorics and analytic information theory. Part I compiles known results of pattern matching problems via analytic methods. Part II focuses on applications to various data structures on words, such as digital trees, suffix trees, string complexity and string-based data compression. The authors use results and techniques from Part I and also introduce new methodology such Pattern as the Mellin transform and analytic depoissonization. More than 100 end-of-chapter problems help the reader to make the link between theory and practice. Matching

Philippe Jacquet is a research director at INRIA, a major public research lab in Computer Science in France. He has been a major contributor to the From DNA to Twitter Internet OLSR protocol for mobile networks. His research interests involve information theory, , quantum telecommunication, protocol design, performance evaluation and optimization, and the analysis of algorithms. Since 2012 he has been with Alcatel-Lucent Bell Labs as head of the department of of Dynamic Networks and Information. Jacquet is a member of the prestigious French Corps des Mines, known for excellence in French industry, with the rank of “Ingenieur General”. He is also a member of ACM and IEEE. Wojciech Szpankowski is Saul Rosen of Computer Science and (by

9780521876087: Jacquet & Szpankowski: PPC: 9780521876087: & Szpankowski: Jacquet courtesy) Electrical and Computer Engineering at Purdue University, where he teaches and conducts research in analysis of algorithms, information theory, bioinformatics, analytic combinatorics, random structures, and stability problems of distributed systems. In 2008 he launched the interdisciplinary Institute for Science of Information, and in 2010 he became the Director of the newly established NSF Science and Technology Center for Science of Information. Szpankowski is a Fellow of IEEE and an Erskine Fellow. He received the Humboldt Research Award in 2010.

ISBN 978-0-521-87608-7 C

M Cover design: Andrew Ward

Y

K 9 780521 87608 7 > Book Contents

Chapter 1: Probabilistic Models

Chapter 2: Exact String Matching

Chapter 3: Constrained Exact String Matching

Chapter 4: Generalized String Matching

Chapter 5: Subsequence String Matching

Chapter 6: Algorithms and Data Structures

Chapter 7: Digital Trees

Chapter 8: Suffix Trees & Lempel-Ziv’77

Chapter 9: Lempel-Ziv’78 Compression Algorithm

Chapter 10: String Complexity Some Results

Last expression allows us to write

(n + 1)n Cn = + E[Sn] E[Ln] 2 − where E[Sn] and E[Ln] are, respectively, the average size and path length in the associated (compact) suffix trees.

We know that 1 E[Sn] = (n + Ψ(log n)) + o(n), h n log n E[Ln] = + nΨ2(log n)+ o(n), h where Ψ(log n) and Ψ2(log n) are periodic functions (when the log pa, a are rationally related), and where h is the entropy rate. Therefore, ∈ A (n + 1)n n Cn = (log n 1+ Q0(log n)+ o(1)) 2 − h − where Q0(x) is a periodic function. Theorem from 2004 Proved with Bare-Hands

In 2004 Svante, Stefano and I published the first result of this kind for a symmetric memoryless source (all symbol probabilities are the same).

Theorem 1 (Janson, Lonardi, W.S., 2004). Let Cn be the string complexity for an unbiased memoryless source over alphabet . Then A n + 1 1 1 γ E(Cn) = n log n+ + − + ϕ (log n) n+O( n log n) 2 − |A| 2 ln |A| |A|    |A|  p where γ 0.577 is Euler’s constant and ≈ 1 2πij ϕ (x) = Γ 1 e2πijx |A| −ln − − ln j=0   |A| X |A| is a continuous function with period 1. ϕ (x) is very small for small : 7 5 | |A| | 4 |A| ϕ (x) < 2 10− , ϕ (x) < 5 10− , ϕ (x) < 3 10− . | 2 | | 3 | | 4 | Outline

1. Working with Svante 2. String Complexity 3. Joint String Complexity Joint String Complexity

For X and Y , let J(X,Y ) be the set of common words between X and Y .

The joint string complexity is J(X,Y ) = I(X) I(Y ) | | | ∩ |

Example. If X = aabaa and Y = abbba, then J(X,Y ) = ε, a, b, ab, ba . { } Goal. Estimate J = E[ J(X,Y ) ] n,m | | when X = n and Y = m. | | | | Joint String Complexity

For X and Y , let J(X,Y ) be the set of common words between X and Y .

The joint string complexity is J(X,Y ) = I(X) I(Y ) | | | ∩ |

Example. If X = aabaa and Y = abbba, then J(X,Y ) = ε, a, b, ab, ba . { } Goal. Estimate J = E[ J(X,Y ) ] n,m | | when X = n and Y = m. | | | | Some Observations. For any word w ∗ ∈ A J(X,Y ) = min 1, O (w) min 1, O (w) . | | { X } { Y } w X∈A∗ When X = n and Y = m, we have | | | | J = E[ J(X,Y ) ] 1 = P (O1 (w) 1)P (O2 (w) 1) n,m | | − n ≥ m ≥ w ε ∈AX∗−{ } i where On(w) is the number of w-occurrences in a string of generated by source i = 1, 2 (i.e., X and Y ) which we assume to be memoryless sources. Independent Joint String Complexity

Consider two sets of n independently generated (memoryless) strings.

i Let Ωn(w) be the number of strings for which w is a prefix when the n strings are generated by a source i = 1, 2 define

C = P (Ω1 (w) 1)P (Ω2 (w) 1) n,m n ≥ m ≥ w ε ∈AX∗−{ } Theorem 2. There exists ε > 0 such that

ε J C = O(min n, m − ) n,m − n,m { } for large n. Independent Joint String Complexity

Consider two sets of n independently generated (memoryless) strings.

i Let Ωn(w) be the number of strings for which w is a prefix when the n strings are generated by a source i = 1, 2 define

C = P (Ω1 (w) 1)P (Ω2 (w) 1) n,m n ≥ m ≥ w ε ∈AX∗−{ } Theorem 2. There exists ε > 0 such that

ε J C = O(min n, m − ) n,m − n,m { } for large n.

Recurrence for Cn,m

n k n k m ℓ m ℓ Cn,m = 1+ P1(a) (1 P1(a)) − P2(a) (1 P2(a)) − Ck,ℓ k − ℓ − a k,ℓ 0 X∈A X≥     with C0,m = Cn,0 = 0. Generating Functions, Mellin Transform, DePoissonization ...

Poisson Transform. The Poisson transform C(z1,z2) of Cn,m is

znzm 1 2 z1 z2 C(z1,z2) = Cn,m e− − . n!m! n,m 0 X≥ which becomes the functional equation after summing up the recurrence:

z z C(z ,z ) = (1 e− 1)(1 e− 2)+ C (P (a)z , P (a)z ). 1 2 − − 1 1 2 2 a X∈A

n m z1+z2 Clearly, n!m!Cn,m = [z1 ][z2 ]C(z1,z2)e . Generating Functions, Mellin Transform, DePoissonization ...

Poisson Transform. The Poisson transform C(z1,z2) of Cn,m is znzm 1 2 z1 z2 C(z1,z2) = Cn,m e− − . n!m! n,m 0 X≥ which becomes the functional equation after summing up the recurrence:

z z C(z ,z ) = (1 e− 1)(1 e− 2)+ C (P (a)z , P (a)z ). 1 2 − − 1 1 2 2 a X∈A n m z1+z2 Clearly, n!m!Cn,m = [z1 ][z2 ]C(z1,z2)e . Mellin Transform. Two dimensional Mellin transform is defined as

∞ ∞ s 1 s 1 1− 2− C∗(s1,s2) = C(z1,z2)z1 z2 dz1dz2. Z0 Z0 From the above functional equation we find for 2 < (s ) < 1 − ℜ i − 1 s1 s2 s1s2 C∗(s1,s2) = Γ(s1)Γ(s2) + + + H(s ,s ) H( 1,s ) H(s , 1) H( 1, 1)  1 2 − 2 1 − − −  where the kernel is defined as

s s H(s ,s ) = 1 (P (a))− 1(P (a))− 2. 1 2 − 1 2 a X∈A Finding Cn,n

Here we only consider m = n and z1 = z2 = z.

To recover Cn,n we first find the inverse Mellin

1 s s C(z,z) = C∗(s ,s )z− 1− 2ds ds 2 1 2 1 2 (2iπ) (s )=c (s )=c Zℜ 1 1 Zℜ 2 2 which turns out to be

1 2 Γ(s )Γ(s ) 1 2 s1 s2 M C(z,z) = z− − ds1ds2 + o(z− ), 2iπ (s )=ρ (s )=ρ H(s1,s2)   Zℜ 1 1 Zℜ 2 2 for any M > 0 as z in a cone around the real axis. → ∞ The final step to recover C C(n,n) n,n ∼ is to apply the two-dimensional depoissonization. Main Results

Assume that a we have P (a) = P (a) = p . ∀ ∈ A 1 2 a Theorem 3. For a biased memoryless source, the joint complexity is asymptotically 2 log 2 Cn,n = n + Q(log n)n + o(n), h 6 where Q(x) is a small periodic function (with amplitude smaller than 10− ) which is nonzero only when the log p , a , are rationally related, that is, a ∈ A log p / log p Q. a b ∈ Assume that P (a) = P (a). 1 2 Theorem 4. Define κ = min 2 ( s1 s2) < 1, where s1 and s2 (s1,s2) R { − − } are roots of ∈K∩

s s H(s ,s ) = 1 (P (a))− 1(P (a))− 2 = 0. 1 2 − 1 2 a X∈A Then

κ n Γ(c1)Γ(c2) Cn,n = + Q(log n)+ O(1/ log n) , √log n π∆H(c1,c2) H(c1,c2) ! ∇ where Q is a doublep periodic function. Very Brief Sketch of Proof

1. Set P (a) = 1/ and then the kernel is 1 |A| H(s ,s ) = 1 s1 ps2. 1 2 − |A| a a X∈A s2 Define r(s2) = a pa and L(s2) = log r(s2). ∈A |A| P 2. Roots of H(s1,s2) = 0 are

2ikπ s1 = log (r(s2)) + − |A| log( ) |A| which are poles of C(z,z) leading to

1 2ikπ L(s) s 2ikπ/log( ) C(z,z) Γ L(s)+ Γ(s)z − − |A| ds ∼ 2iπ log (s)=c2 − log( ) |A| Zℜ Xk  |A| 

Integrating over s = s2 requires the saddle point method. Saddle Point

3. The function L(s) s achieves it minimum at c =: ρ is the dominant real − 2 saddle point. But there is more ...

2π ⋅ j ρ + i ------, j ≥ logn logr Infinitely Many Saddle Points: 2π ⋅ 2 ρ + i ------logr 2π⋅ 1 3a. is a periodic function with period . ρ + i ------L(c2 + it) 2π log ν log r

2π ⋅ 1 ρ – i ------3b The saddle points are at c2 + 2πiℓ/ log ν. logr 2π ⋅ 2 ρ – i ------logr 3c. The infinite saddle points defines 2π ⋅ j ρ – i ------, j ≤ – logn logr the fluctuating function Q.

L(c ) c κ 4. The growth of C(z,z) is defined by z 2 − 2 = z where

κ = min log (r(s)) s , c2 = min args R log (r(s)) s , s R { |A| − } ∈ { |A| − } ∈ where here s = s2, and recall L(s2) = log r(s2). |A| The factor 1/√log n comes from the saddle point approximation. This completes the sketch. Classification of Sources

The growth of Cn,n is: Θ(n) for identical sources; • Θ(nκ/√log n) for nonidential sources with κ < 1. •

(a) (b)

Figure 1: Joint complexity: (a) English text vs French, Greek, Polish, and Finnish texts; (b) real and simulated texts (3rd Markov order) of English, French, Greek, Polish and Finnish language. That’s It

THANK YOU, SVANTE!