Stefano Lonardi March, 2000

Compression of Biological Sequences by Greedy Off-line Textual Substitution

Alberto Apostolico Stefano Lonardi

Purdue University Università di Padova

Genetic Databases

§ Massive § Growing exponentially

Example: GenBank contains approximately 4,654,000,000 bases in 5,355,000 sequence records as of December 1999

Data Compression Conference 2000 1 Stefano Lonardi March, 2000

DNA Sequence Records Composed by annotations (in English) and DNA bases (on the alphabet {A,C,G,T,U,M,R,W,S,Y,K,V,H,D,B,X,N})

>RTS2 RTS2 upstream sequence, from -200 to -1 TCTGTTATAGTACATATTATAGTACACCAATGTAAATCTGGTCCGGGTTACACAACACTT TGTCCTGTACTTTGAAAACTGGAAAAACTCCGCTAGTTGAAATTAATATCAAATGGAAAA GTCAGTATCATCATTCTTTTCTTGACAAGTCCTAAAAAGAGCGAAAACACAGGGTTGTTT GATTGTAGAAAATCACAGCG >MEK1 MEK1 upstream sequence, from -200 to -1 TTCCAATCATAAAGCATACCGTGGTYATTTAGCCGGGGAAAAGAAGAATGATGGCGGCTA AATTTCGGCGGCTATTTCATTCATTCAAGTATAAAAGGGAGAGGTTTGACTAATTTTTTA CTTGAGCTCCTTCTGGAGTGCTCTTGTACGTTTCAAATTTTATTAAGGACCAAATATACA ACAGAAAGAAGAAGAGCGGA >NDJ1 NDJ1 upstream sequence, from -200 to -1 ATAAAATCACTAAGACTAGCAACCACGTTTTGTTTTGTAGTTGAGAGTAATAGTTACAAA TGGAAGATATATATCCGTTTCGTACTCAGTGACGTACCGGGCGTAGAAGTTGGGCGGCTA TTTGACAGATATATCAAAAATATTGTCATGAACTATACCATATACAACTTAGGATAAAA ATACAGGTAGAAAAACTATA

Problem

Textual compression of DNA data is difficult, i.e., “standard” methods do not seem to exploit the redundancies (if any) inherent to DNA sequences

cfr. C.Nevill-Manning, I.H.Witten, “Protein is incompressible”, DCC99

Data Compression Conference 2000 2 Stefano Lonardi March, 2000

Findings and Improvements

§ A third scheme (Off-line3) has been designed § Compression time has been improved using a few “tuned” heuristics § Compression performance on a single DNA sequences is substantially better than other generic textual compression methods § Compression performance approaches the methods specifically designed for DNA sequences § The best performance is in the compression of families of DNA sequences

Yeast Chromosomes

450,000 400,000 350,000 300,000 (Huffman) (LZ-78) 250,000 (LZ-77) 200,000 (BWT) 150,000 Off-Line3 100,000 50,000 0

chrI chrIII chrV mito chrVII chrIX chrXI chrXIII chrXV

Data Compression Conference 2000 3 Stefano Lonardi March, 2000

Yeast Chromosome-III

2.4 2.33 2.3

2.2 2.19 2.18 2.17 2.1

2 bits per symbol 1.97 1.94 1.92 1.9 1.913

1.8

Off-line3 AED [A98] Gzip (LZ-77) BZip2 (BWT) Cdna [LY97] Pack (Huffman) Compress (LZ-78) Biocomp2 [GT94]

Results on Families

300,000

250,000 Pack (Huffman) 200,000 Compress (LZ-78) 150,000 GZip (LZ-77) BZip2 (BWT) 100,000 Off-Line3 50,000

0

Spor_All All_Up_1M Spor_EarlSpor_EarlyII yI Helden_All Spor_All_2x Helden_GCNSpor_Middle All_Up_400k

dataset at http://www.cs.purdue.edu/homes/stelo/off-line/

Data Compression Conference 2000 4 Stefano Lonardi March, 2000

Overall Structure of Off-line

Off-line (string x) repeat § build an index T of the substrings w of the text x, and collect fw (count of non-overlapping occurrences) § choose Q substrings s1,…,sQ in T which maximize the gain function G

§ substitute the occurrences of s1,…,sQ in x with pointers until no further compression of x can be obtained

Data Structures

§ index T: min. augmented suffix tree – construction O(n log2(n)) – annotation with the count of non- overlapping occurrences O(n) § text x stored in a balanced tree of text fragments – frequent deletions and string searches

Data Compression Conference 2000 5 Stefano Lonardi March, 2000

Min. Augmented Suffix Tree

a b a ab a ba aba aba..$ 13 8 5 3 2 2 1 1

$ $ aba..$ $ ba$ 21 19 6 14 9

ba aba aba..$ 3 2 4

$ ba$ 17 12

ab a b a ab a aba..$ 4 3 3 2 2 1 3

aba..$ $ ba$ 8 16 11

ba a ba ba a ba aba..$ 8 4 3 2 2 1 2

$ aba..$ $ ba$ 20 7 15 10

ba aba aba..$ 3 2 5

$ $ ba$ 22 18 13

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 a b a a b a b a a b a a b a b a a b a b a $

Off-line1

1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 abaababaabaababaababa$

B fw mw

1 2 3 4 5 6 7 aba 1,4,9,12,17,… bababa$ 3,… 5,…

B mw + log2(mw) + log2(fw) + fw log2(n)

Data Compression Conference 2000 6 Stefano Lonardi March, 2000

Off-line2

1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 abaababaabaababaababa$ L L L L L L L L L L L L L L L L L L L L L L

(B + 1) fw mw

1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2

8,3 5,3 baaba -3,3 ba -5,3 ba$ P P L L L L L P L L P L L L

(B + 1) mw + (fw - 1) (log2(n) + log2(mw) + 1)

Off-line3

1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 abaababaabaababaababa$ L L L L L L L L L L L L L L L L L L L L L L

(B + 1) fw mw

1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 aba 1 1 ba 1 1 ba 1 ba$ 3,… P P L L P P L L P L L L

B mw + (log2(d) + 1) fw + log2(mw)

Data Compression Conference 2000 7 Stefano Lonardi March, 2000

Off-line Comparison

Paper2 Paper2 Mito Mito

size time [min] size time [min]

Off-line1 30,848 3.21 16,426 1.66

Off-line2 33,757 3.01 17,741 2.24

Off-line3 30,219 2.38 16,086 2.38

300 Mhz/128 MB machine running Solaris

Heuristics

§ Queue – collect Q substrings from T with “high utilization” potential § Pruning – consider only substrings of length < L § “Standard” suffix tree – faster to build but less accurate (i.e., counts overlapping occurrences)

Data Compression Conference 2000 8 Stefano Lonardi March, 2000

Size vs. Iterations (mito)

17100 17000 16900 16800 16700 16600 16500 16400 16300 Final s ize (Off-lin e3) 16200 16100 16000 0 50 100 150 200 250 300 Iteration

Size vs. Iterations (paper2)

30000

29900

29800

29700

29600

Final s ize (Off-line3) 29500

29400 0 50 100 150 200 250 300 Iteration

Data Compression Conference 2000 9 Stefano Lonardi March, 2000

Final RemarksRemarks

§ Off-line appears to be a solid first step to tackle the problem of compression of genetic sequences § Next: specialize Off-line for DNA with “biological knowledge” (e.g., palindromic/approximate occurrences)

Data Compression Conference 2000 10 Stefano Lonardi March, 2000

Dictionary

Complexity Hierarchy

Optimal encoding for general macro schemes

Optimal encoding for a given dictionary

Off-Line encoding

LZ-77 encoding LZ-78 encoding

Exponential Polynomial Linear Time Complexity

Data Compression Conference 2000 11