Genetic Databases
Total Page:16
File Type:pdf, Size:1020Kb
Stefano Lonardi March, 2000 Compression of Biological Sequences by Greedy Off-line Textual Substitution Alberto Apostolico Stefano Lonardi Purdue University Università di Padova Genetic Databases § Massive § Growing exponentially Example: GenBank contains approximately 4,654,000,000 bases in 5,355,000 sequence records as of December 1999 Data Compression Conference 2000 1 Stefano Lonardi March, 2000 DNA Sequence Records Composed by annotations (in English) and DNA bases (on the alphabet {A,C,G,T,U,M,R,W,S,Y,K,V,H,D,B,X,N}) >RTS2 RTS2 upstream sequence, from -200 to -1 TCTGTTATAGTACATATTATAGTACACCAATGTAAATCTGGTCCGGGTTACACAACACTT TGTCCTGTACTTTGAAAACTGGAAAAACTCCGCTAGTTGAAATTAATATCAAATGGAAAA GTCAGTATCATCATTCTTTTCTTGACAAGTCCTAAAAAGAGCGAAAACACAGGGTTGTTT GATTGTAGAAAATCACAGCG >MEK1 MEK1 upstream sequence, from -200 to -1 TTCCAATCATAAAGCATACCGTGGTYATTTAGCCGGGGAAAAGAAGAATGATGGCGGCTA AATTTCGGCGGCTATTTCATTCATTCAAGTATAAAAGGGAGAGGTTTGACTAATTTTTTA CTTGAGCTCCTTCTGGAGTGCTCTTGTACGTTTCAAATTTTATTAAGGACCAAATATACA ACAGAAAGAAGAAGAGCGGA >NDJ1 NDJ1 upstream sequence, from -200 to -1 ATAAAATCACTAAGACTAGCAACCACGTTTTGTTTTGTAGTTGAGAGTAATAGTTACAAA TGGAAGATATATATCCGTTTCGTACTCAGTGACGTACCGGGCGTAGAAGTTGGGCGGCTA TTTGACAGATATATCAAAAATATTGTCATGAACTATACCATATACAACTTAGGATAAAA ATACAGGTAGAAAAACTATA Problem Textual compression of DNA data is difficult, i.e., “standard” methods do not seem to exploit the redundancies (if any) inherent to DNA sequences cfr. C.Nevill-Manning, I.H.Witten, “Protein is incompressible”, DCC99 Data Compression Conference 2000 2 Stefano Lonardi March, 2000 Findings and Improvements § A third scheme (Off-line3) has been designed § Compression time has been improved using a few “tuned” heuristics § Compression performance on a single DNA sequences is substantially better than other generic textual compression methods § Compression performance approaches the methods specifically designed for DNA sequences § The best performance is in the compression of families of DNA sequences Yeast Chromosomes 450,000 400,000 350,000 300,000 Pack (Huffman) Compress (LZ-78) 250,000 GZip (LZ-77) 200,000 BZip2 (BWT) 150,000 Off-Line3 100,000 50,000 0 chrI chrIII chrV mito chrVII chrIX chrXI chrXIII chrXV Data Compression Conference 2000 3 Stefano Lonardi March, 2000 Yeast Chromosome-III 2.4 2.33 2.3 2.2 2.19 2.18 2.17 2.1 2 bits per symbol 1.97 1.94 1.92 1.9 1.913 1.8 Off-line3 AED [A98] Gzip (LZ-77) BZip2 (BWT) Cdna [LY97] Pack (Huffman) Compress (LZ-78) Biocomp2 [GT94] Results on Families 300,000 250,000 Pack (Huffman) 200,000 Compress (LZ-78) 150,000 GZip (LZ-77) BZip2 (BWT) 100,000 Off-Line3 50,000 0 Spor_All All_Up_1M Spor_EarlSpor_EarlyII yI Helden_All Spor_All_2x Helden_GCNSpor_Middle All_Up_400k dataset at http://www.cs.purdue.edu/homes/stelo/off-line/ Data Compression Conference 2000 4 Stefano Lonardi March, 2000 Overall Structure of Off-line Off-line (string x) repeat § build an index T of the substrings w of the text x, and collect fw (count of non-overlapping occurrences) § choose Q substrings s1,…,sQ in T which maximize the gain function G § substitute the occurrences of s1,…,sQ in x with pointers until no further compression of x can be obtained Data Structures § index T: min. augmented suffix tree – construction O(n log2(n)) – annotation with the count of non- overlapping occurrences O(n) § text x stored in a balanced tree of text fragments – frequent deletions and string searches Data Compression Conference 2000 5 Stefano Lonardi March, 2000 Min. Augmented Suffix Tree a b a ab a ba aba aba..$ 13 8 5 3 2 2 1 1 $ $ aba..$ $ ba$ 21 19 6 14 9 ba aba aba..$ 3 2 4 $ ba$ 17 12 ab a b a ab a aba..$ 4 3 3 2 2 1 3 aba..$ $ ba$ 8 16 11 ba a ba ba a ba aba..$ 8 4 3 2 2 1 2 $ aba..$ $ ba$ 20 7 15 10 ba aba aba..$ 3 2 5 $ $ ba$ 22 18 13 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 a b a a b a b a a b a a b a b a a b a b a $ Off-line1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 abaababaabaababaababa$ B fw mw 1 2 3 4 5 6 7 aba 1,4,9,12,17,… bababa$ 3,… 5,… B mw + log2(mw) + log2(fw) + fw log2(n) Data Compression Conference 2000 6 Stefano Lonardi March, 2000 Off-line2 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 abaababaabaababaababa$ L L L L L L L L L L L L L L L L L L L L L L (B + 1) fw mw 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 8,3 5,3 baaba -3,3 ba -5,3 ba$ P P L L L L L P L L P L L L (B + 1) mw + (fw - 1) (log2(n) + log2(mw) + 1) Off-line3 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 abaababaabaababaababa$ L L L L L L L L L L L L L L L L L L L L L L (B + 1) fw mw 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 aba 1 1 ba 1 1 ba 1 ba$ 3,… P P L L P P L L P L L L B mw + (log2(d) + 1) fw + log2(mw) Data Compression Conference 2000 7 Stefano Lonardi March, 2000 Off-line Comparison Paper2 Paper2 Mito Mito size time [min] size time [min] Off-line1 30,848 3.21 16,426 1.66 Off-line2 33,757 3.01 17,741 2.24 Off-line3 30,219 2.38 16,086 2.38 300 Mhz/128 MB machine running Solaris Heuristics § Queue – collect Q substrings from T with “high utilization” potential § Pruning – consider only substrings of length < L § “Standard” suffix tree – faster to build but less accurate (i.e., counts overlapping occurrences) Data Compression Conference 2000 8 Stefano Lonardi March, 2000 Size vs. Iterations (mito) 17100 17000 16900 e3) 16800 16700 16600 (Off-lin 16500 ize s 16400 16300 Final 16200 16100 16000 0 50 100 150 200 250 300 Iteration Size vs. Iterations (paper2) 30000 29900 29800 (Off-line3) 29700 ize s 29600 Final 29500 29400 0 50 100 150 200 250 300 Iteration Data Compression Conference 2000 9 Stefano Lonardi March, 2000 Final RemarksRemarks § Off-line appears to be a solid first step to tackle the problem of compression of genetic sequences § Next: specialize Off-line for DNA with “biological knowledge” (e.g., palindromic/approximate occurrences) Data Compression Conference 2000 10 Stefano Lonardi March, 2000 Dictionary Complexity Hierarchy Optimal encoding for general macro schemes Optimal encoding for a given dictionary Off-Line encoding LZ-77 encoding LZ-78 encoding Exponential Polynomial Linear Time Complexity Data Compression Conference 2000 11.