Genetic Databases

Genetic Databases

Stefano Lonardi March, 2000 Compression of Biological Sequences by Greedy Off-line Textual Substitution Alberto Apostolico Stefano Lonardi Purdue University Università di Padova Genetic Databases § Massive § Growing exponentially Example: GenBank contains approximately 4,654,000,000 bases in 5,355,000 sequence records as of December 1999 Data Compression Conference 2000 1 Stefano Lonardi March, 2000 DNA Sequence Records Composed by annotations (in English) and DNA bases (on the alphabet {A,C,G,T,U,M,R,W,S,Y,K,V,H,D,B,X,N}) >RTS2 RTS2 upstream sequence, from -200 to -1 TCTGTTATAGTACATATTATAGTACACCAATGTAAATCTGGTCCGGGTTACACAACACTT TGTCCTGTACTTTGAAAACTGGAAAAACTCCGCTAGTTGAAATTAATATCAAATGGAAAA GTCAGTATCATCATTCTTTTCTTGACAAGTCCTAAAAAGAGCGAAAACACAGGGTTGTTT GATTGTAGAAAATCACAGCG >MEK1 MEK1 upstream sequence, from -200 to -1 TTCCAATCATAAAGCATACCGTGGTYATTTAGCCGGGGAAAAGAAGAATGATGGCGGCTA AATTTCGGCGGCTATTTCATTCATTCAAGTATAAAAGGGAGAGGTTTGACTAATTTTTTA CTTGAGCTCCTTCTGGAGTGCTCTTGTACGTTTCAAATTTTATTAAGGACCAAATATACA ACAGAAAGAAGAAGAGCGGA >NDJ1 NDJ1 upstream sequence, from -200 to -1 ATAAAATCACTAAGACTAGCAACCACGTTTTGTTTTGTAGTTGAGAGTAATAGTTACAAA TGGAAGATATATATCCGTTTCGTACTCAGTGACGTACCGGGCGTAGAAGTTGGGCGGCTA TTTGACAGATATATCAAAAATATTGTCATGAACTATACCATATACAACTTAGGATAAAA ATACAGGTAGAAAAACTATA Problem Textual compression of DNA data is difficult, i.e., “standard” methods do not seem to exploit the redundancies (if any) inherent to DNA sequences cfr. C.Nevill-Manning, I.H.Witten, “Protein is incompressible”, DCC99 Data Compression Conference 2000 2 Stefano Lonardi March, 2000 Findings and Improvements § A third scheme (Off-line3) has been designed § Compression time has been improved using a few “tuned” heuristics § Compression performance on a single DNA sequences is substantially better than other generic textual compression methods § Compression performance approaches the methods specifically designed for DNA sequences § The best performance is in the compression of families of DNA sequences Yeast Chromosomes 450,000 400,000 350,000 300,000 Pack (Huffman) Compress (LZ-78) 250,000 GZip (LZ-77) 200,000 BZip2 (BWT) 150,000 Off-Line3 100,000 50,000 0 chrI chrIII chrV mito chrVII chrIX chrXI chrXIII chrXV Data Compression Conference 2000 3 Stefano Lonardi March, 2000 Yeast Chromosome-III 2.4 2.33 2.3 2.2 2.19 2.18 2.17 2.1 2 bits per symbol 1.97 1.94 1.92 1.9 1.913 1.8 Off-line3 AED [A98] Gzip (LZ-77) BZip2 (BWT) Cdna [LY97] Pack (Huffman) Compress (LZ-78) Biocomp2 [GT94] Results on Families 300,000 250,000 Pack (Huffman) 200,000 Compress (LZ-78) 150,000 GZip (LZ-77) BZip2 (BWT) 100,000 Off-Line3 50,000 0 Spor_All All_Up_1M Spor_EarlSpor_EarlyII yI Helden_All Spor_All_2x Helden_GCNSpor_Middle All_Up_400k dataset at http://www.cs.purdue.edu/homes/stelo/off-line/ Data Compression Conference 2000 4 Stefano Lonardi March, 2000 Overall Structure of Off-line Off-line (string x) repeat § build an index T of the substrings w of the text x, and collect fw (count of non-overlapping occurrences) § choose Q substrings s1,…,sQ in T which maximize the gain function G § substitute the occurrences of s1,…,sQ in x with pointers until no further compression of x can be obtained Data Structures § index T: min. augmented suffix tree – construction O(n log2(n)) – annotation with the count of non- overlapping occurrences O(n) § text x stored in a balanced tree of text fragments – frequent deletions and string searches Data Compression Conference 2000 5 Stefano Lonardi March, 2000 Min. Augmented Suffix Tree a b a ab a ba aba aba..$ 13 8 5 3 2 2 1 1 $ $ aba..$ $ ba$ 21 19 6 14 9 ba aba aba..$ 3 2 4 $ ba$ 17 12 ab a b a ab a aba..$ 4 3 3 2 2 1 3 aba..$ $ ba$ 8 16 11 ba a ba ba a ba aba..$ 8 4 3 2 2 1 2 $ aba..$ $ ba$ 20 7 15 10 ba aba aba..$ 3 2 5 $ $ ba$ 22 18 13 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 a b a a b a b a a b a a b a b a a b a b a $ Off-line1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 abaababaabaababaababa$ B fw mw 1 2 3 4 5 6 7 aba 1,4,9,12,17,… bababa$ 3,… 5,… B mw + log2(mw) + log2(fw) + fw log2(n) Data Compression Conference 2000 6 Stefano Lonardi March, 2000 Off-line2 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 abaababaabaababaababa$ L L L L L L L L L L L L L L L L L L L L L L (B + 1) fw mw 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 8,3 5,3 baaba -3,3 ba -5,3 ba$ P P L L L L L P L L P L L L (B + 1) mw + (fw - 1) (log2(n) + log2(mw) + 1) Off-line3 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 abaababaabaababaababa$ L L L L L L L L L L L L L L L L L L L L L L (B + 1) fw mw 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 aba 1 1 ba 1 1 ba 1 ba$ 3,… P P L L P P L L P L L L B mw + (log2(d) + 1) fw + log2(mw) Data Compression Conference 2000 7 Stefano Lonardi March, 2000 Off-line Comparison Paper2 Paper2 Mito Mito size time [min] size time [min] Off-line1 30,848 3.21 16,426 1.66 Off-line2 33,757 3.01 17,741 2.24 Off-line3 30,219 2.38 16,086 2.38 300 Mhz/128 MB machine running Solaris Heuristics § Queue – collect Q substrings from T with “high utilization” potential § Pruning – consider only substrings of length < L § “Standard” suffix tree – faster to build but less accurate (i.e., counts overlapping occurrences) Data Compression Conference 2000 8 Stefano Lonardi March, 2000 Size vs. Iterations (mito) 17100 17000 16900 e3) 16800 16700 16600 (Off-lin 16500 ize s 16400 16300 Final 16200 16100 16000 0 50 100 150 200 250 300 Iteration Size vs. Iterations (paper2) 30000 29900 29800 (Off-line3) 29700 ize s 29600 Final 29500 29400 0 50 100 150 200 250 300 Iteration Data Compression Conference 2000 9 Stefano Lonardi March, 2000 Final RemarksRemarks § Off-line appears to be a solid first step to tackle the problem of compression of genetic sequences § Next: specialize Off-line for DNA with “biological knowledge” (e.g., palindromic/approximate occurrences) Data Compression Conference 2000 10 Stefano Lonardi March, 2000 Dictionary Complexity Hierarchy Optimal encoding for general macro schemes Optimal encoding for a given dictionary Off-Line encoding LZ-77 encoding LZ-78 encoding Exponential Polynomial Linear Time Complexity Data Compression Conference 2000 11.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    11 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us