DECIPHER: A Search-Based Approach to Chimera Identification for 16S rRNA Sequences Erik S. Wright ([email protected]), L. Safak Yilmaz, Daniel . Noguera

Introduction Methods Results

Why are chimeras bad? How do you find chimeras with DECIPHER? What is the accuracy of this method?

This method for chimera finding was derived by modeling a set of several hundred The method was validated by testing thousands of artificially generated chimeras real chimeric sequences present in public sequence repositories and thousands of developed by joining the rRNA sequences of randomly selected isolate organisms. artificial chimeras generated from the rRNA sequence of isolated organisms: Using this method resulted in many chimeras that were frequently difcult to 1. Break classified sequence into overlapping 30 nucleotide fragments: diferentiate from their parent sequences. Therefore, this sequence set served as an excellent benchmark for comparison between diferent chimera finding programs. 3.0% 40% fs_DECIPHER 35% 2.5% ss_DECIPHER fs_DECIPHER 30% 2. Categorize fragments by prevalence (search hits): 2.0% - High frequency in classified group: ss_DECIPHER 25% 1.5% 20%

15% 1.0% FalseNegatives 10% Chimeras are DNA sequences of - Low frequency in-group and high frequency out-of-group

CumulativeFalsePositives 0.5% mythical organisms that can cause (suspect fragments): 5% false interpretations of diversity. 0.0% 0% 3. Apply Rules to chimeric region: 100 300 500 700 900 1100 1300 1500 100 300 500 700 900 1100 1300 1500 - At least 6 suspect fragments for full-length sequences (fs): Max. Sequence Length (nucleotides) Sequence Length (nucleotides) Which one of these sequences is a chimera? At least 2 suspect fragments for short-length sequences (ss): How did this method compare to others?

- At least 70 nucleotides long for full-length sequences (fs): Qualitative rating for each DECIPHER1 Chimera Pintail4 Uchime2 characteristic fs ss Slayer3 (WigeoN) At least 40 nucleotides long for short-length sequences (ss): Detection in short sequences + +++ ++++ + + Detection in mid-range sequences + ++++ +++ ++ + - At least 1 nucleotide overlapping first or last 200 nucleotides: Detection in long sequences +++ ++++ +++ ++ ++ Detection of short chimeric regions ++ ++++ ++ ++ + Detection of complex chimeras ++++ ++++ ++++ + +++ - Greater than 60% coverage with suspect fragments: Detection of chimeras from low + + ++++ +++ + Where do chimeras come from? divergence parents Independence from reference dataset +++ +++ ++ ++ ++ Low false positives ++++ ++ ++ +++ ++ Imagine a single strand of DNA represented by a line: How was this applied to a real problem? 100% 100% Major public DNA sequences 2.0% repositories contain millions of 80% 80%

There might be a lesion in the 16S Ribosomal RNA sequences. ss_DECIPHER 1.5% 60% 60% DNA strand caused by age, UV fs_DECIPHER Uchime radiation, or some other factor: 1.0% The curators of these databases ss_DECIPHER 40% 40% ChimeraSlayer 0.5% screen for sequence abnormalities fs_DECIPHER WigeoN

RDP Silva greengenes PercentDetection such as chimeras, yet additional Uchime PercentDetection 20% 20% During PCR DNA Polymerase 0% chimeras are still present in these ChimeraSlayer WigeoN replicates the single strand of DNA: Chimeric Sequences public sequence repositories. 0% 0% 0 500 1000 1500 0 100 200 300 400 500 600 700 Sequence Length (nucleotides) Length of Chimeric Range (nucleotides) When DNA Polymerase reaches the lesion it may stop replication: How was the solution implemented? Where is there more information? Hundreds of thousands of suspect fragments A, Find chimeras online at: In the next PCR cycle, can be queried against a set of over a million 0 C,G T,C the incomplete template 16S rRNA sequences. This is accomplished in DECIPHER.cee.wisc.edu may serve as a primer: linear time by using an Aho-Corasick string Contact Erik at: search algorithm. This method requires pre- T A G [email protected] processing the 30-mer fragments into a finite 1. Wright, E. S., Yilmaz L. S., and Noguera D. R. 2012. “DECIPHER: A Search-Based Approach to Chimera state machine that enables querying for all of Identification for 16S rRNA Sequences”, Appl. Environ. Microbiol., doi:10.1128/AEM.06516-11. Resulting in a PCR product that is a concatenation the sequence patterns simultaneously. 2. Edgar, R. C., B. J. Haas, J. C. Clemente, C. Quince, and R. Knight. 2011. UCHIME improves sensitivity and C G speed of chimera detection. :Epub ahead of print, doi: 10.1093/bioinformatics/btr1381. of two diferent organism’s DNA sequence: 3. Quince, C., A. Lanzen, T. P. Curtis, R. J. Davenport, N. Hall, I. M. Head, L. F. Read, and W. T. Sloan. 2009. The example shows the state A Accurate determination of microbial diversity from 454 pyrosequencing data. Nat. Methods 6:639-U627. machine for the pattern set: 4. Ashelford, K. E., N. A. Chuzhanova, J. C. Fry, A. J. Jones, and A. J. Weightman. 2005. At least 1 in 20 16S {TG, TCA, TCT, GA}. A T rRNA sequence records currently held in public repositories is estimated to contain substantial anomalies. Appl. Environ. Microbiol. 71:7724-7736.