(A) Overview of Sequence Curation Pipeline for Pacbio Reads
Total Page:16
File Type:pdf, Size:1020Kb
Soil samples (x3) 1 Sequencing 2 Generate Circular Consensus 3 Filter CCS Sequences (CCS) SMRT Analysis subreads Discard CCS if: Cluster into OT • homop length > 6 18S ITS + 5.8S 28S • 2,500 > length > 6,000 • min no. of passes = 2 CCS • does not contain 18S and/or 3 SMRT cells on Sequel • min predicted accuracy > 0.99 28S (Barrnap) Long range PCR of rDNA operon • cluster with vsearch (99% Blast Uchime •50% majority rule consensus similarity) to 18S against get preclusters 6 use de novo chimera in-house + detection • if precluster Extract 18S perl remove prokaryotic size > 2, align 28S and 28S script sequences with maft 5 Chimera detection and 4 Precluster and generate prok seq removal consensus seqs align with maft cluster at 97% similarity 650 OTUs (based on 18S) (1804 including singletons) A 7 Align and cluster into OTUs O B C 6000 3000 4000 rna 2000 18S undance b 28S length (bp) full CCS a 1000 2000 0 0 5 10 15 20 25 30 18S 28S full number of passes in each CCS Supplementary Fig 1. (A) Overview of sequence curation pipeline for PacBio reads. (B) Distribution of number of passes in Circular Consensus Sequences (CCS) after curation of the sequences (i.e., after step 5 in the curation pipeline in panel A). As sequencing errors are randomly distributed in each pass, creating a consensus of each greatly reduces the error rate. A CCS generated from 10 passes has an accuracy rate of ~99.9%. (C) Violin plot showing the range of lengths (in bp) of the curated sequences for the 18S, 28S, and 18S-28S (including the ITS region). Supplementary Fig 2. (on following page). ML tree generated for taxonomic annotation of queries. The tree was inferred from an alignment of the 18S gene (1589 bp) with 650 queries (branches in red) and 1661 reference sequences derived from the SILVA SSU database (branches in black; see Methods for details on taxon sampling). This best unconstrained tree was selected from 20 ML tree searches and bipartition support derived from 300 BS runs. Cercozoa Rhizaria Endomyxa Retaria Ochrophyta Stramenopila Peronomycetes Apicomplexa Alveolata Dinofagellata Ciliophora Excavata Archaeplastida (Chloroplastida) Cryptophyta + Breviata + other Incertae Sedis Amoebozoa Archaeplastida (Rhodophyta) + Haptophyta + Centrohelida Ascomycota Opisthokonta Basidiomycota “Early-diverging fungi” Metazoa + other Holozoa 0.2 A B taxonomic annotation by vsearch phylogeny−aware taxonomic annotation 40 30 30 level ies r 20 genus or lower deep−branching lineage or lower 20 supergroup or lower unassigned umber of que n 10 chimera 10 0 0 70 70 80 90 100 80 90 100 Similarity to reference sequence (%) Similarity to reference sequence (%) C D 0.09 0.06 EDPL 0.03 0.00 60 70 80 90 100 Similarity to reference sequence (%) Supplementary Fig 3. Performance of our phylogeny-aware taxonomic annotation pipeline. (A) and (B) Density plots showing the lowest taxonomic rank assigned to each query based on their similarity to the closest reference. Pink corresponds to queries that were annotated down to either the genus or species level; cyan corresponds to queries that were annotated down to a major eukaryotic lineage (listed in Fig 2) or lower (family or sub-family); green corresponds to queries annotated to ranks above major eukaryotic lineages. (A) Density plot when assigning taxonomy using a similarity-based annotation strategy (in this case, vsearch). There is no correlation between lowest taxonomic rank assigned to a query and its similarity to a reference, i.e. if the closest reference is labelled to species level, the query will also be classified down to species level regardless of whether the reference was identical or only 80% similar. (B) In comparison, our phylogeny-aware annotation method results in a better correlation between taxonomic rank assigned and similarity to closest reference sequence. (C) Scatterplot showing the EDPL (Expected Distance between Placement Locations) for queries placed on the pruned 18S tree with EPA (Evolutionary Placement Algorithm) in ‘Strategy 2’ of our taxonomic annotation pipeline. A high EDPL indicates that a query cannot be placed confidently in a location in the tree. The figure shows that the EPA assigns queries confidently to an evolutionary history even if there is no close reference sequence available. This can be seen from the low EDPL values even at low sequence identities. (D) Zoomed in section of the 18S tree with reference tips labelled in black and query tips (in red) relabeled to their assigned taxonomy. The top-most query is correctly annotated down to the species level based on its position relative to its nearest neighbour. The other two are labelled only up to the family level because of their relatively deep-branching position in the tree. A B Supplementary Fig 4. Comparison of phylogenetic placement confidence for long (V4-end of 18S gene) vs. short queries (generated in silico by truncating to V4 region only). More long queries are placed with high confidence in the tree compared to V4 queries. (A) The cumulative frequency plot shows the LWR value of the most likely placement of each query. At the far right, we see a steep rise to 100% caused by all the "remaining" queries, which have a LWR of exactly 1 (i.e perfectly confident placements). Here, the orange V4 line shows that it has fewer queries with the highest LWR possible. (B) The EDPL cumulative frequency plot shows that the long reads start off at an EDPL value of 0 for more than 50% of the queries, while only ~35% of the V4 reads have an EDPL of 0. No queries (long or V4) have an EDPL higher than 0.1. Importantly, the long reads line is always ‘above’ the V4 line, meaning that the EDPL values are smaller for the long reads. m54088_170216_085445_68747772_ccs_Otu00230_7 100 9 7 c-8582_conseq_Otu00283_5 c-8459_conseq_Otu00158_11 2 9 m54032_170210_154134_70779160_ccs_Otu00415_3 9 9 3 8 m54032_170210_154134_15401940_ccs_Otu00139_13 3 2 m54088_170216_085445_50856891_ccs_Otu00364_3 6 2c-8433_conseq_Otu00250_6 c-8421_conseq_Otu00094_21 9 9 3 4 m54032_170210_154134_29098380_ccs_Otu00248_6 c-8534_conseq_Otu00009_435 ‘Terrestrial clade II’ (Rueckert et al 2011) 6 5 7 4 6 3 m54088_170216_085445_32899807_ccs_Otu00533_2 (insect infecting eugregraines) gi_241739987_gb_FJ4597411__Gregarina_blattarum_small_subunit_ribosomal_RNA_gene gi_241739995_gb_FJ4597491__Gregarina_tropica_small_subunit_ribosomal_RNA_gene 5 0 Gregarina_sp_Phaedon/Daegwallyeong/2010_18S_ribosomal_RNA_gene 100 9 2 gi_241739989_gb_FJ4597431__Gregarina_coronata_small_subunit_ribosomal_RNA_gene Gregarina_niphandrodes 100 9 1 c-2109_conseq_Otu00045_50 9 7 m54032_170210_154134_21430739_ccs_Otu00335_4 c-9230_conseq_Otu00285_5 100 m54032_170210_154134_50266641_ccs_Otu00002_1322 gi_322510444_gb_HQ8760071__Heliospora_caprellae_18S_ribosomal_RNA_gene 9 7 5 0 Heliospora_cf_longissima_clone_31-1_18S_ribosomal_RNA_gene Cephaloidophorids 2 6 5 0 gi_285804660_gb_FJ9767211__Ganymedes_sp_SR-2010a_18S_ribosomal_RNA_gene (marine eugregraines) 100 Cephaloidophora_cf_communis_clone_4_1_18S_ribosomal_RNA_gene 2 5 gi_322510443_gb_HQ8760061__Thiriotia_pugettiae_18S_ribosomal_RNA_gene m54088_170216_085445_58983335_ccs_Otu00596_2 2 7 m54088_170216_085445_21495998_ccs_Otu00650_2 7 5 100 c-9300_conseq_Otu00112_17 2 4 c-9326_conseq_Otu00206_8 6 5 c-8846_conseq_Otu00201_8 100 100 c-8874_conseq_Otu00254_6 c-8839_conseq_Otu00090_22 Putative novel eugregarines c-8663_conseq_Otu00061_31 7 2 4 7m54088_170216_085445_15008703_ccs_Otu00530_2 6 6 100 c-8709_conseq_Otu00142_13 3 1 m54088_170216_085445_69206648_ccs_Otu00366_3 c-7917_conseq_Otu00233_7 7 3 c-8622_conseq_Otu00298_5 2 3 Selenidium_pygospionis_isolate_SPMBS_681_18S_ribosomal_RNA_gene 100 Archigregarines 6 1 gi_545331847_gb_KC1108691__Selenidium_sensimae_isolate_xmasiso21_18S_ribosomal_RNA_gene Apicomplexa Ancora_cf_sagittata_isolate_AncoraWSBS2010_external_transcribed_spacer Eugregarine (marine) Eukaryota_SAR_Alveolata_Apicomplexa_Conoidasida_Cryptosporida_Cryptosporidium_Cryptosporidium-muris-RN66 100 Eukaryota_SAR_Alveolata_Apicomplexa_Conoidasida_Cryptosporida_Cryptosporidium_Cryptosporidium-parvum Cryptosporidia m54088_170216_085445_72024439_ccs_Otu00526_2 100 100m54088_170216_085445_9437773_ccs_Otu00348_4 9 7 c-6143_conseq_Otu00240_6 6 8 c-8376_conseq_Otu00088_23 m54088_170214_101554_41091495_ccs_Otu00574_2 4 5 m54088_170216_085445_21495958_ccs_Otu00291_5 100 10gi_25991884_gb_AF4571271__Monocystis_agilis_small_subunit_ribosomal_RNA_gen0 e 1 5 m54088_170216_085445_55968048_ccs_Otu00247_6 c-8229_conseq_Otu00033_76 1 6 100 Eugregarines 4 1 m54088_170214_101554_15401713_ccs_Otu00318_4 m54088_170216_085445_17236353_ccs_Otu00265_6 9 9 10010m0 54088_170216_085445_34931584_ccs_Otu00532_2 m54088_170214_101554_29885244_ccs_Otu00183_9 2 0 ‘Terrestrial clade I’ (Rueckert et al 2011) m54088_170216_085445_53019444_ccs_Otu00470_2 (monocystid eugregarines + neogregarines) 3 7 gi_241740002_gb_FJ4597561__Prismatospora_evansi_small_subunit_ribosomal_RNA_gene 9 8 c-8446_conseq_Otu00021_157 gi_149959797_gb_DQ1764271__Syncystis_mirabilis_18S_ribosomal_RNA_gene 8 2 8 6 c-8530_conseq_Otu00249_6 m54088_170216_085445_70975772_ccs_Otu00485_2 9 95 5 m54088_170216_085445_14877140_ccs_Otu00430_3 c-6131_conseq_Otu00018_189 c-8684_conseq_Otu00154_11 7 5 Eukaryota_SAR_Alveolata_Apicomplexa_Conoidasida_Gregarinasina_Eugregarinorida_Ascogregarina-taiwanensis 100 7 9 100 m54032_170210_154134_6488584_ccs_Otu00418_3 8 3 5 6 c-8263_conseq_Otu00353_4 Neogregarines 4 7 Eukaryota_SAR_Alveolata_Apicomplexa_Conoidasida_Gregarinasina_Eugregarinorida_Neogregarinorida-sp-OPPPC1 m54032_170210_154134_70255583_ccs_Otu00483_2