Supplementary Information for Mosaic Genome of Endobacteria in Arbuscular Mycorrhizal Fungi: Trans-Kingdom Gene Transfer in an Ancient Mycoplasma-Fungus Association
Total Page:16
File Type:pdf, Size:1020Kb
Supplementary information for Mosaic genome of endobacteria in arbuscular mycorrhizal fungi: trans-kingdom gene transfer in an ancient mycoplasma-fungus association Gloria Torres-Cortés, Stefano Ghignone, Paola Bonfante and Arthur Schüßler Supporting Online Material Table of Contents 1. MATERIAL AND METHODS (Text S1) 1.1. Endobacteria DNA extraction for sequencing 1.2. Semiquantitative analysis of DhMRE phylotypes I and II abundance 1.3. Illumina sequencing and assembly 1.4. Phylogenetic analyses 1.5. Identification of horizontal gene transfer 2. SUPPLEMENTARY TABLES: Tables S1 to S7 3. SUPPLEMENTARY FIGURES: Figures S1 to S8 4. REFERENCES 1 Text S1 1. MATERIAL AND METHODS 1.1 Endobacteria DNA extraction for sequencing For DhMRE genomic preparation, Dentiscutata heterogama spores were crushed in 1 ml extraction buffer (250 mM sucrose, 10 mM MES pH 6.5, 25 mM KCl, 20 mM MgCl2, 1 mM dithiothreitol) at 4°C by using a glass homogenizer. The major spore debris was pelleted by centrifugation at 500 g for 2 min, the supernatant was then again centrifuged at 1,000 g for 2 min to remove nuclei and other high density debris. The newly formed supernatant was filtered through an 8 μm and then a 3 μm polycarbonate filter (Whatman). The resulting bacteria suspension was centrifuged at 25,000 g for 15 min to pellet the bacteria; the pellet was re-suspended in re-suspension buffer (10 mM Tris-HCl pH 8, 250 mM sucrose) and treated for 60 min with DNase at 4ºC to remove free DNA. After inactivating DNase activity by heat treatment, DNA was extracted with the MasterPure Gram-positive DNA purification kit (Epicentre) according to the manufacturer’s recommendations. 1.2 Semiquantitative analysis of DhMRE phylotypes I and II abundance For the semiquantitative analysis of DhMRE phylotypes, DNA was extracted from 20 D. heterogama spores and from the suspension resulting after the endobacteria DNA extraction protocol. These DNA samples were used in the construction of each clone library. PCR was performed using MRE specific primers and the Phusion High-Fidelity DNA Polymerase (New England Biolabs). PCR products were TOPO cloned (Invitrogen) and transformed into Top10 chemically competent Escherichia coli. Colonies were then PCR-screened for phylotype- specific insert length differences and RsaI digestion (phylotype I); 498 clones were analyzed. 1.3 Illumina sequencing and assembly Three different sequencing libraries were constructed using the transposon-based Nextera™ DNA Sample Prep Kit (Illumina), with 50 ng of DNA as starting material. One additional 2 library was produced with the transposon-based Nextera™ XT kit (Illumina). After library production, Illumina sequencing was performed using the Illumina MiSeq platform at the Genomics Service Unit of the Ludwig-Maximilian-University Munich Biocenter, generating 41x106 paired end 150 bp raw reads. The paired-end reads were quality trimmed using CLC workbench v5 (CLC Bio), under the default parameters. Trimmed reads were mapped against the main bacterial contaminant genomes, to remove contaminant reads. The remaining, cleaned reads were assembled using the CLC de novo assembly algorithm, with a kmer size of 23, resulting in 3,655 contigs, which were then filtered according to two criteria: all contigs that had a G+C content > 45% and a coverage < 20 were discarded. This resulted in 119 contigs (1.17 Mb; SI Appendix, Table S1) used for further analyses. To identify the putative DhMRE sequences from the resulting 119 contigs we performed BLASTX searches against the NCBI database. To determine possible contamination by sequences from the fungal host, raw reads obtained after DhMRE spore metagenome sequencing were mapped (>60% identity in 0.5 of sequence length) against the published Rhizophagus irregularis genome assembly (1), with only 0.18% of the reads mapping to it. To further validate the CLC assembly, two additional strategies were followed. Firstly, raw data reads were mapped against the scaffolds with 90 % sequence similarity and length coverage as criteria, using CLC Genomics v. 5.2. Areas with low coverage were also included and manually inspected. Secondly, we tracked and visualized the paired-end connections between scaffolds, following the instructions of Albertsen et al. (2) and cytoscape for the visualization (3). To identify the hypothesized existence of divergent genomes we carried out BLAST searches of putative DhMRE contigs against the total of all contigs obtained. We could not identify any contigs with > 75% identity for > 1,500 bp length with the query, indicating that the 3 assembled data are not composed of multiple closely related genomes. To analyze the expected presence of both 16S rRNA gene phylotypes in the raw reads, we mapped the raw data against both major 16S phylotypes known from Sanger sequencing approaches, but we mainly identify one of them (only 24 reads in total mapped the phylotype II; Fig. S4). To exclude the possibility of reads belonging to different phylotypes binding together in the assembly, a QualitySNPng analysis (4) was performed on the contig containing the 16S rRNA gene. In addition to these analyses, an additional assembly was done with MIRA (5). Using this approach, we obtained the same sequence information as with the CLC assembly, but with a higher fragmentation. Correlations between Z-scores of tetranucleotide composition were assessed using TETRA (6). For the circular and linear representation of the scaffolds the software DNAPlotter was used (7). 1.4 Phylogenetic analyses To study DhMRE proteins candidate for HGT by phylogenetic analyses, homologous sequences were selected after BLAST searches. For this, 40 BLASTP hits were selected for each DhMRE query protein, represented by the five best BLASTP hits of the i) non-redundant protein sequences (nr) database from the NCBI ii) -nr database excluding R. irregularis iii) - nr database including fungi sequences only, but excluding R. irregularis and iv) -nr database including bacteria sequences only. Redundant hits were removed. 1.5 Identification of horizontal gene transfer BLASTP analyses of the DhMRE proteome against the non-redundant protein sequences (nr) database from the NCBI using the software Blast2GO were conducted to identify protein sequences with similarity to proteins from the AMF R. irregularis, which lacks endobacteria. BLASTP affiliation was based on the best-hit with a cut-off value of e-03. Eukaryotic domains 4 were identified by analyzing the results obtained in Interpro and SUPERFAMILY databases integrated in MicroScope platform (8). The presence of genomic islands in the DhMRE genome draft was studied using the software IslandViewer. 5 2. SUPPLEMENTARY TABLES Table S1. Metagenome assembly data. Reads (bp) 2x150 Raw data No. of reads 41x106 Primary sequence data (Gb) 6 No. of reads 15.8x106 Contigs sequence data Mb-No. of contigs 15/3655 Cleaned data No. of contigs with G+C < 45% 1494 No. contigs G+C < 45% and coverage > 20 119 Cleaned sequence data (Mb) 1.17 DhMRE sequence data 0.702 No. of DhMRE contigs 24 DhMRE Average DhMRE contig length (bp) 29,244 sequences DhMRE average coverage 172x N50 contig size (bp) 147,306 Longest scaffold (bp) 222,151 Contigs recovered after removing the main contaminants from the raw data were filtered according to GC content and coverage. All contigs presenting a G+C content higher than 45% and coverage below 20 were discarded. DhMRE contigs were identified from the resulting contigs performing BLASTX searches against NCBI non-redundant protein sequences database. 6 Table S2. Genome features of DhMRE in comparison to members of the Tenericutes and Firmicutes. DhMRE Tenericutes Firmicutes Sca.A Sca. B Sca. C Mgen Upar Mhyo Mflo CaPhy Linn Saga Length 0.649 0.0604 0.0038 0.580 0.752 0.840 0.793 0.602 3.01 2.13 (Mb) G+C 34.06 32.3 34.06 32 25 25.88 27.02 21.39 37.4 35.65 ratio CDSa 606 58 6 482 613 663 683 482 3141 2196 Coding 80.96 83.69 57.13 92.14 93 85.3 93.3 78.75 89.3 86.8 regionb RNA 1 0 0 1 1 1 2 2 6 7 operons tRNAs 35 0 0 36 39 30 29 32 66 80 Lifestyle O P P P FL P FL P (Host)c (F) (A) (A) (A) (Pl) (A) a Number of protein-coding sequences in the corresponding scaffold/chromosome. b Percentage of coding regions in the total scaffold/chromosome. c Lifestyle of local taxa; O: obligate endosymbiont; P: pathogen; FL: free-living; F: fungi; A: animal; Pl: plants. Data for Mgen, Upar, Mhyo, Mflo and CaPhy were obtained from the Molligen (9) database and for Linn and Saga from the Microscope database (8) under the following accession numbers: Mgen: Mycoplasma genitalium G37 (NC_000908); Upar: Ureaplasma parvum serovar 3 ATCC 700970 (NC_002162); Mhyo: Mycoplasma hyorhinis HUB-1 (CP002170); Mflo: Mesoplasma florum L1 (NC_006055); CaPhy: Ca. Phytoplasma mali (NC_011047); Linn: Listeria innocua (NC_003212); Saga: Streptococcus agalactiae A909 (NC_007432). Sca.A, B and C: Scaffold A, B and C from DhMRE draft genome. 7 Table S3. Pearson correlation coefficients for Z-score of tetranucleotide frequency. DhMRE DhMRE Rirr Mgen Upar Mhyo Mflo CaPhy Linn Saga Rirr 523 Dacid A B 4192 DhMRE 1 A DhMRE 0.89 1 B Mgen 0.59 0.52 1 Upar 0.55 0.50 0.66 1 Mhyo 0.43 0.38 0.47 0.83 1 Mflo 0.41 0.35 0.60 0.80 0.83 1 CaPhy 0.56 0.50 0.65 0.86 0.80 0.80 1 Linn 0.65 0.61 0.58 0.70 0.69 0.71 0.74 1 Saga 0.65 0.61 0.68 0.65 0.59 0.71 0.74 0.83 1 Rirr 523 0.33 0.25 0.34 0.48 0.55 0.53 0.56 0.48 0.43 1 Rirr 0.43 0.38 0.36 0.62 0.65 0.62 0.64 0.51 0.46 0.65 1 4192 Dacid 0.06 0.001 0.19 0.35 0.39 0.46 0.33 0.27 0.25 0.34 0.30 1 Abbreviations and GenBank accession numbers: DhMRE, Dentiscutata heterogama MRE scaffold; Mgen, Mycoplasma genitalium G37 (NC_000908); Upar, Ureaplasma parvum serovar 3 ATCC 700970 (NC_002162); Mhyo, Mycoplasma hyorhinis HUB-1 (CP002170); Mflo, Mesoplasma florum L1 (NC_006055.1); CaPhy, Candidatus Phytoplasma mali (NC_011047); Linn, Listeria innocua (NC_003212); Saga, Streptococcus agalactiae (NC_007432); Rirr, R.