
Additional Supplemental file for “The Genome of the foraminiferan Reticulomyxa filosa ” Content 1. The Genome ........................................................................................................................... 3 1.1 Fosmid end sequencing ..................................................................................................... 3 1.2 Repeats ............................................................................................................................. 4 1.3 Repetitiveness between fosmid clones .............................................................................. 4 1.4 Transposons ...................................................................................................................... 5 1.5 Highly repetitive sequences tend to be not represented in the whole genome assembly.... 6 1.6 Genome size estimation ..................................................................................................... 7 2. Completeness ......................................................................................................................... 8 2.1 Completeness calculation using CEGMA ........................................................................... 8 2.2 Transcriptome coverage .................................................................................................... 9 2.3 Next generation transcriptome sequencing ...................................................................... 10 3. Gene repertoires and potential pseudogenes ........................................................................ 12 3.1 tRNA genes ..................................................................................................................... 12 3.2 Protein coding gene prediction ......................................................................................... 13 3.3 Homopolymer runs in the predicted proteins .................................................................... 14 3.4 Gene families ................................................................................................................... 15 3.5 Detailed analysis of some gene families .......................................................................... 16 3.5.1 GPCRs ...................................................................................................................... 16 3.5.2 Signaling proteins ...................................................................................................... 17 3.5.3 Adhesion and phagocytosis ....................................................................................... 20 3.5.4 Motor domain proteins ............................................................................................... 25 3.5.5 Meiosis related genes ................................................................................................ 28 1 3.5.6 Transcription factors .................................................................................................. 29 4. Comparative genomics ......................................................................................................... 32 4.1 Identification of common genes between R. filosa and other Foraminifera ....................... 32 4.2 Common and divergent gene sets in Rhizaria and other Eukaryote lineages ................... 33 4.3 Horizontal gene transfer (HGT) ........................................................................................ 36 4.3.1 Global analysis of potential HGT ............................................................................... 36 4.3.2 HGT from eukaryotes ................................................................................................ 38 5. References ............................................................................................................................ 40 2 1. The Genome 1.1 Fosmid end sequencing We sequenced more than 8,000 fosmids from each end using classical Sanger based technology. This yielded 16,715 raw sequences, which we mapped after quality clipping with varying thresholds to the complete assembly. Only 25 % of the fosmid end sequences we generated could be mapped to the assembled genome over the entire length if we assumed an error rate of 2 % in the sequencing reads. If we allow partial matches with varying identity thresholds we can match more than 15,000 fosmid end sequences to the assembly (Fig. S1). Most of the reads can be matched to the genome, if lower threshold values are applied, but higher thresholds prevent mapping of more than half of the sequences. This indicates that this half of the reads matches to repetitive contigs with slight sequence variation. Therefore, we concluded that most of the genome is repetitive and not represented by the assembly. Visual inspection of randomly selected alignments showed that indeed often nearly identical sections of fosmid end sequences are interspersed with other sequences, thus preventing the alignment of the whole read to the assembly. This finding underscores that the genome holds nearly identical sequences dispersed throughout the genome. This would thus prevent a whole genome assembly even if long range mapping information were available. Figure S1: Matching of fosmid end sequences to the whole genome assembly . The different curves were obtained by defining the minimum matching length in bases of the fosmid end sequence to the assembly. 3 1.2 Repeats To analyze the genome structure in more detail and resolve the discrepancy between previous genome size estimates (350 to 550 MB) we completely sequenced 15 fosmids using traditional Sanger sequencing technology. 13 where picked randomly and two were selected according to matches of their ends to one of the sequenced fosmids to enlarge the contiguous sequenced area. We then used the fosmid sequences as backbone for the mapping of Illumina reads. From the sequenced fosmid clones we isolated sequence sections which are highly overrepresented indicated by sequence coverage. One of those sections is 351 bases long and is an inverted repeat. The coverage indicates that this small simple repeat makes up several Mb of the R. filosa genome (Fig. S2). A GACTAGTCAACCATAGGCCAAGTCATAGACTAACATTTTATACATAGACTTAGACTCATTGAAGCAATA CACTAGGGTGCTTATGCAACCTCATATGACTAGTAGGCTAACCTATTAGGCCAAAGTTATACTGATGA TCGATATCACTATCACTAAGACCTCGATAGGCTTAGCCCAGTAGGCCTATTTTGGTCTTAGTGATGGC GATATTGATCACTAGGCCAAACTTTGTCCTATATGGTAGACTACTAGTCCTATGAGGTTACATAAGCA CCCTACCCTATTGCTTCAATTGAGGTAGGCCAAGTATAAAATGTATAGTTTATGACTTGGCCTAGTAG GGCAGTCTAC B Figure S2: The most abundant simple repeat observed. A: Sequence of the inverted repeat. B: Graphical representation of the repeat structure of A. 1.3 Repetitiveness between fosmid clones In addition to the simple repeats we could identify in the fosmid clones we found that longer stretches of fosmid sequences were duplicated in other fosmid clones (Fig. S3). The respective fosmid clones are not overlapping clones from the same chromosomal region since differences in the similar sequences exist. Since we sequenced a haploid genome these sequences do not represent allelic differences but rather segmental duplications. 4 Figure S3: A miropeat graph [1] showing repeated structures in three fosmids . The simple repeat of Fig. S2 is represented by the end fragment of the third fosmid clone and present in several copies in the others. Fosmid Rf14 and FSTFLP are nearly identical showing that extensive similar segments exist in this genome. 1.4 Transposons We searched the predicted genes for presence of transposon domains and found 114 predicted genes with such domains. A coverage analysis of all contigs carrying these transposon-derived sequences revealed that the contig sequences are not overrepresented in the genome (Fig. S4). Closer inspection revealed that some contig parts exhibit a higher coverage than normal, but these parts do not contain the transposon domains. We also found that the coverage of these contigs has a more pronounced double peak than the whole genome. Indeed, most of the transposon sequences appear to be derived from the Rickettsia like endonuclear parasite. Figure S4: Coverage plot for all transposon containing contigs . The doted line shows coverage of all bases while the solid line indicates the coverage of contigs larger than 2 kb. 5 1.5 Highly repetitive sequences tend to be not represented in the whole genome assembly We checked, whether our assembly contained the previously sequenced SSU RNA gene. Interestingly, we did not find a contig matching to this sequence in the genome assembly. The transcriptome data, however, yielded an almost identical match to three individual contigs covering the whole published SSU sequence. We then mapped the raw sequencing reads from the Illumina run to this template and found that more than 400,000 reads could be mapped to this contig (Fig. S5). The mean coverage was more than 20,000 indicating that at least 200 SSU copies are present in the R. filosa genome. None of the other assembly approaches including the separate assembly of 454 reads only yielded SSU contigs. We conclude that the high coverage is unfavorable for the assembly due to sequencing errors, incomplete and varying copies. We checked whether proteins found to be missing in the assembly match to the 4x coverage raw sequencing reads from the 454 titanium runs. In no case we found such matches indicating that the observed failure to assemble the SSU regions is restricted to highly amplified segments. Figure S5: Illumina reads matching
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages41 Page
-
File Size-